Re: seeing this message repeatedly.

2016-09-03 Thread kant kodali
I don't think my driver program which is running on my local machine can
connect to worker/executor machine because spark UI lists private ip's for
worker machine but I can connect to master from the driver because of this
setting export SPARK_PUBLIC_DNS="52.44.36.224". really not sure how to fix this
or what I am missing?
Any help would be great.Thanks!
 





On Sat, Sep 3, 2016 5:39 PM, kant kodali kanth...@gmail.com
wrote:
Hi Guys,
I am running my driver program on my local machine and my spark cluster is on
AWS. The big question is I don't know what are the right settings to get around
this public and private ip thing on AWS? my spark-env.sh currently has the the
following lines
export  SPARK_PUBLIC_DNS="52.44.36.224"export  SPARK_WORKER_CORES=12export 
SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4"
I am seeing the lines below when I run my driver program on my local machine.
not sure what is going on ?


16/09/03 17:32:15 INFO DAGScheduler: Submitting 50 missing tasks from
ShuffleMapStage 0 (MapPartitionsRDD[1] at start at Consumer.java:41)16/09/03
17:32:15 INFO TaskSchedulerImpl: Adding task set 0.0 with 50 tasks16/09/03
17:32:30 WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have sufficient
resources16/09/03 17:32:45 WARN TaskSchedulerImpl: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are registered and
have sufficient resources

Re: Passing Custom App Id for consumption in History Server

2016-09-03 Thread ayan guha
How about this:

1. You create a primary key in your custom system.
2. Schedule the job with custom primary name as the job name.
3. After setting up spark context (inside the job) get the application id.
Then save the mapping of App Name & AppId from spark job to your custom
database, through some web service.



On Sun, Sep 4, 2016 at 12:30 AM, Raghavendra Pandey <
raghavendra.pan...@gmail.com> wrote:

> Default implementation is to add milliseconds. For mesos it is
> framework-id. If you are using mesos, you can assume that your framework id
> used to register your app is same as app-id.
> As you said, you have a system application to schedule spark jobs, you can
> keep track of framework-ids submitted by your application, you can use the
> same info.
>
> On Fri, Sep 2, 2016 at 6:29 PM, Amit Shanker 
> wrote:
>
>> Currently Spark sets current time in Milliseconds as the app Id. Is there
>> a way one can pass in the app id to the spark job, so that it uses this
>> provided app id instead of generating one using time?
>>
>> Lets take the following scenario : I have a system application which
>> schedules spark jobs, and records the metadata for that job (say job
>> params, cores, etc). In this system application, I want to link every job
>> with its corresponding UI (history server). The only way I can do this is
>> if I have the app Id of that job stored in this system application. And the
>> only way one can get the app Id is by using the
>> SparkContext.getApplicationId() function - which needs to be run from
>> inside the job. So, this make it difficult to convey this piece of
>> information from spark to a system outside spark.
>>
>> Thanks,
>> Amit Shanker
>>
>
>


-- 
Best Regards,
Ayan Guha


seeing this message repeatedly.

2016-09-03 Thread kant kodali

Hi Guys,
I am running my driver program on my local machine and my spark cluster is on
AWS. The big question is I don't know what are the right settings to get around
this public and private ip thing on AWS? my spark-env.sh currently has the the
following lines
export  SPARK_PUBLIC_DNS="52.44.36.224"export  SPARK_WORKER_CORES=12export 
SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4"

I am seeing the lines below when I run my driver program on my local machine.
not sure what is going on ?


16/09/03 17:32:15 INFO DAGScheduler: Submitting 50 missing tasks from
ShuffleMapStage 0 (MapPartitionsRDD[1] at start at Consumer.java:41)16/09/03
17:32:15 INFO TaskSchedulerImpl: Adding task set 0.0 with 50 tasks16/09/03
17:32:30 WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have sufficient
resources16/09/03 17:32:45 WARN TaskSchedulerImpl: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are registered and
have sufficient resources

Re: any idea what this error could be?

2016-09-03 Thread Fridtjof Sander
I see. The default scala version changed to 2.11 with Spark 2.0.0 afaik, so 
that's probably the version you get when downloading prepackaged binaries. Glad 
I could help ;)

Am 3. September 2016 23:59:51 MESZ, schrieb kant kodali :
>@Fridtjof you are right!
>changing it to this Fixed it!
>ompile group: org.apache.spark' name: 'spark-core_2.11' version:
>'2.0.0'
>compile group: 'org.apache.spark' name: 'spark-streaming_2.11' version:
>'2.0.0'
>
>
>
>
>
>
>
>On Sat, Sep 3, 2016 12:30 PM, kant kodali kanth...@gmail.com
>wrote:
>I increased the memory but nothing has changed I still get the same
>error.
>@Fridtjofon my driver side I am using the following dependenciescompile
>group:
>org.apache.spark' name: 'spark-core_2.10' version: '2.0.0'
>compile group: 'org.apache.spark' name: 'spark-streaming_2.10' version:
>'2.0.0'
>on the executor side I don't know what jars are being used but I have
>installed
>using this zip filespark-2.0.0-bin-hadoop2.7.tgz
> 
>
>
>
>
>
>On Sat, Sep 3, 2016 4:20 AM, Fridtjof Sander
>fridtjof.san...@googlemail.com
>wrote:
>There is an InvalidClassException complaining about non-matching
>serialVersionUIDs. Shouldn't that be caused by different jars on
>executors and
>driver?
>
>
>Am 03.09.2016 1:04 nachm. schrieb "Tal Grynbaum"
>:
>My guess is that you're running out of memory somewhere.  Try to
>increase the
>driver memory and/or executor memory.   
>
>
>On Sat, Sep 3, 2016, 11:42 kant kodali  wrote:
>I am running this on aws.
>
> 
>
>
>
>
>
>On Fri, Sep 2, 2016 11:49 PM, kant kodali kanth...@gmail.com
>wrote:
>I am running spark in stand alone mode. I guess this error when I run
>my driver
>program..I am using spark 2.0.0. any idea what this error could be?
>
>
>Using Spark's default log4j profile:
>org/apache/spark/log4j-defaults.properties16/09/02 23:44:44 INFO
>SparkContext: Running Spark version 2.0.016/09/02 23:44:44 WARN
>NativeCodeLoader: Unable to load native-hadoop library for your
>platform... using builtin-java classes where applicable16/09/02
>23:44:45 INFO SecurityManager: Changing view acls to:
>kantkodali16/09/02 23:44:45 INFO SecurityManager: Changing modify acls
>to: kantkodali16/09/02 23:44:45 INFO SecurityManager: Changing view
>acls groups to: 16/09/02 23:44:45 INFO SecurityManager: Changing modify
>acls groups to: 16/09/02 23:44:45 INFO SecurityManager:
>SecurityManager: authentication disabled; ui acls disabled; users  with
>view permissions: Set(kantkodali); groups with view permissions: Set();
>users  with modify permissions: Set(kantkodali); groups with modify
>permissions: Set()16/09/02 23:44:45 INFO Utils: Successfully started
>service 'sparkDriver' on port 62256.16/09/02 23:44:45 INFO SparkEnv:
>Registering MapOutputTracker16/09/02 23:44:45 INFO SparkEnv:
>Registering BlockManagerMaster16/09/02 23:44:45 INFO DiskBlockManager:
>Created local directory at
>/private/var/folders/_6/lfxt933j3bd_xhq0m7dwm8s0gn/T/blockmgr-b56eea49-0102-4570-865a-1d3d230f0ffc16/09/02
>23:44:45 INFO MemoryStore: MemoryStore started with capacity 2004.6
>MB16/09/02 23:44:45 INFO SparkEnv: Registering
>OutputCommitCoordinator16/09/02 23:44:45 INFO Utils: Successfully
>started service 'SparkUI' on port 4040.16/09/02 23:44:45 INFO SparkUI:
>Bound SparkUI to 0.0.0.0, and started at
>http://192.168.0.191:404016/09/02 23:44:45 INFO
>StandaloneAppClient$ClientEndpoint: Connecting to master
>spark://52.43.37.223:7077...16/09/02 23:44:46 INFO
>TransportClientFactory: Successfully created connection to
>/52.43.37.223:7077 after 70 ms (0 ms spent in bootstraps)16/09/02
>23:44:46 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to
>master 52.43.37.223:7077org.apache.spark.SparkException: Exception
>thrown in awaitResultat
>org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
>at
>org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
>at
>scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>at
>org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>at
>org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)at
>org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)at
>org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)   
>at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)at
>org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:109)
>at
>java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>at java.util.concurrent.FutureTask.run(FutureTask.java:266)at
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>at
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>at 

Re: any idea what this error could be?

2016-09-03 Thread kant kodali
@Fridtjof you are right!
changing it to this Fixed it!
ompile group: org.apache.spark' name: 'spark-core_2.11' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-streaming_2.11' version: '2.0.0'







On Sat, Sep 3, 2016 12:30 PM, kant kodali kanth...@gmail.com
wrote:
I increased the memory but nothing has changed I still get the same error.
@Fridtjofon my driver side I am using the following dependenciescompile group:
org.apache.spark' name: 'spark-core_2.10' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-streaming_2.10' version: '2.0.0'
on the executor side I don't know what jars are being used but I have installed
using this zip filespark-2.0.0-bin-hadoop2.7.tgz
 





On Sat, Sep 3, 2016 4:20 AM, Fridtjof Sander fridtjof.san...@googlemail.com
wrote:
There is an InvalidClassException complaining about non-matching
serialVersionUIDs. Shouldn't that be caused by different jars on executors and
driver?


Am 03.09.2016 1:04 nachm. schrieb "Tal Grynbaum" :
My guess is that you're running out of memory somewhere.  Try to increase the
driver memory and/or executor memory.   


On Sat, Sep 3, 2016, 11:42 kant kodali  wrote:
I am running this on aws.

 





On Fri, Sep 2, 2016 11:49 PM, kant kodali kanth...@gmail.com
wrote:
I am running spark in stand alone mode. I guess this error when I run my driver
program..I am using spark 2.0.0. any idea what this error could be?


  Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties16/09/02 23:44:44 INFO SparkContext: 
Running Spark version 2.0.016/09/02 23:44:44 WARN NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable16/09/02 23:44:45 INFO SecurityManager: Changing view acls to: 
kantkodali16/09/02 23:44:45 INFO SecurityManager: Changing modify acls to: 
kantkodali16/09/02 23:44:45 INFO SecurityManager: Changing view acls groups to: 
16/09/02 23:44:45 INFO SecurityManager: Changing modify acls groups to: 
16/09/02 23:44:45 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(kantkodali); 
groups with view permissions: Set(); users  with modify permissions: 
Set(kantkodali); groups with modify permissions: Set()16/09/02 23:44:45 INFO 
Utils: Successfully started service 'sparkDriver' on port 62256.16/09/02 
23:44:45 INFO SparkEnv: Registering MapOutputTracker16/09/02 23:44:45 INFO 
SparkEnv: Registering BlockManagerMaster16/09/02 23:44:45 INFO 
DiskBlockManager: Created local directory at 
/private/var/folders/_6/lfxt933j3bd_xhq0m7dwm8s0gn/T/blockmgr-b56eea49-0102-4570-865a-1d3d230f0ffc16/09/02
 23:44:45 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB16/09/02 
23:44:45 INFO SparkEnv: Registering OutputCommitCoordinator16/09/02 23:44:45 
INFO Utils: Successfully started service 'SparkUI' on port 4040.16/09/02 
23:44:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
http://192.168.0.191:404016/09/02 23:44:45 INFO 
StandaloneAppClient$ClientEndpoint: Connecting to master 
spark://52.43.37.223:7077...16/09/02 23:44:46 INFO TransportClientFactory: 
Successfully created connection to /52.43.37.223:7077 after 70 ms (0 ms spent 
in bootstraps)16/09/02 23:44:46 WARN StandaloneAppClient$ClientEndpoint: Failed 
to connect to master 52.43.37.223:7077org.apache.spark.SparkException: 
Exception thrown in awaitResultat 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)  
  at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)   
 at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)at 
org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)at 
org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)at 
org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)at 
org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:109)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)  
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
   at java.lang.Thread.run(Thread.java:745)Caused by: 
java.lang.RuntimeException: java.io.InvalidClassException: 
org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream 
classdesc serialVersionUID = -2221986757032131007, local class serialVersionUID 
= -5447855329526097695at 

Creating RDD using swebhdfs with truststore

2016-09-03 Thread Sourav Mazumder
Hi,

I am trying to create a RDD by using swebhdfs to a remote hadoop cluster
which is protected by Knox and uses SSL.

The code looks like this -

sc.textFile("swebhdfs:/host:port/gateway/default/webhdfs/v1/").count.

I'm passing the truststore and trustorepassword through extra java options
while starting the spark shell as -

spark-shell --conf
"spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=truststor.jks
-Djavax.net.ssl.trustStorePassword=" --conf
"spark.driver.extraJavaOptions=-Djavax.net.ssl.trustStore=truststore.jks
-Djavax.net.ssl.trustStorePassword="

But I'm always getting the error that -

Name: javax.net.ssl.SSLHandshakeException
Message: Remote host closed connection during handshake

Am I passing the truststore and truststore password in right way ?

Regards,

Sourav


Re: Spark SQL Tables on top of HBase Tables

2016-09-03 Thread Mich Talebzadeh
Mine is Hbase-0.98,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 3 September 2016 at 20:51, Benjamin Kim  wrote:

> I’m using Spark 1.6 and HBase 1.2. Have you got it to work using these
> versions?
>
> On Sep 3, 2016, at 12:49 PM, Mich Talebzadeh 
> wrote:
>
> I am trying to find a solution for this
>
> ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class
> org.apache.hadoop.hive.hbase.HBaseSerDe not found
>
> I am using Spark 2 and Hive 2!
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 3 September 2016 at 20:31, Benjamin Kim  wrote:
>
>> Mich,
>>
>> I’m in the same boat. We can use Hive but not Spark.
>>
>> Cheers,
>> Ben
>>
>> On Sep 2, 2016, at 3:37 PM, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> You can create Hive external  tables on top of existing Hbase table using
>> the property
>>
>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>>
>> Example
>>
>> hive> show create table hbase_table;
>> OK
>> CREATE TABLE `hbase_table`(
>>   `key` int COMMENT '',
>>   `value1` string COMMENT '',
>>   `value2` int COMMENT '',
>>   `value3` int COMMENT '')
>> ROW FORMAT SERDE
>>   'org.apache.hadoop.hive.hbase.HBaseSerDe'
>> STORED BY
>>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>> WITH SERDEPROPERTIES (
>>   'hbase.columns.mapping'=':key,a:b,a:c,d:e',
>>   'serialization.format'='1')
>> TBLPROPERTIES (
>>   'transient_lastDdlTime'='1472370939')
>>
>>  Then try to access this Hive table from Spark which is giving me grief
>> at the moment :(
>>
>> scala> HiveContext.sql("use test")
>> res9: org.apache.spark.sql.DataFrame = []
>> scala> val hbase_table= spark.table("hbase_table")
>> 16/09/02 23:31:07 ERROR log: error in initSerDe:
>> java.lang.ClassNotFoundException Class 
>> org.apache.hadoop.hive.hbase.HBaseSerDe
>> not found
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 2 September 2016 at 23:08, KhajaAsmath Mohammed <
>> mdkhajaasm...@gmail.com> wrote:
>>
>>> Hi Kim,
>>>
>>> I am also looking for same information. Just got the same requirement
>>> today.
>>>
>>> Thanks,
>>> Asmath
>>>
>>> On Fri, Sep 2, 2016 at 4:46 PM, Benjamin Kim  wrote:
>>>
 I was wondering if anyone has tried to create Spark SQL tables on top
 of HBase tables so that data in HBase can be accessed using Spark
 Thriftserver with SQL statements? This is similar what can be done using
 Hive.

 Thanks,
 Ben


 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>>
>>
>
>


Re: Spark SQL Tables on top of HBase Tables

2016-09-03 Thread Benjamin Kim
I’m using Spark 1.6 and HBase 1.2. Have you got it to work using these versions?

> On Sep 3, 2016, at 12:49 PM, Mich Talebzadeh  
> wrote:
> 
> I am trying to find a solution for this
> 
> ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class 
> org.apache.hadoop.hive.hbase.HBaseSerDe not found
> 
> I am using Spark 2 and Hive 2!
> 
> HTH
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 3 September 2016 at 20:31, Benjamin Kim  > wrote:
> Mich,
> 
> I’m in the same boat. We can use Hive but not Spark.
> 
> Cheers,
> Ben
> 
>> On Sep 2, 2016, at 3:37 PM, Mich Talebzadeh > > wrote:
>> 
>> Hi,
>> 
>> You can create Hive external  tables on top of existing Hbase table using 
>> the property
>> 
>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>> 
>> Example
>> 
>> hive> show create table hbase_table;
>> OK
>> CREATE TABLE `hbase_table`(
>>   `key` int COMMENT '',
>>   `value1` string COMMENT '',
>>   `value2` int COMMENT '',
>>   `value3` int COMMENT '')
>> ROW FORMAT SERDE
>>   'org.apache.hadoop.hive.hbase.HBaseSerDe'
>> STORED BY
>>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>> WITH SERDEPROPERTIES (
>>   'hbase.columns.mapping'=':key,a:b,a:c,d:e',
>>   'serialization.format'='1')
>> TBLPROPERTIES (
>>   'transient_lastDdlTime'='1472370939')
>> 
>>  Then try to access this Hive table from Spark which is giving me grief at 
>> the moment :(
>> 
>> scala> HiveContext.sql("use test")
>> res9: org.apache.spark.sql.DataFrame = []
>> scala> val hbase_table= spark.table("hbase_table")
>> 16/09/02 23:31:07 ERROR log: error in initSerDe: 
>> java.lang.ClassNotFoundException Class 
>> org.apache.hadoop.hive.hbase.HBaseSerDe not found
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 2 September 2016 at 23:08, KhajaAsmath Mohammed > > wrote:
>> Hi Kim,
>> 
>> I am also looking for same information. Just got the same requirement today.
>> 
>> Thanks,
>> Asmath
>> 
>> On Fri, Sep 2, 2016 at 4:46 PM, Benjamin Kim > > wrote:
>> I was wondering if anyone has tried to create Spark SQL tables on top of 
>> HBase tables so that data in HBase can be accessed using Spark Thriftserver 
>> with SQL statements? This is similar what can be done using Hive.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> 
>> 
>> 
>> 
> 
> 



Re: Spark SQL Tables on top of HBase Tables

2016-09-03 Thread Mich Talebzadeh
I am trying to find a solution for this

ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class
org.apache.hadoop.hive.hbase.HBaseSerDe not found

I am using Spark 2 and Hive 2!

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 3 September 2016 at 20:31, Benjamin Kim  wrote:

> Mich,
>
> I’m in the same boat. We can use Hive but not Spark.
>
> Cheers,
> Ben
>
> On Sep 2, 2016, at 3:37 PM, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> You can create Hive external  tables on top of existing Hbase table using
> the property
>
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>
> Example
>
> hive> show create table hbase_table;
> OK
> CREATE TABLE `hbase_table`(
>   `key` int COMMENT '',
>   `value1` string COMMENT '',
>   `value2` int COMMENT '',
>   `value3` int COMMENT '')
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.hbase.HBaseSerDe'
> STORED BY
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES (
>   'hbase.columns.mapping'=':key,a:b,a:c,d:e',
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'transient_lastDdlTime'='1472370939')
>
>  Then try to access this Hive table from Spark which is giving me grief at
> the moment :(
>
> scala> HiveContext.sql("use test")
> res9: org.apache.spark.sql.DataFrame = []
> scala> val hbase_table= spark.table("hbase_table")
> 16/09/02 23:31:07 ERROR log: error in initSerDe: 
> java.lang.ClassNotFoundException
> Class org.apache.hadoop.hive.hbase.HBaseSerDe not found
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 September 2016 at 23:08, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Hi Kim,
>>
>> I am also looking for same information. Just got the same requirement
>> today.
>>
>> Thanks,
>> Asmath
>>
>> On Fri, Sep 2, 2016 at 4:46 PM, Benjamin Kim  wrote:
>>
>>> I was wondering if anyone has tried to create Spark SQL tables on top of
>>> HBase tables so that data in HBase can be accessed using Spark Thriftserver
>>> with SQL statements? This is similar what can be done using Hive.
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>


Re: Spark SQL Tables on top of HBase Tables

2016-09-03 Thread Benjamin Kim
Mich,

I’m in the same boat. We can use Hive but not Spark.

Cheers,
Ben

> On Sep 2, 2016, at 3:37 PM, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> You can create Hive external  tables on top of existing Hbase table using the 
> property
> 
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> 
> Example
> 
> hive> show create table hbase_table;
> OK
> CREATE TABLE `hbase_table`(
>   `key` int COMMENT '',
>   `value1` string COMMENT '',
>   `value2` int COMMENT '',
>   `value3` int COMMENT '')
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.hbase.HBaseSerDe'
> STORED BY
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES (
>   'hbase.columns.mapping'=':key,a:b,a:c,d:e',
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'transient_lastDdlTime'='1472370939')
> 
>  Then try to access this Hive table from Spark which is giving me grief at 
> the moment :(
> 
> scala> HiveContext.sql("use test")
> res9: org.apache.spark.sql.DataFrame = []
> scala> val hbase_table= spark.table("hbase_table")
> 16/09/02 23:31:07 ERROR log: error in initSerDe: 
> java.lang.ClassNotFoundException Class 
> org.apache.hadoop.hive.hbase.HBaseSerDe not found
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 September 2016 at 23:08, KhajaAsmath Mohammed  > wrote:
> Hi Kim,
> 
> I am also looking for same information. Just got the same requirement today.
> 
> Thanks,
> Asmath
> 
> On Fri, Sep 2, 2016 at 4:46 PM, Benjamin Kim  > wrote:
> I was wondering if anyone has tried to create Spark SQL tables on top of 
> HBase tables so that data in HBase can be accessed using Spark Thriftserver 
> with SQL statements? This is similar what can be done using Hive.
> 
> Thanks,
> Ben
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 
> 
> 



Re: any idea what this error could be?

2016-09-03 Thread kant kodali
I increased the memory but nothing has changed I still get the same error.
@Fridtjofon my driver side I am using the following dependenciescompile group:
org.apache.spark' name: 'spark-core_2.10' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-streaming_2.10' version: '2.0.0'
on the executor side I don't know what jars are being used but I have installed
using this zip filespark-2.0.0-bin-hadoop2.7.tgz
 





On Sat, Sep 3, 2016 4:20 AM, Fridtjof Sander fridtjof.san...@googlemail.com
wrote:
There is an InvalidClassException complaining about non-matching
serialVersionUIDs. Shouldn't that be caused by different jars on executors and
driver?


Am 03.09.2016 1:04 nachm. schrieb "Tal Grynbaum" :
My guess is that you're running out of memory somewhere.  Try to increase the
driver memory and/or executor memory.   


On Sat, Sep 3, 2016, 11:42 kant kodali  wrote:
I am running this on aws.

 





On Fri, Sep 2, 2016 11:49 PM, kant kodali kanth...@gmail.com
wrote:
I am running spark in stand alone mode. I guess this error when I run my driver
program..I am using spark 2.0.0. any idea what this error could be?


  Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties16/09/02 23:44:44 INFO SparkContext: 
Running Spark version 2.0.016/09/02 23:44:44 WARN NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable16/09/02 23:44:45 INFO SecurityManager: Changing view acls to: 
kantkodali16/09/02 23:44:45 INFO SecurityManager: Changing modify acls to: 
kantkodali16/09/02 23:44:45 INFO SecurityManager: Changing view acls groups to: 
16/09/02 23:44:45 INFO SecurityManager: Changing modify acls groups to: 
16/09/02 23:44:45 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(kantkodali); 
groups with view permissions: Set(); users  with modify permissions: 
Set(kantkodali); groups with modify permissions: Set()16/09/02 23:44:45 INFO 
Utils: Successfully started service 'sparkDriver' on port 62256.16/09/02 
23:44:45 INFO SparkEnv: Registering MapOutputTracker16/09/02 23:44:45 INFO 
SparkEnv: Registering BlockManagerMaster16/09/02 23:44:45 INFO 
DiskBlockManager: Created local directory at 
/private/var/folders/_6/lfxt933j3bd_xhq0m7dwm8s0gn/T/blockmgr-b56eea49-0102-4570-865a-1d3d230f0ffc16/09/02
 23:44:45 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB16/09/02 
23:44:45 INFO SparkEnv: Registering OutputCommitCoordinator16/09/02 23:44:45 
INFO Utils: Successfully started service 'SparkUI' on port 4040.16/09/02 
23:44:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
http://192.168.0.191:404016/09/02 23:44:45 INFO 
StandaloneAppClient$ClientEndpoint: Connecting to master 
spark://52.43.37.223:7077...16/09/02 23:44:46 INFO TransportClientFactory: 
Successfully created connection to /52.43.37.223:7077 after 70 ms (0 ms spent 
in bootstraps)16/09/02 23:44:46 WARN StandaloneAppClient$ClientEndpoint: Failed 
to connect to master 52.43.37.223:7077org.apache.spark.SparkException: 
Exception thrown in awaitResultat 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)  
  at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)   
 at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)at 
org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)at 
org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)at 
org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)at 
org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:109)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)  
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
   at java.lang.Thread.run(Thread.java:745)Caused by: 
java.lang.RuntimeException: java.io.InvalidClassException: 
org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream 
classdesc serialVersionUID = -2221986757032131007, local class serialVersionUID 
= -5447855329526097695at 
java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:

Re: Help with Jupyter Notebook Settup on CDH using Anaconda

2016-09-03 Thread Marco Mistroni
Hi
  please paste the exception
for Spark vs Jupyter, you might want to sign up for  this.
It'll give you jupyter  and spark...and presumably the spark-csv is already
part of it ?

https://community.cloud.databricks.com/login.html

hth
 marco



On Sat, Sep 3, 2016 at 8:10 PM, Arif,Mubaraka  wrote:

> On the on-premise *Cloudera Hadoop 5.7.2* I have installed the anaconda
> package and trying to *setup Jupyter notebook *to work with spark1.6.
>
>
>
> I have ran into problems when I trying to use the package
> *com.databricks:spark-csv_2.10:1.4.0* for *reading and inferring the
> schema of the csv file using python spark*.
>
>
>
> I have installed the* jar file - spark-csv_2.10-1.4.0.jar *in
> */var/opt/teradata/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jar* and c
> *onfigurations* are set as  :
>
>
>
> export PYSPARK_DRIVER_PYTHON=/var/opt/teradata/cloudera/parcels/
> Anaconda-4.0.0/bin/jupyter
> export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False
> --NotebookApp.ip='*' --NotebookApp.port=8083"
> export PYSPARK_PYTHON=/var/opt/teradata/cloudera/parcels/
> Anaconda-4.0.0/bin/python
>
>
>
> When I run pyspark from the command line with packages option, like :
>
>
>
> *$pyspark --packages com.databricks:spark-csv_2.10:1.4.0 *
>
>
>
> It throws the error and fails to recognize the added dependency.
>
>
>
> Any ideas on how to resolve this error is much appreciated.
>
>
>
> Also, any ideas on the experience in installing and running Jupyter
> notebook with anaconda and spark please share.
>
>
>
> thanks,
>
> Muby
>
>
>
>
> - To
> unsubscribe e-mail: user-unsubscr...@spark.apache.org


Help with Jupyter Notebook Settup on CDH using Anaconda

2016-09-03 Thread Arif,Mubaraka




On the on-premise Cloudera Hadoop 5.7.2 I have installed the anaconda package and trying to
setup Jupyter notebook to work with spark1.6.

 
I have ran into problems when I trying to use the package
com.databricks:spark-csv_2.10:1.4.0 for
reading and inferring the schema of the csv file using python spark.
 
I have installed the jar file - spark-csv_2.10-1.4.0.jar
in /var/opt/teradata/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/jar
and configurations are set as  :
 
export PYSPARK_DRIVER_PYTHON=/var/opt/teradata/cloudera/parcels/Anaconda-4.0.0/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=8083"
export PYSPARK_PYTHON=/var/opt/teradata/cloudera/parcels/Anaconda-4.0.0/bin/python
 
When I run pyspark from the command line with packages option, like :
 
$pyspark --packages com.databricks:spark-csv_2.10:1.4.0

 
It throws the error and fails to recognize the added dependency.
 
Any ideas on how to resolve this error is much appreciated.

 
Also, any ideas on the experience in installing and running Jupyter notebook with anaconda and spark please share.
 
thanks,
Muby
 
 




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Pls assist: Spark 2.0 build failure on Ubuntu 16.06

2016-09-03 Thread Diwakar Dhanuskodi
Please run with -X and post logs here. We can get exact error from it.

On Sat, Sep 3, 2016 at 7:24 PM, Marco Mistroni  wrote:

> hi all
>
>  i am getting failures when building spark 2.0 on Ubuntu 16.06
> Here's details of what i have installed on the ubuntu host
> -  java 8
> - scala 2.11
> - git
>
> When i launch the command
>
> ./build/mvn  -Pyarn -Phadoop-2.7  -DskipTests clean package
>
> everything compiles sort of fine and at the end i get this exception
>
> INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 01:05 h
> [INFO] Finished at: 2016-09-03T13:25:27+00:00
> [INFO] Final Memory: 57M/208M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile
> (scala-test-compile-first) on project spark-streaming_2.11: Execution
> scala-test-compile-first of goal 
> net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile
> failed. CompileFailed -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/
> PluginExecutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
> [ERROR]   mvn  -rf :spark-streaming_2.11
>
> could anyone help?
>
> kr
>


Re: Re[6]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Gavin Yue
Any shuffling? 


> On Sep 3, 2016, at 5:50 AM, Сергей Романов  wrote:
> 
> Same problem happens with CSV data file, so it's not parquet-related either.
> 
> 
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>   /_/
> 
> Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
> SparkSession available as 'spark'.
> >>> import timeit
> >>> from pyspark.sql.types import *
> >>> schema = StructType([StructField('dd_convs', FloatType(), True)])
> >>> for x in range(50, 70): print x, 
> >>> timeit.timeit(spark.read.csv('file:///data/dump/test_csv', 
> >>> schema=schema).groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
> 50 0.372850894928
> 51 0.376906871796
> 52 0.381325960159
> 53 0.385444164276
> 54 0.38685192
> 55 0.388918161392
> 56 0.397624969482
> 57 0.391713142395
> 58 2.62714004517
> 59 2.68421196938
> 60 2.74627685547
> 61 2.81081581116
> 62 3.43532109261
> 63 3.07742786407
> 64 3.03904604912
> 65 3.01616096497
> 66 3.06293702126
> 67 3.09386610985
> 68 3.27610206604
> 69 3.2041969299
> 
> Суббота, 3 сентября 2016, 15:40 +03:00 от Сергей Романов 
> :
> 
> Hi,
> 
> I had narrowed down my problem to a very simple case. I'm sending 27kb 
> parquet in attachment. (file:///data/dump/test2 in example)
> 
> Please, can you take a look at it? Why there is performance drop after 57 sum 
> columns?
> 
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>   /_/
> 
> Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
> SparkSession available as 'spark'.
> >>> import timeit
> >>> for x in range(70): print x, 
> >>> timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs']
> >>>  * x) ).collect, number=1)
> ... 
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> 0 1.05591607094
> 1 0.200426101685
> 2 0.203800916672
> 3 0.176458120346
> 4 0.184863805771
> 5 0.232321023941
> 6 0.216032981873
> 7 0.201778173447
> 8 0.292424917221
> 9 0.228524923325
> 10 0.190534114838
> 11 0.197028160095
> 12 0.270443916321
> 13 0.429781913757
> 14 0.270851135254
> 15 0.776989936829
> 16 0.27879181
> 17 0.227638959885
> 18 0.212944030762
> 19 0.2144780159
> 20 0.22200012207
> 21 0.262261152267
> 22 0.254227876663
> 23 0.275084018707
> 24 0.292124032974
> 25 0.280488014221
> 16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
> 26 0.290093898773
> 27 0.238478899002
> 28 0.246420860291
> 29 0.241401195526
> 30 0.255286931992
> 31 0.42702794075
> 32 0.327946186066
> 33 0.434395074844
> 34 0.314198970795
> 35 0.34576010704
> 36 0.278323888779
> 37 0.289474964142
> 38 0.290827989578
> 39 0.376291036606
> 40 0.347742080688
> 41 0.363158941269
> 42 0.318687915802
> 43 0.376327991486
> 44 0.374994039536
> 45 0.362971067429
> 46 0.425967931747
> 47 0.370860099792
> 48 0.443903923035
> 49 0.374128103256
> 50 0.378985881805
> 51 0.476850986481
> 52 0.451028823853
> 53 0.432540893555
> 54 0.514838933945
> 55 0.53990483284
> 56 0.449142932892
> 57 0.465240001678 // 5x slower after 57 columns
> 58 2.40412116051
> 59 2.41632795334
> 60 2.41812801361
> 61 2.55726218224
> 62 2.55484509468
> 63 2.56128406525
> 64 2.54642391205
> 65 2.56381797791
> 66 2.56871509552
> 67 2.66187620163
> 68 2.63496208191
> 69 2.8154599
> 
> 
> 
> Sergei Romanov
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> Sergei Romanov
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Catalog, SessionCatalog and ExternalCatalog in spark 2.0

2016-09-03 Thread Kapil Malik
Thanks Raghavendra :)
Will look into Analyzer as well.


Kapil Malik
*Sr. Principal Engineer | Data Platform, Technology*
M: +91 8800836581 | T: 0124-433 | EXT: 20910
ASF Centre A | 1st Floor | Udyog Vihar Phase IV |
Gurgaon | Haryana | India

*Disclaimer:* This communication is for the sole use of the addressee and
is confidential and privileged information. If you are not the intended
recipient of this communication, you are prohibited from disclosing it and
are required to delete it forthwith. Please note that the contents of this
communication do not necessarily represent the views of Jasper Infotech
Private Limited ("Company"). E-mail transmission cannot be guaranteed to be
secure or error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The Company,
therefore, does not accept liability for any loss caused due to this
communication. *Jasper Infotech Private Limited, Registered Office: 1st
Floor, Plot 238, Okhla Industrial Estate, New Delhi - 110020 INDIA CIN:
U72300DL2007PTC168097*


On Sat, Sep 3, 2016 at 7:27 PM, Raghavendra Pandey <
raghavendra.pan...@gmail.com> wrote:

> Kapil -- I afraid you need to plugin your own SessionCatalog as
> ResolveRelations class depends on that. To keep up with consistent design
> you may like to implement ExternalCatalog as well.
> You can also look to plug in your own Analyzer class to give your more
> flexibility. Ultimately that is where all Relations get resolved from
> SessionCatalog.
>
> On Sat, Sep 3, 2016 at 5:49 PM, Kapil Malik 
> wrote:
>
>> Hi all,
>>
>> I have a Spark SQL 1.6 application in production which does following on
>> executing sqlContext.sql(...) -
>> 1. Identify the table-name mentioned in query
>> 2. Use an external database to decide where's the data located, in which
>> format (parquet or csv or jdbc) etc.
>> 3. Load the dataframe
>> 4. Register it as temp table (for future calls to this table)
>>
>> This is achieved by extending HiveContext, and correspondingly
>> HiveCatalog. I have my own implementation of trait "Catalog", which
>> over-rides the "lookupRelation" method to do the magic behind the scenes.
>>
>> However, in spark 2.0, I can see following -
>> SessionCatalog - which contains lookupRelation method, but doesn't have
>> any interface / abstract class to it.
>> ExternalCatalog - which deals with CatalogTable instead of Df /
>> LogicalPlan.
>> Catalog - which also doesn't expose any method to lookup Df / LogicalPlan.
>>
>> So apparently it looks like I need to extend SessionCatalog only.
>> However, just wanted to get a feedback on if there's a better /
>> recommended approach to achieve this.
>>
>>
>> Thanks and regards,
>>
>>
>> Kapil Malik
>> *Sr. Principal Engineer | Data Platform, Technology*
>> M: +91 8800836581 | T: 0124-433 | EXT: 20910
>> ASF Centre A | 1st Floor | Udyog Vihar Phase IV |
>> Gurgaon | Haryana | India
>>
>> *Disclaimer:* This communication is for the sole use of the addressee
>> and is confidential and privileged information. If you are not the intended
>> recipient of this communication, you are prohibited from disclosing it and
>> are required to delete it forthwith. Please note that the contents of this
>> communication do not necessarily represent the views of Jasper Infotech
>> Private Limited ("Company"). E-mail transmission cannot be guaranteed to be
>> secure or error-free as information could be intercepted, corrupted, lost,
>> destroyed, arrive late or incomplete, or contain viruses. The Company,
>> therefore, does not accept liability for any loss caused due to this
>> communication. *Jasper Infotech Private Limited, Registered Office: 1st
>> Floor, Plot 238, Okhla Industrial Estate, New Delhi - 110020 INDIA CIN:
>> U72300DL2007PTC168097*
>>
>>
>


Re: Passing Custom App Id for consumption in History Server

2016-09-03 Thread Raghavendra Pandey
Default implementation is to add milliseconds. For mesos it is
framework-id. If you are using mesos, you can assume that your framework id
used to register your app is same as app-id.
As you said, you have a system application to schedule spark jobs, you can
keep track of framework-ids submitted by your application, you can use the
same info.

On Fri, Sep 2, 2016 at 6:29 PM, Amit Shanker 
wrote:

> Currently Spark sets current time in Milliseconds as the app Id. Is there
> a way one can pass in the app id to the spark job, so that it uses this
> provided app id instead of generating one using time?
>
> Lets take the following scenario : I have a system application which
> schedules spark jobs, and records the metadata for that job (say job
> params, cores, etc). In this system application, I want to link every job
> with its corresponding UI (history server). The only way I can do this is
> if I have the app Id of that job stored in this system application. And the
> only way one can get the app Id is by using the
> SparkContext.getApplicationId() function - which needs to be run from
> inside the job. So, this make it difficult to convey this piece of
> information from spark to a system outside spark.
>
> Thanks,
> Amit Shanker
>


Re: Importing large file with SparkContext.textFile

2016-09-03 Thread Somasundaram Sekar
If the file is not splittable(can I assume the log file is splittable,
though) can you advise on how spark handles such case…? If Spark can't what
is the widely used practice?

On 3 Sep 2016 7:29 pm, "Raghavendra Pandey" 
wrote:

If your file format is splittable say TSV, CSV etc, it will be distributed
across all executors.

On Sat, Sep 3, 2016 at 3:38 PM, Somasundaram Sekar  wrote:

> Hi All,
>
>
>
> Would like to gain some understanding on the questions listed below,
>
>
>
> 1.   When processing a large file with Apache Spark, with, say,
> sc.textFile("somefile.xml"), does it split it for parallel processing
> across executors or, will it be processed as a single chunk in a single
> executor?
>
> 2.   When using dataframes, with implicit XMLContext from Databricks
> is there any optimization prebuilt for such large file processing?
>
>
>
> Please help!!!
>
>
>
> http://stackoverflow.com/questions/39305310/does-spark-proce
> ss-large-file-in-the-single-worker
>
>
>
> Regards,
>
> Somasundaram S
>


Re: Pausing spark kafka streaming (direct) or exclude/include some partitions on the fly per batch

2016-09-03 Thread Cody Koeninger
Not built in, you're going to have to do some work.

On Sep 2, 2016 16:33, "sagarcasual ."  wrote:

> Hi Cody, thanks for the reply.
> I am using Spark 1.6.1 with Kafka 0.9.
> When I want to stop streaming, stopping the context sounds ok, but for
> temporarily excluding partitions is there any way I can supply
> topic-partition info on the fly at the beginning of every pull dynamically.
> Will streaminglistener be of any help?
>
> On Fri, Sep 2, 2016 at 10:37 AM, Cody Koeninger 
> wrote:
>
>> If you just want to pause the whole stream, just stop the app and then
>> restart it when you're ready.
>>
>> If you want to do some type of per-partition manipulation, you're
>> going to need to write some code.  The 0.10 integration makes the
>> underlying kafka consumer pluggable, so you may be able to wrap a
>> consumer to do what you need.
>>
>> On Fri, Sep 2, 2016 at 12:28 PM, sagarcasual . 
>> wrote:
>> > Hello, this is for
>> > Pausing spark kafka streaming (direct) or exclude/include some
>> partitions on
>> > the fly per batch
>> > =
>> > I have following code that creates a direct stream using Kafka
>> connector for
>> > Spark.
>> >
>> > final JavaInputDStream msgRecords =
>> > KafkaUtils.createDirectStream(
>> > jssc, String.class, String.class, StringDecoder.class,
>> > StringDecoder.class,
>> > KafkaMessage.class, kafkaParams, topicsPartitions,
>> > message -> {
>> > return KafkaMessage.builder()
>> > .
>> > .build();
>> > }
>> > );
>> >
>> > However I want to handle a situation, where I can decide that this
>> streaming
>> > needs to pause for a while on conditional basis, is there any way to
>> achieve
>> > this? Say my Kafka is undergoing some maintenance, so between 10AM to
>> 12PM
>> > stop processing, and then again pick up at 12PM from the last offset,
>> how do
>> > I do it?
>> >
>> > Also, assume all of a sudden we want to take one-or-more of the
>> partitions
>> > for a pull and add it back after some pulls, how do I achieve that?
>> >
>> > -Regards
>> > Sagar
>> >
>>
>
>


Re: how to pass trustStore path into pyspark ?

2016-09-03 Thread Raghavendra Pandey
Did you try passing them in spark-env.sh?

On Sat, Sep 3, 2016 at 2:42 AM, Eric Ho  wrote:

> I'm trying to pass a trustStore pathname into pyspark.
> What env variable and/or config file or script I need to change to do this
> ?
> I've tried setting JAVA_OPTS env var but to no avail...
> any pointer much appreciated...  thx
>
> --
>
> -eric ho
>
>


Re: Importing large file with SparkContext.textFile

2016-09-03 Thread Raghavendra Pandey
If your file format is splittable say TSV, CSV etc, it will be distributed
across all executors.

On Sat, Sep 3, 2016 at 3:38 PM, Somasundaram Sekar <
somasundar.se...@tigeranalytics.com> wrote:

> Hi All,
>
>
>
> Would like to gain some understanding on the questions listed below,
>
>
>
> 1.   When processing a large file with Apache Spark, with, say,
> sc.textFile("somefile.xml"), does it split it for parallel processing
> across executors or, will it be processed as a single chunk in a single
> executor?
>
> 2.   When using dataframes, with implicit XMLContext from Databricks
> is there any optimization prebuilt for such large file processing?
>
>
>
> Please help!!!
>
>
>
> http://stackoverflow.com/questions/39305310/does-spark-
> process-large-file-in-the-single-worker
>
>
>
> Regards,
>
> Somasundaram S
>


Re: Catalog, SessionCatalog and ExternalCatalog in spark 2.0

2016-09-03 Thread Raghavendra Pandey
Kapil -- I afraid you need to plugin your own SessionCatalog as
ResolveRelations class depends on that. To keep up with consistent design
you may like to implement ExternalCatalog as well.
You can also look to plug in your own Analyzer class to give your more
flexibility. Ultimately that is where all Relations get resolved from
SessionCatalog.

On Sat, Sep 3, 2016 at 5:49 PM, Kapil Malik 
wrote:

> Hi all,
>
> I have a Spark SQL 1.6 application in production which does following on
> executing sqlContext.sql(...) -
> 1. Identify the table-name mentioned in query
> 2. Use an external database to decide where's the data located, in which
> format (parquet or csv or jdbc) etc.
> 3. Load the dataframe
> 4. Register it as temp table (for future calls to this table)
>
> This is achieved by extending HiveContext, and correspondingly
> HiveCatalog. I have my own implementation of trait "Catalog", which
> over-rides the "lookupRelation" method to do the magic behind the scenes.
>
> However, in spark 2.0, I can see following -
> SessionCatalog - which contains lookupRelation method, but doesn't have
> any interface / abstract class to it.
> ExternalCatalog - which deals with CatalogTable instead of Df /
> LogicalPlan.
> Catalog - which also doesn't expose any method to lookup Df / LogicalPlan.
>
> So apparently it looks like I need to extend SessionCatalog only.
> However, just wanted to get a feedback on if there's a better /
> recommended approach to achieve this.
>
>
> Thanks and regards,
>
>
> Kapil Malik
> *Sr. Principal Engineer | Data Platform, Technology*
> M: +91 8800836581 | T: 0124-433 | EXT: 20910
> ASF Centre A | 1st Floor | Udyog Vihar Phase IV |
> Gurgaon | Haryana | India
>
> *Disclaimer:* This communication is for the sole use of the addressee and
> is confidential and privileged information. If you are not the intended
> recipient of this communication, you are prohibited from disclosing it and
> are required to delete it forthwith. Please note that the contents of this
> communication do not necessarily represent the views of Jasper Infotech
> Private Limited ("Company"). E-mail transmission cannot be guaranteed to be
> secure or error-free as information could be intercepted, corrupted, lost,
> destroyed, arrive late or incomplete, or contain viruses. The Company,
> therefore, does not accept liability for any loss caused due to this
> communication. *Jasper Infotech Private Limited, Registered Office: 1st
> Floor, Plot 238, Okhla Industrial Estate, New Delhi - 110020 INDIA CIN:
> U72300DL2007PTC168097*
>
>


Pls assist: Spark 2.0 build failure on Ubuntu 16.06

2016-09-03 Thread Marco Mistroni
hi all

 i am getting failures when building spark 2.0 on Ubuntu 16.06
Here's details of what i have installed on the ubuntu host
-  java 8
- scala 2.11
- git

When i launch the command

./build/mvn  -Pyarn -Phadoop-2.7  -DskipTests clean package

everything compiles sort of fine and at the end i get this exception

INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 01:05 h
[INFO] Finished at: 2016-09-03T13:25:27+00:00
[INFO] Final Memory: 57M/208M
[INFO]

[ERROR] Failed to execute goal
net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile
(scala-test-compile-first) on project spark-streaming_2.11: Execution
scala-test-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.2:testCompile failed.
CompileFailed -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the
command
[ERROR]   mvn  -rf :spark-streaming_2.11

could anyone help?

kr


Re[7]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Сергей Романов

And even more simple case:

>>> df = sc.parallelize([1] for x in xrange(760857)).toDF()
>>> for x in range(50, 70): print x, timeit.timeit(df.groupBy().sum(*(['_1'] * 
>>> x)).collect, number=1)
50 1.91226291656
51 1.50933384895
52 1.582903862
53 1.90537405014
54 1.84442877769
55 1.9177978
56 1.50977802277
57 1.5907189846
// after 57 rows it's 2x slower

58 3.22199988365
59 2.96345090866
60 2.8297970295
61 2.87895679474
62 2.92077898979
63 2.95195293427
64 4.10550689697
65 4.14798402786
66 3.13437199593
67 3.11248207092
68 3.18963003159
69 3.18774986267


>Суббота,  3 сентября 2016, 15:50 +03:00 от Сергей Романов 
>:
>
>Same problem happens with CSV data file, so it's not parquet-related either.
>
>Welcome to
>    __
> / __/__  ___ _/ /__
>    _\ \/ _ \/ _ `/ __/  '_/
>   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>  /_/
>
>Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
>SparkSession available as 'spark'.
 import timeit
 from pyspark.sql.types import *
 schema = StructType([StructField('dd_convs', FloatType(), True)])
 for x in range(50, 70): print x, 
 timeit.timeit(spark.read.csv('file:///data/dump/test_csv', 
 schema=schema).groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
>50 0.372850894928
>51 0.376906871796
>52 0.381325960159
>53 0.385444164276
>54 0.38685192
>55 0.388918161392
>56 0.397624969482
>57 0.391713142395
>58 2.62714004517
>59 2.68421196938
>60 2.74627685547
>61 2.81081581116
>62 3.43532109261
>63 3.07742786407
>64 3.03904604912
>65 3.01616096497
>66 3.06293702126
>67 3.09386610985
>68 3.27610206604
>69 3.2041969299 Суббота,  3 сентября 2016, 15:40 +03:00 от Сергей Романов < 
>romano...@inbox.ru.INVALID >:
>>
>>Hi,
>>I had narrowed down my problem to a very simple case. I'm sending 27kb 
>>parquet in attachment. (file:///data/dump/test2 in example)
>>Please, can you take a look at it? Why there is performance drop after 57 sum 
>>columns?
>>Welcome to
>>    __
>> / __/__  ___ _/ /__
>>    _\ \/ _ \/ _ `/ __/  '_/
>>   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>>  /_/
>>
>>Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
>>SparkSession available as 'spark'.
> import timeit
> for x in range(70): print x, 
> timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs']
>  * x) ).collect, number=1)
>>... 
>>SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>SLF4J: Defaulting to no-operation (NOP) logger implementation
>>SLF4J: See  http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
>>details.
>>0 1.05591607094
>>1 0.200426101685
>>2 0.203800916672
>>3 0.176458120346
>>4 0.184863805771
>>5 0.232321023941
>>6 0.216032981873
>>7 0.201778173447
>>8 0.292424917221
>>9 0.228524923325
>>10 0.190534114838
>>11 0.197028160095
>>12 0.270443916321
>>13 0.429781913757
>>14 0.270851135254
>>15 0.776989936829
>>16 0.27879181
>>17 0.227638959885
>>18 0.212944030762
>>19 0.2144780159
>>20 0.22200012207
>>21 0.262261152267
>>22 0.254227876663
>>23 0.275084018707
>>24 0.292124032974
>>25 0.280488014221
>>16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan 
>>since it was too large. This behavior can be adjusted by setting 
>>'spark.debug.maxToStringFields' in SparkEnv.conf.
>>26 0.290093898773
>>27 0.238478899002
>>28 0.246420860291
>>29 0.241401195526
>>30 0.255286931992
>>31 0.42702794075
>>32 0.327946186066
>>33 0.434395074844
>>34 0.314198970795
>>35 0.34576010704
>>36 0.278323888779
>>37 0.289474964142
>>38 0.290827989578
>>39 0.376291036606
>>40 0.347742080688
>>41 0.363158941269
>>42 0.318687915802
>>43 0.376327991486
>>44 0.374994039536
>>45 0.362971067429
>>46 0.425967931747
>>47 0.370860099792
>>48 0.443903923035
>>49 0.374128103256
>>50 0.378985881805
>>51 0.476850986481
>>52 0.451028823853
>>53 0.432540893555
>>54 0.514838933945
>>55 0.53990483284
>>56 0.449142932892
>>57 0.465240001678 // 5x slower after 57 columns
>>58 2.40412116051
>>59 2.41632795334
>>60 2.41812801361
>>61 2.55726218224
>>62 2.55484509468
>>63 2.56128406525
>>64 2.54642391205
>>65 2.56381797791
>>66 2.56871509552
>>67 2.66187620163
>>68 2.63496208191
>>69 2.8154599

>>
>>Sergei Romanov
>>
>>-
>>To unsubscribe e-mail:  user-unsubscr...@spark.apache.org
>Sergei Romanov
>
>-
>To unsubscribe e-mail:  user-unsubscr...@spark.apache.org



Re[6]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Сергей Романов

Same problem happens with CSV data file, so it's not parquet-related either.

Welcome to
    __
 / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
SparkSession available as 'spark'.
>>> import timeit
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField('dd_convs', FloatType(), True)])
>>> for x in range(50, 70): print x, 
>>> timeit.timeit(spark.read.csv('file:///data/dump/test_csv', 
>>> schema=schema).groupBy().sum(*(['dd_convs'] * x) ).collect, number=1)
50 0.372850894928
51 0.376906871796
52 0.381325960159
53 0.385444164276
54 0.38685192
55 0.388918161392
56 0.397624969482
57 0.391713142395
58 2.62714004517
59 2.68421196938
60 2.74627685547
61 2.81081581116
62 3.43532109261
63 3.07742786407
64 3.03904604912
65 3.01616096497
66 3.06293702126
67 3.09386610985
68 3.27610206604
69 3.2041969299 Суббота,  3 сентября 2016, 15:40 +03:00 от Сергей Романов 
:
>
>Hi,
>I had narrowed down my problem to a very simple case. I'm sending 27kb parquet 
>in attachment. (file:///data/dump/test2 in example)
>Please, can you take a look at it? Why there is performance drop after 57 sum 
>columns?
>Welcome to
>    __
> / __/__  ___ _/ /__
>    _\ \/ _ \/ _ `/ __/  '_/
>   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
>  /_/
>
>Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
>SparkSession available as 'spark'.
 import timeit
 for x in range(70): print x, 
 timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs']
  * x) ).collect, number=1)
>... 
>SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>SLF4J: Defaulting to no-operation (NOP) logger implementation
>SLF4J: See  http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
>details.
>0 1.05591607094
>1 0.200426101685
>2 0.203800916672
>3 0.176458120346
>4 0.184863805771
>5 0.232321023941
>6 0.216032981873
>7 0.201778173447
>8 0.292424917221
>9 0.228524923325
>10 0.190534114838
>11 0.197028160095
>12 0.270443916321
>13 0.429781913757
>14 0.270851135254
>15 0.776989936829
>16 0.27879181
>17 0.227638959885
>18 0.212944030762
>19 0.2144780159
>20 0.22200012207
>21 0.262261152267
>22 0.254227876663
>23 0.275084018707
>24 0.292124032974
>25 0.280488014221
>16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan 
>since it was too large. This behavior can be adjusted by setting 
>'spark.debug.maxToStringFields' in SparkEnv.conf.
>26 0.290093898773
>27 0.238478899002
>28 0.246420860291
>29 0.241401195526
>30 0.255286931992
>31 0.42702794075
>32 0.327946186066
>33 0.434395074844
>34 0.314198970795
>35 0.34576010704
>36 0.278323888779
>37 0.289474964142
>38 0.290827989578
>39 0.376291036606
>40 0.347742080688
>41 0.363158941269
>42 0.318687915802
>43 0.376327991486
>44 0.374994039536
>45 0.362971067429
>46 0.425967931747
>47 0.370860099792
>48 0.443903923035
>49 0.374128103256
>50 0.378985881805
>51 0.476850986481
>52 0.451028823853
>53 0.432540893555
>54 0.514838933945
>55 0.53990483284
>56 0.449142932892
>57 0.465240001678 // 5x slower after 57 columns
>58 2.40412116051
>59 2.41632795334
>60 2.41812801361
>61 2.55726218224
>62 2.55484509468
>63 2.56128406525
>64 2.54642391205
>65 2.56381797791
>66 2.56871509552
>67 2.66187620163
>68 2.63496208191
>69 2.8154599

>
>Sergei Romanov
>
>-
>To unsubscribe e-mail:  user-unsubscr...@spark.apache.org
Sergei Romanov


bad.csv.tgz
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re[5]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Сергей Романов

Hi,
I had narrowed down my problem to a very simple case. I'm sending 27kb parquet 
in attachment. (file:///data/dump/test2 in example)
Please, can you take a look at it? Why there is performance drop after 57 sum 
columns?
Welcome to
    __
 / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
SparkSession available as 'spark'.
>>> import timeit
>>> for x in range(70): print x, 
>>> timeit.timeit(spark.read.parquet('file:///data/dump/test2').groupBy().sum(*(['dd_convs']
>>>  * x) ).collect, number=1)
... 
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
0 1.05591607094
1 0.200426101685
2 0.203800916672
3 0.176458120346
4 0.184863805771
5 0.232321023941
6 0.216032981873
7 0.201778173447
8 0.292424917221
9 0.228524923325
10 0.190534114838
11 0.197028160095
12 0.270443916321
13 0.429781913757
14 0.270851135254
15 0.776989936829
16 0.27879181
17 0.227638959885
18 0.212944030762
19 0.2144780159
20 0.22200012207
21 0.262261152267
22 0.254227876663
23 0.275084018707
24 0.292124032974
25 0.280488014221
16/09/03 15:31:28 WARN Utils: Truncated the string representation of a plan 
since it was too large. This behavior can be adjusted by setting 
'spark.debug.maxToStringFields' in SparkEnv.conf.
26 0.290093898773
27 0.238478899002
28 0.246420860291
29 0.241401195526
30 0.255286931992
31 0.42702794075
32 0.327946186066
33 0.434395074844
34 0.314198970795
35 0.34576010704
36 0.278323888779
37 0.289474964142
38 0.290827989578
39 0.376291036606
40 0.347742080688
41 0.363158941269
42 0.318687915802
43 0.376327991486
44 0.374994039536
45 0.362971067429
46 0.425967931747
47 0.370860099792
48 0.443903923035
49 0.374128103256
50 0.378985881805
51 0.476850986481
52 0.451028823853
53 0.432540893555
54 0.514838933945
55 0.53990483284
56 0.449142932892
57 0.465240001678 // 5x slower after 57 columns
58 2.40412116051
59 2.41632795334
60 2.41812801361
61 2.55726218224
62 2.55484509468
63 2.56128406525
64 2.54642391205
65 2.56381797791
66 2.56871509552
67 2.66187620163
68 2.63496208191
69 2.8154599


Sergei Romanov


part-r-0-d06531d2-d435-4ff0-8804-0dea49953be4.snappy.parquet
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Catalog, SessionCatalog and ExternalCatalog in spark 2.0

2016-09-03 Thread Kapil Malik
Hi all,

I have a Spark SQL 1.6 application in production which does following on
executing sqlContext.sql(...) -
1. Identify the table-name mentioned in query
2. Use an external database to decide where's the data located, in which
format (parquet or csv or jdbc) etc.
3. Load the dataframe
4. Register it as temp table (for future calls to this table)

This is achieved by extending HiveContext, and correspondingly HiveCatalog.
I have my own implementation of trait "Catalog", which over-rides the
"lookupRelation" method to do the magic behind the scenes.

However, in spark 2.0, I can see following -
SessionCatalog - which contains lookupRelation method, but doesn't have any
interface / abstract class to it.
ExternalCatalog - which deals with CatalogTable instead of Df / LogicalPlan.
Catalog - which also doesn't expose any method to lookup Df / LogicalPlan.

So apparently it looks like I need to extend SessionCatalog only.
However, just wanted to get a feedback on if there's a better / recommended
approach to achieve this.


Thanks and regards,


Kapil Malik
*Sr. Principal Engineer | Data Platform, Technology*
M: +91 8800836581 | T: 0124-433 | EXT: 20910
ASF Centre A | 1st Floor | Udyog Vihar Phase IV |
Gurgaon | Haryana | India

*Disclaimer:* This communication is for the sole use of the addressee and
is confidential and privileged information. If you are not the intended
recipient of this communication, you are prohibited from disclosing it and
are required to delete it forthwith. Please note that the contents of this
communication do not necessarily represent the views of Jasper Infotech
Private Limited ("Company"). E-mail transmission cannot be guaranteed to be
secure or error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The Company,
therefore, does not accept liability for any loss caused due to this
communication. *Jasper Infotech Private Limited, Registered Office: 1st
Floor, Plot 238, Okhla Industrial Estate, New Delhi - 110020 INDIA CIN:
U72300DL2007PTC168097*


Re[4]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Сергей Романов

Hi, Mich,
I don't think it is related to Hive or parquet partitioning. Same issue happens 
while working with non-partitioned parquet file using python Dataframe API.
Please, take a look at following example:
$ hdfs dfs -ls /user/test   // I had copied partition dt=2016-07-28 to another 
standalone path.
Found 1 items
-rw-r--r--   2 hdfs supergroup   33568823 2016-09-03 11:11 /user/test/part-0

$ ./pyspark
Welcome to
    __
 / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0
  /_/

Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
SparkSession available as 'spark'.

>>> spark.read.parquet('hdfs://spark-master1.uslicer.net:8020/user/test').groupBy().sum(*(['dd_convs']
>>>  * 57) ).collect()
*response time over 3 runs skipping the first run*
0.8370630741119385
0.22276782989501953
0.7722570896148682

>>> spark.read.parquet('hdfs://spark-master1.uslicer.net:8020/user/test').groupBy().sum(*(['dd_convs']
>>>  * 58) ).collect()
*response time over 3 runs skipping the first run*
4.40575098991394
4.320873022079468
3.2484099864959717

$ hdfs dfs -get /user/test/part-0 .
$ parquet-meta part-0 
file:   file:/data/dump/part-0 
creator:    parquet-mr 

file schema:    hive_schema 
-
actual_dsp_fee: OPTIONAL FLOAT R:0 D:1
actual_pgm_fee: OPTIONAL FLOAT R:0 D:1
actual_ssp_fee: OPTIONAL FLOAT R:0 D:1
advertiser_id:  OPTIONAL INT32 R:0 D:1
advertiser_spent:   OPTIONAL DOUBLE R:0 D:1
anomaly_clicks: OPTIONAL INT64 R:0 D:1
anomaly_conversions_filtered:   OPTIONAL INT64 R:0 D:1
anomaly_conversions_unfiltered: OPTIONAL INT64 R:0 D:1
anomaly_decisions:  OPTIONAL FLOAT R:0 D:1
bid_price:  OPTIONAL FLOAT R:0 D:1
campaign_id:    OPTIONAL INT32 R:0 D:1
click_prob: OPTIONAL FLOAT R:0 D:1
clicks: OPTIONAL INT64 R:0 D:1
clicks_static:  OPTIONAL INT64 R:0 D:1
conv_prob:  OPTIONAL FLOAT R:0 D:1
conversion_id:  OPTIONAL INT64 R:0 D:1
conversions:    OPTIONAL INT64 R:0 D:1
creative_id:    OPTIONAL INT32 R:0 D:1
dd_convs:   OPTIONAL INT64 R:0 D:1
decisions:  OPTIONAL FLOAT R:0 D:1
dmp_liveramp_margin:    OPTIONAL FLOAT R:0 D:1
dmp_liveramp_payout:    OPTIONAL FLOAT R:0 D:1
dmp_nielsen_margin: OPTIONAL FLOAT R:0 D:1
dmp_nielsen_payout: OPTIONAL FLOAT R:0 D:1
dmp_rapleaf_margin: OPTIONAL FLOAT R:0 D:1
dmp_rapleaf_payout: OPTIONAL FLOAT R:0 D:1
e:  OPTIONAL FLOAT R:0 D:1
expected_cpa:   OPTIONAL FLOAT R:0 D:1
expected_cpc:   OPTIONAL FLOAT R:0 D:1
expected_payout:    OPTIONAL FLOAT R:0 D:1
first_impressions:  OPTIONAL INT64 R:0 D:1
fraud_clicks:   OPTIONAL INT64 R:0 D:1
fraud_impressions:  OPTIONAL INT64 R:0 D:1
g:  OPTIONAL FLOAT R:0 D:1
impressions:    OPTIONAL FLOAT R:0 D:1
line_item_id:   OPTIONAL INT32 R:0 D:1
mail_type:  OPTIONAL BINARY O:UTF8 R:0 D:1
noads:  OPTIONAL FLOAT R:0 D:1
predict_version:    OPTIONAL INT64 R:0 D:1
publisher_id:   OPTIONAL INT32 R:0 D:1
publisher_revenue:  OPTIONAL DOUBLE R:0 D:1
pvc:    OPTIONAL INT64 R:0 D:1
second_price:   OPTIONAL FLOAT R:0 D:1
thirdparty_margin:  OPTIONAL FLOAT R:0 D:1
thirdparty_payout:  OPTIONAL FLOAT R:0 D:1

row group 1:    RC:769163 TS:40249546 OFFSET:4 
-
actual_dsp_fee:  FLOAT SNAPPY DO:0 FPO:4 
SZ:1378278/1501551/1.09 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
actual_pgm_fee:  FLOAT SNAPPY DO:0 FPO:1378282 
SZ:39085/42374/1.08 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
actual_ssp_fee:  FLOAT SNAPPY DO:0 FPO:1417367 
SZ:1426888/1553337/1.09 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
advertiser_id:   INT32 SNAPPY DO:0 FPO:2844255 
SZ:339061/572962/1.69 VC:769163 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
advertiser_spent:    DOUBLE SNAPPY DO:0 FPO:3183316 
SZ:2731185/3788429/1.39 VC:769163 ENC:PLAIN,RLE,PLAIN_DICTIONARY,BIT_PACKED

Re: any idea what this error could be?

2016-09-03 Thread Fridtjof Sander
There is an InvalidClassException complaining about non-matching
serialVersionUIDs. Shouldn't that be caused by different jars on executors
and driver?

Am 03.09.2016 1:04 nachm. schrieb "Tal Grynbaum" :

> My guess is that you're running out of memory somewhere.  Try to increase
> the driver memory and/or executor memory.
>
> On Sat, Sep 3, 2016, 11:42 kant kodali  wrote:
>
>> I am running this on aws.
>>
>>
>>
>> On Fri, Sep 2, 2016 11:49 PM, kant kodali kanth...@gmail.com wrote:
>>
>>> I am running spark in stand alone mode. I guess this error when I run my
>>> driver program..I am using spark 2.0.0. any idea what this error could be?
>>>
>>>
>>>
>>>   Using Spark's default log4j profile: 
>>> org/apache/spark/log4j-defaults.properties
>>> 16/09/02 23:44:44 INFO SparkContext: Running Spark version 2.0.0
>>> 16/09/02 23:44:44 WARN NativeCodeLoader: Unable to load native-hadoop 
>>> library for your platform... using builtin-java classes where applicable
>>> 16/09/02 23:44:45 INFO SecurityManager: Changing view acls to: kantkodali
>>> 16/09/02 23:44:45 INFO SecurityManager: Changing modify acls to: kantkodali
>>> 16/09/02 23:44:45 INFO SecurityManager: Changing view acls groups to:
>>> 16/09/02 23:44:45 INFO SecurityManager: Changing modify acls groups to:
>>> 16/09/02 23:44:45 INFO SecurityManager: SecurityManager: authentication 
>>> disabled; ui acls disabled; users  with view permissions: Set(kantkodali); 
>>> groups with view permissions: Set(); users  with modify permissions: 
>>> Set(kantkodali); groups with modify permissions: Set()
>>> 16/09/02 23:44:45 INFO Utils: Successfully started service 'sparkDriver' on 
>>> port 62256.
>>> 16/09/02 23:44:45 INFO SparkEnv: Registering MapOutputTracker
>>> 16/09/02 23:44:45 INFO SparkEnv: Registering BlockManagerMaster
>>> 16/09/02 23:44:45 INFO DiskBlockManager: Created local directory at 
>>> /private/var/folders/_6/lfxt933j3bd_xhq0m7dwm8s0gn/T/blockmgr-b56eea49-0102-4570-865a-1d3d230f0ffc
>>> 16/09/02 23:44:45 INFO MemoryStore: MemoryStore started with capacity 
>>> 2004.6 MB
>>> 16/09/02 23:44:45 INFO SparkEnv: Registering OutputCommitCoordinator
>>> 16/09/02 23:44:45 INFO Utils: Successfully started service 'SparkUI' on 
>>> port 4040.
>>> 16/09/02 23:44:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
>>> http://192.168.0.191:4040
>>> 16/09/02 23:44:45 INFO StandaloneAppClient$ClientEndpoint: Connecting to 
>>> master spark://52.43.37.223:7077...
>>> 16/09/02 23:44:46 INFO TransportClientFactory: Successfully created 
>>> connection to /52.43.37.223:7077 after 70 ms (0 ms spent in bootstraps)
>>> 16/09/02 23:44:46 WARN StandaloneAppClient$ClientEndpoint: Failed to 
>>> connect to master 52.43.37.223:7077
>>> org.apache.spark.SparkException: Exception thrown in awaitResult
>>> at 
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
>>> at 
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
>>> at 
>>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>>> at 
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>>> at 
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>>> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>>> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>>> at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
>>> at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
>>> at 
>>> org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:109)
>>> at 
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: java.lang.RuntimeException: java.io.InvalidClassException: 
>>> org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream 
>>> classdesc serialVersionUID = -2221986757032131007, local class 
>>> serialVersionUID = -5447855329526097695
>>> at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>>> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:
>>>
>>>


Re: any idea what this error could be?

2016-09-03 Thread Tal Grynbaum
My guess is that you're running out of memory somewhere.  Try to increase
the driver memory and/or executor memory.

On Sat, Sep 3, 2016, 11:42 kant kodali  wrote:

> I am running this on aws.
>
>
>
> On Fri, Sep 2, 2016 11:49 PM, kant kodali kanth...@gmail.com wrote:
>
>> I am running spark in stand alone mode. I guess this error when I run my
>> driver program..I am using spark 2.0.0. any idea what this error could be?
>>
>>
>>
>>   Using Spark's default log4j profile: 
>> org/apache/spark/log4j-defaults.properties
>> 16/09/02 23:44:44 INFO SparkContext: Running Spark version 2.0.0
>> 16/09/02 23:44:44 WARN NativeCodeLoader: Unable to load native-hadoop 
>> library for your platform... using builtin-java classes where applicable
>> 16/09/02 23:44:45 INFO SecurityManager: Changing view acls to: kantkodali
>> 16/09/02 23:44:45 INFO SecurityManager: Changing modify acls to: kantkodali
>> 16/09/02 23:44:45 INFO SecurityManager: Changing view acls groups to:
>> 16/09/02 23:44:45 INFO SecurityManager: Changing modify acls groups to:
>> 16/09/02 23:44:45 INFO SecurityManager: SecurityManager: authentication 
>> disabled; ui acls disabled; users  with view permissions: Set(kantkodali); 
>> groups with view permissions: Set(); users  with modify permissions: 
>> Set(kantkodali); groups with modify permissions: Set()
>> 16/09/02 23:44:45 INFO Utils: Successfully started service 'sparkDriver' on 
>> port 62256.
>> 16/09/02 23:44:45 INFO SparkEnv: Registering MapOutputTracker
>> 16/09/02 23:44:45 INFO SparkEnv: Registering BlockManagerMaster
>> 16/09/02 23:44:45 INFO DiskBlockManager: Created local directory at 
>> /private/var/folders/_6/lfxt933j3bd_xhq0m7dwm8s0gn/T/blockmgr-b56eea49-0102-4570-865a-1d3d230f0ffc
>> 16/09/02 23:44:45 INFO MemoryStore: MemoryStore started with capacity 2004.6 
>> MB
>> 16/09/02 23:44:45 INFO SparkEnv: Registering OutputCommitCoordinator
>> 16/09/02 23:44:45 INFO Utils: Successfully started service 'SparkUI' on port 
>> 4040.
>> 16/09/02 23:44:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
>> http://192.168.0.191:4040
>> 16/09/02 23:44:45 INFO StandaloneAppClient$ClientEndpoint: Connecting to 
>> master spark://52.43.37.223:7077...
>> 16/09/02 23:44:46 INFO TransportClientFactory: Successfully created 
>> connection to /52.43.37.223:7077 after 70 ms (0 ms spent in bootstraps)
>> 16/09/02 23:44:46 WARN StandaloneAppClient$ClientEndpoint: Failed to connect 
>> to master 52.43.37.223:7077
>> org.apache.spark.SparkException: Exception thrown in awaitResult
>> at 
>> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
>> at 
>> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
>> at 
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>> at 
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>> at 
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>> at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
>> at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
>> at 
>> org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:109)
>> at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.lang.RuntimeException: java.io.InvalidClassException: 
>> org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream 
>> classdesc serialVersionUID = -2221986757032131007, local class 
>> serialVersionUID = -5447855329526097695
>> at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:
>>
>>


Importing large file with SparkContext.textFile

2016-09-03 Thread Somasundaram Sekar
Hi All,



Would like to gain some understanding on the questions listed below,



1.   When processing a large file with Apache Spark, with, say,
sc.textFile("somefile.xml"), does it split it for parallel processing
across executors or, will it be processed as a single chunk in a single
executor?

2.   When using dataframes, with implicit XMLContext from Databricks is
there any optimization prebuilt for such large file processing?



Please help!!!



http://stackoverflow.com/questions/39305310/does-spark-process-large-file-in-the-single-worker



Regards,

Somasundaram S


Need a help in row repetation

2016-09-03 Thread Selvam Raman
I have my dataset as dataframe. Using spark 1.5.0 version


cola,colb,colc,cold,cole,colf,colg,colh,coli -> columns in row

In the above column date fileds column  are (colc,colf,colh,coli).

scenario:((colc -2016,colf -2016,colh -2016,coli -2016)
if all the  year are same, no need of any logic. just remains same record.


scenario:((colc -2016,colf -2017,colh -2016,coli -2018) -> unque values are
2016,2017,2018
if all the year(in date fields) are different then we need repeat the
record as distinct years(ie. the above column has three year so we need to
repeat the same row twice)

please give me any suggestion in terms of dataframe.



-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Hive connection issues in spark-shell

2016-09-03 Thread Diwakar Dhanuskodi
Hi,

I recently built spark using maven. Now when starting spark-shell, it
couldn't connect hive and getting below error

I couldn't find datanucleus jar in built library. But datanucleus jar is
available in hive/lib folders.

java.lang.ClassNotFoundException:
org.datanucleus.api.jdo.JDOPersistenceManagerFactory
at
scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at javax.jdo.JDOHelper$18.run(JDOHelper.java:2018)
at javax.jdo.JDOHelper$18.run(JDOHelper.java:2016)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.forName(JDOHelper.java:2015)
at
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
at
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
at
org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
at
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at
org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
at
org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
at
org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199)
at
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
at
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
at
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
at
org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
at
org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
at org.apache.spark.sql.UDFRegistration.(UDFRegistration.scala:40)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:330)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:97)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.spark.repl.Main$.createSQLContext(Main.scala:89)
at $line4.$read$$iw$$iw.(:12)
at $line4.$read$$iw.(:21)
at $line4.$read.(:23)
at $line4.$read$.(:27)
at $line4.$read$.()
at $line4.$eval$.$print$lzycompute(:7)
at $line4.$eval$.$print(:6)
at $line4.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

Re: Spark build 1.6.2 error

2016-09-03 Thread Diwakar Dhanuskodi
Sorry my bad. In both the runs I included -Dscala-2.11

On Sat, Sep 3, 2016 at 12:39 PM, Nachiketa 
wrote:

> I think the difference was the -Dscala2.11 to the command line.
>
> I have seen this show up when I miss that.
>
> Regards,
> Nachiketa
>
> On Sat 3 Sep, 2016, 12:14 PM Diwakar Dhanuskodi, <
> diwakar.dhanusk...@gmail.com> wrote:
>
>> Hi,
>>
>> Just re-ran again without killing zinc server process
>>
>> /make-distribution.sh --name custom-spark --tgz  -Phadoop-2.6 -Phive
>> -Pyarn -Dmaven.version=3.0.4 -Dscala-2.11 -X -rf :spark-sql_2.11
>>
>> Build is success. Not sure how it worked with just re-running command
>> again.
>>
>> On Sat, Sep 3, 2016 at 11:44 AM, Diwakar Dhanuskodi <
>> diwakar.dhanusk...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> java version 7
>>>
>>> mvn command
>>> ./make-distribution.sh --name custom-spark --tgz  -Phadoop-2.6 -Phive
>>> -Phive-thriftserver -Pyarn -Dmaven.version=3.0.4
>>>
>>>
>>> yes, I executed script to change scala version to 2.11
>>> killed  "com.typesafe zinc.Nailgun" process
>>>
>>> re-ran mvn with below command again
>>>
>>> ./make-distribution.sh --name custom-spark --tgz  -Phadoop-2.6 -Phive
>>> -Phive-thriftserver -Pyarn -Dmaven.version=3.0.4 -X -rf :spark-sql_2.11
>>>
>>> Getting same error
>>>
>>> [warn] /home/cloudera/Downloads/spark-1.6.2/sql/core/src/main/
>>> scala/org/apache/spark/sql/sources/interfaces.scala:911: method isDir
>>> in class FileStatus is deprecated: see corresponding Javadoc for more
>>> information.
>>> [warn] status.isDir,
>>> [warn]^
>>> [error] missing or invalid dependency detected while loading class file
>>> 'WebUI.class'.
>>> [error] Could not access term eclipse in package org,
>>> [error] because it (or its dependencies) are missing. Check your build
>>> definition for
>>> [error] missing or conflicting dependencies. (Re-run with
>>> `-Ylog-classpath` to see the problematic classpath.)
>>> [error] A full rebuild may help if 'WebUI.class' was compiled against an
>>> incompatible version of org.
>>> [error] missing or invalid dependency detected while loading class file
>>> 'WebUI.class'.
>>> [error] Could not access term jetty in value org.eclipse,
>>> [error] because it (or its dependencies) are missing. Check your build
>>> definition for
>>> [error] missing or conflicting dependencies. (Re-run with
>>> `-Ylog-classpath` to see the problematic classpath.)
>>> [error] A full rebuild may help if 'WebUI.class' was compiled against an
>>> incompatible version of org.eclipse.
>>> [warn] 17 warnings found
>>> [error] two errors found
>>> [debug] Compilation failed (CompilerInterface)
>>> [error] Compile failed at Sep 3, 2016 11:28:34 AM [21.611s]
>>> [INFO] 
>>> 
>>> [INFO] Reactor Summary:
>>> [INFO]
>>> [INFO] Spark Project Parent POM .. SUCCESS
>>> [5.583s]
>>> [INFO] Spark Project Test Tags ... SUCCESS
>>> [4.189s]
>>> [INFO] Spark Project Launcher  SUCCESS
>>> [12.226s]
>>> [INFO] Spark Project Networking .. SUCCESS
>>> [13.386s]
>>> [INFO] Spark Project Shuffle Streaming Service ... SUCCESS
>>> [6.723s]
>>> [INFO] Spark Project Unsafe .. SUCCESS
>>> [21.231s]
>>> [INFO] Spark Project Core  SUCCESS
>>> [3:46.334s]
>>> [INFO] Spark Project Bagel ... SUCCESS
>>> [7.032s]
>>> [INFO] Spark Project GraphX .. SUCCESS
>>> [19.558s]
>>> [INFO] Spark Project Streaming ... SUCCESS
>>> [50.452s]
>>> [INFO] Spark Project Catalyst  SUCCESS
>>> [1:14.172s]
>>> [INFO] Spark Project SQL . FAILURE
>>> [23.222s]
>>> [INFO] Spark Project ML Library .. SKIPPED
>>> [INFO] Spark Project Tools ... SKIPPED
>>> [INFO] Spark Project Hive  SKIPPED
>>> [INFO] Spark Project Docker Integration Tests  SKIPPED
>>> [INFO] Spark Project REPL  SKIPPED
>>> [INFO] Spark Project YARN Shuffle Service  SKIPPED
>>> [INFO] Spark Project YARN  SKIPPED
>>> [INFO] Spark Project Assembly  SKIPPED
>>> [INFO] Spark Project External Twitter  SKIPPED
>>> [INFO] Spark Project External Flume Sink . SKIPPED
>>> [INFO] Spark Project External Flume .. SKIPPED
>>> [INFO] Spark Project External Flume Assembly . SKIPPED
>>> [INFO] Spark Project External MQTT ... SKIPPED
>>> [INFO] Spark Project External MQTT Assembly .. SKIPPED
>>> [INFO] Spark Project External ZeroMQ . SKIPPED
>>> [INFO] Spark Project External Kafka .. 

Re: Spark build 1.6.2 error

2016-09-03 Thread Nachiketa
I think the difference was the -Dscala2.11 to the command line.

I have seen this show up when I miss that.

Regards,
Nachiketa

On Sat 3 Sep, 2016, 12:14 PM Diwakar Dhanuskodi, <
diwakar.dhanusk...@gmail.com> wrote:

> Hi,
>
> Just re-ran again without killing zinc server process
>
> /make-distribution.sh --name custom-spark --tgz  -Phadoop-2.6 -Phive
> -Pyarn -Dmaven.version=3.0.4 -Dscala-2.11 -X -rf :spark-sql_2.11
>
> Build is success. Not sure how it worked with just re-running command
> again.
>
> On Sat, Sep 3, 2016 at 11:44 AM, Diwakar Dhanuskodi <
> diwakar.dhanusk...@gmail.com> wrote:
>
>> Hi,
>>
>> java version 7
>>
>> mvn command
>> ./make-distribution.sh --name custom-spark --tgz  -Phadoop-2.6 -Phive
>> -Phive-thriftserver -Pyarn -Dmaven.version=3.0.4
>>
>>
>> yes, I executed script to change scala version to 2.11
>> killed  "com.typesafe zinc.Nailgun" process
>>
>> re-ran mvn with below command again
>>
>> ./make-distribution.sh --name custom-spark --tgz  -Phadoop-2.6 -Phive
>> -Phive-thriftserver -Pyarn -Dmaven.version=3.0.4 -X -rf :spark-sql_2.11
>>
>> Getting same error
>>
>> [warn]
>> /home/cloudera/Downloads/spark-1.6.2/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala:911:
>> method isDir in class FileStatus is deprecated: see corresponding Javadoc
>> for more information.
>> [warn] status.isDir,
>> [warn]^
>> [error] missing or invalid dependency detected while loading class file
>> 'WebUI.class'.
>> [error] Could not access term eclipse in package org,
>> [error] because it (or its dependencies) are missing. Check your build
>> definition for
>> [error] missing or conflicting dependencies. (Re-run with
>> `-Ylog-classpath` to see the problematic classpath.)
>> [error] A full rebuild may help if 'WebUI.class' was compiled against an
>> incompatible version of org.
>> [error] missing or invalid dependency detected while loading class file
>> 'WebUI.class'.
>> [error] Could not access term jetty in value org.eclipse,
>> [error] because it (or its dependencies) are missing. Check your build
>> definition for
>> [error] missing or conflicting dependencies. (Re-run with
>> `-Ylog-classpath` to see the problematic classpath.)
>> [error] A full rebuild may help if 'WebUI.class' was compiled against an
>> incompatible version of org.eclipse.
>> [warn] 17 warnings found
>> [error] two errors found
>> [debug] Compilation failed (CompilerInterface)
>> [error] Compile failed at Sep 3, 2016 11:28:34 AM [21.611s]
>> [INFO]
>> 
>> [INFO] Reactor Summary:
>> [INFO]
>> [INFO] Spark Project Parent POM .. SUCCESS
>> [5.583s]
>> [INFO] Spark Project Test Tags ... SUCCESS
>> [4.189s]
>> [INFO] Spark Project Launcher  SUCCESS
>> [12.226s]
>> [INFO] Spark Project Networking .. SUCCESS
>> [13.386s]
>> [INFO] Spark Project Shuffle Streaming Service ... SUCCESS
>> [6.723s]
>> [INFO] Spark Project Unsafe .. SUCCESS
>> [21.231s]
>> [INFO] Spark Project Core  SUCCESS
>> [3:46.334s]
>> [INFO] Spark Project Bagel ... SUCCESS
>> [7.032s]
>> [INFO] Spark Project GraphX .. SUCCESS
>> [19.558s]
>> [INFO] Spark Project Streaming ... SUCCESS
>> [50.452s]
>> [INFO] Spark Project Catalyst  SUCCESS
>> [1:14.172s]
>> [INFO] Spark Project SQL . FAILURE
>> [23.222s]
>> [INFO] Spark Project ML Library .. SKIPPED
>> [INFO] Spark Project Tools ... SKIPPED
>> [INFO] Spark Project Hive  SKIPPED
>> [INFO] Spark Project Docker Integration Tests  SKIPPED
>> [INFO] Spark Project REPL  SKIPPED
>> [INFO] Spark Project YARN Shuffle Service  SKIPPED
>> [INFO] Spark Project YARN  SKIPPED
>> [INFO] Spark Project Assembly  SKIPPED
>> [INFO] Spark Project External Twitter  SKIPPED
>> [INFO] Spark Project External Flume Sink . SKIPPED
>> [INFO] Spark Project External Flume .. SKIPPED
>> [INFO] Spark Project External Flume Assembly . SKIPPED
>> [INFO] Spark Project External MQTT ... SKIPPED
>> [INFO] Spark Project External MQTT Assembly .. SKIPPED
>> [INFO] Spark Project External ZeroMQ . SKIPPED
>> [INFO] Spark Project External Kafka .. SKIPPED
>> [INFO] Spark Project Examples  SKIPPED
>> [INFO] Spark Project External Kafka Assembly . SKIPPED
>> [INFO]
>> 
>> [INFO] BUILD FAILURE

Re: Spark build 1.6.2 error

2016-09-03 Thread Diwakar Dhanuskodi
Hi,

Just re-ran again without killing zinc server process

/make-distribution.sh --name custom-spark --tgz  -Phadoop-2.6 -Phive -Pyarn
-Dmaven.version=3.0.4 -Dscala-2.11 -X -rf :spark-sql_2.11

Build is success. Not sure how it worked with just re-running command
again.

On Sat, Sep 3, 2016 at 11:44 AM, Diwakar Dhanuskodi <
diwakar.dhanusk...@gmail.com> wrote:

> Hi,
>
> java version 7
>
> mvn command
> ./make-distribution.sh --name custom-spark --tgz  -Phadoop-2.6 -Phive
> -Phive-thriftserver -Pyarn -Dmaven.version=3.0.4
>
>
> yes, I executed script to change scala version to 2.11
> killed  "com.typesafe zinc.Nailgun" process
>
> re-ran mvn with below command again
>
> ./make-distribution.sh --name custom-spark --tgz  -Phadoop-2.6 -Phive
> -Phive-thriftserver -Pyarn -Dmaven.version=3.0.4 -X -rf :spark-sql_2.11
>
> Getting same error
>
> [warn] /home/cloudera/Downloads/spark-1.6.2/sql/core/src/main/
> scala/org/apache/spark/sql/sources/interfaces.scala:911: method isDir in
> class FileStatus is deprecated: see corresponding Javadoc for more
> information.
> [warn] status.isDir,
> [warn]^
> [error] missing or invalid dependency detected while loading class file
> 'WebUI.class'.
> [error] Could not access term eclipse in package org,
> [error] because it (or its dependencies) are missing. Check your build
> definition for
> [error] missing or conflicting dependencies. (Re-run with
> `-Ylog-classpath` to see the problematic classpath.)
> [error] A full rebuild may help if 'WebUI.class' was compiled against an
> incompatible version of org.
> [error] missing or invalid dependency detected while loading class file
> 'WebUI.class'.
> [error] Could not access term jetty in value org.eclipse,
> [error] because it (or its dependencies) are missing. Check your build
> definition for
> [error] missing or conflicting dependencies. (Re-run with
> `-Ylog-classpath` to see the problematic classpath.)
> [error] A full rebuild may help if 'WebUI.class' was compiled against an
> incompatible version of org.eclipse.
> [warn] 17 warnings found
> [error] two errors found
> [debug] Compilation failed (CompilerInterface)
> [error] Compile failed at Sep 3, 2016 11:28:34 AM [21.611s]
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM .. SUCCESS [5.583s]
> [INFO] Spark Project Test Tags ... SUCCESS [4.189s]
> [INFO] Spark Project Launcher  SUCCESS
> [12.226s]
> [INFO] Spark Project Networking .. SUCCESS
> [13.386s]
> [INFO] Spark Project Shuffle Streaming Service ... SUCCESS [6.723s]
> [INFO] Spark Project Unsafe .. SUCCESS
> [21.231s]
> [INFO] Spark Project Core  SUCCESS
> [3:46.334s]
> [INFO] Spark Project Bagel ... SUCCESS
> [7.032s]
> [INFO] Spark Project GraphX .. SUCCESS
> [19.558s]
> [INFO] Spark Project Streaming ... SUCCESS
> [50.452s]
> [INFO] Spark Project Catalyst  SUCCESS
> [1:14.172s]
> [INFO] Spark Project SQL . FAILURE
> [23.222s]
> [INFO] Spark Project ML Library .. SKIPPED
> [INFO] Spark Project Tools ... SKIPPED
> [INFO] Spark Project Hive  SKIPPED
> [INFO] Spark Project Docker Integration Tests  SKIPPED
> [INFO] Spark Project REPL  SKIPPED
> [INFO] Spark Project YARN Shuffle Service  SKIPPED
> [INFO] Spark Project YARN  SKIPPED
> [INFO] Spark Project Assembly  SKIPPED
> [INFO] Spark Project External Twitter  SKIPPED
> [INFO] Spark Project External Flume Sink . SKIPPED
> [INFO] Spark Project External Flume .. SKIPPED
> [INFO] Spark Project External Flume Assembly . SKIPPED
> [INFO] Spark Project External MQTT ... SKIPPED
> [INFO] Spark Project External MQTT Assembly .. SKIPPED
> [INFO] Spark Project External ZeroMQ . SKIPPED
> [INFO] Spark Project External Kafka .. SKIPPED
> [INFO] Spark Project Examples  SKIPPED
> [INFO] Spark Project External Kafka Assembly . SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 7:45.641s
> [INFO] Finished at: Sat Sep 03 11:28:34 IST 2016
> [INFO] Final Memory: 49M/415M
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> 

Re: Spark build 1.6.2 error

2016-09-03 Thread Diwakar Dhanuskodi
Hi,

java version 7

mvn command
./make-distribution.sh --name custom-spark --tgz  -Phadoop-2.6 -Phive
-Phive-thriftserver -Pyarn -Dmaven.version=3.0.4


yes, I executed script to change scala version to 2.11
killed  "com.typesafe zinc.Nailgun" process

re-ran mvn with below command again

./make-distribution.sh --name custom-spark --tgz  -Phadoop-2.6 -Phive
-Phive-thriftserver -Pyarn -Dmaven.version=3.0.4 -X -rf :spark-sql_2.11

Getting same error

[warn]
/home/cloudera/Downloads/spark-1.6.2/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala:911:
method isDir in class FileStatus is deprecated: see corresponding Javadoc
for more information.
[warn] status.isDir,
[warn]^
[error] missing or invalid dependency detected while loading class file
'WebUI.class'.
[error] Could not access term eclipse in package org,
[error] because it (or its dependencies) are missing. Check your build
definition for
[error] missing or conflicting dependencies. (Re-run with `-Ylog-classpath`
to see the problematic classpath.)
[error] A full rebuild may help if 'WebUI.class' was compiled against an
incompatible version of org.
[error] missing or invalid dependency detected while loading class file
'WebUI.class'.
[error] Could not access term jetty in value org.eclipse,
[error] because it (or its dependencies) are missing. Check your build
definition for
[error] missing or conflicting dependencies. (Re-run with `-Ylog-classpath`
to see the problematic classpath.)
[error] A full rebuild may help if 'WebUI.class' was compiled against an
incompatible version of org.eclipse.
[warn] 17 warnings found
[error] two errors found
[debug] Compilation failed (CompilerInterface)
[error] Compile failed at Sep 3, 2016 11:28:34 AM [21.611s]
[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM .. SUCCESS [5.583s]
[INFO] Spark Project Test Tags ... SUCCESS [4.189s]
[INFO] Spark Project Launcher  SUCCESS [12.226s]
[INFO] Spark Project Networking .. SUCCESS [13.386s]
[INFO] Spark Project Shuffle Streaming Service ... SUCCESS [6.723s]
[INFO] Spark Project Unsafe .. SUCCESS [21.231s]
[INFO] Spark Project Core  SUCCESS
[3:46.334s]
[INFO] Spark Project Bagel ... SUCCESS [7.032s]
[INFO] Spark Project GraphX .. SUCCESS [19.558s]
[INFO] Spark Project Streaming ... SUCCESS [50.452s]
[INFO] Spark Project Catalyst  SUCCESS
[1:14.172s]
[INFO] Spark Project SQL . FAILURE [23.222s]
[INFO] Spark Project ML Library .. SKIPPED
[INFO] Spark Project Tools ... SKIPPED
[INFO] Spark Project Hive  SKIPPED
[INFO] Spark Project Docker Integration Tests  SKIPPED
[INFO] Spark Project REPL  SKIPPED
[INFO] Spark Project YARN Shuffle Service  SKIPPED
[INFO] Spark Project YARN  SKIPPED
[INFO] Spark Project Assembly  SKIPPED
[INFO] Spark Project External Twitter  SKIPPED
[INFO] Spark Project External Flume Sink . SKIPPED
[INFO] Spark Project External Flume .. SKIPPED
[INFO] Spark Project External Flume Assembly . SKIPPED
[INFO] Spark Project External MQTT ... SKIPPED
[INFO] Spark Project External MQTT Assembly .. SKIPPED
[INFO] Spark Project External ZeroMQ . SKIPPED
[INFO] Spark Project External Kafka .. SKIPPED
[INFO] Spark Project Examples  SKIPPED
[INFO] Spark Project External Kafka Assembly . SKIPPED
[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 7:45.641s
[INFO] Finished at: Sat Sep 03 11:28:34 IST 2016
[INFO] Final Memory: 49M/415M
[INFO]

[ERROR] Failed to execute goal
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first)
on project spark-sql_2.11: Execution scala-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed
-> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
(scala-compile-first) on project spark-sql_2.11: Execution
scala-compile-first of goal
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:225)