date:20160512

Re: Spark handling spill overs

2016-05-12 Thread Mich Talebzadeh

Spill-overs are a common issue for in-memory computing systems, after all
memory is limited. In Spark where RDDs are immutable, if an RDD got created
with its size > 1/2 node's RAM then a transformation and generation of the
consequent RDD' can potentially fill all the node's memory that can  cause
the spill-over into swap space.

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

On 13 May 2016 at 00:38, Takeshi Yamamuro  wrote:

> Hi,
>
> Which version of Spark you use?
> The recent one cannot handle this kind of spilling, see:
> http://spark.apache.org/docs/latest/tuning.html#memory-management-overview
> .
>
> // maropu
>
> On Fri, May 13, 2016 at 8:07 AM, Ashok Kumar  > wrote:
>
>> Hi,
>>
>> How one can avoid having Spark spill over after filling the node's memory.
>>
>> Thanks
>>
>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: High virtual memory consumption on spark-submit client.

2016-05-12 Thread Mich Talebzadeh

Is this a standalone set up single host where executor runs inside the
driver?

also run

*free -t*


To see the virtual memory usage which is basically swap space

free -t
 total   used   free sharedbuffers cached
Mem:  24546308   24268760 277548  01088236   15168668
-/+ buffers/cache:8011856   16534452
Swap:  20316083042031304
Total:26577916   242690642308852


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 13 May 2016 at 07:36, Jone Zhang  wrote:

> mich, Do you want this
>
> ==
> [running]mqq@10.205.3.29:/data/home/hive/conf$ ps aux | grep SparkPi
> mqq  20070  3.6  0.8 10445048 267028 pts/16 Sl+ 13:09   0:11
> /data/home/jdk/bin/java
> -Dlog4j.configuration=file:///data/home/spark/conf/log4j.properties
> -cp
> /data/home/spark/lib/*:/data/home/hadoop/share/hadoop/common/*:/data/home/hadoop/share/hadoop/common/lib/*:/data/home/hadoop/share/hadoop/yarn/*:/data/home/hadoop/share/hadoop/yarn/lib/*:/data/home/hadoop/share/hadoop/hdfs/*:/data/home/hadoop/share/hadoop/hdfs/lib/*:/data/home/hadoop/share/hadoop/tools/*:/data/home/hadoop/share/hadoop/mapreduce/*:/data/home/spark/conf/:/data/home/spark/lib/spark-assembly-1.4.1-hadoop2.5.1_150903.jar:/data/home/spark/lib/datanucleus-api-jdo-3.2.6.jar:/data/home/spark/lib/datanucleus-core-3.2.10.jar:/data/home/spark/lib/datanucleus-rdbms-3.2.9.jar:/data/home/hadoop/conf/:/data/home/hadoop/conf/:/data/home/spark/lib/*:/data/home/hadoop/share/hadoop/common/*:/data/home/hadoop/share/hadoop/common/lib/*:/data/home/hadoop/share/hadoop/yarn/*:/data/home/hadoop/share/hadoop/yarn/lib/*:/data/home/hadoop/share/hadoop/hdfs/*:/data/home/hadoop/share/hadoop/hdfs/lib/*:/data/home/hadoop/share/hadoop/tools/*:/data/home/hadoop/share/hadoop/mapreduce/*
> -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit --master
> yarn-cluster --class org.apache.spark.examples.SparkPi --queue spark
> --num-executors 4
> /data/home/spark/lib/spark-examples-1.4.1-hadoop2.5.1.jar 1
> mqq  22410  0.0  0.0 110600  1004 pts/8S+   13:14   0:00 grep
> SparkPi
> [running]mqq@10.205.3.29:/data/home/hive/conf$ top -p 20070
>
> top - 13:14:48 up 504 days, 19:17, 19 users,  load average: 1.41, 1.10,
> 0.99
> Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
> Cpu(s): 18.1%us,  2.7%sy,  0.0%ni, 74.4%id,  4.5%wa,  0.0%hi,  0.2%si,
> 0.0%st
> Mem:  32740732k total, 31606288k used,  113k free,   475908k buffers
> Swap:  2088952k total,61076k used,  2027876k free, 27594452k cached
>
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 20070 mqq   20   0 10.0g 260m  32m S  0.0  0.8   0:11.38 java
>
> ==
>
> Harsh, physical cpu cores is 1, virtual cpu cores is 4
>
> Thanks.
>
> 2016-05-13 13:08 GMT+08:00, Harsh J :
> > How many CPU cores are on that machine? Read http://qr.ae/8Uv3Xq
> >
> > You can also confirm the above by running the pmap utility on your
> process
> > and most of the virtual memory would be under 'anon'.
> >
> > On Fri, 13 May 2016 09:11 jone,  wrote:
> >
> >> The virtual memory is 9G When i run org.apache.spark.examples.SparkPi
> >> under yarn-cluster model,which using default configurations.
> >>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+
> >> COMMAND
> >>
> >> 4519 mqq   20   0 9041 <2009041>m 248m  26m S  0.3  0.8   0:19.85
> >> java
> >>  I am curious why is so high?
> >>
> >> Thanks.
> >>
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: High virtual memory consumption on spark-submit client.

2016-05-12 Thread Jone Zhang

mich, Do you want this
==
[running]mqq@10.205.3.29:/data/home/hive/conf$ ps aux | grep SparkPi
mqq  20070  3.6  0.8 10445048 267028 pts/16 Sl+ 13:09   0:11
/data/home/jdk/bin/java
-Dlog4j.configuration=file:///data/home/spark/conf/log4j.properties
-cp 
/data/home/spark/lib/*:/data/home/hadoop/share/hadoop/common/*:/data/home/hadoop/share/hadoop/common/lib/*:/data/home/hadoop/share/hadoop/yarn/*:/data/home/hadoop/share/hadoop/yarn/lib/*:/data/home/hadoop/share/hadoop/hdfs/*:/data/home/hadoop/share/hadoop/hdfs/lib/*:/data/home/hadoop/share/hadoop/tools/*:/data/home/hadoop/share/hadoop/mapreduce/*:/data/home/spark/conf/:/data/home/spark/lib/spark-assembly-1.4.1-hadoop2.5.1_150903.jar:/data/home/spark/lib/datanucleus-api-jdo-3.2.6.jar:/data/home/spark/lib/datanucleus-core-3.2.10.jar:/data/home/spark/lib/datanucleus-rdbms-3.2.9.jar:/data/home/hadoop/conf/:/data/home/hadoop/conf/:/data/home/spark/lib/*:/data/home/hadoop/share/hadoop/common/*:/data/home/hadoop/share/hadoop/common/lib/*:/data/home/hadoop/share/hadoop/yarn/*:/data/home/hadoop/share/hadoop/yarn/lib/*:/data/home/hadoop/share/hadoop/hdfs/*:/data/home/hadoop/share/hadoop/hdfs/lib/*:/data/home/hadoop/share/hadoop/tools/*:/data/home/hadoop/share/hadoop/mapreduce/*
-XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit --master
yarn-cluster --class org.apache.spark.examples.SparkPi --queue spark
--num-executors 4
/data/home/spark/lib/spark-examples-1.4.1-hadoop2.5.1.jar 1
mqq  22410  0.0  0.0 110600  1004 pts/8S+   13:14   0:00 grep SparkPi
[running]mqq@10.205.3.29:/data/home/hive/conf$ top -p 20070

top - 13:14:48 up 504 days, 19:17, 19 users,  load average: 1.41, 1.10, 0.99
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
Cpu(s): 18.1%us,  2.7%sy,  0.0%ni, 74.4%id,  4.5%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:  32740732k total, 31606288k used,  113k free,   475908k buffers
Swap:  2088952k total,61076k used,  2027876k free, 27594452k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
20070 mqq   20   0 10.0g 260m  32m S  0.0  0.8   0:11.38 java
==

Harsh, physical cpu cores is 1, virtual cpu cores is 4

Thanks.

2016-05-13 13:08 GMT+08:00, Harsh J :
> How many CPU cores are on that machine? Read http://qr.ae/8Uv3Xq
>
> You can also confirm the above by running the pmap utility on your process
> and most of the virtual memory would be under 'anon'.
>
> On Fri, 13 May 2016 09:11 jone,  wrote:
>
>> The virtual memory is 9G When i run org.apache.spark.examples.SparkPi
>> under yarn-cluster model,which using default configurations.
>>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+
>> COMMAND
>>
>> 4519 mqq   20   0 9041 <2009041>m 248m  26m S  0.3  0.8   0:19.85
>> java
>>  I am curious why is so high?
>>
>> Thanks.
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Deepak Sharma

Rakesh
Have you used awaitTermination() on your ssc ?
If not , dd this and see if it changes the behavior.
I am guessing this issue may be related to yarn deployment mode.
Also try setting the deployment mode to yarn-client.

Thanks
Deepak


On Fri, May 13, 2016 at 10:17 AM, Rakesh H (Marketing Platform-BLR) <
rakes...@flipkart.com> wrote:

> Ping!!
> Has anybody tested graceful shutdown of a spark streaming in yarn-cluster
> mode?It looks like a defect to me.
>
>
> On Thu, May 12, 2016 at 12:53 PM Rakesh H (Marketing Platform-BLR) <
> rakes...@flipkart.com> wrote:
>
>> We are on spark 1.5.1
>> Above change was to add a shutdown hook.
>> I am not adding shutdown hook in code, so inbuilt shutdown hook is being
>> called.
>> Driver signals that it is going to to graceful shutdown, but executor
>> sees that Driver is dead and it shuts down abruptly.
>> Could this issue be related to yarn? I see correct behavior locally. I
>> did "yarn kill " to kill the job.
>>
>>
>> On Thu, May 12, 2016 at 12:28 PM Deepak Sharma 
>> wrote:
>>
>>> This is happening because spark context shuts down without shutting down
>>> the ssc first.
>>> This was behavior till spark 1.4 ans was addressed in later releases.
>>> https://github.com/apache/spark/pull/6307
>>>
>>> Which version of spark are you on?
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Thu, May 12, 2016 at 12:14 PM, Rakesh H (Marketing Platform-BLR) <
>>> rakes...@flipkart.com> wrote:
>>>
 Yes, it seems to be the case.
 In this case executors should have continued logging values till 300,
 but they are shutdown as soon as i do "yarn kill .."

 On Thu, May 12, 2016 at 12:11 PM Deepak Sharma 
 wrote:

> So in your case , the driver is shutting down gracefully , but the
> executors are not.
> IS this the problem?
>
> Thanks
> Deepak
>
> On Thu, May 12, 2016 at 11:49 AM, Rakesh H (Marketing Platform-BLR) <
> rakes...@flipkart.com> wrote:
>
>> Yes, it is set to true.
>> Log of driver :
>>
>> 16/05/12 10:18:29 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: 
>> SIGTERM
>> 16/05/12 10:18:29 INFO streaming.StreamingContext: Invoking 
>> stop(stopGracefully=true) from shutdown hook
>> 16/05/12 10:18:29 INFO scheduler.JobGenerator: Stopping JobGenerator 
>> gracefully
>> 16/05/12 10:18:29 INFO scheduler.JobGenerator: Waiting for all received 
>> blocks to be consumed for job generation
>> 16/05/12 10:18:29 INFO scheduler.JobGenerator: Waited for all received 
>> blocks to be consumed for job generation
>>
>> Log of executor:
>> 16/05/12 10:18:29 ERROR executor.CoarseGrainedExecutorBackend: Driver 
>> xx.xx.xx.xx:x disassociated! Shutting down.
>> 16/05/12 10:18:29 WARN remote.ReliableDeliverySupervisor: Association 
>> with remote system [xx.xx.xx.xx:x] has failed, address is now gated 
>> for [5000] ms. Reason: [Disassociated]
>> 16/05/12 10:18:29 INFO storage.DiskBlockManager: Shutdown hook called
>> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 
>> 204 //This is value i am logging
>> 16/05/12 10:18:29 INFO util.ShutdownHookManager: Shutdown hook called
>> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 
>> 205
>> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 
>> 206
>>
>>
>>
>>
>>
>>
>> On Thu, May 12, 2016 at 11:45 AM Deepak Sharma 
>> wrote:
>>
>>> Hi Rakesh
>>> Did you tried setting *spark.streaming.stopGracefullyOnShutdown to
>>> true *for your spark configuration instance?
>>> If not try this , and let us know if this helps.
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Thu, May 12, 2016 at 11:42 AM, Rakesh H (Marketing Platform-BLR)
>>>  wrote:
>>>
 Issue i am having is similar to the one mentioned here :

 http://stackoverflow.com/questions/36911442/how-to-stop-gracefully-a-spark-streaming-application-on-yarn

 I am creating a rdd from sequence of 1 to 300 and creating
 streaming RDD out of it.

 val rdd = ssc.sparkContext.parallelize(1 to 300)
 val dstream = new ConstantInputDStream(ssc, rdd)
 dstream.foreachRDD{ rdd =>
   rdd.foreach{ x =>
 log(x)
 Thread.sleep(50)
   }
 }


 When i kill this job, i expect elements 1 to 300 to be logged
 before shutting down. It is indeed the case when i run it locally. It 
 wait
 for the job to finish before shutting down.

 But when i launch the job in custer with "yarn-cluster" mode, it
 abruptly shuts down.
 Executor prints following log

 ERROR executor.CoarseGrainedExecutorBackend:
 Driver xx.xx.xx.xxx:y disassociated! Shutting down.

  and then it

Re: High virtual memory consumption on spark-submit client.

2016-05-12 Thread Harsh J

How many CPU cores are on that machine? Read http://qr.ae/8Uv3Xq

You can also confirm the above by running the pmap utility on your process
and most of the virtual memory would be under 'anon'.

On Fri, 13 May 2016 09:11 jone,  wrote:

> The virtual memory is 9G When i run org.apache.spark.examples.SparkPi
> under yarn-cluster model,which using default configurations.
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+
> COMMAND
>
> 4519 mqq   20   0 9041 <2009041>m 248m  26m S  0.3  0.8   0:19.85
> java
>  I am curious why is so high?
>
> Thanks.
>

Re: High virtual memory consumption on spark-submit client.

2016-05-12 Thread Mich Talebzadeh

can you please do the following:

jps|grep SparkSubmit|

and send the output of

ps aux|grep pid
top -p PID

and the output of

free

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 13 May 2016 at 04:40, jone  wrote:

> The virtual memory is 9G When i run org.apache.spark.examples.SparkPi
> under yarn-cluster model,which using default configurations.
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+
> COMMAND
>
> 4519 mqq   20   0 9041 <2009041>m 248m  26m S  0.3  0.8   0:19.85
> java
>  I am curious why is so high?
>
> Thanks.
>

Re: Confused - returning RDDs from functions

2016-05-12 Thread Holden Karau

This is not the expected behavior, can you maybe post the code where you
are running into this?

On Thursday, May 12, 2016, Dood@ODDO  wrote:

> Hello all,
>
> I have been programming for years but this has me baffled.
>
> I have an RDD[(String,Int)] that I return from a function after extensive
> manipulation of an initial RDD of a different type. When I return this RDD
> and initiate the .collectAsMap() on it from the caller, I get an empty
> Map().
>
> If I copy and paste the code from the function into the caller (same exact
> code) and produce the same RDD and call collectAsMap() on it, I get the Map
> with all the expected information in it.
>
> What gives?
>
> Does Spark defy programming principles or am I crazy? ;-)
>
> Thanks!
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Rakesh H (Marketing Platform-BLR)

Ping!!
Has anybody tested graceful shutdown of a spark streaming in yarn-cluster
mode?It looks like a defect to me.

On Thu, May 12, 2016 at 12:53 PM Rakesh H (Marketing Platform-BLR) <
rakes...@flipkart.com> wrote:

> We are on spark 1.5.1
> Above change was to add a shutdown hook.
> I am not adding shutdown hook in code, so inbuilt shutdown hook is being
> called.
> Driver signals that it is going to to graceful shutdown, but executor sees
> that Driver is dead and it shuts down abruptly.
> Could this issue be related to yarn? I see correct behavior locally. I did
> "yarn kill " to kill the job.
>
>
> On Thu, May 12, 2016 at 12:28 PM Deepak Sharma 
> wrote:
>
>> This is happening because spark context shuts down without shutting down
>> the ssc first.
>> This was behavior till spark 1.4 ans was addressed in later releases.
>> https://github.com/apache/spark/pull/6307
>>
>> Which version of spark are you on?
>>
>> Thanks
>> Deepak
>>
>> On Thu, May 12, 2016 at 12:14 PM, Rakesh H (Marketing Platform-BLR) <
>> rakes...@flipkart.com> wrote:
>>
>>> Yes, it seems to be the case.
>>> In this case executors should have continued logging values till 300,
>>> but they are shutdown as soon as i do "yarn kill .."
>>>
>>> On Thu, May 12, 2016 at 12:11 PM Deepak Sharma 
>>> wrote:
>>>
 So in your case , the driver is shutting down gracefully , but the
 executors are not.
 IS this the problem?

 Thanks
 Deepak

 On Thu, May 12, 2016 at 11:49 AM, Rakesh H (Marketing Platform-BLR) <
 rakes...@flipkart.com> wrote:

> Yes, it is set to true.
> Log of driver :
>
> 16/05/12 10:18:29 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: 
> SIGTERM
> 16/05/12 10:18:29 INFO streaming.StreamingContext: Invoking 
> stop(stopGracefully=true) from shutdown hook
> 16/05/12 10:18:29 INFO scheduler.JobGenerator: Stopping JobGenerator 
> gracefully
> 16/05/12 10:18:29 INFO scheduler.JobGenerator: Waiting for all received 
> blocks to be consumed for job generation
> 16/05/12 10:18:29 INFO scheduler.JobGenerator: Waited for all received 
> blocks to be consumed for job generation
>
> Log of executor:
> 16/05/12 10:18:29 ERROR executor.CoarseGrainedExecutorBackend: Driver 
> xx.xx.xx.xx:x disassociated! Shutting down.
> 16/05/12 10:18:29 WARN remote.ReliableDeliverySupervisor: Association 
> with remote system [xx.xx.xx.xx:x] has failed, address is now gated 
> for [5000] ms. Reason: [Disassociated]
> 16/05/12 10:18:29 INFO storage.DiskBlockManager: Shutdown hook called
> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 
> 204 //This is value i am logging
> 16/05/12 10:18:29 INFO util.ShutdownHookManager: Shutdown hook called
> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 
> 205
> 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 
> 206
>
>
>
>
>
>
> On Thu, May 12, 2016 at 11:45 AM Deepak Sharma 
> wrote:
>
>> Hi Rakesh
>> Did you tried setting *spark.streaming.stopGracefullyOnShutdown to
>> true *for your spark configuration instance?
>> If not try this , and let us know if this helps.
>>
>> Thanks
>> Deepak
>>
>> On Thu, May 12, 2016 at 11:42 AM, Rakesh H (Marketing Platform-BLR) <
>> rakes...@flipkart.com> wrote:
>>
>>> Issue i am having is similar to the one mentioned here :
>>>
>>> http://stackoverflow.com/questions/36911442/how-to-stop-gracefully-a-spark-streaming-application-on-yarn
>>>
>>> I am creating a rdd from sequence of 1 to 300 and creating streaming
>>> RDD out of it.
>>>
>>> val rdd = ssc.sparkContext.parallelize(1 to 300)
>>> val dstream = new ConstantInputDStream(ssc, rdd)
>>> dstream.foreachRDD{ rdd =>
>>>   rdd.foreach{ x =>
>>> log(x)
>>> Thread.sleep(50)
>>>   }
>>> }
>>>
>>>
>>> When i kill this job, i expect elements 1 to 300 to be logged before
>>> shutting down. It is indeed the case when i run it locally. It wait for 
>>> the
>>> job to finish before shutting down.
>>>
>>> But when i launch the job in custer with "yarn-cluster" mode, it
>>> abruptly shuts down.
>>> Executor prints following log
>>>
>>> ERROR executor.CoarseGrainedExecutorBackend:
>>> Driver xx.xx.xx.xxx:y disassociated! Shutting down.
>>>
>>>  and then it shuts down. It is not a graceful shutdown.
>>>
>>> Anybody knows how to do it in yarn ?
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>


 --
 Thanks
 Deepak
 www.bigdatabig.com
 www.keosha.net

>>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>

Re: sbt for Spark build with Scala 2.11

2016-05-12 Thread Luciano Resende

Spark has moved to build using Scala 2.11 by default in master/trunk.

As for the 2.0.0-SNAPSHOT, it is actually the version of master/trunk and
you might be missing some modules/profiles for your build. What command did
you use to build ?

On Thu, May 12, 2016 at 9:01 PM, Raghava Mutharaju <
m.vijayaragh...@gmail.com> wrote:

> Hello All,
>
> I built Spark from the source code available at
> https://github.com/apache/spark/. Although I haven't specified the
> "-Dscala-2.11" option (to build with Scala 2.11), from the build messages I
> see that it ended up using Scala 2.11. Now, for my application sbt, what
> should be the spark version? I tried the following
>
> val spark = "org.apache.spark" %% "spark-core" % "2.0.0-SNAPSHOT"
> val sparksql = "org.apache.spark" % "spark-sql_2.11" % "2.0.0-SNAPSHOT"
>
> and scalaVersion := "2.11.8"
>
> But this setting of spark version gives sbt error
>
> unresolved dependency: org.apache.spark#spark-core_2.11;2.0.0-SNAPSHOT
>
> I guess this is because the repository doesn't contain 2.0.0-SNAPSHOT.
> Does this mean, the only option is to put all the required jars in the lib
> folder (unmanaged dependencies)?
>
> Regards,
> Raghava.
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

sbt for Spark build with Scala 2.11

2016-05-12 Thread Raghava Mutharaju

Hello All,

I built Spark from the source code available at
https://github.com/apache/spark/. Although I haven't specified the
"-Dscala-2.11" option (to build with Scala 2.11), from the build messages I
see that it ended up using Scala 2.11. Now, for my application sbt, what
should be the spark version? I tried the following

val spark = "org.apache.spark" %% "spark-core" % "2.0.0-SNAPSHOT"
val sparksql = "org.apache.spark" % "spark-sql_2.11" % "2.0.0-SNAPSHOT"

and scalaVersion := "2.11.8"

But this setting of spark version gives sbt error

unresolved dependency: org.apache.spark#spark-core_2.11;2.0.0-SNAPSHOT

I guess this is because the repository doesn't contain 2.0.0-SNAPSHOT. Does
this mean, the only option is to put all the required jars in the lib
folder (unmanaged dependencies)?

Regards,
Raghava.

High virtual memory consumption on spark-submit client.

2016-05-12 Thread jone

The virtual memory is 9G When i run org.apache.spark.examples.SparkPi under yarn-cluster model,which using default configurations.
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 
 4519 mqq   20   0 9041m 248m  26m S  0.3  0.8   0:19.85 java  
 I am curious why is so high?
Thanks.

Why spark 1.6.1 run so slow?

2016-05-12 Thread sunday2000

Hi,
   When we use spark 1.6.1 to word count a file of 25M bytes , with 2 nodes in 
cluster mode, it cost 10 seconds to finish the task.
  
   Why so slow ? Could you tell why?

Confused - returning RDDs from functions

2016-05-12 Thread Dood


Hello all,

I have been programming for years but this has me baffled.

I have an RDD[(String,Int)] that I return from a function after 
extensive manipulation of an initial RDD of a different type. When I 
return this RDD and initiate the .collectAsMap() on it from the caller, 
I get an empty Map().


If I copy and paste the code from the function into the caller (same 
exact code) and produce the same RDD and call collectAsMap() on it, I 
get the Map with all the expected information in it.


What gives?

Does Spark defy programming principles or am I crazy? ;-)

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Joining a RDD to a Dataframe

2016-05-12 Thread Cyril Scetbon

Nobody has the answer ? 

Another thing I've seen is that if I have no documents at all : 

scala> df.select(explode(df("addresses.id")).as("aid")).collect
res27: Array[org.apache.spark.sql.Row] = Array()

Then

scala> df.select(explode(df("addresses.id")).as("aid"), df("id"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among 
(adresses);
at 
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at 
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)

Is there a better way to query nested objects and to join between a DF 
containing nested objects and another regular data frame (yes it's the current 
case) 

> On May 9, 2016, at 00:42, Cyril Scetbon  wrote:
> 
> Hi Ashish,
> 
> The issue is not related to converting a RDD to a DF. I did it. I was just 
> asking if I should do it differently.
> 
> The issue regards the exception when using array_contains with a sql.Column 
> instead of a value.
> 
> I found another way to do it using explode as follows : 
> 
> df.select(explode(df("addresses.id")).as("aid"), df("id")).join(df_input, 
> $"aid" === df_input("id")).select(df("id"))
> 
> However, I'm wondering if it does almost the same or if the query is 
> different and worst in term of performance.
> 
> If someone can comment on it and maybe give me advices.
> 
> Thank you.
> 
>> On May 8, 2016, at 22:12, Ashish Dubey > > wrote:
>> 
>> Is there any reason you dont want to convert this - i dont think join b/w 
>> RDD and DF is supported.
>> 
>> On Sat, May 7, 2016 at 11:41 PM, Cyril Scetbon > > wrote:
>> Hi,
>> 
>> I have a RDD built during a spark streaming job and I'd like to join it to a 
>> DataFrame (E/S input) to enrich it.
>> It seems that I can't join the RDD and the DF without converting first the 
>> RDD to a DF (Tell me if I'm wrong). Here are the schemas of both DF :
>> 
>> scala> df
>> res32: org.apache.spark.sql.DataFrame = [f1: string, addresses: 
>> array>, id: string]
>> 
>> scala> df_input
>> res33: org.apache.spark.sql.DataFrame = [id: string]
>> 
>> scala> df_input.collect
>> res34: Array[org.apache.spark.sql.Row] = Array([idaddress2], [idaddress12])
>> 
>> I can get ids I want if I know the value to look for in addresses.id 
>>  using :
>> 
>> scala> df.filter(array_contains(df("addresses.id "), 
>> "idaddress2")).select("id").collect
>> res35: Array[org.apache.spark.sql.Row] = Array([], [YY])
>> 
>> However when I try to join df_input and df and to use the previous filter as 
>> the join condition I get an exception :
>> 
>> scala> df.join(df_input, array_contains(df("adresses.id 
>> "), df_input("id")))
>> java.lang.RuntimeException: Unsupported literal type class 
>> org.apache.spark.sql.Column id
>> at 
>> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:50)
>> at 
>> org.apache.spark.sql.functions$.array_contains(functions.scala:2452)
>> ...
>> 
>> It seems that array_contains only supports static arguments and does not 
>> replace a sql.Column by its value.
>> 
>> What's the best way to achieve what I want to do ? (Also speaking in term of 
>> performance)
>> 
>> Thanks
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> 
>> 
>> 
>

Re:Re:Re: Re:Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread kramer2...@126.com

Sorry, the bug link in previous mail was is wrong.

Here is the real link:

http://apache-spark-developers-list.1001551.n3.nabble.com/Re-SQL-Memory-leak-with-spark-streaming-and-spark-sql-in-spark-1-5-1-td14603.html

At 2016-05-13 09:49:05, "李明伟" wrote:

It seems we hit the same issue.

There was a bug on 1.5.1 about memory leak. But I am using 1.6.1

Here is the link about the bug in 1.5.1
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]"
wrote:
I read with Spark-Streaming from a Port. The incoming data consists of key and
value pairs. Then I call forEachRDD on each window. There I create a Dataset
from the window and do some SQL Querys on it. On the result i only do show, to
see the content. It works well, but the memory usage increases. When it reaches
the maximum nothing works anymore. When I use more memory. The Program runs
some time longer, but the problem persists. Because I run a Programm which
writes to the Port, I can control perfectly how much Data Spark has to Process.
When I write every one ms one key and value Pair the Problem is the same as
when i write only every second a key and value pair to the port.

When I dont create a Dataset in the foreachRDD and only count the Elements in
the RDD, then everything works fine. I also use groupBy agg functions in the
querys.

If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26940.html
To unsubscribe from Will the HiveContext cause memory leak ?, click here.
NAML

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26947.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread Ted Yu

The link below doesn't refer to specific bug. 

Can you send the correct link ?

Thanks 

> On May 12, 2016, at 6:50 PM, "kramer2...@126.com"  wrote:
> 
> It seems we hit the same issue.
> 
> There was a bug on 1.5.1 about memory leak. But I am using 1.6.1
> 
> Here is the link about the bug in 1.5.1 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> 
> 
> 
> 
> 
> At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]" <[hidden 
> email]> wrote:
> I read with Spark-Streaming from a Port. The incoming data consists of key 
> and value pairs. Then I call forEachRDD on each window. There I create a 
> Dataset from the window and do some SQL Querys on it. On the result i only do 
> show, to see the content. It works well, but the memory usage increases. When 
> it reaches the maximum nothing works anymore. When I use more memory. The 
> Program runs some time longer, but the problem persists. Because I run a 
> Programm which writes to the Port, I can control perfectly how much Data 
> Spark has to Process. When I write every one ms one key and value Pair the 
> Problem is the same as when i write only every second a key and value pair to 
> the port. 
> 
> When I dont create a Dataset in the foreachRDD and only count the Elements in 
> the RDD, then everything works fine. I also use groupBy agg functions in the 
> querys. 
> 
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26940.html
> To unsubscribe from Will the HiveContext cause memory leak ?, click here.
> NAML
> 
> 
>  
> 
> 
> View this message in context: Re:Re: Re:Re: Will the HiveContext cause memory 
> leak ?
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re:Re: Re:Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread kramer2...@126.com

It seems we hit the same issue.

There was a bug on 1.5.1 about memory leak. But I am using 1.6.1

Here is the link about the bug in 1.5.1
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

When I dont create a Dataset in the foreachRDD and only count the Elements in
the RDD, then everything works fine. I also use groupBy agg functions in the
querys.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Will-the-HiveContext-cause-memory-leak-tp26921p26946.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark 1.6 Catalyst optimizer

2016-05-12 Thread Telmo Rodrigues

Thank you Takeshi.

After executing df3.explain(true) I realised that the Optimiser batches are
being performed and also the predicate push down.

I think that only the analiser batches are executed when creating the data
frame by the context.sql(query). It seems that the optimiser batches are
executed when some action like collect or explain takes place.

scala> d3.explain(true)
16/05/13 02:08:12 DEBUG Analyzer$ResolveReferences: Resolving 't.id to id#0L
16/05/13 02:08:12 DEBUG Analyzer$ResolveReferences: Resolving 'u.id to id#1L
16/05/13 02:08:12 DEBUG Analyzer$ResolveReferences: Resolving 't.id to id#0L
16/05/13 02:08:12 DEBUG SQLContext$$anon$1:
== Parsed Logical Plan ==
'Project [unresolvedalias(*)]
+- 'Filter ('t.id = 1)
   +- 'Join Inner, Some(('t.id = 'u.id))
  :- 'UnresolvedRelation `t`, None
  +- 'UnresolvedRelation `u`, None

== Analyzed Logical Plan ==
id: bigint, id: bigint
Project [id#0L,id#1L]
+- Filter (id#0L = cast(1 as bigint))
   +- Join Inner, Some((id#0L = id#1L))
  :- Subquery t
  :  +- Relation[id#0L] JSONRelation
  +- Subquery u
 +- Relation[id#1L] JSONRelation

== Optimized Logical Plan ==
Project [id#0L,id#1L]
+- Join Inner, Some((id#0L = id#1L))
   :- Filter (id#0L = 1)
   :  +- Relation[id#0L] JSONRelation
   +- Relation[id#1L] JSONRelation

== Physical Plan ==
Project [id#0L,id#1L]
+- BroadcastHashJoin [id#0L], [id#1L], BuildRight
   :- Filter (id#0L = 1)
   :  +- Scan JSONRelation[id#0L] InputPaths: file:/persons.json,
PushedFilters: [EqualTo(id,1)]
   +- Scan JSONRelation[id#1L] InputPaths: file:/cars.json


2016-05-12 16:34 GMT+01:00 Takeshi Yamamuro :

> Hi,
>
> What's the result of `df3.explain(true)`?
>
> // maropu
>
> On Thu, May 12, 2016 at 10:04 AM, Telmo Rodrigues <
> telmo.galante.rodrig...@gmail.com> wrote:
>
>> I'm building spark from branch-1.6 source with mvn -DskipTests package
>> and I'm running the following code with spark shell.
>>
>> *val* sqlContext *=* *new* org.apache.spark.sql.*SQLContext*(sc)
>>
>> *import* *sqlContext.implicits._*
>>
>>
>> *val df = sqlContext.read.json("persons.json")*
>>
>> *val df2 = sqlContext.read.json("cars.json")*
>>
>>
>> *df.registerTempTable("t")*
>>
>> *df2.registerTempTable("u")*
>>
>>
>> *val d3 =sqlContext.sql("select * from t join u on t.id  =
>> u.id  where t.id  = 1")*
>>
>> With the log4j root category level WARN, the last printed messages refers
>> to the Batch Resolution rules execution.
>>
>> === Result of Batch Resolution ===
>> !'Project [unresolvedalias(*)]  Project [id#0L,id#1L]
>> !+- 'Filter ('t.id = 1) +- Filter (id#0L = cast(1 as
>> bigint))
>> !   +- 'Join Inner, Some(('t.id = 'u.id))  +- Join Inner,
>> Some((id#0L = id#1L))
>> !  :- 'UnresolvedRelation `t`, None   :- Subquery t
>> !  +- 'UnresolvedRelation `u`, None   :  +- Relation[id#0L]
>> JSONRelation
>> ! +- Subquery u
>> !+- Relation[id#1L]
>> JSONRelation
>>
>>
>> I think that only the analyser rules are being executed.
>>
>> The optimiser rules should not to run in this case?
>>
>> 2016-05-11 19:24 GMT+01:00 Michael Armbrust :
>>
>>>
 logical plan after optimizer execution:

 Project [id#0L,id#1L]
 !+- Filter (id#0L = cast(1 as bigint))
 !   +- Join Inner, Some((id#0L = id#1L))
 !  :- Subquery t
 !  :  +- Relation[id#0L] JSONRelation
 !  +- Subquery u
 !  +- Relation[id#1L] JSONRelation

>>>
>>> I think you are mistaken.  If this was the optimized plan there would be
>>> no subqueries.
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: XML Processing using Spark SQL

2016-05-12 Thread Mail.com

Hi Arun,

Could you try using Stax or JaxB.

Thanks,
Pradeep

> On May 12, 2016, at 8:35 PM, Hyukjin Kwon  wrote:
> 
> Hi Arunkumar,
> 
> 
> I guess your records are self-closing ones.
> 
> There is an issue open here, https://github.com/databricks/spark-xml/issues/92
> 
> This is about XmlInputFormat.scala and it seems a bit tricky to handle the 
> case so I left open until now.
> 
> 
> Thanks!
> 
> 
> 2016-05-13 5:03 GMT+09:00 Arunkumar Chandrasekar :
>> Hello,
>> 
>> Greetings.
>> 
>> I'm trying to process a xml file exported from Health Kit application using 
>> Spark SQL for learning purpose. The sample record data is like the below:
>> 
>>  > sourceVersion="9.3" device="<, name:iPhone, 
>> manufacturer:Apple, model:iPhone, hardware:iPhone7,2, software:9.3>" 
>> unit="count" creationDate="2016-04-23 19:31:33 +0530" startDate="2016-04-23 
>> 19:00:20 +0530" endDate="2016-04-23 19:01:41 +0530" value="31"/>
>> 
>>  > sourceVersion="9.3.1" device="<, name:iPhone, 
>> manufacturer:Apple, model:iPhone, hardware:iPhone7,2, software:9.3.1>" 
>> unit="count" creationDate="2016-04-24 05:45:00 +0530" startDate="2016-04-24 
>> 05:25:04 +0530" endDate="2016-04-24 05:25:24 +0530" value="10"/>.
>> 
>> I want to have the column name of my table as the field value like type, 
>> sourceName, sourceVersion and the row entries as their respective values 
>> like HKQuantityTypeIdentifierStepCount, Vizhi, 9.3.1,..
>> 
>> I took a look at the Spark-XML, but didn't get any information in my case 
>> (my xml is not well formed with the tags). Is there any other option to 
>> convert the record that I have mentioned above into a schema format for 
>> playing with Spark SQL?
>> 
>> Thanks in Advance.
>> 
>> Thank You,
>> Arun Chandrasekar
>> chan.arunku...@gmail.com
>

Re: XML Processing using Spark SQL

2016-05-12 Thread Hyukjin Kwon

Hi Arunkumar,


I guess your records are self-closing ones.

There is an issue open here,
https://github.com/databricks/spark-xml/issues/92

This is about XmlInputFormat.scala and it seems a bit tricky to handle the
case so I left open until now.


Thanks!


2016-05-13 5:03 GMT+09:00 Arunkumar Chandrasekar :

> Hello,
>
> Greetings.
>
> I'm trying to process a xml file exported from Health Kit application
> using Spark SQL for learning purpose. The sample record data is like the
> below:
>
>   sourceVersion="9.3" device="<, name:iPhone,
> manufacturer:Apple, model:iPhone, hardware:iPhone7,2, software:9.3>"
> unit="count" creationDate="2016-04-23 19:31:33 +0530" startDate="2016-04-23
> 19:00:20 +0530" endDate="2016-04-23 19:01:41 +0530" value="31"/>
>
>   sourceVersion="9.3.1" device="<, name:iPhone,
> manufacturer:Apple, model:iPhone, hardware:iPhone7,2, software:9.3.1>"
> unit="count" creationDate="2016-04-24 05:45:00 +0530" startDate="2016-04-24
> 05:25:04 +0530" endDate="2016-04-24 05:25:24 +0530" value="10"/>.
>
> I want to have the column name of my table as the field value like type,
> sourceName, sourceVersion and the row entries as their respective values
> like HKQuantityTypeIdentifierStepCount, Vizhi, 9.3.1,..
>
> I took a look at the Spark-XML ,
> but didn't get any information in my case (my xml is not well formed with
> the tags). Is there any other option to convert the record that I have
> mentioned above into a schema format for playing with Spark SQL?
>
> Thanks in Advance.
>
> *Thank You*,
> Arun Chandrasekar
> chan.arunku...@gmail.com
>

RE: SQLContext and HiveContext parse a query string differently ?

2016-05-12 Thread Yong Zhang

Not sure what do you mean? You want to have one exactly query running fine in 
both sqlContext and HiveContext? The query parser are different, why do you 
want to have this feature? Do I understand your question correctly?
Yong

Date: Thu, 12 May 2016 13:09:34 +0200
Subject: SQLContext and HiveContext parse a query string differently ?
From: inv...@gmail.com
To: user@spark.apache.org

HI,
I just want to figure out why the two contexts behavior differently even on a 
simple query.In a netshell, I have a query in which there is a String 
containing single quote and casting to Array/Map.I have tried all the 
combination of diff type of sql context and query call api (sql, df.select, 
df.selectExpr).I can't find one rules all.
Here is the code for reproducing the 
problem.-
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}

object Test extends App {

  val sc  = new SparkContext("local[2]", "test", new SparkConf)
  val hiveContext = new HiveContext(sc)
  val sqlContext  = new SQLContext(sc)

  val context = hiveContext
  //  val context = sqlContext

  import context.implicits._

  val df = Seq((Seq(1, 2), 2)).toDF("a", "b")
  df.registerTempTable("tbl")
  df.printSchema()

  // case 1
  context.sql("select cast(a as array) from tbl").show()
  // HiveContext => org.apache.spark.sql.AnalysisException: cannot recognize 
input near 'array' '<' 'string' in primitive type specification; line 1 pos 17
  // SQLContext => OK

  // case 2
  context.sql("select 'a\\'b'").show()
  // HiveContext => OK
  // SQLContext => failure: ``union'' expected but ErrorToken(unclosed string 
literal) found

  // case 3
  df.selectExpr("cast(a as array)").show() // OK with HiveContext and 
SQLContext

  // case 4
  df.selectExpr("'a\\'b'").show() // HiveContext, SQLContext => failure: end of 
input expected
}-
Any clarification / workaround is high appreciated.
-- 
Hao Ren
Data Engineer @ leboncoin
Paris, France

Re: Spark handling spill overs

2016-05-12 Thread Takeshi Yamamuro

Hi,

Which version of Spark you use?
The recent one cannot handle this kind of spilling, see:
http://spark.apache.org/docs/latest/tuning.html#memory-management-overview.

// maropu

On Fri, May 13, 2016 at 8:07 AM, Ashok Kumar 
wrote:

> Hi,
>
> How one can avoid having Spark spill over after filling the node's memory.
>
> Thanks
>
>
>
>

-- 
---
Takeshi Yamamuro

Spark handling spill overs

2016-05-12 Thread Ashok Kumar

Hi,
How one can avoid having Spark spill over after filling the node's memory.
Thanks

Re: apache spark on gitter?

2016-05-12 Thread Xinh Huynh

I agree that it can help build a community and be a place for real-time
conversations.

Xinh

On Thu, May 12, 2016 at 12:28 AM, Paweł Szulc  wrote:

> Hi,
>
> well I guess the advantage of gitter over maling list is the same as with
> IRC. It's not actually a replacer because mailing list is also important.
> But it is lot easier to build a community around tool with ad-hoc ability
> to connect with each other.
>
> I have gitter running on constantly, I visit my favorite OSS projects on
> it from time to time to read what has recently happened. It allows me to
> stay in touch with the project, help fellow developers to with problems
> they have.
> One might argue that u can achive the same with mailing list, well it's
> hard for me to put this into words.. Malinig list is more of an async
> nature (which is good!) but some times you need more "real-time"
> experience. You know, engage in the conversation in the given moment, not
> conversation that might last few days :)
>
> TLDR: It is not a replacement, it's supplement to build the community
> around OSS. Worth having for real-time conversations.
>
> On Wed, May 11, 2016 at 10:24 PM, Xinh Huynh  wrote:
>
>> Hi Pawel,
>>
>> I'd like to hear more about your idea. Could you explain more why you
>> would like to have a gitter channel? What are the advantages over a mailing
>> list (like this one)? Have you had good experiences using gitter on other
>> open source projects?
>>
>> Xinh
>>
>> On Wed, May 11, 2016 at 11:10 AM, Sean Owen  wrote:
>>
>>> I don't know of a gitter channel and I don't use it myself, FWIW. I
>>> think anyone's welcome to start one.
>>>
>>> I hesitate to recommend this, simply because it's preferable to have
>>> one place for discussion rather than split it over several, and, we
>>> have to keep the @spark.apache.org mailing lists as the "forums of
>>> records" for project discussions.
>>>
>>> If something like gitter doesn't attract any chat, then it doesn't add
>>> any value. If it does though, then suddenly someone needs to subscribe
>>> to user@ and gitter to follow all of the conversations.
>>>
>>> I think there is a bit of a scalability problem on the user@ list at
>>> the moment, just because it covers all of Spark. But adding a
>>> different all-Spark channel doesn't help that.
>>>
>>> Anyway maybe that's "why"
>>>
>>>
>>> On Wed, May 11, 2016 at 6:26 PM, Paweł Szulc 
>>> wrote:
>>> > no answer, but maybe one more time, a gitter channel for spark users
>>> would
>>> > be a good idea!
>>> >
>>> > On Mon, May 9, 2016 at 1:45 PM, Paweł Szulc 
>>> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I was wondering - why Spark does not have a gitter channel?
>>> >>
>>> >> --
>>> >> Regards,
>>> >> Paul Szulc
>>> >>
>>> >> twitter: @rabbitonweb
>>> >> blog: www.rabbitonweb.com
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Regards,
>>> > Paul Szulc
>>> >
>>> > twitter: @rabbitonweb
>>> > blog: www.rabbitonweb.com
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> Regards,
> Paul Szulc
>
> twitter: @rabbitonweb
> blog: www.rabbitonweb.com
>

Re: How to get and save core dump of native library in executors

2016-05-12 Thread prateek arora

ubuntu 14.04

On Thu, May 12, 2016 at 2:40 PM, Ted Yu  wrote:

> Which OS are you using ?
>
> See http://en.linuxreviews.org/HOWTO_enable_core-dumps
>
> On Thu, May 12, 2016 at 2:23 PM, prateek arora  > wrote:
>
>> Hi
>>
>> I am running my spark application with some third party native libraries .
>> but it crashes some time and show error " Failed to write core dump. Core
>> dumps have been disabled. To enable core dumping, try "ulimit -c
>> unlimited"
>> before starting Java again " .
>>
>> Below are the log :
>>
>>  A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  SIGSEGV (0xb) at pc=0x7fd44b491fb9, pid=20458, tid=140549318547200
>> #
>> # JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build
>> 1.7.0_67-b01)
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode
>> linux-amd64 compressed oops)
>> # Problematic frame:
>> # V  [libjvm.so+0x650fb9]  jni_SetByteArrayRegion+0xa9
>> #
>> # Failed to write core dump. Core dumps have been disabled. To enable core
>> dumping, try "ulimit -c unlimited" before starting Java again
>> #
>> # An error report file with more information is saved as:
>> #
>>
>> /yarn/nm/usercache/master/appcache/application_1462930975871_0004/container_1462930975871_0004_01_66/hs_err_pid20458.log
>> #
>> # If you would like to submit a bug report, please visit:
>> #   http://bugreport.sun.com/bugreport/crash.jsp
>> #
>>
>>
>> so how can i enable core dump and save it some place ?
>>
>> Regards
>> Prateek
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-and-save-core-dump-of-native-library-in-executors-tp26945.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: How to get and save core dump of native library in executors

2016-05-12 Thread Ted Yu

Which OS are you using ?

See http://en.linuxreviews.org/HOWTO_enable_core-dumps

On Thu, May 12, 2016 at 2:23 PM, prateek arora 
wrote:

> Hi
>
> I am running my spark application with some third party native libraries .
> but it crashes some time and show error " Failed to write core dump. Core
> dumps have been disabled. To enable core dumping, try "ulimit -c unlimited"
> before starting Java again " .
>
> Below are the log :
>
>  A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7fd44b491fb9, pid=20458, tid=140549318547200
> #
> # JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build
> 1.7.0_67-b01)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode
> linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x650fb9]  jni_SetByteArrayRegion+0xa9
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> #
>
> /yarn/nm/usercache/master/appcache/application_1462930975871_0004/container_1462930975871_0004_01_66/hs_err_pid20458.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.sun.com/bugreport/crash.jsp
> #
>
>
> so how can i enable core dump and save it some place ?
>
> Regards
> Prateek
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-and-save-core-dump-of-native-library-in-executors-tp26945.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

How to get and save core dump of native library in executors

2016-05-12 Thread prateek arora

Hi

I am running my spark application with some third party native libraries .
but it crashes some time and show error " Failed to write core dump. Core
dumps have been disabled. To enable core dumping, try "ulimit -c unlimited"
before starting Java again " . 

Below are the log :

 A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7fd44b491fb9, pid=20458, tid=140549318547200
#
# JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build
1.7.0_67-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x650fb9]  jni_SetByteArrayRegion+0xa9
#
# Failed to write core dump. Core dumps have been disabled. To enable core
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
#
/yarn/nm/usercache/master/appcache/application_1462930975871_0004/container_1462930975871_0004_01_66/hs_err_pid20458.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#


so how can i enable core dump and save it some place ?

Regards
Prateek




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-and-save-core-dump-of-native-library-in-executors-tp26945.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SQLContext and HiveContext parse a query string differently ?

2016-05-12 Thread Mich Talebzadeh

yep the same error I got

root
 |-- a: array (nullable = true)
 ||-- element: integer (containsNull = false)
 |-- b: integer (nullable = false)
NoViableAltException(35@[])
at
org.apache.hadoop.hive.ql.parse.HiveParser.primitiveType(HiveParser.java:38886)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.castExpression(HiveParser_IdentifiersParser.java:4336)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.atomExpression(HiveParser_IdentifiersParser.java:6235)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceFieldExpression(HiveParser_IdentifiersParser.java:6383)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceUnaryPrefixExpression(HiveParser_IdentifiersParser.java:6768)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceUnarySuffixExpression(HiveParser_IdentifiersParser.java:6828)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceBitwiseXorExpression(HiveParser_IdentifiersParser.java:7012)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceStarExpression(HiveParser_IdentifiersParser.java:7172)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedencePlusExpression(HiveParser_IdentifiersParser.java:7332)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceAmpersandExpression(HiveParser_IdentifiersParser.java:7483)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceBitwiseOrExpression(HiveParser_IdentifiersParser.java:7634)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8164)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9177)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceAndExpression(HiveParser_IdentifiersParser.java:9296)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceOrExpression(HiveParser_IdentifiersParser.java:9455)
at
org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.expression(HiveParser_IdentifiersParser.java:6105)
at
org.apache.hadoop.hive.ql.parse.HiveParser.expression(HiveParser.java:45846)
at
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectItem(HiveParser_SelectClauseParser.java:2907)
at
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectList(HiveParser_SelectClauseParser.java:1373)
at
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:1128)
at
org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:45817)
at
org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:41495)
at
org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:41402)
at
org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:40413)
at
org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:40283)
at
org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1590)
at
org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
at
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
at
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:276)
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303)
at
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:22

XML Processing using Spark SQL

2016-05-12 Thread Arunkumar Chandrasekar

Hello,

Greetings.

I'm trying to process a xml file exported from Health Kit application using
Spark SQL for learning purpose. The sample record data is like the below:

 

 .

I want to have the column name of my table as the field value like type,
sourceName, sourceVersion and the row entries as their respective values
like HKQuantityTypeIdentifierStepCount, Vizhi, 9.3.1,..

I took a look at the Spark-XML ,
but didn't get any information in my case (my xml is not well formed with
the tags). Is there any other option to convert the record that I have
mentioned above into a schema format for playing with Spark SQL?

Thanks in Advance.

*Thank You*,
Arun Chandrasekar
chan.arunku...@gmail.com

Re: kryo

2016-05-12 Thread Ted Yu

This should be related:

https://github.com/JodaOrg/joda-time/issues/307

Do you have more of the stack trace ?

Cheers

On Thu, May 12, 2016 at 12:39 PM, Younes Naguib <
younes.nag...@tritondigital.com> wrote:

> Thanks,
>
> I used that.
>
> Now I seem to have the following problem:
>
> java.lang.NullPointerException
>
> at
> org.joda.time.tz.CachedDateTimeZone.getInfo(CachedDateTimeZone.java:143)
>
> at
> org.joda.time.tz.CachedDateTimeZone.getOffset(CachedDateTimeZone.java:103)
>
> at
> org.joda.time.DateTimeZone.convertUTCToLocal(DateTimeZone.java:925)
>
>
>
>
>
> Any ideas?
>
>
>
> Thanks
>
>
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* May-11-16 5:32 PM
> *To:* Younes Naguib
> *Cc:* user@spark.apache.org
> *Subject:* Re: kryo
>
>
>
> Have you seen this thread ?
>
>
>
>
> http://search-hadoop.com/m/q3RTtpO0qI3cp06/JodaDateTimeSerializer+spark&subj=Re+NPE+when+using+Joda+DateTime
>
>
>
> On Wed, May 11, 2016 at 2:18 PM, Younes Naguib <
> younes.nag...@tritondigital.com> wrote:
>
> Hi all,
>
> I'm trying to get to use spark.serializer.
> I set it in the spark-default.conf, but I statred getting issues with
> datetimes.
>
> As I understand, I need to disable it.
> Anyways to keep using kryo?
>
> It's seems I can use JodaDateTimeSerializer for datetimes, just not sure
> how to set it, and register it in the spark-default conf.
>
> Thanks,
>
> *Younes Naguib* 
>
>
>

RE: kryo

2016-05-12 Thread Younes Naguib

Thanks,
I used that.
Now I seem to have the following problem:
java.lang.NullPointerException
at 
org.joda.time.tz.CachedDateTimeZone.getInfo(CachedDateTimeZone.java:143)
at 
org.joda.time.tz.CachedDateTimeZone.getOffset(CachedDateTimeZone.java:103)
at org.joda.time.DateTimeZone.convertUTCToLocal(DateTimeZone.java:925)


Any ideas?

Thanks


From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: May-11-16 5:32 PM
To: Younes Naguib
Cc: user@spark.apache.org
Subject: Re: kryo

Have you seen this thread ?

http://search-hadoop.com/m/q3RTtpO0qI3cp06/JodaDateTimeSerializer+spark&subj=Re+NPE+when+using+Joda+DateTime

On Wed, May 11, 2016 at 2:18 PM, Younes Naguib 
mailto:younes.nag...@tritondigital.com>> wrote:
Hi all,

I'm trying to get to use spark.serializer.
I set it in the spark-default.conf, but I statred getting issues with datetimes.

As I understand, I need to disable it.
Anyways to keep using kryo?

It's seems I can use JodaDateTimeSerializer for datetimes, just not sure how to 
set it, and register it in the spark-default conf.

Thanks,
Younes Naguib

S3A Creating Task Per Byte (pyspark / 1.6.1)

2016-05-12 Thread Aaron Jackson

I'm using the spark 1.6.1 (hadoop-2.6) and I'm trying to load a file that's
in s3.  I've done this previously with spark 1.5 with no issue.  Attempting
to load and count a single file as follows:

dataFrame = sqlContext.read.text('s3a://bucket/path-to-file.csv')
dataFrame.count()

But when it attempts to load, it creates 279K tasks.  When I look at the
tasks, the # of tasks is identical to the # of bytes in the file.  Has
anyone seen anything like this or have any ideas why it's getting that
granular?

mllib random forest - executor heartbeat timed out

2016-05-12 Thread vtkmh

Hello,

I have a random forest that works fine with 20 trees on 5e6 LabeledPoints
for training and 300 features...  but when I try to scale it up just a bit
to 60 or 100 trees and 10e6 training points, it consistently gets
ExecutorLostFailure's due to "no recent heartbeats" with timeout of 120s. 
(more detail on error below)

what I'm hoping for, in addition to understanding why it's failing... 
what's a practical approach to making this work?
- change some spark parameters?  i.e. increase
spark.executor.heartbeatInterval?
- larger cluster? 
- (last resort) smaller or fewer trees?

any advice appreciated.
thank you!

// rf params
val numClasses  = 2
val numTrees= 100
val featureSubsetStrategy   = "auto"
val maxDepth= 21
val maxBins = 16
val subsamplingRate = 0.8 
val minInstancesPerNode = 9
val maxMemoryInMB   = 3000
val useNodeIdCache  = true

// error details
16/05/12 15:55:38 WARN HeartbeatReceiver: Removing executor 4 with no recent
heartbeats: 126112 ms exceeds timeout 12 ms
16/05/12 15:55:38 ERROR YarnScheduler: Lost executor 4 on
ip-10-101-xx-xx.us-west-2.compute.internal: Executor heartbeat timed out
after 126112 ms
16/05/12 15:55:38 WARN TaskSetManager: Lost task 72.3 in stage 38.0 (TID
29394, ip-10-101-xx-xx.us-west-2.compute.internal): ExecutorLostFailure
(executor 4 exited caused by one of the run


16/05/12 15:55:38 INFO YarnClientSchedulerBackend: Requesting to kill
executor(s) 4
16/05/12 15:55:38 INFO YarnScheduler: Cancelling stage 38
16/05/12 15:55:38 INFO YarnScheduler: Stage 38 was cancelled
16/05/12 15:55:38 INFO DAGScheduler: ShuffleMapStage 38 (mapPartitions at
DecisionTree.scala:604) failed in 3732.026 s
16/05/12 15:55:38 INFO DAGScheduler: Executor lost: 4 (epoch 18)
16/05/12 15:55:38 INFO DAGScheduler: Job 20 failed: collectAsMap at
DecisionTree.scala:651, took 3739.423425 s
16/05/12 15:55:38 INFO BlockManagerMasterEndpoint: Trying to remove executor
4 from BlockManagerMaster.
16/05/12 15:55:38 INFO BlockManagerMasterEndpoint: Removing block manager
BlockManagerId(4, ip-10-101-xx-xx.us-west-2.compute.internal, 41007)
16/05/12 15:55:38 INFO BlockManagerMaster: Removed 4 successfully in
removeExecutor
16/05/12 15:55:38 INFO DAGScheduler: Host added was in lost list earlier:
ip-10-101-xx-xx.us-west-2.compute.internal
org.apache.spark.SparkException: Job aborted due to stage failure: Task 72
in stage 38.0 failed 4 times, most recent failure: Lost task 72.3 in stage
38.0 (TID 29394, ip-10-101-xx-xx.us-west-2.compute.internal):
ExecutorLostFailure (executor 4 exited caused by one of the running tasks)
Reason: Executor heartbeat timed out after 126112 ms
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$collectAsMap$1.apply(PairRDDFunctions.scala:741)
at
org.apache.spark.rdd.PairRDDF

Re: Reliability of JMS Custom Receiver in Spark Streaming JMS

2016-05-12 Thread Sourav Mazumder

Any inputs on this issue ?

Regards,
Sourav

On Tue, May 10, 2016 at 6:17 PM, Sourav Mazumder <
sourav.mazumde...@gmail.com> wrote:

> Hi,
>
> Need to get bit more understanding of reliability aspects of the Custom
> Receivers in the context of the code in spark-streaming-jms
> https://github.com/mattf/spark-streaming-jms.
>
> Based on the documentation in
> http://spark.apache.org/docs/latest/streaming-custom-receivers.html#receiver-reliability,
> I understand that if the store api is called with multiple records the
> message is reliably stored as it is a blocking call. On the other hand if
> the store api is called with a single record then it is not reliable as the
> call is returned back to the calling program before the message is stored
> appropriately.
>
> Given that I have few questions
>
> 1. Which are the store APIs that relate to multiple records ? Are they the
> ones which use scala.collection.mutable.ArrayBuffer >,
>
> scala.collection.Iterator >
> and
> java.util.Iterator >
> in the parameter signature?
>
> 2. Is there a sample code which can show how to create multiple records
> like that and send the same to appropriate store API ?
>
> 3. If I take the example of spark-streaming-jms, the onMessage method of
> JMSReceiver class calls store API with one JMSEvent. Does that mean that
> this code does not guarantee the reliability of storage of the message
> received even if storage level specified to MEMORY_AND_DISK_SER_2 ?
>
> Regards,
> Sourav
>

LinearRegressionWithSGD fails on 12Mb data

2016-05-12 Thread RainDev

I'm using Spark 1.6.1 along with scala 2.11.7 on my Ubuntu 14.04 with
following memory settings for my project: JAVA_OPTS="-Xmx8G -Xms2G" . My
data is organized in 20 json-like files, every file is about 8-15 Mb,
containing categorical and numerical values. I parse this data, passing by
DataFrame facilities and then scale one numerical feature and create dummy
variables for categorical features. So far from initial 14 keys of my
json-like file I get about 200-240 features in the final LabeledPoint. The
final data is sparse and every file contains about 2-3 of
observations. I try to run two types of algorithms on data :
LinearRegressionWithSGD or LassoWithSGD, since the data is sparse and
regularization might be required. The following questions were emerged while
running my tests:
1. Most IMPORTANT: For data larger than 11MB LinearRegressionWithSGD fails
with the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 58
in stage 346.0 failed 1 times, most recent failure: Lost task 58.0 in stage
346.0 (TID 18140, localhost): ExecutorLostFailure (executor driver exited
caused by one of the running tasks) Reason: Executor heartbeat timed out
after 179307 ms
I faced the same problem with the file of 11Mb (for file of 5MB algorithm
works well), and after trying a lot of debug options (testing different
options for driver.memory & executors.memory, making sure that cache is
cleared properly, proper use of coalesce() ) I've found out that setting up
the StepSize of Gradientt Descent to 1 resolves this bug (while for 5MB
file-size StepSize = 0.4 doesn't bug and gives the better results).
So I tried to augment the StepSize for file-size of 12MB (setting up of
StepSize to 1.5 and 2) but it didn't work. If I take only 10 Mb of file
instead of whole file, the algorithm doesn't fail. It's very embarrassing
since I need construct the model on whole file, that seems to be still far
from Big Data formats. If I can not run Linear Regression on 12 Mb, could I
run it on larger sets? I notices that while using StandardScaler on
preprocessing step and counts on Linear Regression step, collect() method is
perform, that can cause the bug. So the possibility to scale Linear
regression is questioned, cause, as I far as I understand it, collect()
performs on driver and so the sens of scaled calculations is lost.
the following parameters are set :
val algorithme = new LinearRegressionWithSGD() //LassoWithSGD()
algorithme.setIntercept(true)
algorithme.optimizer
.setNumIterations(100)
.setStepSize(1)

2. The second question concern the creation of SparkContext. Since I haven't
new the nature of the bug mentioned above, I tried to create the
SparkContext for every file and not for ensemble of files. Till now, I don't
know the best option, if there is a sens to create the SparkContext for
every file of 10-20 Mb, or it can pass over 20 files ot that size (i.e.
iterate over 20 files inside the SparkContext)

3. The solutions on using coalesce() is not very clear. For now I apply
coalesce(sqlContext.sparkContext.defaultParallelism) only for one dataframe
(the biggest one) but should I call this method again on smaller dataframes?
Can this method increase performance in my case (i.e. 14 columns and 3
rows of initial dataframe)

4. The question of using multiple collect(). While parsing json file ,
applying collect() seems to be inevitable for obtaining LabeledPoint in the
end, and I'd like to know, in which cases I should avoid it and in which
cases there is no danger in using collect()

Thank you in advance for your help, and let me know if you need the sample
of data to reproduce the bug if my description is not sufficient

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/LinearRegressionWithSGD-fails-on-12Mb-data-tp26942.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.6 Catalyst optimizer

2016-05-12 Thread Takeshi Yamamuro

Hi,

What's the result of `df3.explain(true)`?

// maropu

On Thu, May 12, 2016 at 10:04 AM, Telmo Rodrigues <
telmo.galante.rodrig...@gmail.com> wrote:

> I'm building spark from branch-1.6 source with mvn -DskipTests package and
> I'm running the following code with spark shell.
>
> *val* sqlContext *=* *new* org.apache.spark.sql.*SQLContext*(sc)
>
> *import* *sqlContext.implicits._*
>
>
> *val df = sqlContext.read.json("persons.json")*
>
> *val df2 = sqlContext.read.json("cars.json")*
>
>
> *df.registerTempTable("t")*
>
> *df2.registerTempTable("u")*
>
>
> *val d3 =sqlContext.sql("select * from t join u on t.id  =
> u.id  where t.id  = 1")*
>
> With the log4j root category level WARN, the last printed messages refers
> to the Batch Resolution rules execution.
>
> === Result of Batch Resolution ===
> !'Project [unresolvedalias(*)]  Project [id#0L,id#1L]
> !+- 'Filter ('t.id = 1) +- Filter (id#0L = cast(1 as
> bigint))
> !   +- 'Join Inner, Some(('t.id = 'u.id))  +- Join Inner, Some((id#0L
> = id#1L))
> !  :- 'UnresolvedRelation `t`, None   :- Subquery t
> !  +- 'UnresolvedRelation `u`, None   :  +- Relation[id#0L]
> JSONRelation
> ! +- Subquery u
> !+- Relation[id#1L]
> JSONRelation
>
>
> I think that only the analyser rules are being executed.
>
> The optimiser rules should not to run in this case?
>
> 2016-05-11 19:24 GMT+01:00 Michael Armbrust :
>
>>
>>> logical plan after optimizer execution:
>>>
>>> Project [id#0L,id#1L]
>>> !+- Filter (id#0L = cast(1 as bigint))
>>> !   +- Join Inner, Some((id#0L = id#1L))
>>> !  :- Subquery t
>>> !  :  +- Relation[id#0L] JSONRelation
>>> !  +- Subquery u
>>> !  +- Relation[id#1L] JSONRelation
>>>
>>
>> I think you are mistaken.  If this was the optimized plan there would be
>> no subqueries.
>>
>
>


-- 
---
Takeshi Yamamuro

Re: My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Tom Ellis

I would like to also Mich, please send it through, thanks!

On Thu, 12 May 2016 at 15:14 Alonso Isidoro  wrote:

> Me too, send me the guide.
>
> Enviado desde mi iPhone
>
> El 12 may 2016, a las 12:11, Ashok Kumar  > escribió:
>
> Hi Dr Mich,
>
> I will be very keen to have a look at it and review if possible.
>
> Please forward me a copy
>
> Thanking you warmly
>
>
> On Thursday, 12 May 2016, 11:08, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Hi Al,,
>
>
> Following the threads in spark forum, I decided to write up on
> configuration of Spark including allocation of resources and configuration
> of driver, executors, threads, execution of Spark apps and general
> troubleshooting taking into account the allocation of resources for Spark
> applications and OS tools at the disposal.
>
> Since the most widespread configuration as I notice is with "Spark
> Standalone Mode", I have decided to write these notes starting with
> Standalone and later on moving to Yarn
>
>
>- *Standalone *– a simple cluster manager included with Spark that
>makes it easy to set up a cluster.
>- *YARN* – the resource manager in Hadoop 2.
>
>
> I would appreciate if anyone interested in reading and commenting to get
> in touch with me directly on mich.talebza...@gmail.com so I can send the
> write-up for their review and comments.
>
> Just to be clear this is not meant to be any commercial proposition or
> anything like that. As I seem to get involved with members troubleshooting
> issues and threads on this topic, I thought it is worthwhile writing a note
> about it to summarise the findings for the benefit of the community.
>
> Regards.
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
>
>
>

Re: My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Alonso Isidoro

Me too, send me the guide.

Enviado desde mi iPhone

> El 12 may 2016, a las 12:11, Ashok Kumar  
> escribió:
> 
> Hi Dr Mich,
> 
> I will be very keen to have a look at it and review if possible.
> 
> Please forward me a copy
> 
> Thanking you warmly
> 
> 
> On Thursday, 12 May 2016, 11:08, Mich Talebzadeh  
> wrote:
> 
> 
> Hi Al,,
> 
> 
> Following the threads in spark forum, I decided to write up on configuration 
> of Spark including allocation of resources and configuration of driver, 
> executors, threads, execution of Spark apps and general troubleshooting 
> taking into account the allocation of resources for Spark applications and OS 
> tools at the disposal.
> 
> Since the most widespread configuration as I notice is with "Spark Standalone 
> Mode", I have decided to write these notes starting with Standalone and later 
> on moving to Yarn
> 
> Standalone – a simple cluster manager included with Spark that makes it easy 
> to set up a cluster.
> YARN – the resource manager in Hadoop 2.
> 
> I would appreciate if anyone interested in reading and commenting to get in 
> touch with me directly on mich.talebza...@gmail.com so I can send the 
> write-up for their review and comments.
> 
> Just to be clear this is not meant to be any commercial proposition or 
> anything like that. As I seem to get involved with members troubleshooting 
> issues and threads on this topic, I thought it is worthwhile writing a note 
> about it to summarise the findings for the benefit of the community.
> 
> Regards.
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>

Re: Spark 1.6.0: substring on df.select

2016-05-12 Thread Sun Rui

Alternatively, you may try the built-in function:
regexp_extract

> On May 12, 2016, at 20:27, Ewan Leith  wrote:
> 
> You could use a UDF pretty easily, something like this should work, the 
> lastElement function could be changed to do pretty much any string 
> manipulation you want.
>  
> import org.apache.spark.sql.functions.udf
>  
> def lastElement(input: String) = input.split("/").last
>  
> val lastElementUdf = udf(lastElement(_:String))
>  
> df.select(lastElementUdf ($"col1")).show()
>  
> Ewan
>  
>  
> From: Bharathi Raja [mailto:raja...@yahoo.com.INVALID] 
> Sent: 12 May 2016 11:40
> To: Raghavendra Pandey ; Bharathi Raja 
> 
> Cc: User 
> Subject: RE: Spark 1.6.0: substring on df.select
>  
> Thanks Raghav. 
>  
> I have 5+ million records. I feel creating multiple come is not an optimal 
> way.
>  
> Please suggest any other alternate solution.
> Can’t we do any string operation in DF.Select?
>  
> Regards,
> Raja
>  
> From: Raghavendra Pandey 
> Sent: 11 May 2016 09:04 PM
> To: Bharathi Raja 
> Cc: User 
> Subject: Re: Spark 1.6.0: substring on df.select
>  
> You can create a column with count of /.  Then take max of it and create that 
> many columns for every row with null fillers.
> 
> Raghav 
> 
> On 11 May 2016 20:37, "Bharathi Raja"  > wrote:
> Hi,
>  
> I have a dataframe column col1 with values something like 
> “/client/service/version/method”. The number of “/” are not constant.
> Could you please help me to extract all methods from the column col1?
>  
> In Pig i used SUBSTRING with LAST_INDEX_OF(“/”).
>  
> Thanks in advance.
> Regards,
> Raja

Re: groupBy and store in parquet

2016-05-12 Thread Michal Vince


Hi Xinh

sorry for my late reply

it`s slow because of two reasons (at least to my knowledge)

1. lots of IOs - writing as json, then reading and writing again as parquet

2. because of nested rdd I can`t run the cycle and filter by event_type 
in parallel - this applies to your solution (3rd step)


I ended up with the suggestion you proposed - in realtime partition by 
event_type and store as jsons (which is pretty fast) and with another 
job which runs less frequently read jsons and store them as parquet



thank you very much

best regards

Michal



On 05/05/2016 06:02 PM, Xinh Huynh wrote:

Hi Michal,

Why is your solution so slow? Is it from the file IO caused by storing 
in a temp file as JSON and then reading it back in and writing it as 
Parquet? How are you getting "events" in the first place?


Do you have the original Kafka messages as an RDD[String]? Then how about:

1. Start with eventsAsRDD : RDD[String] (before converting to DF)
2. eventsAsRDD.map() --> use a RegEx to parse out the event_type of 
each event

 For example, search the string for {"event_type"="[.*]"}
3. Now, filter by each event_type to create a separate RDD for each 
type, and convert those to DF. You only convert to DF for events of 
the same type, so you avoid the NULLs.


Xinh


On Thu, May 5, 2016 at 2:52 AM, Michal Vince > wrote:


Hi Xinh

For (1) the biggest problem are those null columns. e.g. DF will
have ~1000 columns so every partition of that DF will have ~1000
columns, one of the partitioned columns can have 996 null columns
which is big waste of space (in my case more than 80% in avg)

for (2) I can`t really change anything as the source belongs to
the 3rd party


Miso


On 05/04/2016 05:21 PM, Xinh Huynh wrote:

Hi**Michal,

For (1), would it be possible to partitionBy two columns to
reduce the size? Something like partitionBy("event_type", "date").

For (2), is there a way to separate the different event types
upstream, like on different Kafka topics, and then process them
separately?

Xinh

On Wed, May 4, 2016 at 7:47 AM, Michal Vince
mailto:vince.mic...@gmail.com>> wrote:

Hi guys

I`m trying to store kafka stream with ~5k events/s as
efficiently as possible in parquet format to hdfs.

I can`t make any changes to kafka (belongs to 3rd party)


Events in kafka are in json format, but the problem is there
are many different event types (from different subsystems
with different number of fields, different size etc..) so it
doesn`t make any sense to store them in the same file


I was trying to read data to DF and then repartition it by
event_type and store


events.write.partitionBy("event_type").format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save(tmpFolder)

which is quite fast but have 2 drawbacks that I`m aware of

1. output folder has only one partition which can be huge

2. all DFs created like this share the same schema, so even
dfs with few fields have tons of null fields


My second try is bit naive and really really slow (you can
see why in code) - filter DF by event type and store them
temporarily as json (to get rid of null fields)

val event_types =events.select($"event_type").distinct().collect() // 
get event_types in this batch

for (row <- event_types) {
   val currDF =events.filter($"event_type" === row.get(0))
   val tmpPath =tmpFolder + row.get(0)
   
currDF.write.format("json").mode(org.apache.spark.sql.SaveMode.Append).save(tmpPath)
   sqlContext.read.json(tmpPath).write.format("parquet").save(basePath)

}
hdfs.delete(new Path(tmpFolder),true)


Do you have any suggestions for any better solution to this?

thanks

Re: Is this possible to do in spark ?

2016-05-12 Thread Mathieu Longtin

Make a function (or lambda) that reads the text file. Make a RDD with a
list of X/Y, then map that RDD throught the file reading function. Same
with you X/Y/Z directory. You then have RDDs with the content of each file
as a record. Work with those as needed.

On Wed, May 11, 2016 at 2:36 PM Pradeep Nayak  wrote:

> Hi -
>
> I have a very unique problem which I am trying to solve and I am not sure
> if spark would help here.
>
> I have a directory: /X/Y/a.txt and in the same structure /X/Y/Z/b.txt.
>
> a.txt contains a unique serial number, say:
> 12345
>
> and b.txt contains key value pairs.
> a,1
> b,1,
> c,0 etc.
>
> Everyday you receive data for a system Y. so there are multiple a.txt and
> b.txt for a serial number.  The serial number doesn't change and that the
> key. So there are multiple systems and the data of a whole year is
> available and its huge.
>
> I am trying to generate a report of unique serial numbers where the value
> of the option a has changed to 1 over the last few months. Lets say the
> default is 0. Also figure how many times it was toggled.
>
>
> I am not sure how to read two text files in spark at the same time and
> associated them with the serial number. Is there a way of doing this in
> place given that we know the directory structure ? OR we should be
> transforming the data anyway to solve this ?
>
-- 
Mathieu Longtin
1-514-803-8977

Efficient for loops in Spark

2016-05-12 Thread flyinggip

Hi there,

I'd like to write some iterative computation, i.e., computation that can be
done via a for loop. I understand that in Spark foreach is a better choice.
However, foreach and foreachPartition seem to be for self-contained
computation that only involves the corresponding Row or Partition,
respectively. But in my application each computational task does not only
involve one partition, but also other partitions. It's just that every task
has a specific way of using the corresponding partition and the other
partitions. An example application will be cross-validation in machine
learning, where each fold corresponds to a partition, e.g., the whole data
is divided into 5 folds, then for task 1, I use fold 1 for testing and folds
2,3,4,5 for training; for task 2, I use fold 2 for testing and folds 1,3,4,5
for training; etc.

In this case, if I were to use foreachPartition, it seems that I need to
duplicate the data the number of times equal to the number of folds (or
iterations of my for loop). More generally, I would need to still prepare a
partition for every distributed task and that partition would need to
include all the data needed for the task, which could be a huge waste of
space.

Is there any other solutions? Thanks.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-for-loops-in-Spark-tp26939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Spark 1.6.0: substring on df.select

2016-05-12 Thread Ewan Leith

You could use a UDF pretty easily, something like this should work, the 
lastElement function could be changed to do pretty much any string manipulation 
you want.

import org.apache.spark.sql.functions.udf

def lastElement(input: String) = input.split("/").last

val lastElementUdf = udf(lastElement(_:String))

df.select(lastElementUdf ($"col1")).show()

Ewan


From: Bharathi Raja [mailto:raja...@yahoo.com.INVALID]
Sent: 12 May 2016 11:40
To: Raghavendra Pandey ; Bharathi Raja 

Cc: User 
Subject: RE: Spark 1.6.0: substring on df.select

Thanks Raghav.

I have 5+ million records. I feel creating multiple come is not an optimal way.

Please suggest any other alternate solution.
Can’t we do any string operation in DF.Select?

Regards,
Raja

From: Raghavendra Pandey
Sent: 11 May 2016 09:04 PM
To: Bharathi Raja
Cc: User
Subject: Re: Spark 1.6.0: substring on df.select


You can create a column with count of /.  Then take max of it and create that 
many columns for every row with null fillers.

Raghav
On 11 May 2016 20:37, "Bharathi Raja" 
mailto:raja...@yahoo.com.invalid>> wrote:
Hi,

I have a dataframe column col1 with values something like 
“/client/service/version/method”. The number of “/” are not constant.
Could you please help me to extract all methods from the column col1?

In Pig i used SUBSTRING with LAST_INDEX_OF(“/”).

Thanks in advance.
Regards,
Raja

SQLContext and HiveContext parse a query string differently ?

2016-05-12 Thread Hao Ren

HI,

I just want to figure out why the two contexts behavior differently even on
a simple query.
In a netshell, I have a query in which there is a String containing single
quote and casting to Array/Map.
I have tried all the combination of diff type of sql context and query call
api (sql, df.select, df.selectExpr).
I can't find one rules all.

Here is the code for reproducing the problem.
-

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}

object Test extends App {

  val sc  = new SparkContext("local[2]", "test", new SparkConf)
  val hiveContext = new HiveContext(sc)
  val sqlContext  = new SQLContext(sc)

  val context = hiveContext
  //  val context = sqlContext

  import context.implicits._

  val df = Seq((Seq(1, 2), 2)).toDF("a", "b")
  df.registerTempTable("tbl")
  df.printSchema()

  // case 1
  context.sql("select cast(a as array) from tbl").show()
  // HiveContext => org.apache.spark.sql.AnalysisException: cannot
recognize input near 'array' '<' 'string' in primitive type
specification; line 1 pos 17
  // SQLContext => OK

  // case 2
  context.sql("select 'a\\'b'").show()
  // HiveContext => OK
  // SQLContext => failure: ``union'' expected but ErrorToken(unclosed
string literal) found

  // case 3
  df.selectExpr("cast(a as array)").show() // OK with
HiveContext and SQLContext

  // case 4
  df.selectExpr("'a\\'b'").show() // HiveContext, SQLContext =>
failure: end of input expected
}

-

Any clarification / workaround is high appreciated.

-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France

RE: Spark 1.6.0: substring on df.select

2016-05-12 Thread Bharathi Raja

Thanks Raghav. 

I have 5+ million records. I feel creating multiple come is not an optimal way.

Please suggest any other alternate solution.
Can’t we do any string operation in DF.Select?

Regards,
Raja

From: Raghavendra Pandey
Sent: 11 May 2016 09:04 PM
To: Bharathi Raja
Cc: User
Subject: Re: Spark 1.6.0: substring on df.select

You can create a column with count of /.  Then take max of it and create that 
many columns for every row with null fillers. 
Raghav 
On 11 May 2016 20:37, "Bharathi Raja"  wrote:
Hi,
 
I have a dataframe column col1 with values something like 
“/client/service/version/method”. The number of “/” are not constant. 
Could you please help me to extract all methods from the column col1?
 
In Pig i used SUBSTRING with LAST_INDEX_OF(“/”).
 
Thanks in advance.
Regards,
Raja

Re: My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Ashok Kumar

Hi Dr Mich,
I will be very keen to have a look at it and review if possible.
Please forward me a copy
Thanking you warmly 

On Thursday, 12 May 2016, 11:08, Mich Talebzadeh 
 wrote:
 

 Hi Al,,

Following the threads in spark forum, I decided to write up on configuration of 
Spark including allocation of resources and configuration of driver, executors, 
threads, execution of Spark apps and general troubleshooting taking into 
account the allocation of resources for Spark applications and OS tools at the 
disposal.
Since the most widespread configuration as I notice is with "Spark Standalone 
Mode", I have decided to write these notes starting with Standalone and later 
on moving to Yarn
   
   - Standalone – a simple cluster managerincluded with Spark that makes it 
easy to set up a cluster.
   - YARN – the resource manager inHadoop 2.

I would appreciate if anyone interested in reading and commenting to get in 
touch with me directly on mich.talebza...@gmail.com so I can send the write-up 
for their review and comments.
Just to be clear this is not meant to be any commercial proposition or anything 
like that. As I seem to get involved with members troubleshooting issues and 
threads on this topic, I thought it is worthwhile writing a note about it to 
summarise the findings for the benefit of the community.
Regards.
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com

My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Mich Talebzadeh

Hi Al,,


Following the threads in spark forum, I decided to write up on
configuration of Spark including allocation of resources and configuration
of driver, executors, threads, execution of Spark apps and general
troubleshooting taking into account the allocation of resources for Spark
applications and OS tools at the disposal.

Since the most widespread configuration as I notice is with "Spark
Standalone Mode", I have decided to write these notes starting with
Standalone and later on moving to Yarn


   -

   *Standalone *– a simple cluster manager included with Spark that makes
   it easy to set up a cluster.
   -

   *YARN* – the resource manager in Hadoop 2.


I would appreciate if anyone interested in reading and commenting to get in
touch with me directly on mich.talebza...@gmail.com so I can send the
write-up for their review and comments.


Just to be clear this is not meant to be any commercial proposition or
anything like that. As I seem to get involved with members troubleshooting
issues and threads on this topic, I thought it is worthwhile writing a note
about it to summarise the findings for the benefit of the community.


Regards.


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com

Why spark give out this error message?

2016-05-12 Thread sunday2000

Hi,
   Do u know what this message mean?
  
 org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
localhost/127.0.0.1:50606
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at 
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
at 
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)
Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:50606

Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-12 Thread Renato Marroquín Mogrovejo

Hi Amit,

This is very interesting indeed because I have got similar resutls. I tried
doing a filtter + groupBy using DataSet with a function, and using the
inner RDD of the DF(RDD[row]). I used the inner RDD of a DataFrame because
apparently there is no straight-forward way to create an RDD of Parquet
data without creating a sqlContext. if anybody has some code to share with
me, please share (:
I used 1GB of parquet data and when doing the operations with the RDD it
was much faster. After looking at the execution plans, it is clear why
DataSets do worse. For using them an extra map operation is done to map row
objects into the defined case class. Then the DataSet uses the whole query
optimization platform (Catalyst and move objects in and out of Tungsten).
Thus, I think for operations that are too "simple", it is more expensive to
use the entire DS/DF infrastructure than the inner RDD.
IMHO if you have complex SQL queries, it makes sense you use DS/DF but if
you don't, then probably using RDDs directly is still faster.

Renato M.

2016-05-11 20:17 GMT+02:00 Amit Sela :

> Some how missed that ;)
> Anything about Datasets slowness ?
>
> On Wed, May 11, 2016, 21:02 Ted Yu  wrote:
>
>> Which release are you using ?
>>
>> You can use the following to disable UI:
>> --conf spark.ui.enabled=false
>>
>> On Wed, May 11, 2016 at 10:59 AM, Amit Sela  wrote:
>>
>>> I've ran a simple WordCount example with a very small List as
>>> input lines and ran it in standalone (local[*]), and Datasets is very slow..
>>> We're talking ~700 msec for RDDs while Datasets takes ~3.5 sec.
>>> Is this just start-up overhead ? please note that I'm not timing the
>>> context creation...
>>>
>>> And in general, is there a way to run with local[*] "lightweight" mode
>>> for testing ? something like without the WebUI server for example (and
>>> anything else that's not needed for testing purposes)
>>>
>>> Thanks,
>>> Amit
>>>
>>
>>

Re: ML regression - spark context dies without error

2016-05-12 Thread AlexModestov

Hello,
I have the same problem... Sometimes I get the error: "Py4JError: Answer
from Java side is empty"
Sometimes my code works fine but sometimes not...
Did you find why it might come? What was the reason?
Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ML-regression-spark-context-dies-without-error-tp22633p26938.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Need for advice - performance improvement and out of memory resolution

2016-05-12 Thread AlexModestov

Hello.
I'm sorry but did you find the answer?
I have the similar error and I can not solve it... No one answered me...
Spark driver dies and I get the error "Answer from Java side is empty".
I thought that it is so because I made a mistake this conf-file

I use Sparkling Water 1.6.3, Spark 1.6.
I use Java Oracle 8 or OpenJDK-7:
(every time I get this error when I transform Spark DataFrame into H2O
DataFrame.

ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
  File ".../Spark1.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
746, in send_command
raise Py4JError("Answer from Java side is empty")
Py4JError: Answer from Java side is empty
ERROR:py4j.java_gateway:An error occurred while trying to connect to the
Java server
Traceback (most recent call last):
  File ".../Spark1.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
690, in start
self.socket.connect((self.address, self.port))
  File "/usr/local/anaconda/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
ERROR:py4j.java_gateway:An error occurred while trying to connect to the
Java server
Traceback (most recent call last):

My conf-file:
spark.serializer org.apache.spark.serializer.KryoSerializer 
spark.kryoserializer.buffer.max 1500mb
spark.driver.memory 65g
spark.driver.extraJavaOptions -XX:-PrintGCDetails -XX:PermSize=35480m
-XX:-PrintGCTimeStamps -XX:-PrintTenuringDistribution  
spark.python.worker.memory 65g
spark.local.dir /data/spark-tmp
spark.ext.h2o.client.log.dir /data/h2o
spark.logConf false
spark.master local[*]
spark.driver.maxResultSize 0
spark.eventLog.enabled True
spark.eventLog.dir /data/spark_log

In the code I use "persist" data (amount of data is 5.7 GB).
I guess that there is enough memory.
Could anyone help me?
Thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Need-for-advice-performance-improvement-and-out-of-memory-resolution-tp24886p26937.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Submitting Job to YARN-Cluster using Spark Job Server

2016-05-12 Thread ashesh_28

Hi Guys , 

Does any of you have tried this mechanism before?
I am able to run it locally and get the output ..But how do i submit the
job to the Yarn-Cluster using Spark-JobServer.

Any documentation ?

Regards
Ashesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Job-to-YARN-Cluster-using-Spark-Job-Server-tp26936.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

parallelism of task executor worker threads during s3 reads

2016-05-12 Thread sanusha

I am using a spark cluster on Amazon (launched using
spark-1.6-prebuilt-with-hadoop-2.6 spark-ec2 script)
to run a scala driver application to read S3 object content in parallel.

I have tried “s3n://bucket” with sc.textFile as well as set up an RDD with
the S3 keys and then used
java aws sdk to map it to s3client.getObject(bucket,key) call on each key to
read content.

In both cases, it gets quite slow as the number of objects increase to even
few thousands. For example,
for 8000 objects it took almost 4mins. In contrast, a custom c++ s3 http
client with multiple threads to
fetch them in parallel on a VM with single cpu, took 1min with 8 threads and
12secs if I use 80 threads.

I observed two issues in both s3n based read, and s3client based calls:
(a) Each slave in spark cluster has 2 vcpus. The job [ref:NOTE] is
partitioned into #tasks, either = #objects,
or = 2x#nodes. But at the node, it is not always starting 2/more executor
worker threads as I expected.
Instead a single worker thread serially executes the tasks one after the
other.

(b) Each executor worker thread seems to go through each object key
serially, and the thread ends up almost
always waiting on socket streaming.

Has anyone seen this or am I doing something wrong? How do I make sure all
the cpus are used? And how to instruct the executors to start multiple
threads to process data in parallel within the partition/node/task?

NOTE: since sc.textFile as well as map are transformation, I use
resultRDD.count to kick off the actual read and the times are for the count
job.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/parallelism-of-task-executor-worker-threads-during-s3-reads-tp26935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: apache spark on gitter?

2016-05-12 Thread Paweł Szulc

Hi,

well I guess the advantage of gitter over maling list is the same as with
IRC. It's not actually a replacer because mailing list is also important.
But it is lot easier to build a community around tool with ad-hoc ability
to connect with each other.

I have gitter running on constantly, I visit my favorite OSS projects on it
from time to time to read what has recently happened. It allows me to stay
in touch with the project, help fellow developers to with problems they
have.
One might argue that u can achive the same with mailing list, well it's
hard for me to put this into words.. Malinig list is more of an async
nature (which is good!) but some times you need more "real-time"
experience. You know, engage in the conversation in the given moment, not
conversation that might last few days :)

TLDR: It is not a replacement, it's supplement to build the community
around OSS. Worth having for real-time conversations.

On Wed, May 11, 2016 at 10:24 PM, Xinh Huynh  wrote:

> Hi Pawel,
>
> I'd like to hear more about your idea. Could you explain more why you
> would like to have a gitter channel? What are the advantages over a mailing
> list (like this one)? Have you had good experiences using gitter on other
> open source projects?
>
> Xinh
>
> On Wed, May 11, 2016 at 11:10 AM, Sean Owen  wrote:
>
>> I don't know of a gitter channel and I don't use it myself, FWIW. I
>> think anyone's welcome to start one.
>>
>> I hesitate to recommend this, simply because it's preferable to have
>> one place for discussion rather than split it over several, and, we
>> have to keep the @spark.apache.org mailing lists as the "forums of
>> records" for project discussions.
>>
>> If something like gitter doesn't attract any chat, then it doesn't add
>> any value. If it does though, then suddenly someone needs to subscribe
>> to user@ and gitter to follow all of the conversations.
>>
>> I think there is a bit of a scalability problem on the user@ list at
>> the moment, just because it covers all of Spark. But adding a
>> different all-Spark channel doesn't help that.
>>
>> Anyway maybe that's "why"
>>
>>
>> On Wed, May 11, 2016 at 6:26 PM, Paweł Szulc 
>> wrote:
>> > no answer, but maybe one more time, a gitter channel for spark users
>> would
>> > be a good idea!
>> >
>> > On Mon, May 9, 2016 at 1:45 PM, Paweł Szulc 
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I was wondering - why Spark does not have a gitter channel?
>> >>
>> >> --
>> >> Regards,
>> >> Paul Szulc
>> >>
>> >> twitter: @rabbitonweb
>> >> blog: www.rabbitonweb.com
>> >
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Paul Szulc
>> >
>> > twitter: @rabbitonweb
>> > blog: www.rabbitonweb.com
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

-- 
Regards,
Paul Szulc

twitter: @rabbitonweb
blog: www.rabbitonweb.com

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Rakesh H (Marketing Platform-BLR)

We are on spark 1.5.1
Above change was to add a shutdown hook.
I am not adding shutdown hook in code, so inbuilt shutdown hook is being
called.
Driver signals that it is going to to graceful shutdown, but executor sees
that Driver is dead and it shuts down abruptly.
Could this issue be related to yarn? I see correct behavior locally. I did
"yarn kill " to kill the job.


On Thu, May 12, 2016 at 12:28 PM Deepak Sharma 
wrote:

> This is happening because spark context shuts down without shutting down
> the ssc first.
> This was behavior till spark 1.4 ans was addressed in later releases.
> https://github.com/apache/spark/pull/6307
>
> Which version of spark are you on?
>
> Thanks
> Deepak
>
> On Thu, May 12, 2016 at 12:14 PM, Rakesh H (Marketing Platform-BLR) <
> rakes...@flipkart.com> wrote:
>
>> Yes, it seems to be the case.
>> In this case executors should have continued logging values till 300, but
>> they are shutdown as soon as i do "yarn kill .."
>>
>> On Thu, May 12, 2016 at 12:11 PM Deepak Sharma 
>> wrote:
>>
>>> So in your case , the driver is shutting down gracefully , but the
>>> executors are not.
>>> IS this the problem?
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Thu, May 12, 2016 at 11:49 AM, Rakesh H (Marketing Platform-BLR) <
>>> rakes...@flipkart.com> wrote:
>>>
 Yes, it is set to true.
 Log of driver :

 16/05/12 10:18:29 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
 16/05/12 10:18:29 INFO streaming.StreamingContext: Invoking 
 stop(stopGracefully=true) from shutdown hook
 16/05/12 10:18:29 INFO scheduler.JobGenerator: Stopping JobGenerator 
 gracefully
 16/05/12 10:18:29 INFO scheduler.JobGenerator: Waiting for all received 
 blocks to be consumed for job generation
 16/05/12 10:18:29 INFO scheduler.JobGenerator: Waited for all received 
 blocks to be consumed for job generation

 Log of executor:
 16/05/12 10:18:29 ERROR executor.CoarseGrainedExecutorBackend: Driver 
 xx.xx.xx.xx:x disassociated! Shutting down.
 16/05/12 10:18:29 WARN remote.ReliableDeliverySupervisor: Association with 
 remote system [xx.xx.xx.xx:x] has failed, address is now gated for 
 [5000] ms. Reason: [Disassociated]
 16/05/12 10:18:29 INFO storage.DiskBlockManager: Shutdown hook called
 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 
 204 //This is value i am logging
 16/05/12 10:18:29 INFO util.ShutdownHookManager: Shutdown hook called
 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 
 205
 16/05/12 10:18:29 INFO processors.StreamJobRunner$: VALUE -> 
 206






 On Thu, May 12, 2016 at 11:45 AM Deepak Sharma 
 wrote:

> Hi Rakesh
> Did you tried setting *spark.streaming.stopGracefullyOnShutdown to
> true *for your spark configuration instance?
> If not try this , and let us know if this helps.
>
> Thanks
> Deepak
>
> On Thu, May 12, 2016 at 11:42 AM, Rakesh H (Marketing Platform-BLR) <
> rakes...@flipkart.com> wrote:
>
>> Issue i am having is similar to the one mentioned here :
>>
>> http://stackoverflow.com/questions/36911442/how-to-stop-gracefully-a-spark-streaming-application-on-yarn
>>
>> I am creating a rdd from sequence of 1 to 300 and creating streaming
>> RDD out of it.
>>
>> val rdd = ssc.sparkContext.parallelize(1 to 300)
>> val dstream = new ConstantInputDStream(ssc, rdd)
>> dstream.foreachRDD{ rdd =>
>>   rdd.foreach{ x =>
>> log(x)
>> Thread.sleep(50)
>>   }
>> }
>>
>>
>> When i kill this job, i expect elements 1 to 300 to be logged before
>> shutting down. It is indeed the case when i run it locally. It wait for 
>> the
>> job to finish before shutting down.
>>
>> But when i launch the job in custer with "yarn-cluster" mode, it
>> abruptly shuts down.
>> Executor prints following log
>>
>> ERROR executor.CoarseGrainedExecutorBackend:
>> Driver xx.xx.xx.xxx:y disassociated! Shutting down.
>>
>>  and then it shuts down. It is not a graceful shutdown.
>>
>> Anybody knows how to do it in yarn ?
>>
>>
>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

55 matches

Mail list logo