date:20150116

Does parallel processing mean it is executed in multiple worker or executed in 
one worker but multiple threads? For example if I have only one worker but my 
RDD has 4 partition, will it be executed parallel in 4 thread?

The reason I am asking is try to decide whether I need to configure spark to 
have multiple workers. By default, it just start with one worker.

Regards,

Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541


-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Thursday, January 15, 2015 11:04 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: How to force parallel processing of RDD using multiple thread

Check the number of partitions in your input. It may be much less than the 
available parallelism of your small cluster. For example, input that lives in 
just 1 partition will spawn just 1 task.

Beyond that parallelism just happens. You can see the parallelism of each 
operation in the Spark UI.

On Thu, Jan 15, 2015 at 10:53 PM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com wrote:
 Spark Standalone cluster.

 My program is running very slow, I suspect it is not doing parallel 
 processing of rdd. How can I force it to run parallel? Is there anyway to 
 check whether it is processed in parallel?

 Regards,

 Ningjun Wang
 Consulting Software Engineer
 LexisNexis
 121 Chanlon Road
 New Providence, NJ 07974-1541


 -Original Message-
 From: Sean Owen [mailto:so...@cloudera.com]
 Sent: Thursday, January 15, 2015 4:29 PM
 To: Wang, Ningjun (LNG-NPV)
 Cc: user@spark.apache.org
 Subject: Re: How to force parallel processing of RDD using multiple 
 thread

 What is your cluster manager? For example on YARN you would specify 
 --executor-cores. Read:
 http://spark.apache.org/docs/latest/running-on-yarn.html

 On Thu, Jan 15, 2015 at 8:54 PM, Wang, Ningjun (LNG-NPV) 
 ningjun.w...@lexisnexis.com wrote:
 I have a standalone spark cluster with only one node with 4 CPU cores.
 How can I force spark to do parallel processing of my RDD using 
 multiple threads? For example I can do the following



 Spark-submit  --master local[4]



 However I really want to use the cluster as follow



 Spark-submit  --master spark://10.125.21.15:7070



 In that case, how can I make sure the RDD is processed with multiple 
 threads/cores?



 Thanks

 Ningjun

Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

I have asked this question before but get no answer. Asking again.

Can I save RDD to the local file system and then read it back on a spark 
cluster with multiple nodes?

rdd.saveAsObjectFile(file:///home/data/rdd1file:///\\home\data\rdd1)

val rdd2 = sc.objectFile(file:///home/data/rdd1file:///\\home\data\rdd1)

This will works if the cluster has only one node. But my cluster has 3 nodes 
and each node has a local dir called /home/data. Is rdd saved to the local dir 
across 3 nodes? If so, does sc.objectFile(...) smart enough to read the local 
dir in all 3 nodes to merge them into a single rdd?

Ningjun

RE: How to force parallel processing of RDD using multiple thread

So one worker is enough and it will use all 4 cores? In what situation shall I 
configure more workers in my single node cluster?

Regards,

Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541

From: Gerard Maas [mailto:gerard.m...@gmail.com]
Sent: Friday, January 16, 2015 9:44 AM
To: Wang, Ningjun (LNG-NPV)
Cc: Sean Owen; user@spark.apache.org
Subject: Re: How to force parallel processing of RDD using multiple thread

Spark will use the number of cores available in the cluster. If your cluster is 
1 node with 4 cores, Spark will execute up to 4 tasks in parallel.
Setting your #of partitions to 4 will ensure an even load across cores.
Note that this is different from saying threads - Internally Spark uses many 
threads  (data block sender/receiver, listeners, notifications, scheduler, ...)

-kr, Gerard.

On Fri, Jan 16, 2015 at 3:14 PM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.commailto:ningjun.w...@lexisnexis.com wrote:
Does parallel processing mean it is executed in multiple worker or executed in 
one worker but multiple threads? For example if I have only one worker but my 
RDD has 4 partition, will it be executed parallel in 4 thread?

The reason I am asking is try to decide whether I need to configure spark to 
have multiple workers. By default, it just start with one worker.

Regards,

Ningjun Wang
Consulting Software Engineer
LexisNexis
121 Chanlon Road
New Providence, NJ 07974-1541


-Original Message-
From: Sean Owen [mailto:so...@cloudera.commailto:so...@cloudera.com]
Sent: Thursday, January 15, 2015 11:04 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: How to force parallel processing of RDD using multiple thread

Check the number of partitions in your input. It may be much less than the 
available parallelism of your small cluster. For example, input that lives in 
just 1 partition will spawn just 1 task.

Beyond that parallelism just happens. You can see the parallelism of each 
operation in the Spark UI.

On Thu, Jan 15, 2015 at 10:53 PM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.commailto:ningjun.w...@lexisnexis.com wrote:
 Spark Standalone cluster.

 My program is running very slow, I suspect it is not doing parallel 
 processing of rdd. How can I force it to run parallel? Is there anyway to 
 check whether it is processed in parallel?

 Regards,

 Ningjun Wang
 Consulting Software Engineer
 LexisNexis
 121 Chanlon Road
 New Providence, NJ 07974-1541


 -Original Message-
 From: Sean Owen [mailto:so...@cloudera.commailto:so...@cloudera.com]
 Sent: Thursday, January 15, 2015 4:29 PM
 To: Wang, Ningjun (LNG-NPV)
 Cc: user@spark.apache.orgmailto:user@spark.apache.org
 Subject: Re: How to force parallel processing of RDD using multiple
 thread

 What is your cluster manager? For example on YARN you would specify 
 --executor-cores. Read:
 http://spark.apache.org/docs/latest/running-on-yarn.html

 On Thu, Jan 15, 2015 at 8:54 PM, Wang, Ningjun (LNG-NPV) 
 ningjun.w...@lexisnexis.commailto:ningjun.w...@lexisnexis.com wrote:
 I have a standalone spark cluster with only one node with 4 CPU cores.
 How can I force spark to do parallel processing of my RDD using
 multiple threads? For example I can do the following



 Spark-submit  --master local[4]



 However I really want to use the cluster as follow



 Spark-submit  --master spark://10.125.21.15:7070http://10.125.21.15:7070



 In that case, how can I make sure the RDD is processed with multiple
 threads/cores?



 Thanks

 Ningjun

How to 'Pipe' Binary Data in Apache Spark

I have an RDD containing binary data. I would like to use 'RDD.pipe' to
pipe that binary data to an external program that will translate it to
string/text data. Unfortunately, it seems that Spark is mangling the binary
data before it gets passed to the external program.
This code is representative of what I am trying to do. What am I doing
wrong? How can I pipe binary data in Spark?  Maybe it is getting corrupted
when I read it in initially with 'textFile'?

bin = sc.textFile(binary-data.dat)
csv = bin.pipe (/usr/bin/binary-to-csv.sh)
csv.saveAsTextFile(text-data.csv)

Specifically, I am trying to use Spark to transform pcap (packet capture)
data to text/csv so that I can perform an analysis on it.

Thanks!

-- 
Nick Allen n...@nickallen.org

Re: MLib: How to set preferences for ALS implicit feedback in Collaborative Filtering?

2015-01-16 Thread Sean Owen

On Fri, Jan 16, 2015 at 9:58 AM, Zork Sail zorks...@gmail.com wrote:
 And then train ALSL:

  val model = ALS.trainImplicit(ratings, rank, numIter)

 I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1
 value:

This is likely the problem. RMSE is not an appropriate evaluation
metric when you have trained a model on implicit data. The
factorization is not minimizing the same squared error loss that RMSE
evaluates. Use metrics like AUC instead, for example.

Rating value can be 1 if you have no information at all about the
interaction other than that it exists. It should be thought of as a
weight. 10 means it's 10 times more important to predict an
interaction than one with weight 1.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to 'Pipe' Binary Data in Apache Spark

2015-01-16 Thread Sean Owen

Well it looks like you're reading some kind of binary file as text.
That isn't going to work, in Spark or elsewhere, as binary data is not
even necessarily the valid encoding of a string. There are no line
breaks to delimit lines and thus elements of the RDD.

Your input has some record structure (or else it's not really useful
to put it into an RDD). You can encode this as a SequenceFile and read
it with objectFile.

You could also write a custom InputFormat that knows how to parse pcap
records directly.

On Fri, Jan 16, 2015 at 3:09 PM, Nick Allen n...@nickallen.org wrote:
 I have an RDD containing binary data. I would like to use 'RDD.pipe' to pipe
 that binary data to an external program that will translate it to
 string/text data. Unfortunately, it seems that Spark is mangling the binary
 data before it gets passed to the external program.

 This code is representative of what I am trying to do. What am I doing
 wrong? How can I pipe binary data in Spark?  Maybe it is getting corrupted
 when I read it in initially with 'textFile'?

 bin = sc.textFile(binary-data.dat)
 csv = bin.pipe (/usr/bin/binary-to-csv.sh)
 csv.saveAsTextFile(text-data.csv)

 Specifically, I am trying to use Spark to transform pcap (packet capture)
 data to text/csv so that I can perform an analysis on it.

 Thanks!

 --
 Nick Allen n...@nickallen.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Scala vs Python performance differences

2015-01-16 Thread philpearl

I was interested in this as I had some Spark code in Python that was too slow
and wanted to know whether Scala would fix it for me.  So I re-wrote my code
in Scala.

In my particular case the Scala version was 10 times faster.  But I think
that is because I did an awful lot of computation in my own code rather than
in a library like numpy. (I put a bit more detail  here
http://tttv-engineering.tumblr.com/post/108260351966/spark-python-vs-scala  
in case you are interested)

So there's one data point, if only for the obvious data point comparing
computations in Scala to computations in pure Python.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-performance-differences-tp4247p21190.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to 'Pipe' Binary Data in Apache Spark

Per your last comment, it appears I need something like this:

https://github.com/RIPE-NCC/hadoop-pcap


Thanks a ton.  That get me oriented in the right direction.

On Fri, Jan 16, 2015 at 10:20 AM, Sean Owen so...@cloudera.com wrote:

 Well it looks like you're reading some kind of binary file as text.
 That isn't going to work, in Spark or elsewhere, as binary data is not
 even necessarily the valid encoding of a string. There are no line
 breaks to delimit lines and thus elements of the RDD.

 Your input has some record structure (or else it's not really useful
 to put it into an RDD). You can encode this as a SequenceFile and read
 it with objectFile.

 You could also write a custom InputFormat that knows how to parse pcap
 records directly.

 On Fri, Jan 16, 2015 at 3:09 PM, Nick Allen n...@nickallen.org wrote:
  I have an RDD containing binary data. I would like to use 'RDD.pipe' to
 pipe
  that binary data to an external program that will translate it to
  string/text data. Unfortunately, it seems that Spark is mangling the
 binary
  data before it gets passed to the external program.
 
  This code is representative of what I am trying to do. What am I doing
  wrong? How can I pipe binary data in Spark?  Maybe it is getting
 corrupted
  when I read it in initially with 'textFile'?
 
  bin = sc.textFile(binary-data.dat)
  csv = bin.pipe (/usr/bin/binary-to-csv.sh)
  csv.saveAsTextFile(text-data.csv)
 
  Specifically, I am trying to use Spark to transform pcap (packet capture)
  data to text/csv so that I can perform an analysis on it.
 
  Thanks!
 
  --
  Nick Allen n...@nickallen.org




-- 
Nick Allen n...@nickallen.org

kinesis multiple records adding into stream

2015-01-16 Thread Hafiz Mujadid

Hi Experts!

I am using kinesis dependency as follow
groupId = org.apache.spark
 artifactId = spark-streaming-kinesis-asl_2.10
 version = 1.2.0

in this aws sdk version 1.8.3 is being used. in this sdk multiple records
can not be put in a single request. is it possible to put multiple records
in a single request ? 


thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/kinesis-multiple-records-adding-into-stream-tp21191.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: kinesis multiple records adding into stream

2015-01-16 Thread Aniket Bhatnagar

Sorry. I couldn't understand the issue. Are you trying to send data to
kinesis from a spark batch/real time job?

- Aniket

On Fri, Jan 16, 2015, 9:40 PM Hafiz Mujadid hafizmujadi...@gmail.com
wrote:

 Hi Experts!

 I am using kinesis dependency as follow
 groupId = org.apache.spark
  artifactId = spark-streaming-kinesis-asl_2.10
  version = 1.2.0

 in this aws sdk version 1.8.3 is being used. in this sdk multiple records
 can not be put in a single request. is it possible to put multiple records
 in a single request ?


 thanks



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/kinesis-multiple-records-adding-into-
 stream-tp21191.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-16 Thread Imran Rashid

I'm not positive, but I think this is very unlikely to work.

First, when you call sc.objectFile(...),  I think the *driver* will need to
know something about the file, eg to know how many tasks to create.  But it
won't even be able to see the file, since it only lives on the local
filesystem of the cluster nodes.

If you really wanted to, you could probably write out some small metadata
about the files and write your own version of objectFile that uses it.  But
I think there is a bigger conceptual issue.  You might not in general be
sure that you are running on the same nodes when you save the file, as when
you read it back in.  So the file might not be present on the local
filesystem for the active executors.  You might be able to guarantee it for
the specific cluster setup you have now, but it might limit you down the
road.

What are you trying to achieve?  There might be a better way.  I believe
writing to hdfs will usually write one local copy, so you'd still be doing
a local read when you reload the data.

Imran
On Jan 16, 2015 6:19 AM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com wrote:

  I have asked this question before but get no answer. Asking again.



 Can I save RDD to the local file system and then read it back on a spark
 cluster with multiple nodes?



 rdd.saveAsObjectFile(“file:///home/data/rdd1”)



 val rdd2 = sc.objectFile(“file:///home/data/rdd1”)



 This will works if the cluster has only one node. But my cluster has 3
 nodes and each node has a local dir called /home/data. Is rdd saved to the
 local dir across 3 nodes? If so, does sc.objectFile(…) smart enough to read
 the local dir in all 3 nodes to merge them into a single rdd?



 Ningjun

remote Akka client disassociated - some timeout?

2015-01-16 Thread Antony Mayi

Hi,
I believe this is some kind of timeout problem but can't figure out how to 
increase it.
I am running spark 1.2.0 on yarn (all from cdh 5.3.0). I submit a python task 
which first loads big RDD from hbase - I can see in the screen output all 
executors fire up then no more logging output for next two minutes after which 
I get plenty of
15/01/16 17:35:16 ERROR cluster.YarnClientClusterScheduler: Lost executor 7 on 
node01: remote Akka client disassociated15/01/16 17:35:16 INFO 
scheduler.TaskSetManager: Re-queueing tasks for 7 from TaskSet 1.015/01/16 
17:35:16 WARN scheduler.TaskSetManager: Lost task 32.0 in stage 1.0 (TID 17, 
node01): ExecutorLostFailure (executor 7 lost)15/01/16 17:35:16 WARN 
scheduler.TaskSetManager: Lost task 34.0 in stage 1.0 (TID 25, node01): 
ExecutorLostFailure (executor 7 lost)
this points to some timeout ~120secs while the nodes are loading the big RDD? 
any ideas how to get around it?
fyi I already use following options without any success:
    spark.core.connection.ack.wait.timeout: 600    spark.akka.timeout: 1000

thanks,Antony.

RE: Determine number of running executors

2015-01-16 Thread Shuai Zheng

Hi Tobias,

 

Can you share more information about how do you do that? I also have similar 
question about this.

 

Thanks a lot,

 

Regards,

 

Shuai

 

From: Tobias Pfeiffer [mailto:t...@preferred.jp] 
Sent: Wednesday, November 26, 2014 12:25 AM
To: Sandy Ryza
Cc: Yanbo Liang; user
Subject: Re: Determine number of running executors

 

Hi,

 

Thanks for your help!

 

Sandy, I had a bit of trouble finding the spark.executor.cores property. (It 
wasn't there although its value should have been 2.)

I ended up throwing regular expressions on 
scala.util.Properties.propOrElse(sun.java.command, ), which worked 
surprisingly well ;-)

 

Thanks

Tobias

Re: kinesis multiple records adding into stream

2015-01-16 Thread Kelly, Jonathan

Are you referring to the PutRecords method, which was added in 1.9.9?  (See 
http://aws.amazon.com/releasenotes/1369906126177804)  If so, can't you just 
depend upon this later version of the SDK in your app even though 
spark-streaming-kinesis-asl is depending upon this earlier 1.9.3 version that 
does not yet have the PutRecords call?

Also, could you please explain your use case more fully?  
spark-streaming-kinesis-asl is for *reading* data from Kinesis in Spark, not 
for writing data, so I would expect that the part of your code that would be 
writing to Kinesis would be a totally separate app anyway (unless you are 
reading from Kinesis using spark-streaming-kinesis-asl, transforming it 
somehow, then writing it back out to Kinesis).

~ Jonathan

From: Aniket Bhatnagar 
aniket.bhatna...@gmail.commailto:aniket.bhatna...@gmail.com
Date: Friday, January 16, 2015 at 9:13 AM
To: Hafiz Mujadid hafizmujadi...@gmail.commailto:hafizmujadi...@gmail.com, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: kinesis multiple records adding into stream


Sorry. I couldn't understand the issue. Are you trying to send data to kinesis 
from a spark batch/real time job?

- Aniket

On Fri, Jan 16, 2015, 9:40 PM Hafiz Mujadid 
hafizmujadi...@gmail.commailto:hafizmujadi...@gmail.com wrote:
Hi Experts!

I am using kinesis dependency as follow
groupId = org.apache.spark
 artifactId = spark-streaming-kinesis-asl_2.10
 version = 1.2.0

in this aws sdk version 1.8.3 is being used. in this sdk multiple records
can not be put in a single request. is it possible to put multiple records
in a single request ?


thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/kinesis-multiple-records-adding-into-stream-tp21191.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org

RE: Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

I need to save RDD to file system and then restore my RDD from the file system 
in the future. I don’t have any hdfs file system and don’t want to go the 
hassle of setting up a hdfs system. So how can I achieve this? The application 
need to be run on a cluster with multiple nodes.

Regards,

Ningjun

From: imranra...@gmail.com [mailto:imranra...@gmail.com] On Behalf Of Imran 
Rashid
Sent: Friday, January 16, 2015 12:14 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: Can I save RDD to local file system and then read it back on spark 
cluster with multiple nodes?


I'm not positive, but I think this is very unlikely to work.

First, when you call sc.objectFile(...),  I think the *driver* will need to 
know something about the file, eg to know how many tasks to create.  But it 
won't even be able to see the file, since it only lives on the local filesystem 
of the cluster nodes.

If you really wanted to, you could probably write out some small metadata about 
the files and write your own version of objectFile that uses it.  But I think 
there is a bigger conceptual issue.  You might not in general be sure that you 
are running on the same nodes when you save the file, as when you read it back 
in.  So the file might not be present on the local filesystem for the active 
executors.  You might be able to guarantee it for the specific cluster setup 
you have now, but it might limit you down the road.

What are you trying to achieve?  There might be a better way.  I believe 
writing to hdfs will usually write one local copy, so you'd still be doing a 
local read when you reload the data.

Imran
On Jan 16, 2015 6:19 AM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.commailto:ningjun.w...@lexisnexis.com wrote:
I have asked this question before but get no answer. Asking again.

Can I save RDD to the local file system and then read it back on a spark 
cluster with multiple nodes?

rdd.saveAsObjectFile(“file:///home/data/rdd1file:///\\home\data\rdd1”)

val rdd2 = sc.objectFile(“file:///home/data/rdd1file:///\\home\data\rdd1”)

This will works if the cluster has only one node. But my cluster has 3 nodes 
and each node has a local dir called /home/data. Is rdd saved to the local dir 
across 3 nodes? If so, does sc.objectFile(…) smart enough to read the local dir 
in all 3 nodes to merge them into a single rdd?

Ningjun

Re: Failing jobs runs twice

2015-01-16 Thread Anders Arpteg

FYI, I just confirmed with the latest Spark 1.3 snapshot that the
spark.yarn.maxAppAttempts setting that SPARK-2165 refers to works
perfectly. Great to finally get rid of this problem. Also caused an issue
when the eventLogs were enabled since the spark-events/appXXX folder
already exists the second time the app gets launched.

On Thu, Jan 15, 2015 at 3:01 PM, Anders Arpteg arp...@spotify.com wrote:

 Found a setting that seems to fix this problem, but it does not seems to
 be available until Spark 1.3. See
 https://issues.apache.org/jira/browse/SPARK-2165

 However, glad to see a work is being done with the issue.

 On Tue, Jan 13, 2015 at 8:00 PM, Anders Arpteg arp...@spotify.com wrote:

 Yes Andrew, I am. Tried setting spark.yarn.applicationMaster.waitTries to
 1 (thanks Sean), but with no luck. Any ideas?

 On Tue, Jan 13, 2015 at 7:58 PM, Andrew Or and...@databricks.com wrote:

 Hi Anders, are you using YARN by any chance?

 2015-01-13 0:32 GMT-08:00 Anders Arpteg arp...@spotify.com:

 Since starting using Spark 1.2, I've experienced an annoying issue with
 failing apps that gets executed twice. I'm not talking about tasks inside a
 job, that should be executed multiple times before failing the whole app.
 I'm talking about the whole app, that seems to close the previous Spark
 context, start a new, and rerun the app again.

 This is annoying since it overwrite the log files as well and it
 becomes hard to troubleshoot the failing app. Does anyone know how to turn
 this feature off?

 Thanks,
 Anders

If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?

2015-01-16 Thread Peng Cheng

I'm talking about RDD1 (not persisted or checkpointed) in this situation:

...(somewhere) - RDD1 - RDD2
  ||
 V   V
 RDD3 - RDD4 - Action!

To my experience the change RDD1 get recalculated is volatile, sometimes
once, sometimes twice. When calculation of this RDD is expensive (e.g.
involves using an RESTful service that charges me money), this compels me to
persist RDD1 which takes extra memory, and in case the Action! doesn't
always happen, I don't know when to unpersist it to  free those memory.

A related problem might be in $SQLContest.jsonRDD(), since the source
jsonRDD is used twice (one for schema inferring, another for data read). It
almost guarantees that the source jsonRDD is calculated twice. Has this
problem be addressed so far?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/If-an-RDD-appeared-twice-in-a-DAG-of-which-calculation-is-triggered-by-a-single-action-will-this-RDD-tp21192.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

If an RDD appeared twice in a DAG, of which calculation is triggered by a single action, will this RDD be calculated twice?

2015-01-16 Thread Peng Cheng

I'm talking about RDD1 (not persisted or checkpointed) in this situation:

...(somewhere) - RDD1 - RDD2
  ||
 V   V
 RDD3 - RDD4 - Action!

To my experience the change RDD1 get recalculated is volatile, sometimes
once, sometimes twice. When calculation of this RDD is expensive (e.g.
involves using an RESTful service that charges me money), this compels me
to persist RDD1 which takes extra memory, and in case the Action! doesn't
always happen, I don't know when to unpersist it to  free those memory.

A related problem might be in $SQLContest.jsonRDD(), since the source
jsonRDD is used twice (one for schema inferring, another for data read). It
almost guarantees that the source jsonRDD is calculated twice. Is there a
way to solve (or circumvent) this problem?

Re: How to 'Pipe' Binary Data in Apache Spark

I just wanted to reiterate the solution for the benefit of the community.

The problem is not from my use of 'pipe', but that 'textFile' cannot be
used to read in binary data. (Doh) There are a couple options to move
forward.

1. Implement a custom 'InputFormat' that understands the binary input data.
(Per Sean Owen)

2. Use 'SparkContext.binaryFiles' to read in the entire binary file as a
single record. This will impact performance as it prevents the use of more
than one mapper on the file's data.

In my specific case for #1 I can only find one project from RIPE-NCC (
https://github.com/RIPE-NCC/hadoop-pcap) that does this. Unfortunately, it
appears to only support a limited set of network protocols.



On Fri, Jan 16, 2015 at 10:40 AM, Nick Allen n...@nickallen.org wrote:

 Per your last comment, it appears I need something like this:

 https://github.com/RIPE-NCC/hadoop-pcap


 Thanks a ton.  That get me oriented in the right direction.

 On Fri, Jan 16, 2015 at 10:20 AM, Sean Owen so...@cloudera.com wrote:

 Well it looks like you're reading some kind of binary file as text.
 That isn't going to work, in Spark or elsewhere, as binary data is not
 even necessarily the valid encoding of a string. There are no line
 breaks to delimit lines and thus elements of the RDD.

 Your input has some record structure (or else it's not really useful
 to put it into an RDD). You can encode this as a SequenceFile and read
 it with objectFile.

 You could also write a custom InputFormat that knows how to parse pcap
 records directly.

 On Fri, Jan 16, 2015 at 3:09 PM, Nick Allen n...@nickallen.org wrote:
  I have an RDD containing binary data. I would like to use 'RDD.pipe' to
 pipe
  that binary data to an external program that will translate it to
  string/text data. Unfortunately, it seems that Spark is mangling the
 binary
  data before it gets passed to the external program.
 
  This code is representative of what I am trying to do. What am I doing
  wrong? How can I pipe binary data in Spark?  Maybe it is getting
 corrupted
  when I read it in initially with 'textFile'?
 
  bin = sc.textFile(binary-data.dat)
  csv = bin.pipe (/usr/bin/binary-to-csv.sh)
  csv.saveAsTextFile(text-data.csv)
 
  Specifically, I am trying to use Spark to transform pcap (packet
 capture)
  data to text/csv so that I can perform an analysis on it.
 
  Thanks!
 
  --
  Nick Allen n...@nickallen.org




 --
 Nick Allen n...@nickallen.org




-- 
Nick Allen n...@nickallen.org

Re: Scala vs Python performance differences

2015-01-16 Thread Davies Liu

Hey Phil,

Thank you sharing this. The result didn't surprise me a lot, it's normal to do
the prototype in Python, once it get stable and you really need the performance,
then rewrite part of it in C or whole of it in another language does make sense,
it will not cause you much time.

Davies

On Fri, Jan 16, 2015 at 7:38 AM, philpearl p...@tanktop.tv wrote:
 I was interested in this as I had some Spark code in Python that was too slow
 and wanted to know whether Scala would fix it for me.  So I re-wrote my code
 in Scala.

 In my particular case the Scala version was 10 times faster.  But I think
 that is because I did an awful lot of computation in my own code rather than
 in a library like numpy. (I put a bit more detail  here
 http://tttv-engineering.tumblr.com/post/108260351966/spark-python-vs-scala
 in case you are interested)

 So there's one data point, if only for the obvious data point comparing
 computations in Scala to computations in pure Python.





 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Scala-vs-Python-performance-differences-tp4247p21190.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark java options

2015-01-16 Thread Kane Kim

I want to add some java options when submitting application:
--conf spark.executor.extraJavaOptions=-XX:+UnlockCommercialFeatures
-XX:+FlightRecorder

But looks like it doesn't get set. Where I can add it to make it working?

Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark java options

2015-01-16 Thread Marcelo Vanzin

Hi Kane,

What's the complete command line you're using to submit the app? Where
to you expect these options to appear?

On Fri, Jan 16, 2015 at 11:12 AM, Kane Kim kane.ist...@gmail.com wrote:
 I want to add some java options when submitting application:
 --conf spark.executor.extraJavaOptions=-XX:+UnlockCommercialFeatures
 -XX:+FlightRecorder

 But looks like it doesn't get set. Where I can add it to make it working?

 Thanks.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark java options

2015-01-16 Thread Marcelo Vanzin

Hi Kane,

Here's the command line you sent me privately:
 ./spark-1.2.0-bin-hadoop2.4/bin/spark-submit --class
SimpleApp --conf
spark.executor.extraJavaOptions=-XX:+UnlockCommercialFeatures
-XX:+FlightRecorder --master local simpleapp.jar ./test.log

You're running the app in local mode. In that mode, the executor is
run as a thread inside the same JVM as the driver, so all
executor-related options are ignored.

Try using local-cluster[1,1,512] as your master, that will run a
single executor in a separate process, and then you should see your
options take effect.


On Fri, Jan 16, 2015 at 11:12 AM, Kane Kim kane.ist...@gmail.com wrote:
 I want to add some java options when submitting application:
 --conf spark.executor.extraJavaOptions=-XX:+UnlockCommercialFeatures
 -XX:+FlightRecorder

 But looks like it doesn't get set. Where I can add it to make it working?

 Thanks.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Creating Apache Spark-powered “As Service” applications

2015-01-16 Thread olegshirokikh

The question is about the ways to create a Windows desktop-based and/or
web-based application client that is able to connect and talk to the server
containing Spark application (either local or on-premise cloud
distributions) in the run-time.

Any language/architecture may work. So far, I've seen two things that may be
a help in that, but I'm not so sure if they would be the best alternative
and how they work yet:

Spark Job Server - https://github.com/spark-jobserver/spark-jobserver -
defines a REST API for Spark
Hue -
http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/
- uses item 1)

Any advice would be appreciated. Simple toy example program (or steps) that
shows, e.g. how to build such client for simply creating Spark Context on a
local machine and say reading text file and returning basic stats would be
ideal answer!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Creating-Apache-Spark-powered-As-Service-applications-tp21193.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Creating Apache Spark-powered “As Service” applications

2015-01-16 Thread Corey Nolet

There's also an example of running a SparkContext in a java servlet
container from Calrissian: https://github.com/calrissian/spark-jetty-server

On Fri, Jan 16, 2015 at 2:31 PM, olegshirokikh o...@solver.com wrote:

 The question is about the ways to create a Windows desktop-based and/or
 web-based application client that is able to connect and talk to the server
 containing Spark application (either local or on-premise cloud
 distributions) in the run-time.

 Any language/architecture may work. So far, I've seen two things that may
 be
 a help in that, but I'm not so sure if they would be the best alternative
 and how they work yet:

 Spark Job Server - https://github.com/spark-jobserver/spark-jobserver -
 defines a REST API for Spark
 Hue -

 http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/
 - uses item 1)

 Any advice would be appreciated. Simple toy example program (or steps) that
 shows, e.g. how to build such client for simply creating Spark Context on a
 local machine and say reading text file and returning basic stats would be
 ideal answer!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Creating-Apache-Spark-powered-As-Service-applications-tp21193.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Subscribe

Send email to user-subscr...@spark.apache.org

Cheers

On Fri, Jan 16, 2015 at 11:51 AM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

Maven out of memory error

Just got the latest from Github and tried running `mvn test`; is this error
common and do you have any advice on fixing it?

Thanks!

[INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
spark-core_2.10 ---
[WARNING] Zinc server is not available at port 3030 - reverting to normal
incremental compile
[INFO] Using incremental compilation
[INFO] compiler plugin:
BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
[INFO] Compiling 400 Scala sources and 34 Java sources to
/home/akm/spark/core/target/scala-2.10/classes...
[WARNING]
/home/akm/spark/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:22:
imported `DataReadMethod' is permanently hidden by definition of object
DataReadMethod in package executor
[WARNING] import org.apache.spark.executor.DataReadMethod
[WARNING]  ^
[WARNING]
/home/akm/spark/core/src/main/scala/org/apache/spark/TaskState.scala:41:
match may not be exhaustive.
It would fail on the following input: TASK_ERROR
[WARNING]   def fromMesos(mesosState: MesosTaskState): TaskState =
mesosState match {
[WARNING]  ^
[WARNING]
/home/akm/spark/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala:89:
method isDirectory in class FileSystem is deprecated: see corresponding
Javadoc for more information.
[WARNING] if (!fileSystem.isDirectory(new Path(logBaseDir))) {
[WARNING] ^
[ERROR] PermGen space - [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError

Re: Maven out of memory error

Can you try doing this before running mvn ?

export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
-XX:ReservedCodeCacheSize=512m

What OS are you using ?

Cheers

On Fri, Jan 16, 2015 at 12:03 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 Just got the latest from Github and tried running `mvn test`; is this
 error common and do you have any advice on fixing it?

 Thanks!

 [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
 spark-core_2.10 ---
 [WARNING] Zinc server is not available at port 3030 - reverting to normal
 incremental compile
 [INFO] Using incremental compilation
 [INFO] compiler plugin:
 BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
 [INFO] Compiling 400 Scala sources and 34 Java sources to
 /home/akm/spark/core/target/scala-2.10/classes...
 [WARNING]
 /home/akm/spark/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:22:
 imported `DataReadMethod' is permanently hidden by definition of object
 DataReadMethod in package executor
 [WARNING] import org.apache.spark.executor.DataReadMethod
 [WARNING]  ^
 [WARNING]
 /home/akm/spark/core/src/main/scala/org/apache/spark/TaskState.scala:41:
 match may not be exhaustive.
 It would fail on the following input: TASK_ERROR
 [WARNING]   def fromMesos(mesosState: MesosTaskState): TaskState =
 mesosState match {
 [WARNING]  ^
 [WARNING]
 /home/akm/spark/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala:89:
 method isDirectory in class FileSystem is deprecated: see corresponding
 Javadoc for more information.
 [WARNING] if (!fileSystem.isDirectory(new Path(logBaseDir))) {
 [WARNING] ^
 [ERROR] PermGen space - [Help 1]
 [ERROR]
 [ERROR] To see the full stack trace of the errors, re-run Maven with the
 -e switch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR]
 [ERROR] For more information about the errors and possible solutions,
 please read the following articles:
 [ERROR] [Help 1]
 http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError

Re: Creating Apache Spark-powered “As Service” applications

2015-01-16 Thread Robert C Senkbeil

Hi,

You can take a look at the Spark Kernel project:
https://github.com/ibm-et/spark-kernel

The Spark Kernel's goal is to serve as the foundation for interactive
applications. The project provides a client library in Scala that abstracts
connecting to the kernel (containing a Spark Context), which can be
embedded into a web application. We demonstrated this at StataConf when we
embedded the Spark Kernel client into a Play application to provide an
interactive web application that communicates to Spark via the Spark Kernel
(hosting a Spark Context).

A getting started section can be found here:
https://github.com/ibm-et/spark-kernel/wiki/Getting-Started-with-the-Spark-Kernel

If you have any other questions, feel free to email me or communicate over
our mailing list:

spark-ker...@googlegroups.com

https://groups.google.com/forum/#!forum/spark-kernel

Signed,
Chip Senkbeil
IBM Emerging Technology Software Engineer



From:   olegshirokikh o...@solver.com
To: user@spark.apache.org
Date:   01/16/2015 01:32 PM
Subject:Creating Apache Spark-powered “As Service” applications



The question is about the ways to create a Windows desktop-based and/or
web-based application client that is able to connect and talk to the server
containing Spark application (either local or on-premise cloud
distributions) in the run-time.

Any language/architecture may work. So far, I've seen two things that may
be
a help in that, but I'm not so sure if they would be the best alternative
and how they work yet:

Spark Job Server - https://github.com/spark-jobserver/spark-jobserver -
defines a REST API for Spark
Hue -
http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/

- uses item 1)

Any advice would be appreciated. Simple toy example program (or steps) that
shows, e.g. how to build such client for simply creating Spark Context on a
local machine and say reading text file and returning basic stats would be
ideal answer!



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Creating-Apache-Spark-powered-As-Service-applications-tp21193.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Maven out of memory error

Thanks Ted, got farther along but now have a failing test; is this a known
issue?

---
 T E S T S
---
Running org.apache.spark.JavaAPISuite
Tests run: 72, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 123.462
sec  FAILURE! - in org.apache.spark.JavaAPISuite
testGuavaOptional(org.apache.spark.JavaAPISuite)  Time elapsed: 106.5 sec
 ERROR!
org.apache.spark.SparkException: Job aborted due to stage failure: Master
removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1199)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1188)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1187)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1187)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1399)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1360)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Running org.apache.spark.JavaJdbcRDDSuite
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.846 sec -
in org.apache.spark.JavaJdbcRDDSuite

Results :


Tests in error:
  JavaAPISuite.testGuavaOptional » Spark Job aborted due to stage failure:
Maste...

On Fri, Jan 16, 2015 at 12:06 PM, Ted Yu yuzhih...@gmail.com wrote:

 Can you try doing this before running mvn ?

 export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
 -XX:ReservedCodeCacheSize=512m

 What OS are you using ?

 Cheers

 On Fri, Jan 16, 2015 at 12:03 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:

 Just got the latest from Github and tried running `mvn test`; is this
 error common and do you have any advice on fixing it?

 Thanks!

 [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
 spark-core_2.10 ---
 [WARNING] Zinc server is not available at port 3030 - reverting to normal
 incremental compile
 [INFO] Using incremental compilation
 [INFO] compiler plugin:
 BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
 [INFO] Compiling 400 Scala sources and 34 Java sources to
 /home/akm/spark/core/target/scala-2.10/classes...
 [WARNING]
 /home/akm/spark/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:22:
 imported `DataReadMethod' is permanently hidden by definition of object
 DataReadMethod in package executor
 [WARNING] import org.apache.spark.executor.DataReadMethod
 [WARNING]  ^
 [WARNING]
 /home/akm/spark/core/src/main/scala/org/apache/spark/TaskState.scala:41:
 match may not be exhaustive.
 It would fail on the following input: TASK_ERROR
 [WARNING]   def fromMesos(mesosState: MesosTaskState): TaskState =
 mesosState match {
 [WARNING]  ^
 [WARNING]
 /home/akm/spark/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala:89:
 method isDirectory in class FileSystem is deprecated: see corresponding
 Javadoc for more information.
 [WARNING] if (!fileSystem.isDirectory(new Path(logBaseDir))) {
 [WARNING] ^
 [ERROR] PermGen space - [Help 1]
 [ERROR]
 [ERROR] To see the full stack trace of the errors, re-run Maven with the
 -e switch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR]
 [ERROR] For more information about the errors and possible solutions,
 please read the following articles:
 [ERROR] [Help 1]
 http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError

Re: Maven out of memory error

Thanks Sean

On Fri, Jan 16, 2015 at 12:06 PM, Sean Owen so...@cloudera.com wrote:

 Hey Andrew, you'll want to have a look at the Spark docs on building:
 http://spark.apache.org/docs/latest/building-spark.html

 It's the first thing covered there.

 The warnings are normal as you are probably building with newer Hadoop
 profiles and so old-Hadoop support code shows deprecation warnings on
 its use of old APIs.

 On Fri, Jan 16, 2015 at 8:03 PM, Andrew Musselman
 andrew.mussel...@gmail.com wrote:
  Just got the latest from Github and tried running `mvn test`; is this
 error
  common and do you have any advice on fixing it?
 
  Thanks!
 
  [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
  spark-core_2.10 ---
  [WARNING] Zinc server is not available at port 3030 - reverting to normal
  incremental compile
  [INFO] Using incremental compilation
  [INFO] compiler plugin:
  BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
  [INFO] Compiling 400 Scala sources and 34 Java sources to
  /home/akm/spark/core/target/scala-2.10/classes...
  [WARNING]
 
 /home/akm/spark/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:22:
  imported `DataReadMethod' is permanently hidden by definition of object
  DataReadMethod in package executor
  [WARNING] import org.apache.spark.executor.DataReadMethod
  [WARNING]  ^
  [WARNING]
  /home/akm/spark/core/src/main/scala/org/apache/spark/TaskState.scala:41:
  match may not be exhaustive.
  It would fail on the following input: TASK_ERROR
  [WARNING]   def fromMesos(mesosState: MesosTaskState): TaskState =
  mesosState match {
  [WARNING]  ^
  [WARNING]
 
 /home/akm/spark/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala:89:
  method isDirectory in class FileSystem is deprecated: see corresponding
  Javadoc for more information.
  [WARNING] if (!fileSystem.isDirectory(new Path(logBaseDir))) {
  [WARNING] ^
  [ERROR] PermGen space - [Help 1]
  [ERROR]
  [ERROR] To see the full stack trace of the errors, re-run Maven with the
 -e
  switch.
  [ERROR] Re-run Maven using the -X switch to enable full debug logging.
  [ERROR]
  [ERROR] For more information about the errors and possible solutions,
 please
  read the following articles:
  [ERROR] [Help 1]
  http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError

Performance issue

2015-01-16 Thread TJ Klein

Hi,

I observed some weird performance issue using Spark in combination with
Theano, and I have no real explanation for that. To exemplify the issue I am
using the pi.py example of spark that computes pi:

When I modify the function from the example: 

#unmodified code
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1

return 1 if x ** 2 + y ** 2  1 else 0

count = sc.parallelize(xrange(1, n + 1), partitions).map(f).reduce(add)
#

by adding a very simple dummy function that just computes the product of two
floats, the execution slows down massively (about 100x slower). 

Here is the slow code:

# define simple function in theano that computes the product
 x = T.dscalar()
 y = T.dscalar()
 dummyFun = theano.function([x,y],y * x)
 broadcast_dummyFun = sc.broadcast(dummyFun)

def f(_):
x = random() * 2 - 1
y = random() * 2 - 1

# compute product
tmp = broadcast_dummyFun.value(x,y)

return 1 if x ** 2 + y ** 2  1 else 0


Any idea why it slows down so much? Using a python function that computes
the product (or lambda function) again gives full-speed.

I would appreciate some help on that.

-Tassilo



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-tp21194.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-16 Thread Michael Armbrust

You can get the internal RDD using: schemaRDD.queryExecution.toRDD.  This
is used internally and does not copy.  This is an unstable developer API.

On Thu, Jan 15, 2015 at 11:26 PM, Nathan McCarthy 
nathan.mccar...@quantium.com.au wrote:

  Thanks Cheng!

  Is there any API I can get access too (e.g. ParquetTableScan) which
 would allow me to load up the low level/baseRDD of just RDD[Row] so I could
 avoid the defensive copy (maybe lose our on columnar storage etc.).

  We have parts of our pipeline using SparkSQL/SchemaRDDs and others using
 the core RDD api (mapPartitions etc.). Any tips?

  Out of curiosity, a lot of SparkSQL functions seem to run in a
 mapPartiton (e.g. Distinct). Does a defensive copy happen there too?

  Keen to get the best performance and the best blend of SparkSQL and
 functional Spark.

  Cheers,
 Nathan

   From: Cheng Lian lian.cs@gmail.com
 Date: Monday, 12 January 2015 1:21 am
 To: Nathan nathan.mccar...@quantium.com.au, Michael Armbrust 
 mich...@databricks.com
 Cc: user@spark.apache.org user@spark.apache.org

 Subject: Re: SparkSQL schemaRDD  MapPartitions calls - performance
 issues - columnar formats?


 On 1/11/15 1:40 PM, Nathan McCarthy wrote:

 Thanks Cheng  Michael! Makes sense. Appreciate the tips!

  Idiomatic scala isn't performant. I’ll definitely start using while
 loops or tail recursive methods. I have noticed this in the spark code
 base.

  I might try turning off columnar compression (via 
 *spark.sql.inMemoryColumnarStorage.compressed=false
 *correct?) and see how performance compares to the primitive objects.
 Would you expect to see similar runtimes vs the primitive objects? We do
 have the luxury of lots of memory at the moment so this might give us an
 additional performance boost.

 Turning off compression should be faster, but still slower than directly
 using primitive objects. Because Spark SQL also serializes all objects
 within a column into byte buffers in a compact format. However, this
 radically reduces number of Java objects in the heap and is more GC
 friendly. When running large queries, cost introduced by GC can be
 significant.


  Regarding the defensive copying of row objects. Can we switch this off
 and just be aware of the risks? Is MapPartitions on SchemaRDDs and
 operating on the Row object the most performant way to be flipping between
 SQL  Scala user code? Is there anything else I could be doing?

 This can be very dangerous and error prone. Whenever an operator tries to
 cache row objects, turning off defensive copying can introduce wrong query
 result. For example, sort-based shuffle caches rows to do sorting. In some
 cases, sample operator may also cache row objects. This is very
 implementation specific and may change between versions.


  Cheers,
 ~N

   From: Michael Armbrust mich...@databricks.com
 Date: Saturday, 10 January 2015 3:41 am
 To: Cheng Lian lian.cs@gmail.com
 Cc: Nathan nathan.mccar...@quantium.com.au, user@spark.apache.org 
 user@spark.apache.org
 Subject: Re: SparkSQL schemaRDD  MapPartitions calls - performance
 issues - columnar formats?

   The other thing to note here is that Spark SQL defensively copies rows
 when we switch into user code.  This probably explains the difference
 between 1  2.

  The difference between 1  3 is likely the cost of decompressing the
 column buffers vs. accessing a bunch of uncompressed primitive objects.

 On Fri, Jan 9, 2015 at 6:59 AM, Cheng Lian lian.cs@gmail.com wrote:

 Hey Nathan,

 Thanks for sharing, this is a very interesting post :) My comments are
 inlined below.

 Cheng

 On 1/7/15 11:53 AM, Nathan McCarthy wrote:

 Hi,

  I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala
 via rdd.mapPartitions(…). Using the latest release 1.2.0.

  Simple example; load up some sample data from parquet on HDFS (about
 380m rows, 10 columns) on a 7 node cluster.

val t = sqlC.parquetFile(/user/n/sales-tran12m.parquet”)
t.registerTempTable(test1”)
sqlC.cacheTable(test1”)

  Now lets do some operations on it; I want the total sales  quantities
 sold for each hour in the day so I choose 3 out of the 10 possible
 columns...

sqlC.sql(select Hour, sum(ItemQty), sum(Sales) from test1 group by
 Hour).collect().foreach(println)

  After the table has been 100% cached in memory, this takes around 11
 seconds.

  Lets do the same thing but via a MapPartitions call (this isn’t
 production ready code but gets the job done).

val try2 = sqlC.sql(select Hour, ItemQty, Sales from test1”)
   rddPC.mapPartitions { case hrs =
 val qtySum = new Array[Double](24)
 val salesSum = new Array[Double](24)

  for(r - hrs) {
   val hr = r.getInt(0)
   qtySum(hr) += r.getDouble(1)
   salesSum(hr) += r.getDouble(2)
 }
 (salesSum zip qtySum).zipWithIndex.map(_.swap).iterator
   }.reduceByKey((a,b) = (a._1 + b._1, a._2 +
 b._2)).collect().foreach(println)

 I believe the evil thing that makes this snippet much slower is the

Re: Maven out of memory error

I got the same error:

testGuavaOptional(org.apache.spark.JavaAPISuite)  Time elapsed: 261.111 sec
  ERROR!
org.apache.spark.SparkException: Job aborted due to stage failure: Master
removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1199)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1188)

Looking under ore/target/surefire-reports/ , I don't see test output.
Trying to figure out how test output can be generated.

Cheers

On Fri, Jan 16, 2015 at 12:26 PM, Andrew Musselman 
andrew.mussel...@gmail.com wrote:

 Thanks Ted, got farther along but now have a failing test; is this a known
 issue?

 ---
  T E S T S
 ---
 Running org.apache.spark.JavaAPISuite
 Tests run: 72, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 123.462
 sec  FAILURE! - in org.apache.spark.JavaAPISuite
 testGuavaOptional(org.apache.spark.JavaAPISuite)  Time elapsed: 106.5 sec
  ERROR!
 org.apache.spark.SparkException: Job aborted due to stage failure: Master
 removed our application: FAILED
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1199)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1188)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1187)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1187)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1399)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1360)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
 at akka.dispatch.Mailbox.run(Mailbox.scala:220)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 Running org.apache.spark.JavaJdbcRDDSuite
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.846 sec
 - in org.apache.spark.JavaJdbcRDDSuite

 Results :


 Tests in error:
   JavaAPISuite.testGuavaOptional » Spark Job aborted due to stage failure:
 Maste...

 On Fri, Jan 16, 2015 at 12:06 PM, Ted Yu yuzhih...@gmail.com wrote:

 Can you try doing this before running mvn ?

 export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
 -XX:ReservedCodeCacheSize=512m

 What OS are you using ?

 Cheers

 On Fri, Jan 16, 2015 at 12:03 PM, Andrew Musselman 
 andrew.mussel...@gmail.com wrote:

 Just got the latest from Github and tried running `mvn test`; is this
 error common and do you have any advice on fixing it?

 Thanks!

 [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
 spark-core_2.10 ---
 [WARNING] Zinc server is not available at port 3030 - reverting to
 normal incremental compile
 [INFO] Using incremental compilation
 [INFO] compiler plugin:
 BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
 [INFO] Compiling 400 Scala sources and 34 Java sources to
 /home/akm/spark/core/target/scala-2.10/classes...
 [WARNING]
 /home/akm/spark/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala:22:
 imported `DataReadMethod' is permanently hidden by definition of object
 DataReadMethod in package executor
 [WARNING] import org.apache.spark.executor.DataReadMethod
 [WARNING]  ^
 [WARNING]
 /home/akm/spark/core/src/main/scala/org/apache/spark/TaskState.scala:41:
 match may not be exhaustive.
 It would fail on the following input: TASK_ERROR
 [WARNING]   def fromMesos(mesosState: MesosTaskState): TaskState =
 mesosState match {
 [WARNING]  ^
 [WARNING]

Cluster Aware Custom RDD

2015-01-16 Thread Jim Carroll

Hello all,

I have a custom RDD for fast loading of data from a non-partitioned source.
The partitioning happens in the RDD implementation by pushing data from the
source into queues picked up by the current active partitions in worker
threads.

This works great on a multi-threaded single host (say with the manager set
to local[x] ) but I'd like to run it distributed. However, I need to know,
not only which slice my partition is, but also which host (by sequence)
it's on so I can divide up the source by worker (host) and then run the
multi-threaded. In other words, I need what effectively amounts to a 2-tier
slice identifier.

I know this is probably unorthodox, but is there some way to get this
information in the compute method or the deserialized Partition objects?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cluster-Aware-Custom-RDD-tp21196.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark random forest - string data

2015-01-16 Thread Asaf Lahav

Hi,

I have been playing around with the new version of Spark MLlib Random
forest implementation, and while in the process, tried it with a file with
String Features.
While training, it fails with:
java.lang.NumberFormatException: For input string.


Is MBLib Random forest adapted to run on top of numeric data only?

Thanks

RE: Creating Apache Spark-powered “As Service” applications

2015-01-16 Thread Oleg Shirokikh

Thanks a lot, Robert – I’ll definitely investigate this and probably would come 
back with questions.

P.S. I’m new to this Spark forum. I’m getting responses through emails but they 
are not appearing as “replies” in the thread – it’s kind of inconvenient. Is it 
something that I should tweak?

Thanks,
Oleg

From: Robert C Senkbeil [mailto:rcsen...@us.ibm.com]
Sent: Friday, January 16, 2015 12:21 PM
To: Oleg Shirokikh
Cc: user@spark.apache.org
Subject: Re: Creating Apache Spark-powered “As Service” applications


Hi,

You can take a look at the Spark Kernel project: 
https://github.com/ibm-et/spark-kernel

The Spark Kernel's goal is to serve as the foundation for interactive 
applications. The project provides a client library in Scala that abstracts 
connecting to the kernel (containing a Spark Context), which can be embedded 
into a web application. We demonstrated this at StataConf when we embedded the 
Spark Kernel client into a Play application to provide an interactive web 
application that communicates to Spark via the Spark Kernel (hosting a Spark 
Context).

A getting started section can be found here: 
https://github.com/ibm-et/spark-kernel/wiki/Getting-Started-with-the-Spark-Kernel

If you have any other questions, feel free to email me or communicate over our 
mailing list:

spark-ker...@googlegroups.commailto:spark-ker...@googlegroups.com

https://groups.google.com/forum/#!forum/spark-kernel

Signed,
Chip Senkbeil
IBM Emerging Technology Software Engineer

[Inactive hide details for olegshirokikh ---01/16/2015 01:32:43 PM---The 
question is about the ways to create a Windows desktop-]olegshirokikh 
---01/16/2015 01:32:43 PM---The question is about the ways to create a Windows 
desktop-based and/or web-based application client

From: olegshirokikh o...@solver.commailto:o...@solver.com
To: user@spark.apache.orgmailto:user@spark.apache.org
Date: 01/16/2015 01:32 PM
Subject: Creating Apache Spark-powered “As Service” applications





The question is about the ways to create a Windows desktop-based and/or
web-based application client that is able to connect and talk to the server
containing Spark application (either local or on-premise cloud
distributions) in the run-time.

Any language/architecture may work. So far, I've seen two things that may be
a help in that, but I'm not so sure if they would be the best alternative
and how they work yet:

Spark Job Server - https://github.com/spark-jobserver/spark-jobserver -
defines a REST API for Spark
Hue -
http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/
- uses item 1)

Any advice would be appreciated. Simple toy example program (or steps) that
shows, e.g. how to build such client for simply creating Spark Context on a
local machine and say reading text file and returning basic stats would be
ideal answer!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Creating-Apache-Spark-powered-As-Service-applications-tp21193.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org

Re: Spark random forest - string data

2015-01-16 Thread Sean Owen

The implementation accepts an RDD of LabeledPoint only, so you
couldn't feed in strings from a text file directly. LabeledPoint is a
wrapper around double values rather than strings. How were you trying
to create the input then?

No, it only accepts numeric values, although you can encode
categorical values as 0, 1, 2 ... and tell the implementation about
your categorical features to use categorical features.

On Fri, Jan 16, 2015 at 9:25 PM, Asaf Lahav asaf.la...@gmail.com wrote:
 Hi,

 I have been playing around with the new version of Spark MLlib Random forest
 implementation, and while in the process, tried it with a file with String
 Features.
 While training, it fails with:
 java.lang.NumberFormatException: For input string.


 Is MBLib Random forest adapted to run on top of numeric data only?

 Thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Using Spark SQL with multiple (avro) files

2015-01-16 Thread Michael Armbrust

I'd open an issue on the github to ask us to allow you to use hadoops glob
file format for the path.

On Thu, Jan 15, 2015 at 4:57 AM, David Jones letsnumsperi...@gmail.com
wrote:

I've tried this now. Spark can load multiple avro files from the same
directory by passing a path to a directory. However, passing multiple paths
separated with commas didn't work.

Is there any way to load all avro files in multiple directories using
sqlContext.avroFile?

On Wed, Jan 14, 2015 at 3:53 PM, David Jones letsnumsperi...@gmail.com
wrote:

Should I be able to pass multiple paths separated by commas? I haven't
tried but didn't think it'd work. I'd expected a function that accepted a
list of strings.

On Wed, Jan 14, 2015 at 3:20 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

If the wildcard path you have doesn't work you should probably open a
bug -- I had a similar problem with Parquet and it was a bug which recently
got closed. Not sure if sqlContext.avroFile shares a codepath with
.parquetFile...you
can try running with bits that have the fix for .parquetFile or look at the
source...
Here was my question for reference:

http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3ccaaswr-5rfmu-y-7htluj2eqqaecwjs8jh+irrzhm7g1ex7v...@mail.gmail.com%3E

On Wed, Jan 14, 2015 at 4:34 AM, David Jones letsnumsperi...@gmail.com
wrote:

Hi,

I have a program that loads a single avro file using spark SQL, queries
it, transforms it and then outputs the data. The file is loaded with:

val records = sqlContext.avroFile(filePath)
val data = records.registerTempTable(data)
...

Now I want to run it over tens of thousands of Avro files (all with
schemas that contain the fields I'm interested in).

Is it possible to load multiple avro files recursively from a top-level
directory using wildcards? All my avro files are stored under
s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all of
these on EMR.

If that's not possible, is there some way to load multiple avro files
into the same table/RDD so the whole dataset can be processed (and in that
case I'd supply paths to each file concretely, but I *really* don't want to
have to do that).

Thanks
David

Re: Spark random forest - string data

An alternative approach would be to translate your categorical variables
into dummy variables.  If your strings represent N classes/categories you
would generate N-1 dummy variables containing 0/1 values.

Auto-magically creating dummy variables from categorical data definitely
comes in handy.  I assume this is what SPARK-1216 is referring to, but I am
not sure from the description.

https://issues.apache.org/jira/browse/SPARK-1216

Auto-magically doing the scheme that Sean mentioned is referenced in
SPARK-4081, I believe.

https://issues.apache.org/jira/browse/SPARK-4081



On Fri, Jan 16, 2015 at 4:45 PM, Sean Owen so...@cloudera.com wrote:

 The implementation accepts an RDD of LabeledPoint only, so you
 couldn't feed in strings from a text file directly. LabeledPoint is a
 wrapper around double values rather than strings. How were you trying
 to create the input then?

 No, it only accepts numeric values, although you can encode
 categorical values as 0, 1, 2 ... and tell the implementation about
 your categorical features to use categorical features.

 On Fri, Jan 16, 2015 at 9:25 PM, Asaf Lahav asaf.la...@gmail.com wrote:
  Hi,
 
  I have been playing around with the new version of Spark MLlib Random
 forest
  implementation, and while in the process, tried it with a file with
 String
  Features.
  While training, it fails with:
  java.lang.NumberFormatException: For input string.
 
 
  Is MBLib Random forest adapted to run on top of numeric data only?
 
  Thanks

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Nick Allen n...@nickallen.org

Re: Spark random forest - string data

2015-01-16 Thread Andy Twigg

Hi Asaf,

featurestream [1] is an internal project I'm playing with that includes
support for some of this, in particular:
* 1-pass random forest construction
* schema inference
* native support for text fields

Would this be of interest? It's not open source, but if there's sufficient
demand I can get access to it.

[1] https://github.com/featurestream

On 16 January 2015 at 13:59, Nick Allen n...@nickallen.org wrote:

 An alternative approach would be to translate your categorical variables
 into dummy variables.  If your strings represent N classes/categories you
 would generate N-1 dummy variables containing 0/1 values.

 Auto-magically creating dummy variables from categorical data definitely
 comes in handy.  I assume this is what SPARK-1216 is referring to, but I am
 not sure from the description.

 https://issues.apache.org/jira/browse/SPARK-1216

 Auto-magically doing the scheme that Sean mentioned is referenced in
 SPARK-4081, I believe.

 https://issues.apache.org/jira/browse/SPARK-4081



 On Fri, Jan 16, 2015 at 4:45 PM, Sean Owen so...@cloudera.com wrote:

 The implementation accepts an RDD of LabeledPoint only, so you
 couldn't feed in strings from a text file directly. LabeledPoint is a
 wrapper around double values rather than strings. How were you trying
 to create the input then?

 No, it only accepts numeric values, although you can encode
 categorical values as 0, 1, 2 ... and tell the implementation about
 your categorical features to use categorical features.

 On Fri, Jan 16, 2015 at 9:25 PM, Asaf Lahav asaf.la...@gmail.com wrote:
  Hi,
 
  I have been playing around with the new version of Spark MLlib Random
 forest
  implementation, and while in the process, tried it with a file with
 String
  Features.
  While training, it fails with:
  java.lang.NumberFormatException: For input string.
 
 
  Is MBLib Random forest adapted to run on top of numeric data only?
 
  Thanks

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Nick Allen n...@nickallen.org

spark 1.2 compatibility

2015-01-16 Thread bhavyateja

Is spark 1.2 is compatibly with HDP 2.1



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Processing .wav files in PySpark

2015-01-16 Thread Davies Liu

I think you can not use textFile() or binaryFile() or pickleFile()
here, it's different format than wav.

You could get a list of paths for all the files, then
sc.parallelize(), and foreach():

def process(path):
# use subprocess to launch a process to do the job, read the
stdout as result

files = []  # a list of path of wav files
sc.parallelize(files, len(files)).foreach(process)

On Fri, Jan 16, 2015 at 2:11 PM, Venkat, Ankam
ankam.ven...@centurylink.com wrote:
 I need to process .wav files in Pyspark.  If the files are in local file
 system, I am able to process them.  Once I store them on HDFS, I am facing
 issues.  For example,



 I run a sox program on a wav file like this.



 sox ext2187854_03_27_2014.wav -n stats  -- works fine



 sox hdfs://xxx:8020/user/ab00855/ext2187854_03_27_2014.wav -n stats
 -- Does not work as sox cannot read HDFS file.



 So, I do like this.



 hadoop fs -cat hdfs://xxx:8020/user/ab00855/ext2187854_03_27_2014.wav |
 sox -t wav - -n stats  -- This works fine



 But, I am not able to do this in PySpark.



 wavfile =
 sc.textFile('hdfs://xxx:8020/user/ab00855/ext2187854_03_27_2014.wav')

 wavfile.pipe(subprocess.call(['sox', '-t' 'wav', '-', '-n', 'stats']))



 I tried different options like sc.binaryFiles and sc.pickleFile.



 Any thoughts?



 Regards,

 Venkat Ankam



 This communication is the property of CenturyLink and may contain
 confidential or privileged information. Unauthorized use of this
 communication is strictly prohibited and may be unlawful. If you have
 received this communication in error, please immediately notify the sender
 by reply e-mail and destroy all copies of the communication and any
 attachments.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Row similarities

What's a good way to calculate similarities between all vector-rows in a
matrix or RDD[Vector]?

I'm seeing RowMatrix has a columnSimilarities method but I'm not sure I'm
going down a good path to transpose a matrix in order to run that.

RE: spark 1.2 compatibility

2015-01-16 Thread Judy Nash

Yes. It's compatible with HDP 2.1 

-Original Message-
From: bhavyateja [mailto:bhavyateja.potin...@gmail.com] 
Sent: Friday, January 16, 2015 3:17 PM
To: user@spark.apache.org
Subject: spark 1.2 compatibility

Is spark 1.2 is compatibly with HDP 2.1



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: spark 1.2 compatibility

2015-01-16 Thread Judy Nash

Should clarify on this. I personally have used HDP 2.1 + Spark 1.2 and have not 
seen a problem. 

However officially HDP 2.1 + Spark 1.2 is not a supported scenario. 

-Original Message-
From: Judy Nash 
Sent: Friday, January 16, 2015 5:35 PM
To: 'bhavyateja'; user@spark.apache.org
Subject: RE: spark 1.2 compatibility

Yes. It's compatible with HDP 2.1 

-Original Message-
From: bhavyateja [mailto:bhavyateja.potin...@gmail.com] 
Sent: Friday, January 16, 2015 3:17 PM
To: user@spark.apache.org
Subject: spark 1.2 compatibility

Is spark 1.2 is compatibly with HDP 2.1



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark 1.2 compatibility

2015-01-16 Thread Matei Zaharia

The Apache Spark project should work with it, but I'm not sure you can get 
support from HDP (if you have that).

Matei

 On Jan 16, 2015, at 5:36 PM, Judy Nash judyn...@exchange.microsoft.com 
 wrote:
 
 Should clarify on this. I personally have used HDP 2.1 + Spark 1.2 and have 
 not seen a problem. 
 
 However officially HDP 2.1 + Spark 1.2 is not a supported scenario. 
 
 -Original Message-
 From: Judy Nash 
 Sent: Friday, January 16, 2015 5:35 PM
 To: 'bhavyateja'; user@spark.apache.org
 Subject: RE: spark 1.2 compatibility
 
 Yes. It's compatible with HDP 2.1 
 
 -Original Message-
 From: bhavyateja [mailto:bhavyateja.potin...@gmail.com] 
 Sent: Friday, January 16, 2015 3:17 PM
 To: user@spark.apache.org
 Subject: spark 1.2 compatibility
 
 Is spark 1.2 is compatibly with HDP 2.1
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
 commands, e-mail: user-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

HDFS Namenode in safemode when I turn off my EC2 instance

2015-01-16 Thread Su She

Hello Everyone,

I am encountering trouble running Spark applications when I shut down my
EC2 instances. Everything else seems to work except Spark. When I try
running a simple Spark application, like sc.parallelize() I get the message
that hdfs name node is in safemode.

Has anyone else had this issue? Is there a proper protocol I should be
following to turn off my spark nodes?

Thank you!

Re: Spark SQL Custom Predicate Pushdown

2015-01-16 Thread Corey Nolet

Hao,

Thanks so much for the links! This is exactly what I'm looking for. If I
understand correctly, I can extend PrunedFilteredScan, PrunedScan, and
TableScan and I should be able to support all the sql semantics?

I'm a little confused about the Array[Filter] that is used with the
Filtered scan. I have the ability to perform pretty robust seeks in the
underlying data sets in Accumulo. I have an inverted index and I'm able to
do intersections as well as unions- and rich predicates which form a tree
of alternating intersections and unions. If I understand correctly- the
Array[Filter] is to be treated as an AND operator? Do OR operators get
propagated through the API at all? I'm trying to do as much pairing down of
the dataset as possible on the individual tablet servers so that the data
loaded into the spark layer is minimal- really used to perform joins,
groupBys, sortBys and other computations that would require the relations
to be combined in various ways.

Thanks again for pointing me to this.

On Fri, Jan 16, 2015 at 2:07 AM, Cheng, Hao hao.ch...@intel.com wrote:

The Data Source API probably work for this purpose.

It support the column pruning and the Predicate Push Down:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

Examples also can be found in the unit test:

https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/sources

*From:* Corey Nolet [mailto:cjno...@gmail.com]
*Sent:* Friday, January 16, 2015 1:51 PM
*To:* user
*Subject:* Spark SQL Custom Predicate Pushdown

I have document storage services in Accumulo that I'd like to expose to
Spark SQL. I am able to push down predicate logic to Accumulo to have it
perform only the seeks necessary on each tablet server to grab the results
being asked for.

I'm interested in using Spark SQL to push those predicates down to the
tablet servers. Where wouldI begin my implementation? Currently I have an
input format which accepts a query object that gets pushed down. How
would I extract this information from the HiveContext/SQLContext to be able
to push this down?

Re: Low Level Kafka Consumer for Spark

2015-01-16 Thread Debasish Das

Hi Dib,

For our usecase I want my spark job1 to read from hdfs/cache and write to
kafka queues. Similarly spark job2 should read from kafka queues and write
to kafka queues.

Is writing to kafka queues from spark job supported in your code ?

Thanks
Deb
On Jan 15, 2015 11:21 PM, Akhil Das ak...@sigmoidanalytics.com wrote:

There was a simple example
https://github.com/dibbhatt/kafka-spark-consumer/blob/master/examples/scala/LowLevelKafkaConsumer.scala#L45
which you can run after changing few lines of configurations.

Thanks
Best Regards

On Fri, Jan 16, 2015 at 12:23 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:

Hi Kidong,

Just now I tested the Low Level Consumer with Spark 1.2 and I did not see
any issue with Receiver.Store method . It is able to fetch messages form
Kafka.

Can you cross check other configurations in your setup like Kafka broker
IP , topic name, zk host details, consumer id etc.

Dib

On Fri, Jan 16, 2015 at 11:50 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:

Hi Kidong,

No , I have not tried yet with Spark 1.2 yet. I will try this out and
let you know how this goes.

By the way, is there any change in Receiver Store method happened in
Spark 1.2 ?

Regards,
Dibyendu

On Fri, Jan 16, 2015 at 11:25 AM, mykidong mykid...@gmail.com wrote:

Hi Dibyendu,

I am using kafka 0.8.1.1 and spark 1.2.0.
After modifying these version of your pom, I have rebuilt your codes.
But I have not got any messages from ssc.receiverStream(new
KafkaReceiver(_props, i)).

I have found, in your codes, all the messages are retrieved correctly,
but
_receiver.store(_dataBuffer.iterator()) which is spark streaming
abstract
class's method does not seem to work correctly.

Have you tried running your spark streaming kafka consumer with kafka
0.8.1.1 and spark 1.2.0 ?

- Kidong.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p21180.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ALS.trainImplicit running out of mem when using higher rank

2015-01-16 Thread Antony Mayi

although this helped to improve it significantly I still run into this problem
despite increasing the spark.yarn.executor.memoryOverhead vastly:
export SPARK_EXECUTOR_MEMORY=24Gspark.yarn.executor.memoryOverhead=6144

yet getting this:2015-01-17 04:47:40,389 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Container [pid=30211,containerID=container_1421451766649_0002_01_115969] is
running beyond physical memory limits. Current usage: 30.1 GB of 30 GB physical
memory used; 33.0 GB of 63.0 GB virtual memory used. Killing container.
is there anything more I can do?
thanks,Antony.

On Monday, 12 January 2015, 8:21, Antony Mayi antonym...@yahoo.com wrote:

this seems to have sorted it, awesome, thanks for great help.Antony.

On Sunday, 11 January 2015, 13:02, Sean Owen so...@cloudera.com wrote:

I would expect the size of the user/item feature RDDs to grow linearly
with the rank, of course. They are cached, so that would drive cache
memory usage on the cluster.

This wouldn't cause executors to fail for running out of memory
though. In fact, your error does not show the task failing for lack of
memory. What it shows is that YARN thinks the task is using a little
bit more memory than it said it would, and killed it.

This happens sometimes with JVM-based YARN jobs since a JVM configured
to use X heap ends up using a bit more than X physical memory if the
heap reaches max size. So there's a bit of headroom built in and
controlled by spark.yarn.executor.memoryOverhead
(http://spark.apache.org/docs/latest/running-on-yarn.html) You can try
increasing it to a couple GB.

On Sun, Jan 11, 2015 at 9:43 AM, Antony Mayi
antonym...@yahoo.com.invalid wrote:
the question really is whether this is expected that the memory requirements
grow rapidly with the rank... as I would expect memory is rather O(1)
problem with dependency only on the size of input data.

if this is expected is there any rough formula to determine the required
memory based on ALS input and parameters?

thanks,
Antony.

On Saturday, 10 January 2015, 10:47, Antony Mayi antonym...@yahoo.com
wrote:

the actual case looks like this:
* spark 1.1.0 on yarn (cdh 5.2.1)
* ~8-10 executors, 36GB phys RAM per host
* input RDD is roughly 3GB containing ~150-200M items (and this RDD is made
persistent using .cache())
* using pyspark

yarn is configured with the limit yarn.nodemanager.resource.memory-mb of
33792 (33GB), spark is set to be:
SPARK_EXECUTOR_CORES=6
SPARK_EXECUTOR_INSTANCES=9
SPARK_EXECUTOR_MEMORY=30G

when using higher rank (above 20) for ALS.trainImplicit the executor runs
after some time (~hour) of execution out of the yarn limit and gets killed:

2015-01-09 17:51:27,130 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Container [pid=27125,containerID=container_1420871936411_0002_01_23] is
running beyond physical memory limits. Current usage: 31.2 GB of 31 GB
physical memory used; 34.7 GB of 65.1 GB virtual memory used. Killing
container.

thanks for any ideas,
Antony.

On Saturday, 10 January 2015, 10:11, Antony Mayi antonym...@yahoo.com
wrote:

the memory requirements seem to be rapidly growing hen using higher rank...
I am unable to get over 20 without running out of memory. is this expected?
thanks, Antony.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Low Level Kafka Consumer for Spark

2015-01-16 Thread Dibyendu Bhattacharya

My code handles the Kafka Consumer part. But writing to Kafka may not be a
big challenge which you can easily do in your driver code.

dibyendu

On Sat, Jan 17, 2015 at 9:43 AM, Debasish Das debasish.da...@gmail.com
wrote:

Hi Dib,

For our usecase I want my spark job1 to read from hdfs/cache and write to
kafka queues. Similarly spark job2 should read from kafka queues and write
to kafka queues.

Is writing to kafka queues from spark job supported in your code ?

Thanks
Deb
On Jan 15, 2015 11:21 PM, Akhil Das ak...@sigmoidanalytics.com wrote:

There was a simple example
https://github.com/dibbhatt/kafka-spark-consumer/blob/master/examples/scala/LowLevelKafkaConsumer.scala#L45
which you can run after changing few lines of configurations.

Thanks
Best Regards

On Fri, Jan 16, 2015 at 12:23 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:

Hi Kidong,

Just now I tested the Low Level Consumer with Spark 1.2 and I did not
see any issue with Receiver.Store method . It is able to fetch messages
form Kafka.

Can you cross check other configurations in your setup like Kafka broker
IP , topic name, zk host details, consumer id etc.

Dib

On Fri, Jan 16, 2015 at 11:50 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:

Hi Kidong,

No , I have not tried yet with Spark 1.2 yet. I will try this out and
let you know how this goes.

By the way, is there any change in Receiver Store method happened in
Spark 1.2 ?

Regards,
Dibyendu

On Fri, Jan 16, 2015 at 11:25 AM, mykidong mykid...@gmail.com wrote:

Hi Dibyendu,

I am using kafka 0.8.1.1 and spark 1.2.0.
After modifying these version of your pom, I have rebuilt your codes.
But I have not got any messages from ssc.receiverStream(new
KafkaReceiver(_props, i)).

I have found, in your codes, all the messages are retrieved correctly,
but
_receiver.store(_dataBuffer.iterator()) which is spark streaming
abstract
class's method does not seem to work correctly.

Have you tried running your spark streaming kafka consumer with kafka
0.8.1.1 and spark 1.2.0 ?

- Kidong.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Maven out of memory error