For performance, Spark prefers OracleJDK or OpenJDK?

2014-05-19 Thread Hao Wang
Hi, Oracle JDK and OpenJDK, which one is better or preferred for Spark? Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com

Re: For performance, Spark prefers OracleJDK or OpenJDK?

2014-05-19 Thread Gordon Wang
I would like to say that Oracle JDK may be the better choice. A lot of hadoop distribution vendors use Oracle JDK instead of Open JDK for enterprise. On Mon, May 19, 2014 at 2:50 PM, Hao Wang wh.s...@gmail.com wrote: Hi, Oracle JDK and OpenJDK, which one is better or preferred for Spark?

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Daniel Mahler
btw is there a command or script to update the slaves from the master? thanks Daniel On Mon, May 19, 2014 at 1:48 AM, Andrew Ash and...@andrewash.com wrote: If the codebase for Spark's broadcast is pretty self-contained, you could consider creating a small bootstrap sent out via the doubling

persist @ disk-only failing

2014-05-19 Thread Sai Prasanna
Hi all, When i gave the persist level as DISK_ONLY, still Spark tries to use memory and caches. Any reason ? Do i need to override some parameter elsewhere ? Thanks !

Re: Packaging a spark job using maven

2014-05-19 Thread Laurent T
Hi Eugen, Thanks for your help. I'm not familiar with the shaded plugin and i was wondering: does it replace the assembly plugin ? Also, do i have to specify all the artifacts and sub artifacts in the artifactSet ? Or can i just use a *:* wildcard and let the maven scopes do their work ? I have a

Re: Packaging a spark job using maven

2014-05-19 Thread Eugen Cepoi
2014-05-19 10:35 GMT+02:00 Laurent T laurent.thou...@ldmobile.net: Hi Eugen, Thanks for your help. I'm not familiar with the shaded plugin and i was wondering: does it replace the assembly plugin ? Nope it doesn't replace it. It allows you to make fat jars and other nice things such as

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Daniel Mahler
On Mon, May 19, 2014 at 2:04 AM, Daniel Mahler dmah...@gmail.com wrote: I agree that for updating rsync is probably preferable, and it seems like for that purpose it would also parallelize well since most of the time is spent computing checksums so the process is not constrained by the total

Re: File present but file not found exception

2014-05-19 Thread Koert Kuipers
why does it need to be local file? why not do some filter ops on hdfs file and save to hdfs, from where you can create rdd? you can read a small file in on driver program and use sc.parallelize to turn it into RDD On May 16, 2014 7:01 PM, Sai Prasanna ansaiprasa...@gmail.com wrote: I found that

specifying worker nodes when using the repl?

2014-05-19 Thread Eric Friedman
Hi I am working with a Cloudera 5 cluster with 192 nodes and can’t work out how to get the spark repo to use more than 2 nodes in an interactive session. So, this works, but is non-interactive (using yarn-client as MASTER)

Re: Yarn configuration file doesn't work when run with yarn-client mode

2014-05-19 Thread Arun Ahuja
I am encountering the same thing. Basic yarn apps work as does the SparkPi example, but my custom application gives this result. I am using compute-classpath to create the proper classpath for my application, same with SparkPi - was there a resolution to this issue? Thanks, Arun On Wed, Feb

Re: persist @ disk-only failing

2014-05-19 Thread Matei Zaharia
This is the patch for it: https://github.com/apache/spark/pull/50/. It might be possible to backport it to 0.8. Matei On May 19, 2014, at 2:04 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Matei, I am using 0.8.1 !! But is there a way without moving to 0.9.1 to bypass cache ? On

Re: specifying worker nodes when using the repl?

2014-05-19 Thread Sandy Ryza
Hi Eric, Have you tried setting the SPARK_WORKER_INSTANCES env variable before running spark-shell? http://spark.apache.org/docs/0.9.0/running-on-yarn.html -Sandy On Mon, May 19, 2014 at 8:08 AM, Eric Friedman e...@spottedsnake.netwrote: Hi I am working with a Cloudera 5 cluster with 192

Spark Streaming and Shark | Streaming Taking All CPUs

2014-05-19 Thread anishs...@yahoo.co.in
Hi All I am new to Spark, I was trying to use Spark Streaming and Shark at the same time. I was recieiving messages from Kafka and pushing them to HDFS after minor processing. It was workin fine, but it was taking all the CPUs and at the same time on other terminal i tried to access shark

Re: unsubscribe

2014-05-19 Thread Marcelo Vanzin
Hey Andrew, Since we're seeing so many of these e-mails, I think it's worth pointing out that it's not really obvious to find unsubscription information for the lists. The community link on the Spark site (http://spark.apache.org/community.html) does not have instructions for unsubscribing; it

spark ec2 commandline tool error VPC security groups may not be used for a non-VPC launch

2014-05-19 Thread Matt Work Coarr
Hi, I'm attempting to run spark-ec2 launch on AWS. My AWS instances would be in our EC2 VPC (which seems to be causing a problem). The two security groups MyClusterName-master and MyClusterName-slaves have already been setup with the same ports open as the security group that spark-ec2 tries to

Re: unsubscribe

2014-05-19 Thread Andrew Ash
Agree that the links to the archives should probably point to the Apache archives rather than Nabble's, so the unsubscribe documentation is clearer. Also, an (unsubscribe) link right next to subscribe with the email already generated could help too. I'd be one of those highly against a footer on

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Aaron Davidson
On the ec2 machines, you can update the slaves from the master using something like ~/spark-ec2/copy-dir ~/spark. Spark's TorrentBroadcast relies on the Block Manager to distribute blocks, making it relatively hard to extract. On Mon, May 19, 2014 at 12:36 AM, Daniel Mahler dmah...@gmail.com

Re: sync master with slaves with bittorrent?

2014-05-19 Thread Mosharaf Chowdhury
Good catch. In that case, using BitTornado/murder would be better. -- Mosharaf Chowdhury http://www.mosharaf.com/ On Mon, May 19, 2014 at 11:17 AM, Aaron Davidson ilike...@gmail.com wrote: On the ec2 machines, you can update the slaves from the master using something like

Re: How to run the SVM and LogisticRegression

2014-05-19 Thread yxzhao
Thanks Xiangrui, But I did not find the directory: examples/src/main/scala/org/apache/spark/examples/mllib. Could you give me more detail or show me one example? Thanks a lot. -- View this message in context:

Which component(s) of Spark do not support IPv6?

2014-05-19 Thread queniesun
Besides Hadoop, are there any other components of Spark that do not support IPv6? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Which-component-s-of-Spark-do-not-support-IPv6-tp6050.html Sent from the Apache Spark User List mailing list archive at

Re: specifying worker nodes when using the repl?

2014-05-19 Thread Eric Friedman
Sandy, thank you so much — that was indeed my omission! Eric On May 19, 2014, at 10:14 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Eric, Have you tried setting the SPARK_WORKER_INSTANCES env variable before running spark-shell? http://spark.apache.org/docs/0.9.0/running-on-yarn.html

Re: How to compile the examples directory?

2014-05-19 Thread Matei Zaharia
If you’d like to work on just this code for your own changes, it might be best to copy it to a separate project. Look at http://spark.apache.org/docs/latest/quick-start.html for how to set up a standalone job. Matei On May 19, 2014, at 4:53 AM, Hao Wang wh.s...@gmail.com wrote: Hi, I am

combinebykey throw classcastexception

2014-05-19 Thread xiemeilong
I am using CDH5 on a three machines cluster. map data from hbase as (string, V) pair , then call combineByKey like this: .combineByKey[C]( (v:V)=new C(v), //this line throw java.lang.ClassCastException: C cannot be cast to V (v:C,v:V)=C, (c1:C,c2:C)=C) I am very

Re: advice on maintaining a production spark cluster?

2014-05-19 Thread Matei Zaharia
Which version is this with? I haven’t seen standalone masters lose workers. Is there other stuff on the machines that’s killing them, or what errors do you see? Matei On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.com wrote: Hey folks, I'm wondering what strategies other folks

Re: How to run the SVM and LogisticRegression

2014-05-19 Thread yxzhao
Thanks Xiangrui, Sorry I am new for Spark, could you give me more detail about master or branch-1.0 I do not know what master or branch-1.0 is. Thanks again. On Mon, May 19, 2014 at 10:17 PM, Xiangrui Meng [via Apache Spark User List] ml-node+s1001560n6064...@n3.nabble.com wrote:

Re: How to run the SVM and LogisticRegression

2014-05-19 Thread Andrew Ash
Hi yxzhao, Those are branches in the source code git repository. You can get to them with git checkout branch-1.0 once you've cloned the git repository. Cheers, Andrew On Mon, May 19, 2014 at 8:30 PM, yxzhao yxz...@ualr.edu wrote: Thanks Xiangrui, Sorry I am new for Spark, could

Spark stalling during shuffle (maybe a memory issue)

2014-05-19 Thread jonathan.keebler
Has anyone observed Spark worker threads stalling during a shuffle phase with the following message (one per worker host) being echoed to the terminal on the driver thread? INFO spark.MapOutputTrackerActor: Asked to send map output locations for shuffle 0 to [worker host]... At this point

Re: Reading from .bz2 files with Spark

2014-05-19 Thread Andrew Ash
Hi Xiangrui, many thanks to you and Sandy for fixing this issue! On Fri, May 16, 2014 at 10:23 PM, Xiangrui Meng men...@gmail.com wrote: Hi Andrew, I submitted a patch and verified it solves the problem. You can download the patch from https://issues.apache.org/jira/browse/HADOOP-10614 .

life if an executor

2014-05-19 Thread Koert Kuipers
from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are they tied to the sparkcontext and life/die with it? thx

Re: How to run the SVM and LogisticRegression

2014-05-19 Thread yxzhao
Thanks Andrew, Yes, I have downloaded the master code. But, actually I just want to know how to run the classfication algorithms SVM and LogisticRegression implemented under /spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification . Thanks. On Mon,

Re: Problem when sorting big file

2014-05-19 Thread Andrew Ash
Is your RDD of Strings? If so, you should make sure to use the Kryo serializer instead of the default Java one. It stores strings as UTF8 rather than Java's default UTF16 representation, which can save you half the memory usage in the right situation. Try setting the persistence level on the

facebook data mining with Spark

2014-05-19 Thread Joe L
Is there any way to get facebook data into Spark and filter the content of it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/facebook-data-mining-with-Spark-tp6072.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Setting queue for spark job on yarn

2014-05-19 Thread Ron Gonzalez
Hi,   How does one submit a spark job to yarn and specify a queue?   The code that successfully submits to yarn is:    val conf = new SparkConf()    val sc = new SparkContext(yarn-client, Simple App, conf)    Where do I need to specify the queue?   Thanks in advance for any help on this...

Re: life if an executor

2014-05-19 Thread Matei Zaharia
They’re tied to the SparkContext (application) that launched them. Matei On May 19, 2014, at 8:44 PM, Koert Kuipers ko...@tresata.com wrote: from looking at the source code i see executors run in their own jvm subprocesses. how long to they live for? as long as the worker/slave? or are

Re: filling missing values in a sequence

2014-05-19 Thread Mohit Jaggi
Thanks Sean. Yes, your solution works :-) I did oversimplify my real problem, which has other parameters that go along with the sequence. On Fri, May 16, 2014 at 3:03 AM, Sean Owen so...@cloudera.com wrote: Not sure if this is feasible, but this literally does what I think you are describing:

Re: life if an executor

2014-05-19 Thread Mohit Jaggi
I guess it needs to be this way to benefit from caching of RDDs in memory. It would be nice however if the RDD cache can be dissociated from the JVM heap so that in cases where garbage collection is difficult to tune, one could choose to discard the JVM and run the next operation in a few one.

Re: life if an executor

2014-05-19 Thread Tathagata Das
That's one the main motivation in using Tachyon ;) http://tachyon-project.org/ It gives off heap in-memory caching. And starting Spark 0.9, you can cache any RDD in Tachyon just by specifying the appropriate StorageLevel. TD On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.com