Re: Error when run Spark on mesos

2014-04-03 Thread panfei
after upgrading to 0.9.1 , everything goes well now. thanks for the reply. 2014-04-03 13:47 GMT+08:00 andy petrella andy.petre...@gmail.com: Hello, It's indeed due to a known bug, but using another IP for the driver won't be enough (other problems will pop up). A easy solution would be to

How to stop system info output in spark shell

2014-04-03 Thread weida xu
Hi, alll When I start spark in the shell. It automatically output some system info every minute, see below. Can I stop or block the output of these info? I tried the :silent comnond, but the automatical output remains. 14/04/03 19:34:30 INFO MetadataCleaner: Ran metadata cleaner for

Spark Disk Usage

2014-04-03 Thread Surendranauth Hiraman
Hi, I know if we call persist with the right options, we can have Spark persist an RDD's data on disk. I am wondering what happens in intermediate operations that could conceivably create large collections/Sequences, like GroupBy and shuffling. Basically, one part of the question is when is

Re: Strange behavior of RDD.cartesian

2014-04-03 Thread Jaonary Rabarisoa
You can find here a gist that illustrates this issue https://gist.github.com/jrabary/9953562 I got this with spark from master branch. On Sat, Mar 29, 2014 at 7:12 PM, Andrew Ash and...@andrewash.com wrote: Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a hash collision

Re: Avro serialization

2014-04-03 Thread FRANK AUSTIN NOTHAFT
We use avro objects in our project, and have a Kryo serializer for generic Avro SpecificRecords. Take a look at: https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/edu/berkeley/cs/amplab/adam/serialization/ADAMKryoRegistrator.scala Also, Matt Massie has a good blog post

Re: Is there a way to get the current progress of the job?

2014-04-03 Thread Philip Ogren
This is great news thanks for the update! I will either wait for the 1.0 release or go and test it ahead of time from git rather than trying to pull it out of JobLogger or creating my own SparkListener. On 04/02/2014 06:48 PM, Andrew Or wrote: Hi Philip, In the upcoming release of Spark

Re: Is there a way to get the current progress of the job?

2014-04-03 Thread Philip Ogren
I can appreciate the reluctance to expose something like the JobProgressListener as a public interface. It's exactly the sort of thing that you want to deprecate as soon as something better comes along and can be a real pain when trying to maintain the level of backwards compatibility that

Re: what does SPARK_EXECUTOR_URI in spark-env.sh do ?

2014-04-03 Thread andy petrella
Indeed, it's how mesos works actually. So the tarball just has to be somewhere accessible by the mesos slaves. That's why it is often put in hdfs. Le 3 avr. 2014 18:46, felix cnwe...@gmail.com a écrit : So, if I set this parameter, there is no need to copy the spark tarball to every mesos

Spark 1.0.0 release plan

2014-04-03 Thread Bhaskar Dutta
Hi, Is there any change in the release plan for Spark 1.0.0-rc1 release date from what is listed in the Proposal for Spark Release Strategy thread? == Tentative Release Window for 1.0.0 == Feb 1st - April 1st: General development April 1st: Code freeze for new features April 15th: RC1 Thanks,

Re: Job initialization performance of Spark standalone mode vs YARN

2014-04-03 Thread Kevin Markey
We are now testing precisely what you ask about in our environment. But Sandy's questions are relevant. The bigger issue is not Spark vs. Yarn but "client" vs. "standalone" and where the client is located on the network relative to the cluster. The "client" options

Re: Spark 1.0.0 release plan

2014-04-03 Thread Matei Zaharia
Hey Bhaskar, this is still the plan, though QAing might take longer than 15 days. Right now since we’ve passed April 1st, the only features considered for a merge are those that had pull requests in review before. (Some big ones are things like annotating the public APIs and simplifying

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-03 Thread Vipul Pandey
Any word on this one ? On Apr 2, 2014, at 12:26 AM, Vipul Pandey vipan...@gmail.com wrote: I downloaded 0.9.0 fresh and ran the mvn command - the assembly jar thus generated also has both shaded and real version of protobuf classes Vipuls-MacBook-Pro-3:spark-0.9.0-incubating vipul$ jar -ftv

Spark SQL transformations, narrow vs. wide

2014-04-03 Thread Jan-Paul Bultmann
Hey, Does somebody know the kinds of dependencies that the new SQL operators produce? I’m specifically interested in the relational join operation as it seems substantially more optimized. The old join was narrow on two RDDs with the same partitioner. Is the relational join narrow as well?

Re: Spark SQL transformations, narrow vs. wide

2014-04-03 Thread Michael Armbrust
I'm sorry, but I don't really understand what you mean when you say wide in this context. For a HashJoin, the only dependencies of the produced RDD are the two input RDDs. For BroadcastNestedLoopJoin The only dependence will be on the streamed RDD. The other RDD will be distributed to all

Re: Optimal Server Design for Spark

2014-04-03 Thread Matei Zaharia
To run multiple workers with Spark’s standalone mode, set SPARK_WORKER_INSTANCES and SPARK_WORKER_CORES in conf/spark-env.sh. For example, if you have 16 cores and want 2 workers, you could add export SPARK_WORKER_INSTANCES=2 export SPARK_WORKER_CORES=8 Matei On Apr 3, 2014, at 12:38 PM,

Re: Optimal Server Design for Spark

2014-04-03 Thread Debasish Das
@Mayur...I am hitting ulimits on the cluster if I go beyond 4 core per worker and I don't think I can change the ulimit due to sudo issues etc... If I have more workers, in ALS, I can go for 20 blocks (right now I am running 10 blocks on 10 nodes with 4 cores each and now I can go upto 20 blocks