Joining files

2013-11-17 Thread Something Something
I am a newbie to both Spark & Scala, but I've been working with Hadoop/Pig for quite some time. We've quite a few ETL processes running in production that use Pig, but now we're evaluating Spark to see if they would indeed run faster. A very common use case in our Pig script is joining a file con

configuring final partition length

2013-11-17 Thread Umar Javed
I'm using pyspark. I was wondering how to modify the number of partitions for the result task (reduce in my case). I'm running Spark on a cluster of two machines (each with 16 cores). Here's the relevant log output for my result stage: 13/11/17 23:16:47 INFO SparkContext: time: 18851958895218046 *

Re: failure notice

2013-11-17 Thread Aaron Davidson
Great, thanks for the update! Your ant version matches mine, though I can't reproduce the error. Weird. I have created SPARK-959to track this issue and PR #183 to hopefully solve the issue in

Re: failure notice

2013-11-17 Thread Egon Kidmose
I applied the first approach in $SPARK_HOME/project/SparkBuild.scala (.scala, not .sbt, as the later wasn't present) This solved the issue. Thanks for your help! ~/Downloads/spark-0.8.0-incubating $ ant -v Apache Ant(TM) version 1.8.2 compiled on May 18 2012 Mvh/BR Egon Kidmose On Sun, Nov 17,

interesting question on quora

2013-11-17 Thread jamal sasha
I found this interesting question on quora.. and thought of sharing here. https://www.quora.com/Apache-Hadoop/Will-spark-ever-overtake-hadoop So.. is spark missing any capabilty?

Re: failure notice

2013-11-17 Thread Aaron Davidson
Could you report your ant/Ivy version? Just run ant -version The fundamental problem is that Ivy is stupidly thinking ".orbit" is the file extension when it should be ".jar". There are two possible fixes you can try, and please let us know if one or the other works. In $SPARK_HOME/project/SparkBui

Re: How to add more worker node to spark cluster on EC2

2013-11-17 Thread Wisc Forum
Thank you, I will try that :) On Nov 17, 2013, at 7:06 PM, Aaron Davidson wrote: > Hi Xiaobing, > > At its heart, this is a very easy thing to do. Instead of the master reaching > out to the workers, the worker just needs to find the master. In standalone > mode, this can be accomplished simp

Re: number of splits for standalone cluster mode

2013-11-17 Thread Aaron Davidson
The number of splits can be configured when reading the file, as an argument to textFile(), sequenceFile(), etc (see docs). Note that this is a minimum, however, as cert

Re: failure notice

2013-11-17 Thread Egon Kidmose
Hi All, I'm trying to get started on using spark, but following the quick start guide I encounter an error. As per documentation I run "sbt/sbt assembly" from the root folder, but I get an error for downloading javax.servlet.orbit. (see below) It's the same result for 0.8.0 and what's on github.

Re: foreachPartition in Java

2013-11-17 Thread Aaron Davidson
Also, in general, you can workaround shortcomings in the Java API by converting to a Scala RDD (using JavaRDD's rdd() method). The API tends to be much clunkier since you have to jump through some hoops to talk to a Scala API in Java, though. In this case, JavaRDD's mapPartition() method will likel

Re: How to add more worker node to spark cluster on EC2

2013-11-17 Thread Aaron Davidson
Hi Xiaobing, At its heart, this is a very easy thing to do. Instead of the master reaching out to the workers, the worker just needs to find the master. In standalone mode, this can be accomplished simply by setting the SPARK_MASTER_IP/_PORT variables in spark-env.sh. In order to make the other s

RE: SPARK + YARN the general case

2013-11-17 Thread Liu, Raymond
Well, with #101 pull request get merged. Shark on yarn and spark streaming on Yarn should both working. I do verify both these working modes with simple test case when I submit the #101 request, while codes changes a lot after that. Might have a few little things to fix. Best Regards, Raymond

Re: foreachPartition in Java

2013-11-17 Thread Patrick Wendell
Can you just call mapPartitions and ignore the result? - Patrick On Sun, Nov 17, 2013 at 4:45 PM, Yadid Ayzenberg wrote: > Hi, > > According to the API, foreachPartition() is not yet implemented in Java. > Are there any workarounds to get the same functionality ? > I have a non serializable DB c

foreachPartition in Java

2013-11-17 Thread Yadid Ayzenberg
Hi, According to the API, foreachPartition() is not yet implemented in Java. Are there any workarounds to get the same functionality ? I have a non serializable DB connection and instantiating it is pretty expensive, so I prefer to do it on a per partition basis. thanks, Yadid

number of splits for standalone cluster mode

2013-11-17 Thread Umar Javed
Hi, When running Spark in the standalone cluster node, is there a way to configure the number of splits for the input file(s)? It seems like it is approximately 32 MB for every core be default. Is that correct? For example in my cluster there are two workers, each running on a machine with two cor

How to add more worker node to spark cluster on EC2

2013-11-17 Thread Wisc Forum
Hi, I have a job that runs on Spark on EC2. The cluster currently contains 1 master node and 2 worker node. I am planning to add several other worker nodes to the cluster. How should I do that so the master node knows the new worker nodes? I couldn't find the documentation on it in Spark's site