Re: PySpark script works itself, but fails when called from other script

2013-11-18 Thread Andrei
I've tried adding task.py to pyFiles during SparkContext creation and it worked perfectly. Thanks for your help! If you need some more information for further investigation, here's what I've noticed. Without explicitly adding file to SparkContext, only functions that are defined in main module

How to efficiently manage resources across a cluster and avoid GC overhead exceeded errors?

2013-11-18 Thread ioannis.deligiannis
Hi, I have a cluster of 20 servers, each having 24 cores and 30GB of RAM allocated to Spark. Spark runs in a STANDALONE mode. I am trying to load some 200+GB files and cache the rows using .cache(). What I would like to do is the following: (ATM from the scala console) -Evenly load the files

Re: App master failed to find application jar in the master branch on YARN

2013-11-18 Thread Tom Graves
Hey Jiacheng Guo, do you have SPARK_EXAMPLES_JAR env variable set?  If you do, you have to add the --addJars parameter to the yarn client and point to the spark examples jar.  Or just unset SPARK_EXAMPLES_JAR env variable. You should only have to set SPARK_JAR env variable.   If that isn't

TeraSort on Spark

2013-11-18 Thread Rivera, Dario
Hello spark community. I wanted to ask if any work has been done on porting TeraSort (Tera Gen/Sort/Validate) from Hadoop to Spark on EC2/EMR I am looking for some guidance on lessons learned from this or similar efforts as we are trying to do some benchmarking on some of the newer EC2 instances

Re: foreachPartition in Java

2013-11-18 Thread Yadid Ayzenberg
Great, I will use mapPartitions instead. Thanks for the advice, Yadid On 11/17/13 8:13 PM, Aaron Davidson wrote: Also, in general, you can workaround shortcomings in the Java API by converting to a Scala RDD (using JavaRDD's rdd() method). The API tends to be much clunkier since you have to

Fast Data Processing with Spark

2013-11-18 Thread R. Revert
Hello Does any one read the Fast Data Processing with Spark book ( http://www.amazon.com/Fast-Processing-Spark-Holden-Karau/dp/1782167064/ref=sr_1_1?ie=UTF8qid=1384791032sr=8-1keywords=fast+spark+data+processing ) any review or opinions about the material? because im thinking to buy the book

Re: App master failed to find application jar in the master branch on YARN

2013-11-18 Thread guojc
Hi Tom, I'm on Hadoop 2.05. I can launch application spark 0.8 release normally. However I switch to git master branch version with application built with it, I got the jar not found exception and same happens to the example application. I have tried both file:// protocol and hdfs:// protocol

Re: code review - splitting columns

2013-11-18 Thread Tom Vacek
This is in response to your question about something in the API that already does this. You might want to keep your eye on MLI ( http://www.mlbase.org), which is columnar table written for machine learning but applicable to a lot of problems. It's not perfect right now. On Fri, Nov 15, 2013 at

Re: Spark Avro in Scala

2013-11-18 Thread Matt Massie
Agree with Eugen that you should used Kryo. But even better is to embed your Avro objects inside of Kryo. This allows you to have the benefits of both Avro and Kryo. Here's example code for using Avro with Kryo.

debugging a Spark error

2013-11-18 Thread Chris Grier
Hi, I'm trying to figure out what the problem is with a job that we are running on Spark 0.7.3. When we write out via saveAsTextFile we get an exception that doesn't reveal much: 13/11/18 15:06:19 INFO cluster.TaskSetManager: Loss was due to java.io.IOException java.io.IOException: Map failed

Re: DataFrame RDDs

2013-11-18 Thread andy petrella
Maybe I'm wrong, but this use case could be a good fit for Shapelesshttps://github.com/milessabin/shapeless' records. Shapeless' records are like, so to say, lisp's record but typed! In that sense, they're more closer to Haskell's record notation, but imho less powerful, since the access will be

EC2 node submit jobs to separate Spark Cluster

2013-11-18 Thread Matt Cheah
Hi, I'm working with an infrastructure that already has its own web server set up on EC2. I would like to set up a separate spark cluster on EC2 with the scripts and have the web server submit jobs to this spark cluster. Is it possible to do this? I'm getting some errors running the spark

Can not get the expected output when running the BroadcastTest example program.

2013-11-18 Thread 杨强
Hi, all. I'm using spark-0.8.0-incubating. I tried the example BroadcastTest in local mode. ./run-example org.apache.spark.examples.BroadcastTest local 1 2/dev/null This works fine and get the result: Iteration 0 === 100 100 100 100 100 100 100 100 100

Re: debugging a Spark error

2013-11-18 Thread Aaron Davidson
Have you looked a the Spark executor logs? They're usually located in the $SPARK_HOME/work/ directory. If you're running in a cluster, they'll be on the individual slave nodes. These should hopefully reveal more information. On Mon, Nov 18, 2013 at 3:42 PM, Chris Grier

Re: EC2 node submit jobs to separate Spark Cluster

2013-11-18 Thread Aaron Davidson
The main issue with running a spark-shell locally is that it orchestrates the actual computation, so you want it to be close to the actual Worker nodes for latency reasons. Running a spark-shell on EC2 in the same region as the Spark cluster avoids this problem. The error you're seeing seems to

Re: Joining files

2013-11-18 Thread Something Something
Was my question so dumb? Or, is this not a good use case for Spark? On Sun, Nov 17, 2013 at 11:41 PM, Something Something mailinglist...@gmail.com wrote: I am a newbie to both Spark Scala, but I've been working with Hadoop/Pig for quite some time. We've quite a few ETL processes running

Re: DataFrame RDDs

2013-11-18 Thread Matei Zaharia
Interesting idea — in Scala you can also use the Dynamic type (http://hacking-scala.org/post/49051516694/introduction-to-type-dynamic) to allow dynamic properties. It has the same potential pitfalls as string names, but with nicer syntax. Matei On Nov 18, 2013, at 3:45 PM, andy petrella