Re: PySpark script works itself, but fails when called from other script

2013-11-18 Thread Andrei
I've tried adding task.py to pyFiles during SparkContext creation and it worked perfectly. Thanks for your help! If you need some more information for further investigation, here's what I've noticed. Without explicitly adding file to SparkContext, only functions that are defined in main module run

How to efficiently manage resources across a cluster and avoid GC overhead exceeded errors?

2013-11-18 Thread ioannis.deligiannis
Hi, I have a cluster of 20 servers, each having 24 cores and 30GB of RAM allocated to Spark. Spark runs in a STANDALONE mode. I am trying to load some 200+GB files and cache the rows using ".cache()". What I would like to do is the following: (ATM from the scala console) -Evenly load the files a

Re: interesting question on quora

2013-11-18 Thread Koert Kuipers
the core of hadoop is currently hdfs + mapreduce. the more appropriate question is if it will become hdfs + spark. so will spark overtake mapreduce as the dominant computational engine? its a very serious candidate for that i think. it can do many things mapreduce cannot do, and has an awesome api.

Re: App master failed to find application jar in the master branch on YARN

2013-11-18 Thread Tom Graves
Hey Jiacheng Guo, do you have SPARK_EXAMPLES_JAR env variable set?  If you do, you have to add the --addJars parameter to the yarn client and point to the spark examples jar.  Or just unset SPARK_EXAMPLES_JAR env variable. You should only have to set SPARK_JAR env variable.   If that isn't the

TeraSort on Spark

2013-11-18 Thread Rivera, Dario
Hello spark community. I wanted to ask if any work has been done on porting TeraSort (Tera Gen/Sort/Validate) from Hadoop to Spark on EC2/EMR I am looking for some guidance on lessons learned from this or similar efforts as we are trying to do some benchmarking on some of the newer EC2 instances

Re: foreachPartition in Java

2013-11-18 Thread Yadid Ayzenberg
Great, I will use mapPartitions instead. Thanks for the advice, Yadid On 11/17/13 8:13 PM, Aaron Davidson wrote: Also, in general, you can workaround shortcomings in the Java API by converting to a Scala RDD (using JavaRDD's rdd() method). The API tends to be much clunkier since you have to j

Fast Data Processing with Spark

2013-11-18 Thread R. Revert
Hello Does any one read the Fast Data Processing with Spark book ( http://www.amazon.com/Fast-Processing-Spark-Holden-Karau/dp/1782167064/ref=sr_1_1?ie=UTF8&qid=1384791032&sr=8-1&keywords=fast+spark+data+processing ) any review or opinions about the material? because im thinking to buy the book

Re: App master failed to find application jar in the master branch on YARN

2013-11-18 Thread guojc
Hi Tom, I'm on Hadoop 2.05. I can launch application spark 0.8 release normally. However I switch to git master branch version with application built with it, I got the jar not found exception and same happens to the example application. I have tried both file:// protocol and hdfs:// protocol w

Re: code review - splitting columns

2013-11-18 Thread Tom Vacek
This is in response to your question about something in the API that already does this. You might want to keep your eye on MLI ( http://www.mlbase.org), which is columnar table written for machine learning but applicable to a lot of problems. It's not perfect right now. On Fri, Nov 15, 2013 at

Re: Spark & Avro in Scala

2013-11-18 Thread Eugen Cepoi
Hi Robert, The problem is that spark uses java serialization requiring serialized objects to implement Serializable, AvroKey doesn't. As a workaround you can try using kryofor the serialization. Eugen 2013/11/11 Rober

Re: Spark & Avro in Scala

2013-11-18 Thread Matt Massie
Agree with Eugen that you should used Kryo. But even better is to embed your Avro objects inside of Kryo. This allows you to have the benefits of both Avro and Kryo. Here's example code for using Avro with Kryo. https://github.com/massie/adam/blob/master/adam-commands/src/main/scala/edu/berkeley

debugging a Spark error

2013-11-18 Thread Chris Grier
Hi, I'm trying to figure out what the problem is with a job that we are running on Spark 0.7.3. When we write out via saveAsTextFile we get an exception that doesn't reveal much: 13/11/18 15:06:19 INFO cluster.TaskSetManager: Loss was due to java.io.IOException java.io.IOException: Map failed

Re: DataFrame RDDs

2013-11-18 Thread andy petrella
Maybe I'm wrong, but this use case could be a good fit for Shapeless' records. Shapeless' records are like, so to say, lisp's record but typed! In that sense, they're more closer to Haskell's record notation, but imho less powerful, since the access will be

EC2 node submit jobs to separate Spark Cluster

2013-11-18 Thread Matt Cheah
Hi, I'm working with an infrastructure that already has its own web server set up on EC2. I would like to set up a separate spark cluster on EC2 with the scripts and have the web server submit jobs to this spark cluster. Is it possible to do this? I'm getting some errors running the spark shell

Can not get the expected output when running the BroadcastTest example program.

2013-11-18 Thread 杨强
Hi, all. I'm using spark-0.8.0-incubating. I tried the example BroadcastTest in local mode. ./run-example org.apache.spark.examples.BroadcastTest local 1 2>/dev/null This works fine and get the result: Iteration 0 === 100 100 100 100 100 100 100 100 100

Re: App master failed to find application jar in the master branch on YARN

2013-11-18 Thread Tom Graves
Sorry for the delay. What is the default filesystem on your HDFS setup?  It looks like its set to file: rather then hdfs://.  That is the only reason I can think its listing the directory as   file:/home/work/.sparkStaging/application_1384588058297_0056.  Its basically just copying it local rath

Re: debugging a Spark error

2013-11-18 Thread Aaron Davidson
Have you looked a the Spark executor logs? They're usually located in the $SPARK_HOME/work/ directory. If you're running in a cluster, they'll be on the individual slave nodes. These should hopefully reveal more information. On Mon, Nov 18, 2013 at 3:42 PM, Chris Grier wrote: > Hi, > > I'm tryin

Re: EC2 node submit jobs to separate Spark Cluster

2013-11-18 Thread Aaron Davidson
The main issue with running a spark-shell locally is that it orchestrates the actual computation, so you want it to be "close" to the actual Worker nodes for latency reasons. Running a spark-shell on EC2 in the same region as the Spark cluster avoids this problem. The error you're seeing seems to

Re: Can not get the expected output when running the BroadcastTest example program.

2013-11-18 Thread Aaron Davidson
Assuming your cluster is actually working (e.g., other examples like SparkPi work), then the problem is probably that println() doesn't actually write output back to the driver; instead, it may just be outputting locally to each slave. You can test this by replacing lines 43 through 45 with: sc.

[Advice works] Re: Can not get the expected output when running the BroadcastTest example program.

2013-11-18 Thread 杨强
Thanks, Aaron. Your advice really works. Does this mean that the collect() method pulls all related data from slave nodes to master node? Sincerely Yang, Qiang 发件人: Aaron Davidson 发送时间: 2013年11月19日(星期二) 下午12:38 收件人: user; yangqiang 主题: Re: Can not get the expected output when running th

Re: Joining files

2013-11-18 Thread Something Something
Was my question so dumb? Or, is this not a good use case for Spark? On Sun, Nov 17, 2013 at 11:41 PM, Something Something < mailinglist...@gmail.com> wrote: > I am a newbie to both Spark & Scala, but I've been working with Hadoop/Pig > for quite some time. > > We've quite a few ETL processes ru

Re: Joining files

2013-11-18 Thread Alex Boisvert
Yes it would work and fit spark nicely... Pretty typical I think. On Nov 18, 2013 10:34 PM, "Something Something" wrote: > Was my question so dumb? Or, is this not a good use case for Spark? > > > On Sun, Nov 17, 2013 at 11:41 PM, Something Something < > mailinglist...@gmail.com> wrote: > >> I a

Re: Joining files

2013-11-18 Thread Horia
It seems to me that what you want is the following procedure - parse each file line by line - generate key, value pairs - join by key I think the following should accomplish what you're looking for val students = sc.textFile("./students.txt")// mapping over this RDD already maps over lines va

Re: DataFrame RDDs

2013-11-18 Thread Matei Zaharia
Interesting idea — in Scala you can also use the Dynamic type (http://hacking-scala.org/post/49051516694/introduction-to-type-dynamic) to allow dynamic properties. It has the same potential pitfalls as string names, but with nicer syntax. Matei On Nov 18, 2013, at 3:45 PM, andy petrella wrote

Re: DataFrame RDDs

2013-11-18 Thread Anwar Rizal
I had that in mind too when Miles Sabin presented Shapeless at Scala.IO Paris last month. If anybody would like to experiment with shapeless in Spark to create something like R data frame or In canter dataset, I would be happy to see and eventually help. My feeling is however the fact that shapel