I've tried adding task.py to pyFiles during SparkContext creation and it
worked perfectly. Thanks for your help!
If you need some more information for further investigation, here's what
I've noticed. Without explicitly adding file to SparkContext, only
functions that are defined in main module
Hi,
I have a cluster of 20 servers, each having 24 cores and 30GB of RAM allocated
to Spark. Spark runs in a STANDALONE mode.
I am trying to load some 200+GB files and cache the rows using .cache().
What I would like to do is the following: (ATM from the scala console)
-Evenly load the files
Hey Jiacheng Guo,
do you have SPARK_EXAMPLES_JAR env variable set? If you do, you have to add
the --addJars parameter to the yarn client and point to the spark examples jar.
Or just unset SPARK_EXAMPLES_JAR env variable.
You should only have to set SPARK_JAR env variable.
If that isn't
Hello spark community.
I wanted to ask if any work has been done on porting TeraSort (Tera
Gen/Sort/Validate) from Hadoop to Spark on EC2/EMR
I am looking for some guidance on lessons learned from this or similar efforts
as we are trying to do some benchmarking on some of the newer EC2 instances
Great, I will use mapPartitions instead.
Thanks for the advice,
Yadid
On 11/17/13 8:13 PM, Aaron Davidson wrote:
Also, in general, you can workaround shortcomings in the Java API by
converting to a Scala RDD (using JavaRDD's rdd() method). The API
tends to be much clunkier since you have to
Hello
Does any one read the Fast Data Processing with Spark book (
http://www.amazon.com/Fast-Processing-Spark-Holden-Karau/dp/1782167064/ref=sr_1_1?ie=UTF8qid=1384791032sr=8-1keywords=fast+spark+data+processing
)
any review or opinions about the material?
because im thinking to buy the book
Hi Tom,
I'm on Hadoop 2.05. I can launch application spark 0.8 release
normally. However I switch to git master branch version with application
built with it, I got the jar not found exception and same happens to the
example application. I have tried both file:// protocol and hdfs://
protocol
This is in response to your question about something in the API that
already does this. You might want to keep your eye on MLI (
http://www.mlbase.org), which is columnar table written for machine
learning but applicable to a lot of problems. It's not perfect right now.
On Fri, Nov 15, 2013 at
Agree with Eugen that you should used Kryo.
But even better is to embed your Avro objects inside of Kryo. This allows
you to have the benefits of both Avro and Kryo.
Here's example code for using Avro with Kryo.
Hi,
I'm trying to figure out what the problem is with a job that we are running
on Spark 0.7.3. When we write out via saveAsTextFile we get an exception
that doesn't reveal much:
13/11/18 15:06:19 INFO cluster.TaskSetManager: Loss was due to
java.io.IOException
java.io.IOException: Map failed
Maybe I'm wrong, but this use case could be a good fit for
Shapelesshttps://github.com/milessabin/shapeless'
records.
Shapeless' records are like, so to say, lisp's record but typed! In that
sense, they're more closer to Haskell's record notation, but imho less
powerful, since the access will be
Hi,
I'm working with an infrastructure that already has its own web server set up
on EC2. I would like to set up a separate spark cluster on EC2 with the scripts
and have the web server submit jobs to this spark cluster.
Is it possible to do this? I'm getting some errors running the spark
Hi, all.
I'm using spark-0.8.0-incubating.
I tried the example BroadcastTest in local mode.
./run-example org.apache.spark.examples.BroadcastTest local 1 2/dev/null
This works fine and get the result:
Iteration 0
===
100
100
100
100
100
100
100
100
100
Have you looked a the Spark executor logs? They're usually located in the
$SPARK_HOME/work/ directory. If you're running in a cluster, they'll be on
the individual slave nodes. These should hopefully reveal more information.
On Mon, Nov 18, 2013 at 3:42 PM, Chris Grier
The main issue with running a spark-shell locally is that it orchestrates
the actual computation, so you want it to be close to the actual Worker
nodes for latency reasons. Running a spark-shell on EC2 in the same region
as the Spark cluster avoids this problem.
The error you're seeing seems to
Was my question so dumb? Or, is this not a good use case for Spark?
On Sun, Nov 17, 2013 at 11:41 PM, Something Something
mailinglist...@gmail.com wrote:
I am a newbie to both Spark Scala, but I've been working with Hadoop/Pig
for quite some time.
We've quite a few ETL processes running
Interesting idea — in Scala you can also use the Dynamic type
(http://hacking-scala.org/post/49051516694/introduction-to-type-dynamic) to
allow dynamic properties. It has the same potential pitfalls as string names,
but with nicer syntax.
Matei
On Nov 18, 2013, at 3:45 PM, andy petrella
17 matches
Mail list logo