Re: groupByKey() and keys with many values

2015-09-08 Thread Sean Owen
I think groupByKey is intended for cases where you do want the values in memory; for one-pass use cases, it's more efficient to use reduceByKey, or aggregateByKey if lower-level operations are needed. For your case, you probably want to do you reduceByKey, then perform the expensive per-key

adding jars to the classpath with the relative path to spark home

2015-09-08 Thread Niranda Perera
Hi, is it possible to add jars to the spark executor/ driver classpath with the relative path of the jar (relative to the spark home)? I need to set the following settings in the spark conf - spark.driver.extraClassPath - spark.executor.extraClassPath the reason why I need to use the relative

Re: groupByKey() and keys with many values

2015-09-08 Thread kaklakariada
Hi Antonio! Thank you very much for your answer! You are right in that in my case the computation could be replaced by a reduceByKey. The thing is that my computation also involves database queries: 1. Fetch key-specific data from database into memory. This is expensive and I only want to do

Re: Detecting configuration problems

2015-09-08 Thread Akhil Das
I found an old JIRA referring the same. https://issues.apache.org/jira/browse/SPARK-5421 Thanks Best Regards On Sun, Sep 6, 2015 at 8:53 PM, Madhu wrote: > I'm not sure if this has been discussed already, if so, please point me to > the thread and/or related JIRA. > > I have

Re: Code generation for GPU

2015-09-08 Thread Steve Loughran
On 7 Sep 2015, at 20:44, lonikar > wrote: 2. If the vectorization is difficult or a major effort, I am not sure how I am going to implement even a glimpse of changes I would like to. I think I will have to satisfied with only a partial effort.

Re: Deserializing JSON into Scala objects in Java code

2015-09-08 Thread Marcelo Vanzin
Hi Kevin, How did you try to use the Scala module? Spark has this code when setting up the ObjectMapper used to generate the output: mapper.registerModule(com.fasterxml.jackson.module.scala.DefaultScalaModule) As for supporting direct serialization to Java objects, I don't think that was the

Re: Fast Iteration while developing

2015-09-08 Thread Michael Armbrust
+1 to reynolds suggestion. This is probably the fastest way to iterate. Another option for more ad-hoc debugging is `sbt/sbt sparkShell` which is similar to bin/spark-shell but doesn't require you to rebuild the assembly jar. On Mon, Sep 7, 2015 at 9:03 PM, Reynold Xin

Deserializing JSON into Scala objects in Java code

2015-09-08 Thread Kevin Chen
Hello Spark Devs, I am trying to use the new Spark API json endpoints at /api/v1/[path] (added in SPARK-3454). In order to minimize maintenance on our end, I would like to use Retrofit/Jackson to parse the json directly into the Scala classes in org/apache/spark/status/api/v1/api.scala

Re: Pyspark DataFrame TypeError

2015-09-08 Thread Davies Liu
I tried with Python 2.7/3.4 and Spark 1.4.1/1.5-RC3, they all work as expected: ``` >>> from pyspark.mllib.linalg import Vectors >>> df = sqlContext.createDataFrame([(1.0, Vectors.dense([1.0])), (0.0, >>> Vectors.sparse(1, [], []))], ["label", "featuers"]) >>> df.show() +-+-+ |label|

Re: groupByKey() and keys with many values

2015-09-08 Thread Reynold Xin
On Tue, Sep 8, 2015 at 6:51 AM, Antonio Piccolboni wrote: > As far as the DB writes, remember spark can retry a computation, so your > writes have to be idempotent (see this thread > , in > which Reynold

Re: Detecting configuration problems

2015-09-08 Thread Madhu
Thanks Akhil! I suspect the root cause of the shuffle OOM I was seeing (and probably many that users might see) is due to individual partitions on the reduce side not fitting in memory. As a guideline, I was thinking of something like "be sure that your largest partitions occupy no more then 1%

Re: groupByKey() and keys with many values

2015-09-08 Thread Antonio Piccolboni
You may also consider selecting distinct keys and fetching from database first, then join on key with values. This in case Sean's approach is not viable -- in case you need to have the DB data before the first reduce call. By not revealing your problem, you are forcing us to make guesses, which