Help with Spark Streaming

2014-11-15 Thread Bahubali Jain
Hi, Trying to use spark streaming, but I am struggling with word count :( I want consolidate output of the word count (not on a per window basis), so I am using updateStateByKey(), but for some reason this is not working. The function it self is not being invoked(do not see the sysout output on

repartition combined with zipWithIndex get stuck

2014-11-15 Thread lev
Hi, I'm having trouble using both zipWithIndex and repartition. When I use them both, the following action will get stuck and won't return. I'm using spark 1.1.0. Those 2 lines work as expected: scala sc.parallelize(1 to 10).repartition(10).count() res0: Long = 10 scala sc.parallelize(1 to

Re: SparkSQL exception on cached parquet table

2014-11-15 Thread Cheng Lian
Hi Sadhan, Could you please provide the stack trace of the |ArrayIndexOutOfBoundsException| (if any)? The reason why the first query succeeds is that Spark SQL doesn’t bother reading all data from the table to give |COUNT(*)|. In the second case, however, the whole table is asked to be

Re: saveAsTextFile error

2014-11-15 Thread Prannoy
Hi Niko, Have you tried it running keeping the wordCounts.print() ?? Possibly the import to the package *org.apache.spark.streaming._* is not there so during sbt package it is unable to locate the saveAsTextFile API. Go to

RE: Submitting Python Applications from Remote to Master

2014-11-15 Thread Ashic Mahtab
Hi Ben,I haven't tried it with Python, but the instructions are the same as for Scala compiled (jar) apps. What it's saying is that it's not possible to offload the entire work to the master (ala hadoop) in a fire and forget (or rather submit-and-forget) manner when running on stand alone.

Re: Submitting Python Applications from Remote to Master

2014-11-15 Thread Ognen Duzlevski
Ashic, Thanks for your email. Two things: 1. I think a whole lot of data scientists and other people would love it if they could just fire off jobs from their laptops. It is, in my opinion, a common desired use case. 2. Did anyone actually get the Ooyala job server to work? I asked that

RE: Submitting Python Applications from Remote to Master

2014-11-15 Thread Ashic Mahtab
Hi Ognen,Currently, Note that cluster mode is currently not supported for standalone clusters, Mesos clusters, or python applications. So it seems like Yarn + scala is the only option for fire and forget. It shouldn't be too hard to create a proxy submitter, but yes, that does involve another

using zip gets EOFError error

2014-11-15 Thread chocjy
I was trying to zip the rdd with another rdd. I store my matrix in HDFS and load it as Ab_rdd = sc.textFile('data/Ab.txt', 100) If I do idx = sc.parallelize(range(m),100) #m is the number of records in Ab.txt print matrix_Ab.matrix.zip(idx).first() I got the following error: If I store my

Re: Using data in RDD to specify HDFS directory to write to

2014-11-15 Thread jschindler
UPDATE I have removed and added things systematically to the job and have figured that the inclusion of the construction of the SparkContext object is what is causing it to fail. The last run contained the code below. I keep losing executors apparently and I'm not sure why. Some of the

Pagerank implementation

2014-11-15 Thread tom85
Hi, I wonder if the pagerank implementation is correct. More specifically, I look at the following function from PageRank.scala https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala , which is given to Pregel: def vertexProgram(id:

How to incrementally compile spark examples using mvn

2014-11-15 Thread Yiming (John) Zhang
Hi, I have already successfully compile and run spark examples. My problem is that if I make some modifications (e.g., on SparkPi.scala or LogQuery.scala) I have to use mvn -DskipTests package to rebuild the whole spark project and wait a relatively long time. I also tried mvn scala:cc as

Re: How to incrementally compile spark examples using mvn

2014-11-15 Thread Marcelo Vanzin
I haven't tried scala:cc, but you can ask maven to just build a particular sub-project. For example: mvn -pl :spark-examples_2.10 compile On Sat, Nov 15, 2014 at 5:31 PM, Yiming (John) Zhang sdi...@gmail.com wrote: Hi, I have already successfully compile and run spark examples. My problem

Re: Client application that calls Spark and receives an MLlib model Scala Object and then predicts without Spark installed on hadoop

2014-11-15 Thread Xiangrui Meng
If Spark is not installed on the client side, you won't be able to deserialize the model. Instead of serializing the model object, you may serialize the model weights array and implement predict on the client side. -Xiangrui On Fri, Nov 14, 2014 at 2:54 PM, xiaoyan yu xiaoyan...@gmail.com wrote:

Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng
This is a bug. Could you make a JIRA? -Xiangrui On Sat, Nov 15, 2014 at 3:27 AM, lev kat...@gmail.com wrote: Hi, I'm having trouble using both zipWithIndex and repartition. When I use them both, the following action will get stuck and won't return. I'm using spark 1.1.0. Those 2 lines

Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng
I think I understand where the bug is now. I created a JIRA (https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR soon. -Xiangrui On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng men...@gmail.com wrote: This is a bug. Could you make a JIRA? -Xiangrui On Sat, Nov 15, 2014 at 3:27

Re: repartition combined with zipWithIndex get stuck

2014-11-15 Thread Xiangrui Meng
PR: https://github.com/apache/spark/pull/3291 . For now, here is a workaround: val a = sc.parallelize(1 to 10).zipWithIndex() a.partitions // call .partitions explicitly a.repartition(10).count() Thanks for reporting the bug! -Xiangrui On Sat, Nov 15, 2014 at 8:38 PM, Xiangrui Meng

Re: SparkSQL exception on cached parquet table

2014-11-15 Thread sadhan
Hi Cheng, Thanks for your response.Here is the stack trace from yarn logs: -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-exception-on-cached-parquet-table-tp18978p19020.html Sent from the Apache Spark User List mailing list archive at

SparkSQL exception on spark.sql.codegen

2014-11-15 Thread Eric Zhen
Hi all, We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we got exceptions as below, has anyone else saw these before? java.lang.ExceptionInInitializerError at org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92) at