Spark shuffling OutOfMemoryError Java heap space

2016-05-15 Thread Renyi Xiong
Hi I am consistently observing driver OutOfMemoryError (Java heap space) during shuffling operation indicated by the log: 16/05/14 21:57:03 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 36060250 bytes à shuffle metadata size is big and the full metadata will be sent

MetadataFetchFailedException if executorLost when spark.speculation enabled ?

2016-05-01 Thread Renyi Xiong
Hi, We observed MetadataFetchFailedException during executorLost if spark.speculation enabled. looks like something out of sync between original task on failed executor and its speculative counterpart task? is it a known issue? please let me know if you need more details. Thanks, Renyi.

persist versus checkpoint

2016-04-30 Thread Renyi Xiong
Hi, Is RDD.persist equivalent to RDD.checkpoint If they save same number of copies (say 3) to disk? (I assume persist saves copies on different machines ?) thanks, Renyi.

Spark streaming concurrent job scheduling question

2016-04-28 Thread Renyi Xiong
Hi, I am trying to run an I/O intensive RDD in parallel with CPU intensive RDD within an application through a window like below: var ssc = new StreamingContext(sc, 1min); var ds1 = ... var ds2 = ds1.Window(2min).ForeachRDD(...) ds1.ForeachRDD(...) I hope ds1 to start its job at 1min interval

Spark streaming Kafka receiver WriteAheadLog question

2016-04-22 Thread Renyi Xiong
Hi, Is it possible for Kafka receiver generated WriteAheadLogBackedBlockRDD to hold corresponded Kafka offset range so that during recovery the RDD can refer back to Kafka queue instead of paying the cost of write ahead log? I guess there must be a reason here. Could anyone please help me

Spark Streaming UI reporting a different task duration

2016-04-05 Thread Renyi Xiong
Hi TD, We noticed that Spark Streaming UI is reporting a different task duration from time to time. e.g. here's the standard output of the application which reports the duration of the longest task is about 3.3 minutes: 16/04/01 16:07:19 INFO TaskSetManager: Finished task 1077.0 in stage 0.0

RE: Declare rest of @Experimental items non-experimental if they'veexisted since 1.2.0

2016-04-01 Thread Renyi Xiong
Thanks a lot, Sean, really appreciate your comments. Sent from my Windows 10 phone From: Sean Owen Sent: Friday, April 1, 2016 12:55 PM To: Renyi Xiong Cc: Tathagata Das; dev Subject: Re: Declare rest of @Experimental items non-experimental if they'veexisted since 1.2.0 The change

Declare rest of @Experimental items non-experimental if they've existed since 1.2.0

2016-04-01 Thread Renyi Xiong
Hi Sean, We're upgrading Mobius (C# binding for Spark) in Microsoft to align with Spark 1.6.2 and noticed some changes in API you did in https://github.com/apache/spark/commit/6f81eae24f83df51a99d4bb2629dd7daadc01519 mostly on APIs with Approx postfix. (still marked as experimental in pyspark

DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-10 Thread Renyi Xiong
Hi TD, Thanks a lot for offering to look at our PR (if we fire one) at the conference NYC. As we discussed briefly the issues of unbalanced and under-distributed kafka partitions when developing Spark streaming application in Mobius (C# for Spark), we're trying the option of repartitioning

Re: pyspark worker concurrency

2016-02-08 Thread Renyi Xiong
never mind, I think pyspark is already doing async socket read / write, but on scala side in PythonRDD.scala On Sat, Feb 6, 2016 at 6:27 PM, Renyi Xiong <renyixio...@gmail.com> wrote: > Hi, > > is it a good idea to have 2 threads in pyspark worker? - main thread > respo

pyspark worker concurrency

2016-02-06 Thread Renyi Xiong
Hi, is it a good idea to have 2 threads in pyspark worker? - main thread responsible for receive and send data over socket while the other thread is calling user functions to process data? since CPU is idle (?) during network I/O, this should improve concurrency quite a bit. can expert answer

Spark Streaming KafkaUtils missing Save API?

2016-01-15 Thread Renyi Xiong
Hi, We noticed there's no Save method in KafkaUtils. we do have scenarios where we want to save RDD back to Kafka queue to be consumed by down stream streaming applications. I wonder if this is a common scenario, if yes, any plan to add it? Thanks, Renyi.

pyspark streaming 1.6 mapWithState?

2015-12-21 Thread Renyi Xiong
Hi TD, I noticed mapWithState was available in spark 1.6. Is there any plan to enable it in pyspark as well? thanks, Renyi.

DStream not initialized SparkException

2015-12-09 Thread Renyi Xiong
hi, I met following exception when the driver program tried to recover from checkpoint, looks like the logic relies on zeroTime being set which doesn't seem to happen here. am I missing anything or is it a bug in 1.4.1? org.apache.spark.SparkException:

Re: DStream not initialized SparkException

2015-12-09 Thread Renyi Xiong
never mind, one of my peers correct the driver program for me - all dstream operations need to be within the scope of getOrCreate API On Wed, Dec 9, 2015 at 3:32 PM, Renyi Xiong <renyixio...@gmail.com> wrote: > following scala program throws same exception, I know people are running &g

Re: DStream not initialized SparkException

2015-12-09 Thread Renyi Xiong
bmit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) On Wed, Dec 9, 2015 at 12:45 PM, Renyi Xiong <renyixio...@gmail.com> wrote: > hi, > > I met following exceptio

let spark streaming sample come to stop

2015-11-13 Thread Renyi Xiong
Hi, I try to run the following 1.4.1 sample by putting a words.txt under localdir bin\run-example org.apache.spark.examples.streaming.HdfsWordCount localdir 2 questions 1. it does not pick up words.txt because it's 'old' I guess - any option to let it picked up? 2. I managed to put a 'new'

Re: failure notice

2015-10-06 Thread Renyi Xiong
are > disposable, and there should not be any node-specific long term state that > you rely on unless you can recover that state on a different node. > > On Mon, Oct 5, 2015 at 3:03 PM, Renyi Xiong <renyixio...@gmail.com> wrote: > >> if RDDs from same DStream not g

Re: failure notice

2015-10-05 Thread Renyi Xiong
if RDDs from same DStream not guaranteed to run on same worker, then the question becomes: is it possible to specify an unlimited duration in ssc to have a continuous stream (as opposed to discretized). say, we have a per node streaming engine (built-in checkpoint and recovery) we'd like to

SparkR dataframe UDF

2015-10-02 Thread Renyi Xiong
Hi Shiva, Is Dataframe UDF implemented in SparkR yet? - I could not find it in below URL https://github.com/hlin09/spark/tree/SparkR-streaming/R/pkg/R Thanks, Renyi.

Re: failed to run spark sample on windows

2015-09-30 Thread Renyi Xiong
thanks a lot, it works now after I set %HADOOP_HOME% On Tue, Sep 29, 2015 at 1:22 PM, saurfang wrote: > See > > http://stackoverflow.com/questions/26516865/is-it-possible-to-run-hadoop-jobs-like-the-wordcount-sample-in-the-local-mode > , >

Re: failed to run spark sample on windows

2015-09-29 Thread Renyi Xiong
version consistent with the one which was used to build Spark > 1.4.0 ? > > Cheers > > On Mon, Sep 28, 2015 at 4:36 PM, Renyi Xiong <renyixio...@gmail.com> > wrote: > >> I tried to run HdfsTest sample on windows spark-1.4.0 >> >> bin\run-sample org.apa

failed to run spark sample on windows

2015-09-28 Thread Renyi Xiong
I tried to run HdfsTest sample on windows spark-1.4.0 bin\run-sample org.apache.spark.examples.HdfsTest but got below exception, any body any idea what was wrong here? 15/09/28 16:33:56.565 ERROR SparkContext: Error initializing SparkContext. java.lang.NullPointerException at

Spark streaming DStream state on worker

2015-09-16 Thread Renyi Xiong
Hi, I want to do temporal join operation on DStream across RDDs, my question is: Are RDDs from same DStream always computed on same worker (except failover) ? thanks, Renyi.

Re: SparkR streaming source code

2015-09-16 Thread Renyi Xiong
AM, Reynold Xin <r...@databricks.com> wrote: > > You should reach out to the speakers directly. > > > > > > On Wed, Sep 16, 2015 at 9:52 AM, Renyi Xiong <renyixio...@gmail.com> > wrote: > >> > >> SparkR streaming is mentioned at about page 17 in

pyspark streaming DStream compute

2015-09-15 Thread Renyi Xiong
Can anybody help understand why pyspark streaming uses py4j callback to execute python code while pyspark batch uses worker.py? regarding pyspark streaming, is py4j callback only used for DStream, worker.py still used for RDD? thanks, Renyi.

Re: SparkR driver side JNI

2015-09-11 Thread Renyi Xiong
> Yeah in addition to the downside of having 2 JVMs the command line > arguments and SparkConf etc. will be set by spark-submit in the first > JVM which won't be available in the second JVM. > > Shivaram > > On Thu, Sep 10, 2015 at 5:18 PM, Renyi Xiong <renyixio...@gmail.com> &g

Re: SparkR driver side JNI

2015-09-11 Thread Renyi Xiong
ll be set by spark-submit in the first > >> JVM which won't be available in the second JVM. > >> > >> Shivaram > >> > >> On Thu, Sep 10, 2015 at 5:18 PM, Renyi Xiong <renyixio...@gmail.com> > >> wrote: > >> > for 2nd case wh

SparkR driver side JNI

2015-08-06 Thread Renyi Xiong
why SparkR chose to uses inter-process socket solution eventually on driver side instead of in-process JNI showed in one of its doc's below (about page 20)? https://spark-summit.org/wp-content/uploads/2014/07/SparkR-Interactive-R-Programs-at-Scale-Shivaram-Vankataraman-Zongheng-Yang.pdf