Hi
I am consistently observing driver OutOfMemoryError (Java heap space)
during shuffling operation indicated by the log:
16/05/14 21:57:03 INFO MapOutputTrackerMaster: Size of output statuses for
shuffle 2 is 36060250 bytes à shuffle metadata size is big and the full
metadata will be sent
Hi,
We observed MetadataFetchFailedException during executorLost if
spark.speculation enabled.
looks like something out of sync between original task on failed executor
and its speculative counterpart task?
is it a known issue?
please let me know if you need more details.
Thanks,
Renyi.
Hi,
Is RDD.persist equivalent to RDD.checkpoint If they save same number of
copies (say 3) to disk?
(I assume persist saves copies on different machines ?)
thanks,
Renyi.
Hi,
I am trying to run an I/O intensive RDD in parallel with CPU intensive RDD
within an application through a window like below:
var ssc = new StreamingContext(sc, 1min);
var ds1 = ...
var ds2 = ds1.Window(2min).ForeachRDD(...)
ds1.ForeachRDD(...)
I hope ds1 to start its job at 1min interval
Hi,
Is it possible for Kafka receiver generated WriteAheadLogBackedBlockRDD to
hold corresponded Kafka offset range so that during recovery the RDD can
refer back to Kafka queue instead of paying the cost of write ahead log?
I guess there must be a reason here. Could anyone please help me
Hi TD,
We noticed that Spark Streaming UI is reporting a different task duration
from time to time.
e.g. here's the standard output of the application which reports the
duration of the longest task is about 3.3 minutes:
16/04/01 16:07:19 INFO TaskSetManager: Finished task 1077.0 in stage 0.0
Thanks a lot, Sean, really appreciate your comments.
Sent from my Windows 10 phone
From: Sean Owen
Sent: Friday, April 1, 2016 12:55 PM
To: Renyi Xiong
Cc: Tathagata Das; dev
Subject: Re: Declare rest of @Experimental items non-experimental if
they'veexisted since 1.2.0
The change
Hi Sean,
We're upgrading Mobius (C# binding for Spark) in Microsoft to align with
Spark 1.6.2 and noticed some changes in API you did in
https://github.com/apache/spark/commit/6f81eae24f83df51a99d4bb2629dd7daadc01519
mostly on APIs with Approx postfix. (still marked as experimental in
pyspark
Hi TD,
Thanks a lot for offering to look at our PR (if we fire one) at the
conference NYC.
As we discussed briefly the issues of unbalanced and
under-distributed kafka partitions when developing Spark streaming
application in Mobius (C# for Spark), we're trying the option of
repartitioning
never mind, I think pyspark is already doing async socket read / write,
but on scala side in PythonRDD.scala
On Sat, Feb 6, 2016 at 6:27 PM, Renyi Xiong <renyixio...@gmail.com> wrote:
> Hi,
>
> is it a good idea to have 2 threads in pyspark worker? - main thread
> respo
Hi,
is it a good idea to have 2 threads in pyspark worker? - main thread
responsible for receive and send data over socket while the other thread is
calling user functions to process data?
since CPU is idle (?) during network I/O, this should improve concurrency
quite a bit.
can expert answer
Hi,
We noticed there's no Save method in KafkaUtils. we do have scenarios where
we want to save RDD back to Kafka queue to be consumed by down stream
streaming applications.
I wonder if this is a common scenario, if yes, any plan to add it?
Thanks,
Renyi.
Hi TD,
I noticed mapWithState was available in spark 1.6. Is there any plan to
enable it in pyspark as well?
thanks,
Renyi.
hi,
I met following exception when the driver program tried to recover from
checkpoint, looks like the logic relies on zeroTime being set which doesn't
seem to happen here. am I missing anything or is it a bug in 1.4.1?
org.apache.spark.SparkException:
never mind, one of my peers correct the driver program for me - all dstream
operations need to be within the scope of getOrCreate API
On Wed, Dec 9, 2015 at 3:32 PM, Renyi Xiong <renyixio...@gmail.com> wrote:
> following scala program throws same exception, I know people are running
&g
bmit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
On Wed, Dec 9, 2015 at 12:45 PM, Renyi Xiong <renyixio...@gmail.com> wrote:
> hi,
>
> I met following exceptio
Hi,
I try to run the following 1.4.1 sample by putting a words.txt under
localdir
bin\run-example org.apache.spark.examples.streaming.HdfsWordCount localdir
2 questions
1. it does not pick up words.txt because it's 'old' I guess - any option to
let it picked up?
2. I managed to put a 'new'
are
> disposable, and there should not be any node-specific long term state that
> you rely on unless you can recover that state on a different node.
>
> On Mon, Oct 5, 2015 at 3:03 PM, Renyi Xiong <renyixio...@gmail.com> wrote:
>
>> if RDDs from same DStream not g
if RDDs from same DStream not guaranteed to run on same worker, then the
question becomes:
is it possible to specify an unlimited duration in ssc to have a continuous
stream (as opposed to discretized).
say, we have a per node streaming engine (built-in checkpoint and recovery)
we'd like to
Hi Shiva,
Is Dataframe UDF implemented in SparkR yet? - I could not find it in below
URL
https://github.com/hlin09/spark/tree/SparkR-streaming/R/pkg/R
Thanks,
Renyi.
thanks a lot, it works now after I set %HADOOP_HOME%
On Tue, Sep 29, 2015 at 1:22 PM, saurfang wrote:
> See
>
> http://stackoverflow.com/questions/26516865/is-it-possible-to-run-hadoop-jobs-like-the-wordcount-sample-in-the-local-mode
> ,
>
version consistent with the one which was used to build Spark
> 1.4.0 ?
>
> Cheers
>
> On Mon, Sep 28, 2015 at 4:36 PM, Renyi Xiong <renyixio...@gmail.com>
> wrote:
>
>> I tried to run HdfsTest sample on windows spark-1.4.0
>>
>> bin\run-sample org.apa
I tried to run HdfsTest sample on windows spark-1.4.0
bin\run-sample org.apache.spark.examples.HdfsTest
but got below exception, any body any idea what was wrong here?
15/09/28 16:33:56.565 ERROR SparkContext: Error initializing SparkContext.
java.lang.NullPointerException
at
Hi,
I want to do temporal join operation on DStream across RDDs, my question
is: Are RDDs from same DStream always computed on same worker (except
failover) ?
thanks,
Renyi.
AM, Reynold Xin <r...@databricks.com> wrote:
> > You should reach out to the speakers directly.
> >
> >
> > On Wed, Sep 16, 2015 at 9:52 AM, Renyi Xiong <renyixio...@gmail.com>
> wrote:
> >>
> >> SparkR streaming is mentioned at about page 17 in
Can anybody help understand why pyspark streaming uses py4j callback to
execute python code while pyspark batch uses worker.py?
regarding pyspark streaming, is py4j callback only used for
DStream, worker.py still used for RDD?
thanks,
Renyi.
> Yeah in addition to the downside of having 2 JVMs the command line
> arguments and SparkConf etc. will be set by spark-submit in the first
> JVM which won't be available in the second JVM.
>
> Shivaram
>
> On Thu, Sep 10, 2015 at 5:18 PM, Renyi Xiong <renyixio...@gmail.com>
&g
ll be set by spark-submit in the first
> >> JVM which won't be available in the second JVM.
> >>
> >> Shivaram
> >>
> >> On Thu, Sep 10, 2015 at 5:18 PM, Renyi Xiong <renyixio...@gmail.com>
> >> wrote:
> >> > for 2nd case wh
why SparkR chose to uses inter-process socket solution eventually on driver
side instead of in-process JNI showed in one of its doc's below (about page
20)?
https://spark-summit.org/wp-content/uploads/2014/07/SparkR-Interactive-R-Programs-at-Scale-Shivaram-Vankataraman-Zongheng-Yang.pdf
29 matches
Mail list logo