Fellows,
I have a simple code.
sc.parallelize (Array (1, 4, 3, 2), 2).sortBy (i=>i).foreach (println)
This results in 2 jobs (sortBy, foreach) in Spark's application master ui.
I thought there is one to one relationship between RDD action and job. Here,
only action is foreach, so should be only
See if https://issues.apache.org/jira/browse/SPARK-3660 helps you. My patch
has been accepted and, this enhancement is scheduled for 1.3.0.
This lets you specify initialRDD for updateStateByKey operation. Let me
know if you need any information.
On Sun, Feb 22, 2015 at 5:21 PM, Tobias Pfeiffer
It is a Streaming application, so how/when do you plan to access the
accumulator on driver?
On Tue, Jan 27, 2015 at 6:48 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
thanks for your mail!
On Wed, Jan 28, 2015 at 11:44 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
That seems
Hi Adam,
I have following scala actor based code to do graceful shutdown:
class TimerActor (val timeout : Long, val who : Actor) extends Actor {
def act {
reactWithin (timeout) {
case TIMEOUT = who ! SHUTDOWN
}
}
}
class SSCReactor (val ssc :
Entire file in a window.
On Mon, Nov 10, 2014 at 9:20 AM, Saiph Kappa saiph.ka...@gmail.com wrote:
Hi,
In my application I am doing something like this new
StreamingContext(sparkConf, Seconds(10)).textFileStream(logs/), and I
get some unknown exceptions when I copy a file with about 800 MB
Great, it worked.
I don't have an answer what is special about SPARK_CLASSPATH vs --jars, just
found the working setting through trial an error.
- Original Message -
From: Fengyun RAO raofeng...@gmail.com
To: Soumitra Kumar kumar.soumi...@gmail.com
Cc: user@spark.apache.org, u
Hello,
I am debugging my code to find out what else to cache.
Following is a line in log:
14/10/16 12:00:01 INFO TransformedDStream: Persisting RDD 6 for time
141348600 ms to StorageLevel(true, true, false, false, 1) at time
141348600 ms
Is there a way to name a DStream? RDD has a
Hello,
Is there a way to print the dependency graph of complete program or RDD/DStream
as a DOT file? It would be very helpful to have such a thing.
Thanks,
-Soumitra.
-
To unsubscribe, e-mail:
I am writing to HBase, following are my options:
export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar
spark-submit \
--jars
to new schema.
- Original Message -
From: Buntu Dev buntu...@gmail.com
To: Soumitra Kumar kumar.soumi...@gmail.com
Cc: u...@spark.incubator.apache.org
Sent: Tuesday, October 7, 2014 10:18:16 AM
Subject: Re: Kafka-HDFS to store as Parquet format
Thanks for the info Soumitra.. its a good
I thought I did a good job ;-)
OK, so what is the best way to initialize updateStateByKey operation? I have
counts from previous spark-submit, and want to load that in next spark-submit
job.
- Original Message -
From: Soumitra Kumar kumar.soumi...@gmail.com
To: spark users user
onStart should be non-blocking. You may try to create a thread in onStart
instead.
- Original Message -
From: t1ny wbr...@gmail.com
To: u...@spark.incubator.apache.org
Sent: Friday, September 19, 2014 1:26:42 AM
Subject: Re: Spark Streaming and ReactiveMongo
Here's what we've tried so
I successfully did this once.
RDD map to RDD [(ImmutableBytesWritable, KeyValue)]
then
val conf = HBaseConfiguration.create()
val job = new Job (conf, CEF2HFile)
job.setMapOutputKeyClass (classOf[ImmutableBytesWritable]);
job.setMapOutputValueClass (classOf[KeyValue]);
val table = new
Hmm, no response to this thread!
Adding to it, please share experiences of building an enterprise grade product
based on Spark Streaming.
I am exploring Spark Streaming for enterprise software and am cautiously
optimistic about it. I see huge potential to improve debuggability of Spark.
-
://issues.apache.org/jira/browse/SPARK-2316
Best I understand and have been told, this does not affect data
integrity but may cause un-necessary recomputes.
Hope this helps,
Tim
On Wed, Sep 17, 2014 at 8:30 AM, Soumitra Kumar
kumar.soumi...@gmail.com wrote:
Hmm, no response to this thread!
Adding
I had looked at that.
If I have a set of saved word counts from previous run, and want to load
that in the next run, what is the best way to do it?
I am thinking of hacking the Spark code and have an initial rdd in
StateDStream,
and use that in for the first time.
On Fri, Sep 12, 2014 at 11:04
Thanks for the pointers. I meant previous run of spark-submit.
For 1: This would be a bit more computation in every batch.
2: Its a good idea, but it may be inefficient to retrieve each value.
In general, for a generic state machine the initialization and input
sequence is critical for
I have the following code:
stream foreachRDD { rdd =
if (rdd.take (1).size == 1) {
rdd foreachPartition { iterator =
initDbConnection ()
iterator foreach {
write to db
Yes, that is an option.
I started with a function of batch time, and index to generate id as long. This
may be faster than generating UUID, with added benefit of sorting based on time.
- Original Message -
From: Tathagata Das tathagata.das1...@gmail.com
To: Soumitra Kumar kumar.soumi
Hello,
If I do:
DStream transform {
rdd.zipWithIndex.map {
Is the index guaranteed to be unique across all RDDs here?
}
}
Thanks,
-Soumitra.
So, I guess zipWithUniqueId will be similar.
Is there a way to get unique index?
On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote:
No. The indices start at 0 for every RDD. -Xiangrui
On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
kumar.soumi...@gmail.com wrote
. Then you can use
rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid)
Just a hack ..
On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar
kumar.soumi...@gmail.com wrote:
So, I guess zipWithUniqueId will be similar.
Is there a way to get unique index?
On Wed, Aug 27, 2014
of the RDDs would contain more than 1 billion
records. Then you can use
rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid)
Just a hack ..
On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar
kumar.soumi...@gmail.com wrote:
So, I guess zipWithUniqueId will be similar.
Is there a way
Hello,
I want to count the number of elements in the DStream, like RDD.count() .
Since there is no such method in DStream, I thought of using DStream.count
and use the accumulator.
How do I do DStream.count() to count the number of elements in a DStream?
How do I create a shared variable in
variable will reside in the driver and will keep being
updated after every batch.
TD
On Thu, Aug 7, 2014 at 10:16 PM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:
Hello,
I want to count the number of elements in the DStream, like RDD.count() .
Since there is no such method in DStream, I
running a
count every time on the full dataset) then caching is not going to help you.
On Tue, Feb 25, 2014 at 4:59 PM, Soumitra Kumar
kumar.soumi...@gmail.comwrote:
Thanks Nick.
How do I figure out if the RDD fits in memory?
On Tue, Feb 25, 2014 at 1:04 AM, Nick Pentreath nick.pentre
26 matches
Mail list logo