Re: Issues with Spark Streaming checkpointing of Kafka topic content

2019-04-02 Thread Dmitry Goldenberg
To add more info, this project is on an older version of Spark, 1.5.0, and on an older version of Kafka which is 0.8.2.1 (2.10-0.8.2.1). On Tue, Apr 2, 2019 at 11:39 AM Dmitry Goldenberg wrote: > Hi, > > I've got 3 questions/issues regarding checkpointing, was hoping someone > cou

Issues with Spark Streaming checkpointing of Kafka topic content

2019-04-02 Thread Dmitry Goldenberg
Hi, I've got 3 questions/issues regarding checkpointing, was hoping someone could help shed some light on this. We've got a Spark Streaming consumer consuming data from a Kafka topic; works fine generally until I switch it to the checkpointing mode by calling the 'checkpoint' method on the

Losing system properties on executor side, if context is checkpointed

2019-02-19 Thread Dmitry Goldenberg
Hi all, I'm seeing an odd behavior where if I switch the context from regular to checkpointed, the system properties are no longer automatically carried over into the worker / executors and turn out to be null there. This is in Java, using spark-streaming_2.10, version 1.5.0. I'm placing

How to fix error "Failed to get records for..." after polling for 120000

2017-04-18 Thread Dmitry Goldenberg
Hi, I was wondering if folks have some ideas, recommendation for how to fix this error (full stack trace included below). We're on Kafka 0.10.0.0 and spark_streaming_2.11 v. 2.0.0. We've tried a few things as suggested in these sources: -

NoNodeAvailableException (None of the configured nodes are available) error when trying to push data to Elastic from a Spark job

2017-02-03 Thread Dmitry Goldenberg
Hi, Any reason why we might be getting this error? The code seems to work fine in the non-distributed mode but the same code when run from a Spark job is not able to get to Elastic. Spark version: 2.0.1 built for Hadoop 2.4, Scala 2.11 Elastic version: 2.3.1 I've verified the Elastic hosts and

Re: Reason for Kafka topic existence check / "Does the topic exist?" error

2016-10-29 Thread Dmitry Goldenberg
er to run with auto.create set to false (because it makes sure the > topic is actually set up the way you want, reduces the likelihood of > spurious topics being created, etc). > > > > On Sat, Oct 8, 2016 at 11:44 AM, Dmitry Goldenberg > <dgoldenberg...@gmail.com> wrote: &

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-14 Thread Dmitry Goldenberg
loaded into the cluster memory as needed; paged in/out on-demand in an LRU fashion. From this perspective, it's not yet clear to me what the best option(s) would be. Any thoughts / recommendations would be appreciated. On Tue, Jan 12, 2016 at 3:04 PM, Dmitry Goldenberg <dgoldenberg...@gmail.c

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-14 Thread Dmitry Goldenberg
The other thing from some folks' recommendations on this list was Apache Ignite. Their In-Memory File System ( https://ignite.apache.org/features/igfs.html) looks quite interesting. On Thu, Jan 14, 2016 at 7:54 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com > wrote: > OK so it l

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
I'd guess that if the resources are broadcast Spark would put them into Tachyon... > On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com> > wrote: > > Would it make sense to load them into Tachyon and read and broadcast them > from there since Tach

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
Jorn, you said Ignite or ... ? What was the second choice you were thinking of? It seems that got omitted. > On Jan 12, 2016, at 2:44 AM, Jörn Franke <jornfra...@gmail.com> wrote: > > You can look at ignite as a HDFS cache or for storing rdds. > >> On 11 Jan 2016, at

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
nthan.com> wrote: > > One option could be to store them as blobs in a cache like Redis and then > read + broadcast them from the driver. Or you could store them in HDFS and > read + broadcast from the driver. > > Regards > Sab > >> On Tue, Jan 12, 2016 at 1

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
. Spark is also able to > interact with Tachyon via the Hadoop-compatible API, so Spark jobs can read > input files from Tachyon and write output files to Tachyon. > > I hope that helps, > Gene > > On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.c

Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Dmitry Goldenberg
We have a bunch of Spark jobs deployed and a few large resource files such as e.g. a dictionary for lookups or a statistical model. Right now, these are deployed as part of the Spark jobs which will eventually make the mongo-jars too bloated for deployments. What are some of the best practices

Re: What are the .snapshot files in /home/spark/Snapshots?

2015-11-10 Thread Dmitry Goldenberg
N/m, these are just profiling snapshots :) Sorry for the wide distribution. On Tue, Nov 10, 2015 at 9:46 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com > wrote: > We're seeing a bunch of .snapshot files being created under > /home/spark/Snapshots, such as the following

What are the .snapshot files in /home/spark/Snapshots?

2015-11-10 Thread Dmitry Goldenberg
We're seeing a bunch of .snapshot files being created under /home/spark/Snapshots, such as the following for example: CoarseGrainedExecutorBackend-2015-08-27-shutdown.snapshot CoarseGrainedExecutorBackend-2015-08-31-shutdown-1.snapshot SparkSubmit-2015-08-31-shutdown-1.snapshot

Re: How to tell Spark not to use /tmp for snappy-unknown-***-libsnappyjava.so

2015-09-30 Thread Dmitry Goldenberg
Thanks, Ted, will try it out. On Wed, Sep 30, 2015 at 9:07 AM, Ted Yu <yuzhih...@gmail.com> wrote: > See the tail of this: > https://bugzilla.redhat.com/show_bug.cgi?id=1005811 > > FYI > > > On Sep 30, 2015, at 5:54 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com>

How to tell Spark not to use /tmp for snappy-unknown-***-libsnappyjava.so

2015-09-30 Thread Dmitry Goldenberg
Is there a way to ensure Spark doesn't write to /tmp directory? We've got spark.local.dir specified in the spark-defaults.conf file to point at another directory. But we're seeing many of these snappy-unknown-***-libsnappyjava.so files being written to /tmp still. Is there a config setting or

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-30 Thread Dmitry Goldenberg
f protobuf jar is loaded ahead of hbase-protocol.jar, things start to get > interesting ... > > On Tue, Sep 29, 2015 at 6:12 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Ted, I think I have tried these settings with the hbase protocol jar, to

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
ith fewer partitions/replicas, see if it works. > > -adrian > > From: Dmitry Goldenberg > Date: Tuesday, September 29, 2015 at 3:37 PM > To: Adrian Tanase > Cc: "user@spark.apache.org" > Subject: Re: Kafka error "partitions don't have a leader" / > LeaderNo

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
> -adrian > > From: Dmitry Goldenberg > Date: Tuesday, September 29, 2015 at 3:26 PM > To: "user@spark.apache.org" > Subject: Kafka error "partitions don't have a leader" / > LeaderNotAvailableException > > I apologize for posting this Kafka related issue

Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
I apologize for posting this Kafka related issue into the Spark list. Have gotten no responses on the Kafka list and was hoping someone on this list could shed some light on the below. --- We're running into

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
, you > may want to look through the kafka jira, e.g. > > https://issues.apache.org/jira/browse/KAFKA-899 > > > On Tue, Sep 29, 2015 at 8:05 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> "more partitions and replicas than available brokers" --

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

2015-09-29 Thread Dmitry Goldenberg
you > may want to look through the kafka jira, e.g. > > https://issues.apache.org/jira/browse/KAFKA-899 > > > On Tue, Sep 29, 2015 at 8:05 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> "more partitions and replicas than available brokers"

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
release of Spark > command line for running Spark job > > Cheers > > On Tue, Sep 29, 2015 at 1:37 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> We're seeing this occasionally. Granted, this was caused by a wrinkle in >> the Solr schema but

ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
We're seeing this occasionally. Granted, this was caused by a wrinkle in the Solr schema but this bubbled up all the way in Spark and caused job failures. I just checked and SolrException class is actually in the consumer job jar we use. Is there any reason why Spark cannot find the

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
I'm actually not sure how either one of these would possibly cause Spark to find SolrException. Whether the driver or executor class path is first, should it not matter, if the class is in the consumer job jar? On Tue, Sep 29, 2015 at 9:12 PM, Dmitry Goldenberg <dgoldenberg...@gmail.

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-29 Thread Dmitry Goldenberg
wrote: > Have you tried the following ? > --conf spark.driver.userClassPathFirst=true --conf spark.executor. > userClassPathFirst=true > > On Tue, Sep 29, 2015 at 4:38 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Release of Spar

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-15 Thread Dmitry Goldenberg
correct numbers before killing any >> job) >> >> Thanks >> Best Regards >> >> On Mon, Sep 14, 2015 at 10:40 PM, Dmitry Goldenberg < >> dgoldenberg...@gmail.com> wrote: >> >>> Is there a way in Spark to automatically terminate

A way to timeout and terminate a laggard 'Stage' ?

2015-09-14 Thread Dmitry Goldenberg
Is there a way in Spark to automatically terminate laggard "stage's", ones that appear to be hanging? In other words, is there a timeout for processing of a given RDD? In the Spark GUI, I see the "kill" function for a given Stage under 'Details for Job <...>". Is there something in Spark that

A way to kill laggard jobs?

2015-09-11 Thread Dmitry Goldenberg
Is there a way to kill a laggard Spark job manually, and more importantly, is there a way to do it programmatically based on a configurable timeout value? Thanks.

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Dmitry Goldenberg
>> checkpoints can't be used between controlled restarts Is that true? If so, why? From my testing, checkpoints appear to be working fine, we get the data we've missed between the time the consumer went down and the time we brought it back up. >> If I cannot make checkpoints between code

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-10 Thread Dmitry Goldenberg
erval at the time of > recovery? Trying to understand your usecase. > > > On Wed, Sep 9, 2015 at 12:03 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> >> when you use getOrCreate, and there exists a valid checkpoint, it will >> always return the

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-09 Thread Dmitry Goldenberg
inting to override its checkpoint duration millis, is there? Is the default there max(batchdurationmillis, 10seconds)? Is there a way to override this? Thanks. On Wed, Sep 9, 2015 at 2:44 PM, Tathagata Das <t...@databricks.com> wrote: > > > See inline. > > On Tue, Sep

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
M, Tathagata Das <t...@databricks.com> wrote: > Why are you checkpointing the direct kafka stream? It serves not purpose. > > TD > > On Tue, Sep 8, 2015 at 9:35 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> I just disabled checkpointing in

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
gt; > > On Tue, Sep 8, 2015 at 8:28 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> That is good to know. However, that doesn't change the problem I'm >> seeing. Which is that, even with that piece of code commented out >> (stream.checkpoint()), th

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
; in each batch > > (which checkpoint is enabled using > streamingContext.checkpoint(checkpointDir)) and can recover from failure by > reading the exact same data back from Kafka. > > > TD > > On Tue, Sep 8, 2015 at 4:38 PM, Dmitry Goldenberg < > dgoldenberg...@gma

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
eninger <c...@koeninger.org> wrote: > Well, I'm not sure why you're checkpointing messages. > > I'd also put in some logging to see what values are actually being read > out of your params object for the various settings. > > > On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenber

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
logging to see what values are actually being read > out of your params object for the various settings. > > > On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> I've stopped the jobs, the workers, and the master. Deleted the con

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-08 Thread Dmitry Goldenberg
deleting or moving the contents of the checkpoint directory > and restarting the job? > > On Fri, Sep 4, 2015 at 8:02 PM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Sorry, more relevant code below: >> >> SparkConf sparkConf = createSparkConf(a

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
15 at 10:45 PM, Tathagata Das <t...@databricks.com> wrote: > Are you accidentally recovering from checkpoint files which has 10 second > as the batch interval? > > > On Thu, Sep 3, 2015 at 7:34 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> I'm s

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
t; On Fri, Sep 4, 2015 at 3:38 AM, Dmitry Goldenberg < > dgoldenberg...@gmail.com> wrote: > >> Tathagata, >> >> Checkpointing is turned on but we were not recovering. I'm looking at the >> logs now, feeding fresh content hours after the restart. Here's a snippet: >&

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
; >> On Fri, Sep 4, 2015 at 3:50 PM, Tathagata Das <t...@databricks.com> >> wrote: >> >>> Could you see what the streaming tab in the Spark UI says? It should >>> show the underlying batch duration of the StreamingContext, the details of >>> when the b

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
t;() { @Override public Void call(JavaRDD rdd) throws Exception { ProcessPartitionFunction func = new ProcessPartitionFunction(params); rdd.foreachPartition(func); return null; } }); return jssc; } On Fri, Sep 4, 2015 at 8:57 PM, Dmitry Goldenberg &

Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-03 Thread Dmitry Goldenberg
I'm seeing an oddity where I initially set the batchdurationmillis to 1 second and it works fine: JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.milliseconds(batchDurationMillis)); Then I tried changing the value to 10 seconds. The change didn't seem to take. I've

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
. It sort of seems wrong though since https://spark.apache.org/docs/latest/streaming-programming-guide.html suggests it was intended to be a multiple of the batch interval. The slide duration wouldn't always be relevant anyway. On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg dgoldenberg

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
could terminate as the last batch is being processed... On Fri, Aug 14, 2015 at 6:17 PM, Cody Koeninger c...@koeninger.org wrote: You'll resume and re-process the rdd that didnt finish On Fri, Aug 14, 2015 at 1:31 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Our additional question

How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
We're getting the below error. Tried increasing spark.executor.memory e.g. from 1g to 2g but the below error still happens. Any recommendations? Something to do with specifying -Xmx in the submit job scripts? Thanks. Exception in thread main java.lang.OutOfMemoryError: GC overhead limit

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
a checkpoint whether we can estimate the size of the checkpoint and compare with Runtime.getRuntime().freeMemory(). If the size of checkpoint is much bigger than free memory, log warning, etc Cheers On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thanks, Cody

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
, Aug 10, 2015 at 1:07 PM, Cody Koeninger c...@koeninger.org wrote: You need to keep a certain number of rdds around for checkpointing, based on e.g. the window size. Those would all need to be loaded at once. On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
: I wonder during recovery from a checkpoint whether we can estimate the size of the checkpoint and compare with Runtime.getRuntime(). freeMemory(). If the size of checkpoint is much bigger than free memory, log warning, etc Cheers On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
wouldn't always be relevant anyway. On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: I've instrumented checkpointing per the programming guide and I can tell that Spark Streaming is creating the checkpoint directories but I'm not seeing any content being

Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
I've instrumented checkpointing per the programming guide and I can tell that Spark Streaming is creating the checkpoint directories but I'm not seeing any content being created in those directories nor am I seeing the effects I'd expect from checkpointing. I'd expect any data that comes into

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
https://spark.apache.org/docs/latest/streaming-programming-guide.html suggests it was intended to be a multiple of the batch interval. The slide duration wouldn't always be relevant anyway. On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: I've instrumented

Re: What is a best practice for passing environment variables to Spark workers?

2015-07-10 Thread Dmitry Goldenberg
Thanks, Akhil. We're trying the conf.setExecutorEnv() approach since we've already got environment variables set. For system properties we'd go the conf.set(spark.) route. We were concerned that doing the below type of thing did not work, which this blog post seems to confirm (

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
Richard, That's exactly the strategy I've been trying, which is a wrapper singleton class. But I was seeing the inner object being created multiple times. I wonder if the problem has to do with the way I'm processing the RDD's. I'm using JavaDStream to stream data (from Kafka). Then I'm

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg
My singletons do in fact stick around. They're one per worker, looks like. So with 4 workers running on the box, we're creating one singleton per worker process/jvm, which seems OK. Still curious about foreachPartition vs. foreachRDD though... On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
These are quite different operations. One operates on RDDs in DStream and one operates on partitions of an RDD. They are not alternatives. Sean, different operations as they are, they can certainly be used on the same data set. In that sense, they are alternatives. Code can be written using one

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
the processing. On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: These are quite different operations. One operates on RDDs in DStream and one operates on partitions of an RDD. They are not alternatives. Sean, different operations as they are, they can certainly

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg
Thanks, Cody. The good boy comment wasn't from me :) I was the one asking for help. On Wed, Jul 8, 2015 at 10:52 AM, Cody Koeninger c...@koeninger.org wrote: Sean already answered your question. foreachRDD and foreachPartition are completely different, there's nothing fuzzy or insufficient

Re: Registering custom metrics

2015-06-22 Thread Dmitry Goldenberg
Great, thank you, Silvio. In your experience, is there any way to instument a callback into Coda Hale or the Spark consumers from the metrics sink? If the sink performs some steps once it has received the metrics, I'd like to be able to make the consumers aware of that via some sort of a

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg
of the currently running cluster I was thinking more about the scenario where you have e.g. 100 boxes and want to / can add e.g. 20 more *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] *Sent:* Wednesday, June 3, 2015 4:46 PM *To:* Evo Eftimov *Cc:* Cody Koeninger; Andrew Or; Gerard

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg
? On Thu, Jun 11, 2015 at 7:30 AM, Cody Koeninger c...@koeninger.org wrote: Depends on what you're reusing multiple times (if anything). Read http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence On Wed, Jun 10, 2015 at 12:18 AM, Dmitry Goldenberg dgoldenberg

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-09 Thread Dmitry Goldenberg
unless you want it to. If you want it, just call cache. On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: set the storage policy for the DStream RDDs to MEMORY AND DISK - it appears the storage level can be specified in the createStream methods

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Dmitry Goldenberg
Dmytiis intial question – you can load large data sets as Batch (Static) RDD from any Spark Streaming App and then join DStream RDDs against them to emulate “lookups” , you can also try the “Lookup RDD” – there is a git hub project *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] *Sent

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Dmitry Goldenberg
Thanks so much, Yiannis, Olivier, Huang! On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com wrote: Hi there, I would recommend checking out https://github.com/spark-jobserver/spark-jobserver which I think gives the functionality you are looking for. I haven't tested it

Re: StreamingListener, anyone?

2015-06-04 Thread Dmitry Goldenberg
Shixiong, Thanks, interesting point. So if we want to only process one batch then terminate the consumer, what's the best way to achieve that? Presumably the listener could set a flag on the driver notifying it that it can terminate. But the driver is not in a loop, it's basically blocked in

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-04 Thread Dmitry Goldenberg
Date:2015/05/28 13:22 (GMT+00:00) To: Dmitry Goldenberg Cc: Gerard Maas ,spark users Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics? You can always spin new boxes in the background and bring them into the cluster fold when fully

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
” resources / nodes which allows you to double, triple etc in the background/parallel the resources of the currently running cluster I was thinking more about the scenario where you have e.g. 100 boxes and want to / can add e.g. 20 more *From:* Dmitry Goldenberg [mailto:dgoldenberg

Re: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Dmitry Goldenberg
So Evo, option b is to singleton the Param, as in your modified snippet, i.e. instantiate is once per an RDD. But if I understand correctly the a) option is broadcast, meaning instantiation is in the Driver once before any transformations and actions, correct? That's where my serialization costs

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
of the currently running cluster I was thinking more about the scenario where you have e.g. 100 boxes and want to / can add e.g. 20 more *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] *Sent:* Wednesday, June 3, 2015 4:46 PM *To:* Evo Eftimov *Cc:* Cody Koeninger; Andrew

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
, spark streaming (spark) will NOT resort to disk – and of course resorting to disk from time to time (ie when there is no free RAM ) and taking a performance hit from that, BUT only until there is no free RAM *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] *Sent:* Thursday, May 28

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-03 Thread Dmitry Goldenberg
the resources of the currently running cluster I was thinking more about the scenario where you have e.g. 100 boxes and want to / can add e.g. 20 more *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] *Sent:* Wednesday, June 3, 2015 4:46 PM *To:* Evo Eftimov *Cc:* Cody Koeninger

Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Dmitry Goldenberg
Thank you, Tathagata, Cody, Otis. - Dmitry On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: I think you can use SPM - http://sematext.com/spm - it will give you all Spark and all Kafka metrics, including offsets broken down by topic, etc. out of the box.

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
to time (ie when there is no free RAM ) and taking a performance hit from that, BUT only until there is no free RAM *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] *Sent:* Thursday, May 28, 2015 2:34 PM *To:* Evo Eftimov *Cc:* Gerard Maas; spark users *Subject:* Re: FW: Re

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Thank you, Gerard. We're looking at the receiver-less setup with Kafka Spark streaming so I'm not sure how to apply your comments to that case (not that we have to use receiver-less but it seems to offer some advantages over the receiver-based). As far as the number of Kafka receivers is fixed

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Thanks, Evo. Per the last part of your comment, it sounds like we will need to implement a job manager which will be in control of starting the jobs, monitoring the status of the Kafka topic(s), shutting jobs down and marking them as ones to relaunch, scaling the cluster up/down by

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-05-28 Thread Dmitry Goldenberg
Eftimov Date:2015/05/28 13:22 (GMT+00:00) To: Dmitry Goldenberg Cc: Gerard Maas ,spark users Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics? You can always spin new boxes in the background and bring them into the cluster fold when

Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Dmitry Goldenberg
Got it, thank you, Tathagata and Ted. Could you comment on my other question http://apache-spark-user-list.1001560.n3.nabble.com/Autoscaling-Spark-cluster-based-on-topic-sizes-rate-of-growth-in-Kafka-or-Spark-s-metrics-tt23062.html as well? Basically, I'm trying to get a handle on a good

Re: Spark Streaming and reducing latency

2015-05-18 Thread Dmitry Goldenberg
Thanks, Akhil. So what do folks typically do to increase/contract the capacity? Do you plug in some cluster auto-scaling solution to make this elastic? Does Spark have any hooks for instrumenting auto-scaling? In other words, how do you avoid overwheling the receivers in a scenario when your

Re: Spark and RabbitMQ

2015-05-12 Thread Dmitry Goldenberg
Thanks, Akhil. It looks like in the second example, for Rabbit they're doing this: https://www.rabbitmq.com/mqtt.html. On Tue, May 12, 2015 at 7:37 AM, Akhil Das ak...@sigmoidanalytics.com wrote: I found two examples Java version

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Dmitry Goldenberg
Sean, How does this model actually work? Let's say we want to run one job as N threads executing one particular task, e.g. streaming data out of Kafka into a search engine. How do we configure our Spark job execution? Right now, I'm seeing this job running as a single thread. And it's quite a

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Dmitry Goldenberg
RDD is determined by the block interval and the batch interval. If you have a batch interval of 10s and block interval of 1s you'll get 10 partitions of data in the RDD. On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Understood. We'll use the multi-threaded

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Dmitry Goldenberg
then. If you run code on your driver, it's not distributed. If you run Spark over an RDD with 1 partition, only one task works on it. On Mon, May 11, 2015 at 10:16 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Sean, How does this model actually work? Let's say we want to run one job as N

Re: Running Spark in local mode seems to ignore local[N]

2015-05-11 Thread Dmitry Goldenberg
have a batch interval of 10s and block interval of 1s you'll get 10 partitions of data in the RDD. On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Understood. We'll use the multi-threaded code we already have.. How are these execution slots filled up? I

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
Yes, and Kafka topics are basically queues. So perhaps what's needed is just KafkaRDD with starting offset being 0 and finish offset being a very large number... Sent from my iPhone On Apr 29, 2015, at 1:52 AM, ayan guha guha.a...@gmail.com wrote: I guess what you mean is not streaming.

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
Part of the issues is, when you read messages in a topic, the messages are peeked, not polled, so there'll be no when the queue is empty, as I understand it. So it would seem I'd want to do KafkaUtils.createRDD, which takes an array of OffsetRange's. Each OffsetRange is characterized by topic,

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread Dmitry Goldenberg
. You can then kill the job after the first batch. It's possible you may be able to kill the job from a StreamingListener.onBatchCompleted, but I've never tried and don't know what the consequences may be. On Wed, Apr 29, 2015 at 8:52 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Dmitry Goldenberg
=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thanks, Ted. Well, so far even there I'm only seeing my post and not, for example, your response. On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Dmitry Goldenberg
, 2015 at 10:49 AM Dmitry Goldenberg dgoldenberg...@gmail.com wrote: It seems that those archives are not necessarily easy to find stuff in. Is there a search engine on top of them? so as to find e.g. your own posts easily? On Thu, Mar 19, 2015 at 10:34 AM, Nicholas Chammas nicholas.cham

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-19 Thread Dmitry Goldenberg
, ConnectionFactory is an interface defined inside JdbcRDD, not scala Function0 On Wed, Feb 18, 2015 at 4:50 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: That's exactly what I was doing. However, I ran into runtime issues with doing that. For instance, I had a public class DbConnection

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-19 Thread Dmitry Goldenberg
. Of course, a numeric primary key is going to be the most efficient way to do that. On Thu, Feb 19, 2015 at 8:57 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Yup, I did see that. Good point though, Cody. The mismatch was happening for me when I was trying to get the 'new JdbcRDD

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
Thanks, Cody. Yes, I originally started off by looking at that but I get a compile error if I try and use that approach: constructor JdbcRDD in class JdbcRDDT cannot be applied to given types. Not to mention that JavaJdbcRDDSuite somehow manages to not pass in the class tag (the last argument).

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
/apache/spark/JavaJdbcRDDSuite.java#L90 is calling a static method JdbcRDD.create, not new JdbcRDD. Is that what you tried doing? On Wed, Feb 18, 2015 at 12:00 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thanks, Cody. Yes, I originally started off by looking at that but I get

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
I'm not sure what on the driver means but I've tried setting spark.files.userClassPathFirst to true, in $SPARK-HOME/conf/spark-defaults.conf and also in the SparkConf programmatically; it appears to be ignored. The solution was to follow Emre's recommendation and downgrade the selected Solrj

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
...@koeninger.org wrote: Is sc there a SparkContext or a JavaSparkContext? The compilation error seems to indicate the former, but JdbcRDD.create expects the latter On Wed, Feb 18, 2015 at 12:30 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: I have tried that as well, I get a compile

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
Are you proposing I downgrade Solrj's httpclient dependency to be on par with that of Spark/Hadoop? Or upgrade Spark/Hadoop's httpclient to the latest? Solrj has to stay with its selected version. I could try and rebuild Spark with the latest httpclient but I've no idea what effects that may

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
that version 4.0.0 of SolrJ had all the functionality I needed. -- Emre Sevinç http://www.bigindustries.be/ On Wed, Feb 18, 2015 at 4:39 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: I think I'm going to have to rebuild Spark with commons.httpclient.version set to 4.3.1 which

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
Thanks, Emre! Will definitely try this. On Wed, Feb 18, 2015 at 11:00 AM, Emre Sevinc emre.sev...@gmail.com wrote: On Wed, Feb 18, 2015 at 4:54 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Thank you, Emre. It seems solrj still depends on HttpClient 4.1.3; would that not collide

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-18 Thread Dmitry Goldenberg
I think I'm going to have to rebuild Spark with commons.httpclient.version set to 4.3.1 which looks to be the version chosen by Solrj, rather than the 4.2.6 that Spark's pom mentions. Might work. On Wed, Feb 18, 2015 at 1:37 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Hi Did you

Re: JdbcRDD, ClassCastException with scala.Function0

2015-02-18 Thread Dmitry Goldenberg
the org.apache.spark.api.java.function.Function interface and pass an instance of that to JdbcRDD.create ? On Wed, Feb 18, 2015 at 3:48 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Cody, you were right, I had a copy and paste snag where I ended up with a vanilla SparkContext rather than a Java one. I also had

  1   2   >