from:"Dmitry Goldenberg"

I'd guess that if the resources are broadcast Spark would put them into 
Tachyon...

> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com> 
> wrote:
> 
> Would it make sense to load them into Tachyon and read and broadcast them 
> from there since Tachyon is already a part of the Spark stack?
> 
> If so I wonder if I could do that Tachyon read/write via a Spark API?
> 
> 
>> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan 
>> <sabarish.sasidha...@manthan.com> wrote:
>> 
>> One option could be to store them as blobs in a cache like Redis and then 
>> read + broadcast them from the driver. Or you could store them in HDFS and 
>> read + broadcast from the driver.
>> 
>> Regards
>> Sab
>> 
>>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg 
>>> <dgoldenberg...@gmail.com> wrote:
>>> We have a bunch of Spark jobs deployed and a few large resource files such 
>>> as e.g. a dictionary for lookups or a statistical model.
>>> 
>>> Right now, these are deployed as part of the Spark jobs which will 
>>> eventually make the mongo-jars too bloated for deployments.
>>> 
>>> What are some of the best practices to consider for maintaining and sharing 
>>> large resource files like these?
>>> 
>>> Thanks.
>> 
>> 
>> 
>> -- 
>> 
>> Architect - Big Data
>> Ph: +91 99805 99458
>> 
>> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan 
>> India ICT)
>> +++

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Jorn, you said Ignite or ... ? What was the second choice you were thinking of? 
It seems that got omitted.

> On Jan 12, 2016, at 2:44 AM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> You can look at ignite as a HDFS cache or for  storing rdds. 
> 
>> On 11 Jan 2016, at 21:14, Dmitry Goldenberg <dgoldenberg...@gmail.com> wrote:
>> 
>> We have a bunch of Spark jobs deployed and a few large resource files such 
>> as e.g. a dictionary for lookups or a statistical model.
>> 
>> Right now, these are deployed as part of the Spark jobs which will 
>> eventually make the mongo-jars too bloated for deployments.
>> 
>> What are some of the best practices to consider for maintaining and sharing 
>> large resource files like these?
>> 
>> Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Would it make sense to load them into Tachyon and read and broadcast them from 
there since Tachyon is already a part of the Spark stack?

If so I wonder if I could do that Tachyon read/write via a Spark API?


> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan 
> <sabarish.sasidha...@manthan.com> wrote:
> 
> One option could be to store them as blobs in a cache like Redis and then 
> read + broadcast them from the driver. Or you could store them in HDFS and 
> read + broadcast from the driver.
> 
> Regards
> Sab
> 
>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg 
>> <dgoldenberg...@gmail.com> wrote:
>> We have a bunch of Spark jobs deployed and a few large resource files such 
>> as e.g. a dictionary for lookups or a statistical model.
>> 
>> Right now, these are deployed as part of the Spark jobs which will 
>> eventually make the mongo-jars too bloated for deployments.
>> 
>> What are some of the best practices to consider for maintaining and sharing 
>> large resource files like these?
>> 
>> Thanks.
> 
> 
> 
> -- 
> 
> Architect - Big Data
> Ph: +91 99805 99458
> 
> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan 
> India ICT)
> +++

Re: Best practices for sharing/maintaining large resource files for Spark jobs

Thanks, Gene.

Does Spark use Tachyon under the covers anyway for implementing its
"cluster memory" support?

It seems that the practice I hear the most about is the idea of loading
resources as RDD's and then doing join's against them to achieve the lookup
effect.

The other approach would be to load the resources into broadcast variables
but I've heard concerns about memory.  Could we run out of memory if we
load too much into broadcast vars?  Is there any memory_to_disk/spill to
disk capability for broadcast variables in Spark?


On Tue, Jan 12, 2016 at 11:19 AM, Gene Pang <gene.p...@gmail.com> wrote:

> Hi Dmitry,
>
> Yes, Tachyon can help with your use case. You can read and write to
> Tachyon via the filesystem api (
> http://tachyon-project.org/documentation/File-System-API.html). There is
> a native Java API as well as a Hadoop-compatible API. Spark is also able to
> interact with Tachyon via the Hadoop-compatible API, so Spark jobs can read
> input files from Tachyon and write output files to Tachyon.
>
> I hope that helps,
> Gene
>
> On Tue, Jan 12, 2016 at 4:26 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> I'd guess that if the resources are broadcast Spark would put them into
>> Tachyon...
>>
>> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com>
>> wrote:
>>
>> Would it make sense to load them into Tachyon and read and broadcast them
>> from there since Tachyon is already a part of the Spark stack?
>>
>> If so I wonder if I could do that Tachyon read/write via a Spark API?
>>
>>
>> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>> One option could be to store them as blobs in a cache like Redis and then
>> read + broadcast them from the driver. Or you could store them in HDFS and
>> read + broadcast from the driver.
>>
>> Regards
>> Sab
>>
>> On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg <
>> dgoldenberg...@gmail.com> wrote:
>>
>>> We have a bunch of Spark jobs deployed and a few large resource files
>>> such as e.g. a dictionary for lookups or a statistical model.
>>>
>>> Right now, these are deployed as part of the Spark jobs which will
>>> eventually make the mongo-jars too bloated for deployments.
>>>
>>> What are some of the best practices to consider for maintaining and
>>> sharing large resource files like these?
>>>
>>> Thanks.
>>>
>>
>>
>>
>> --
>>
>> Architect - Big Data
>> Ph: +91 99805 99458
>>
>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>> Sullivan India ICT)*
>> +++
>>
>>
>

Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Dmitry Goldenberg

We have a bunch of Spark jobs deployed and a few large resource files such
as e.g. a dictionary for lookups or a statistical model.

Right now, these are deployed as part of the Spark jobs which will
eventually make the mongo-jars too bloated for deployments.

What are some of the best practices to consider for maintaining and sharing
large resource files like these?

Thanks.

Re: What are the .snapshot files in /home/spark/Snapshots?

2015-11-10 Thread Dmitry Goldenberg

N/m, these are just profiling snapshots :) Sorry for the wide distribution.

On Tue, Nov 10, 2015 at 9:46 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com
> wrote:

> We're seeing a bunch of .snapshot files being created under
> /home/spark/Snapshots, such as the following for example:
>
> CoarseGrainedExecutorBackend-2015-08-27-shutdown.snapshot
> CoarseGrainedExecutorBackend-2015-08-31-shutdown-1.snapshot
> SparkSubmit-2015-08-31-shutdown-1.snapshot
> Worker-2015-08-27-shutdown.snapshot
>
> These files are large and they blow out our disk space in some
> environments.
>
> What are these, when are they created and for what purpose?  Is there a
> way to control how they're generated and most importantly where they're
> stored?
>
> Thanks.
>

What are the .snapshot files in /home/spark/Snapshots?

2015-11-10 Thread Dmitry Goldenberg

We're seeing a bunch of .snapshot files being created under
/home/spark/Snapshots, such as the following for example:

CoarseGrainedExecutorBackend-2015-08-27-shutdown.snapshot
CoarseGrainedExecutorBackend-2015-08-31-shutdown-1.snapshot
SparkSubmit-2015-08-31-shutdown-1.snapshot
Worker-2015-08-27-shutdown.snapshot

These files are large and they blow out our disk space in some environments.

What are these, when are they created and for what purpose?  Is there a way
to control how they're generated and most importantly where they're stored?

Thanks.

Re: How to tell Spark not to use /tmp for snappy-unknown-***-libsnappyjava.so

2015-09-30 Thread Dmitry Goldenberg

Thanks, Ted, will try it out.

On Wed, Sep 30, 2015 at 9:07 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> See the tail of this:
> https://bugzilla.redhat.com/show_bug.cgi?id=1005811
>
> FYI
>
> > On Sep 30, 2015, at 5:54 AM, Dmitry Goldenberg <dgoldenberg...@gmail.com>
> wrote:
> >
> > Is there a way to ensure Spark doesn't write to /tmp directory?
> >
> > We've got spark.local.dir specified in the spark-defaults.conf file to
> point at another directory.  But we're seeing many of these
> snappy-unknown-***-libsnappyjava.so files being written to /tmp still.
> >
> > Is there a config setting or something that would cause Spark to use
> another directory of our choosing?
> >
> > Thanks.
>

How to tell Spark not to use /tmp for snappy-unknown-***-libsnappyjava.so

2015-09-30 Thread Dmitry Goldenberg

Is there a way to ensure Spark doesn't write to /tmp directory?

We've got spark.local.dir specified in the spark-defaults.conf file to
point at another directory.  But we're seeing many of
these snappy-unknown-***-libsnappyjava.so files being written to /tmp still.

Is there a config setting or something that would cause Spark to use
another directory of our choosing?

Thanks.

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

2015-09-30 Thread Dmitry Goldenberg

I believe I've had trouble with --conf spark.driver.userClassPathFirst=true
--conf spark.executor.userClassPathFirst=true before, so these might not
work...

I was thinking of trying to add the solr4j jar to
spark.executor.extraClassPath...

On Wed, Sep 30, 2015 at 12:01 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> bq. have tried these settings with the hbase protocol jar, to no avail
>
> In that case, HBaseZeroCopyByteString is contained in hbase-protocol.jar.
> In HBaseZeroCopyByteString , you can see:
>
> package com.google.protobuf;  // This is a lie.
>
> If protobuf jar is loaded ahead of hbase-protocol.jar, things start to get
> interesting ...
>
> On Tue, Sep 29, 2015 at 6:12 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> Ted, I think I have tried these settings with the hbase protocol jar, to
>> no avail.
>>
>> I'm going to see if I can try and use these with this SolrException issue
>> though it now may be harder to reproduce it. Thanks for the suggestion.
>>
>> On Tue, Sep 29, 2015 at 8:03 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Have you tried the following ?
>>> --conf spark.driver.userClassPathFirst=true --conf spark.executor.
>>> userClassPathFirst=true
>>>
>>> On Tue, Sep 29, 2015 at 4:38 PM, Dmitry Goldenberg <
>>> dgoldenberg...@gmail.com> wrote:
>>>
>>>> Release of Spark: 1.5.0.
>>>>
>>>> Command line invokation:
>>>>
>>>> ACME_INGEST_HOME=/mnt/acme/acme-ingest
>>>> ACME_INGEST_VERSION=0.0.1-SNAPSHOT
>>>> ACME_BATCH_DURATION_MILLIS=5000
>>>> SPARK_MASTER_URL=spark://data1:7077
>>>> JAVA_OPTIONS="-Dspark.streaming.kafka.maxRatePerPartition=1000"
>>>> JAVA_OPTIONS="$JAVA_OPTIONS -Dspark.executor.memory=2g"
>>>>
>>>> $SPARK_HOME/bin/spark-submit \
>>>> --driver-class-path  $ACME_INGEST_HOME \
>>>> --driver-java-options "$JAVA_OPTIONS" \
>>>> --class
>>>> "com.acme.consumer.kafka.spark.KafkaSparkStreamingDriver" \
>>>> --master $SPARK_MASTER_URL  \
>>>> --conf
>>>> "spark.executor.extraClassPath=$ACME_INGEST_HOME/conf:$ACME_INGEST_HOME/lib/hbase-protocol-0.98.9-hadoop2.jar"
>>>> \
>>>>
>>>> $ACME_INGEST_HOME/lib/acme-ingest-kafka-spark-$ACME_INGEST_VERSION.jar \
>>>> -brokerlist $METADATA_BROKER_LIST \
>>>> -topic acme.topic1 \
>>>> -autooffsetreset largest \
>>>> -batchdurationmillis $ACME_BATCH_DURATION_MILLIS \
>>>> -appname Acme.App1 \
>>>> -checkpointdir file://$SPARK_HOME/acme/checkpoint-acme-app1
>>>> Note that SolrException is definitely in our consumer jar
>>>> acme-ingest-kafka-spark-$ACME_INGEST_VERSION.jar which gets deployed to
>>>> $ACME_INGEST_HOME.
>>>>
>>>> For the extraClassPath on the executors, we've got additionally
>>>> hbase-protocol-0.98.9-hadoop2.jar: we're using Apache Phoenix from the
>>>> Spark jobs to communicate with HBase.  The only way to force Phoenix to
>>>> successfully communicate with HBase was to have that JAR explicitly added
>>>> to the executor classpath regardless of the fact that the contents of the
>>>> hbase-protocol hadoop jar get rolled up into the consumer jar at build 
>>>> time.
>>>>
>>>> I'm starting to wonder whether there's some class loading pattern here
>>>> where some classes may not get loaded out of the consumer jar and therefore
>>>> have to have their respective jars added to the executor extraClassPath?
>>>>
>>>> Or is this a serialization problem for SolrException as Divya
>>>> Ravichandran suggested?
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Sep 29, 2015 at 6:16 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> Mind providing a bit more information:
>>>>>
>>>>> release of Spark
>>>>> command line for running Spark job
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Tue, Sep 29, 2015 at 1:37 PM, Dmitry Goldenberg <
>>>>> dgoldenberg...@gmail.com> wrote:
>>>>>
>>>>>> We're seeing this occasionally. Granted, this was caused by a wrinkle
>>>>>> in the Solr schema but this bubbled up all the way in Spar

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

"more partitions and replicas than available brokers" -- what would be a
good ratio?

We've been trying to set up 3 topics with 64 partitions.  I'm including the
output of "bin/kafka-topics.sh --zookeeper localhost:2181 --describe
topic1" below.

I think it's symptomatic and confirms your theory, Adrian, that we've got
too many partitions. In fact, for topic 2, only 12 partitions appear to
have been created despite the requested 64.  Does Kafka have the limit of
140 partitions total within a cluster?

The doc doesn't appear to have any prescriptions as to how you go about
calculating an optimal number of partitions.

We'll definitely try with fewer, I'm just looking for a good formula to
calculate how many. And no, Adrian, this hasn't worked yet, so we'll start
with something like 12 partitions.  It'd be good to know how high we can go
with that...

Topic:topic1 PartitionCount:64 ReplicationFactor:1 Configs:

Topic: topic1 Partition: 0 Leader: 1 Replicas: 1 Isr: 1

Topic: topic2 Partition: 1 Leader: 2 Replicas: 2 Isr: 2




Topic: topic3 Partition: 63 Leader: 2 Replicas: 2 Isr: 2

---

Topic:topic2 PartitionCount:12 ReplicationFactor:1 Configs:

Topic: topic2 Partition: 0 Leader: 2 Replicas: 2 Isr: 2

Topic: topic2 Partition: 1 Leader: 1 Replicas: 1 Isr: 1




Topic: topic2 Partition: 11 Leader: 1 Replicas: 1 Isr: 1

---

Topic:topic3 PartitionCount:64 ReplicationFactor:1 Configs:

Topic: topic3 Partition: 0 Leader: 2 Replicas: 2 Isr: 2

Topic: topic3 Partition: 1 Leader: 1 Replicas: 1 Isr: 1




Topic: topic3 Partition: 63 Leader: 1 Replicas: 1 Isr: 1


On Tue, Sep 29, 2015 at 8:47 AM, Adrian Tanase <atan...@adobe.com> wrote:

> The error message is very explicit (partition is under replicated), I
> don’t think it’s related to networking issues.
>
> Try to run /home/kafka/bin/kafka-topics.sh —zookeeper localhost/kafka
> —describe topic_name and see which brokers are missing from the replica
> assignment.
> *(replace home, zk-quorum etc with your own set-up)*
>
> Lastly, has this ever worked? Maybe you’ve accidentally created the topic
> with more partitions and replicas than available brokers… try to recreate
> with fewer partitions/replicas, see if it works.
>
> -adrian
>
> From: Dmitry Goldenberg
> Date: Tuesday, September 29, 2015 at 3:37 PM
> To: Adrian Tanase
> Cc: "user@spark.apache.org"
> Subject: Re: Kafka error "partitions don't have a leader" /
> LeaderNotAvailableException
>
> Adrian,
>
> Thanks for your response. I just looked at both machines we're testing on
> and on both the Kafka server process looks OK. Anything specific I can
> check otherwise?
>
> From googling around, I see some posts where folks suggest to check the
> DNS settings (those appear fine) and to set the advertised.host.name in
> Kafka's server.properties. Yay/nay?
>
> Thanks again.
>
> On Tue, Sep 29, 2015 at 8:31 AM, Adrian Tanase <atan...@adobe.com> wrote:
>
>> I believe some of the brokers in your cluster died and there are a number
>> of partitions that nobody is currently managing.
>>
>> -adrian
>>
>> From: Dmitry Goldenberg
>> Date: Tuesday, September 29, 2015 at 3:26 PM
>> To: "user@spark.apache.org"
>> Subject: Kafka error "partitions don't have a leader" /
>> LeaderNotAvailableException
>>
>> I apologize for posting this Kafka related issue into the Spark list.
>> Have gotten no responses on the Kafka list and was hoping someone on this
>> list could shed some light on the below.
>>
>> 
>> ---
>>
>> We're running into this issue in a clustered environment where we're
>> trying to send messages to Kafka and are getting the below error.
>>
>> Can someone explain what might be causing it and what the error message
>> means (Failed to send data since partitions [,8] don't have a
>> leader) ?
>>
>>
>> ---
>>
>> WARN kafka.producer.BrokerPartitionInfo: Error while fetching
>> metadata partition 10 leader: none replicas: isr: isUnderReplicated: false
>> for topic partition [,10]: [class
>> kafka.common.

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

Adrian,

Thanks for your response. I just looked at both machines we're testing on
and on both the Kafka server process looks OK. Anything specific I can
check otherwise?

>From googling around, I see some posts where folks suggest to check the DNS
settings (those appear fine) and to set the advertised.host.name in Kafka's
server.properties. Yay/nay?

Thanks again.

On Tue, Sep 29, 2015 at 8:31 AM, Adrian Tanase <atan...@adobe.com> wrote:

> I believe some of the brokers in your cluster died and there are a number
> of partitions that nobody is currently managing.
>
> -adrian
>
> From: Dmitry Goldenberg
> Date: Tuesday, September 29, 2015 at 3:26 PM
> To: "user@spark.apache.org"
> Subject: Kafka error "partitions don't have a leader" /
> LeaderNotAvailableException
>
> I apologize for posting this Kafka related issue into the Spark list. Have
> gotten no responses on the Kafka list and was hoping someone on this list
> could shed some light on the below.
>
> 
> ---
>
> We're running into this issue in a clustered environment where we're
> trying to send messages to Kafka and are getting the below error.
>
> Can someone explain what might be causing it and what the error message
> means (Failed to send data since partitions [,8] don't have a
> leader) ?
>
>
> ---
>
> WARN kafka.producer.BrokerPartitionInfo: Error while fetching
> metadata partition 10 leader: none replicas: isr: isUnderReplicated: false
> for topic partition [,10]: [class
> kafka.common.LeaderNotAvailableException]
>
> ERROR kafka.producer.async.DefaultEventHandler: Failed to send requests
> for topics  with correlation ids in [2398792,2398801]
>
> ERROR com.acme.core.messaging.kafka.KafkaMessageProducer: Error while
> sending a message to the message
> store. kafka.common.FailedToSendMessageException: Failed to send messages
> after 3 tries.
> at
> kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90)
> ~[kafka_2.10-0.8.2.0.jar:?]
> at kafka.producer.Producer.send(Producer.scala:77)
> ~[kafka_2.10-0.8.2.0.jar:?]
> at kafka.javaapi.producer.Producer.send(Producer.scala:33)
> ~[kafka_2.10-0.8.2.0.jar:?]
>
> WARN kafka.producer.async.DefaultEventHandler: Failed to send data since
> partitions [,8] don't have a leader
>
> What do these errors and warnings mean and how do we get around them?
>
>
> ---
>
> The code for sending messages is basically as follows:
>
> public class KafkaMessageProducer {
> private Producer<String, String> producer;
>
> .
>
> public void sendMessage(String topic, String key,
> String message) throws IOException, MessagingException {
> KeyedMessage<String, String> data = new KeyedMessage<String,
> String>(topic, key, message);
> try {
>   producer.send(data);
> } catch (Exception ex) {
>   throw new MessagingException("Error while sending a message to the
> message store.", ex);
> }
> }
>
> Is it possible that the producer gets "stale" and needs to be
> re-initialized?  Do we want to re-create the producer on every message (??)
> or is it OK to hold on to one indefinitely?
>
>
> ---
>
> The following are the producer properties that are being set into the
> producer
>
> batch.num.messages => 200
> client.id => Acme
> compression.codec => none
> key.serializer.class => kafka.serializer.StringEncoder
> message.send.max.retries => 3
> metadata.broker.list => data2.acme.com:9092,data3.acme.com:9092
> partitioner.class => kafka.producer.DefaultPartitioner
> producer.type => sync
> queue.buffering.max.messages => 1
> queue.buffering.max.ms => 5000
> queue.enqueue.timeout.ms => -1
> request.required.acks => 1
> request.timeout.ms => 1
> retry.backoff.ms => 1000
> send.buffer.bytes => 102400
> serializer.class => kafka.serializer.StringEncoder
> topic.metadata.refresh.interval.ms => 60
>
>
> Thanks.
>

Kafka error "partitions don't have a leader" / LeaderNotAvailableException

I apologize for posting this Kafka related issue into the Spark list. Have
gotten no responses on the Kafka list and was hoping someone on this list
could shed some light on the below.


---

We're running into this issue in a clustered environment where we're trying
to send messages to Kafka and are getting the below error.

Can someone explain what might be causing it and what the error message
means (Failed to send data since partitions [,8] don't have a
leader) ?

---

WARN kafka.producer.BrokerPartitionInfo: Error while fetching
metadata partition 10 leader: none replicas: isr: isUnderReplicated: false
for topic partition [,10]: [class
kafka.common.LeaderNotAvailableException]

ERROR kafka.producer.async.DefaultEventHandler: Failed to send requests for
topics  with correlation ids in [2398792,2398801]

ERROR com.acme.core.messaging.kafka.KafkaMessageProducer: Error while
sending a message to the message
store. kafka.common.FailedToSendMessageException: Failed to send messages
after 3 tries.
at
kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90)
~[kafka_2.10-0.8.2.0.jar:?]
at kafka.producer.Producer.send(Producer.scala:77)
~[kafka_2.10-0.8.2.0.jar:?]
at kafka.javaapi.producer.Producer.send(Producer.scala:33)
~[kafka_2.10-0.8.2.0.jar:?]

WARN kafka.producer.async.DefaultEventHandler: Failed to send data since
partitions [,8] don't have a leader

What do these errors and warnings mean and how do we get around them?

---

The code for sending messages is basically as follows:

public class KafkaMessageProducer {
private Producer producer;

.

public void sendMessage(String topic, String key,
String message) throws IOException, MessagingException {
KeyedMessage data = new KeyedMessage(topic, key, message);
try {
  producer.send(data);
} catch (Exception ex) {
  throw new MessagingException("Error while sending a message to the
message store.", ex);
}
}

Is it possible that the producer gets "stale" and needs to be
re-initialized?  Do we want to re-create the producer on every message (??)
or is it OK to hold on to one indefinitely?

---

The following are the producer properties that are being set into the
producer

batch.num.messages => 200
client.id => Acme
compression.codec => none
key.serializer.class => kafka.serializer.StringEncoder
message.send.max.retries => 3
metadata.broker.list => data2.acme.com:9092,data3.acme.com:9092
partitioner.class => kafka.producer.DefaultPartitioner
producer.type => sync
queue.buffering.max.messages => 1
queue.buffering.max.ms => 5000
queue.enqueue.timeout.ms => -1
request.required.acks => 1
request.timeout.ms => 1
retry.backoff.ms => 1000
send.buffer.bytes => 102400
serializer.class => kafka.serializer.StringEncoder
topic.metadata.refresh.interval.ms => 60


Thanks.

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

Thanks, Cody.

Yes we did see that writeup from Jay, it seems to just refer to his test 6
partitions.  I've been looking for more of a recipe of what the possible
max is vs. what the optimal value may be; haven't found such.

KAFKA-899 appears related but it was fixed in Kafka 0.8.2.0 - we're running
0.8.2.1.

I'm more curious about another error message from the logs which is this:

*fetching topic metadata for topics [Set(my-topic-1)] from broker
[ArrayBuffer(id:0,host:data2.acme.com <http://data2.acme.com>,port:9092,
id:1,host:data3.acme.com <http://data3.acme.com>,port:9092)] failed*

I know that data2 should have broker ID of 1 and data3 should have broker
ID of 2.  So there's some disconnect somewhere as to what these ID's are.
In Zookeeper, ls /brokers/ids lists: [1, 2].  So where could the [0, 1] be
stuck?



On Tue, Sep 29, 2015 at 9:39 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Try writing and reading to the topics in question using the kafka command
> line tools, to eliminate your code as a variable.
>
>
> That number of partitions is probably more than sufficient:
>
>
> https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
>
> Obviously if you ask for more replicas than you have brokers you're going
> to have a problem, but that doesn't seem to be the case.
>
>
>
> Also, depending on what version of kafka you're using on the broker, you
> may want to look through the kafka jira, e.g.
>
> https://issues.apache.org/jira/browse/KAFKA-899
>
>
> On Tue, Sep 29, 2015 at 8:05 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> "more partitions and replicas than available brokers" -- what would be a
>> good ratio?
>>
>> We've been trying to set up 3 topics with 64 partitions.  I'm including
>> the output of "bin/kafka-topics.sh --zookeeper localhost:2181 --describe
>> topic1" below.
>>
>> I think it's symptomatic and confirms your theory, Adrian, that we've got
>> too many partitions. In fact, for topic 2, only 12 partitions appear to
>> have been created despite the requested 64.  Does Kafka have the limit of
>> 140 partitions total within a cluster?
>>
>> The doc doesn't appear to have any prescriptions as to how you go about
>> calculating an optimal number of partitions.
>>
>> We'll definitely try with fewer, I'm just looking for a good formula to
>> calculate how many. And no, Adrian, this hasn't worked yet, so we'll start
>> with something like 12 partitions.  It'd be good to know how high we can go
>> with that...
>>
>> Topic:topic1 PartitionCount:64 ReplicationFactor:1 Configs:
>>
>> Topic: topic1 Partition: 0 Leader: 1 Replicas: 1 Isr: 1
>>
>> Topic: topic2 Partition: 1 Leader: 2 Replicas: 2 Isr: 2
>>
>>
>> 
>>
>> Topic: topic3 Partition: 63 Leader: 2 Replicas: 2 Isr: 2
>>
>>
>> ---
>>
>> Topic:topic2 PartitionCount:12 ReplicationFactor:1 Configs:
>>
>> Topic: topic2 Partition: 0 Leader: 2 Replicas: 2 Isr: 2
>>
>> Topic: topic2 Partition: 1 Leader: 1 Replicas: 1 Isr: 1
>>
>>
>> 
>>
>> Topic: topic2 Partition: 11 Leader: 1 Replicas: 1 Isr: 1
>>
>>
>> ---
>>
>> Topic:topic3 PartitionCount:64 ReplicationFactor:1 Configs:
>>
>> Topic: topic3 Partition: 0 Leader: 2 Replicas: 2 Isr: 2
>>
>> Topic: topic3 Partition: 1 Leader: 1 Replicas: 1 Isr: 1
>>
>>
>> 
>>
>> Topic: topic3 Partition: 63 Leader: 1 Replicas: 1 Isr: 1
>>
>>
>> On Tue, Sep 29, 2015 at 8:47 AM, Adrian Tanase <atan...@adobe.com> wrote:
>>
>>> The error message is very explicit (partition is under replicated), I
>>> don’t think it’s related to networking issues.
>>>
>>> Try to run /home/kafka/bin/kafka-topics.sh —zookeeper localhost/kafka
>>> —describe topic_name and see which brokers are missing from the replica
>>> assignment.
>>> *(replace home, zk-quorum etc with your own set-up)*
>>>
>>> Lastly, has this ever worked? Maybe you’ve accidentally created the
>>> topic with more partitions and replicas than available brok

Re: Kafka error "partitions don't have a leader" / LeaderNotAvailableException

We've got Kafka working generally. Can definitely write to it now.

There was a snag where num.partitions was set to 12 on one node but to 64
on the other.  We fixed this and set num.partitions to 42 and things are
working on that side.






On Tue, Sep 29, 2015 at 9:39 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Try writing and reading to the topics in question using the kafka command
> line tools, to eliminate your code as a variable.
>
>
> That number of partitions is probably more than sufficient:
>
>
> https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
>
> Obviously if you ask for more replicas than you have brokers you're going
> to have a problem, but that doesn't seem to be the case.
>
>
>
> Also, depending on what version of kafka you're using on the broker, you
> may want to look through the kafka jira, e.g.
>
> https://issues.apache.org/jira/browse/KAFKA-899
>
>
> On Tue, Sep 29, 2015 at 8:05 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> "more partitions and replicas than available brokers" -- what would be a
>> good ratio?
>>
>> We've been trying to set up 3 topics with 64 partitions.  I'm including
>> the output of "bin/kafka-topics.sh --zookeeper localhost:2181 --describe
>> topic1" below.
>>
>> I think it's symptomatic and confirms your theory, Adrian, that we've got
>> too many partitions. In fact, for topic 2, only 12 partitions appear to
>> have been created despite the requested 64.  Does Kafka have the limit of
>> 140 partitions total within a cluster?
>>
>> The doc doesn't appear to have any prescriptions as to how you go about
>> calculating an optimal number of partitions.
>>
>> We'll definitely try with fewer, I'm just looking for a good formula to
>> calculate how many. And no, Adrian, this hasn't worked yet, so we'll start
>> with something like 12 partitions.  It'd be good to know how high we can go
>> with that...
>>
>> Topic:topic1 PartitionCount:64 ReplicationFactor:1 Configs:
>>
>> Topic: topic1 Partition: 0 Leader: 1 Replicas: 1 Isr: 1
>>
>> Topic: topic2 Partition: 1 Leader: 2 Replicas: 2 Isr: 2
>>
>>
>> 
>>
>> Topic: topic3 Partition: 63 Leader: 2 Replicas: 2 Isr: 2
>>
>>
>> ---
>>
>> Topic:topic2 PartitionCount:12 ReplicationFactor:1 Configs:
>>
>> Topic: topic2 Partition: 0 Leader: 2 Replicas: 2 Isr: 2
>>
>> Topic: topic2 Partition: 1 Leader: 1 Replicas: 1 Isr: 1
>>
>>
>> 
>>
>> Topic: topic2 Partition: 11 Leader: 1 Replicas: 1 Isr: 1
>>
>>
>> ---
>>
>> Topic:topic3 PartitionCount:64 ReplicationFactor:1 Configs:
>>
>> Topic: topic3 Partition: 0 Leader: 2 Replicas: 2 Isr: 2
>>
>> Topic: topic3 Partition: 1 Leader: 1 Replicas: 1 Isr: 1
>>
>>
>> 
>>
>> Topic: topic3 Partition: 63 Leader: 1 Replicas: 1 Isr: 1
>>
>>
>> On Tue, Sep 29, 2015 at 8:47 AM, Adrian Tanase <atan...@adobe.com> wrote:
>>
>>> The error message is very explicit (partition is under replicated), I
>>> don’t think it’s related to networking issues.
>>>
>>> Try to run /home/kafka/bin/kafka-topics.sh —zookeeper localhost/kafka
>>> —describe topic_name and see which brokers are missing from the replica
>>> assignment.
>>> *(replace home, zk-quorum etc with your own set-up)*
>>>
>>> Lastly, has this ever worked? Maybe you’ve accidentally created the
>>> topic with more partitions and replicas than available brokers… try to
>>> recreate with fewer partitions/replicas, see if it works.
>>>
>>> -adrian
>>>
>>> From: Dmitry Goldenberg
>>> Date: Tuesday, September 29, 2015 at 3:37 PM
>>> To: Adrian Tanase
>>> Cc: "user@spark.apache.org"
>>> Subject: Re: Kafka error "partitions don't have a leader" /
>>> LeaderNotAvailableException
>>>
>>> Adrian,
>>>
>>> Thanks for your response. I just looked at both machines we're testing
>>> on a

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

Release of Spark: 1.5.0.

Command line invokation:

ACME_INGEST_HOME=/mnt/acme/acme-ingest
ACME_INGEST_VERSION=0.0.1-SNAPSHOT
ACME_BATCH_DURATION_MILLIS=5000
SPARK_MASTER_URL=spark://data1:7077
JAVA_OPTIONS="-Dspark.streaming.kafka.maxRatePerPartition=1000"
JAVA_OPTIONS="$JAVA_OPTIONS -Dspark.executor.memory=2g"

$SPARK_HOME/bin/spark-submit \
--driver-class-path  $ACME_INGEST_HOME \
--driver-java-options "$JAVA_OPTIONS" \
--class "com.acme.consumer.kafka.spark.KafkaSparkStreamingDriver" \
--master $SPARK_MASTER_URL  \
--conf
"spark.executor.extraClassPath=$ACME_INGEST_HOME/conf:$ACME_INGEST_HOME/lib/hbase-protocol-0.98.9-hadoop2.jar"
\

$ACME_INGEST_HOME/lib/acme-ingest-kafka-spark-$ACME_INGEST_VERSION.jar \
-brokerlist $METADATA_BROKER_LIST \
-topic acme.topic1 \
-autooffsetreset largest \
-batchdurationmillis $ACME_BATCH_DURATION_MILLIS \
-appname Acme.App1 \
-checkpointdir file://$SPARK_HOME/acme/checkpoint-acme-app1
Note that SolrException is definitely in our consumer jar
acme-ingest-kafka-spark-$ACME_INGEST_VERSION.jar which gets deployed to
$ACME_INGEST_HOME.

For the extraClassPath on the executors, we've got additionally
hbase-protocol-0.98.9-hadoop2.jar: we're using Apache Phoenix from the
Spark jobs to communicate with HBase.  The only way to force Phoenix to
successfully communicate with HBase was to have that JAR explicitly added
to the executor classpath regardless of the fact that the contents of the
hbase-protocol hadoop jar get rolled up into the consumer jar at build time.

I'm starting to wonder whether there's some class loading pattern here
where some classes may not get loaded out of the consumer jar and therefore
have to have their respective jars added to the executor extraClassPath?

Or is this a serialization problem for SolrException as Divya Ravichandran
suggested?




On Tue, Sep 29, 2015 at 6:16 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Mind providing a bit more information:
>
> release of Spark
> command line for running Spark job
>
> Cheers
>
> On Tue, Sep 29, 2015 at 1:37 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> We're seeing this occasionally. Granted, this was caused by a wrinkle in
>> the Solr schema but this bubbled up all the way in Spark and caused job
>> failures.
>>
>> I just checked and SolrException class is actually in the consumer job
>> jar we use.  Is there any reason why Spark cannot find the SolrException
>> class?
>>
>> 15/09/29 15:41:58 WARN ThrowableSerializationWrapper: Task exception
>> could not be deserialized
>> java.lang.ClassNotFoundException: org.apache.solr.common.SolrException
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:348)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
>> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>> at
>> org.apache.spark.ThrowableSerializationWrapper.readObject(TaskEndReason.scala:163)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>> at
>> java.i

ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

We're seeing this occasionally. Granted, this was caused by a wrinkle in
the Solr schema but this bubbled up all the way in Spark and caused job
failures.

I just checked and SolrException class is actually in the consumer job jar
we use.  Is there any reason why Spark cannot find the SolrException class?

15/09/29 15:41:58 WARN ThrowableSerializationWrapper: Task exception could
not be deserialized
java.lang.ClassNotFoundException: org.apache.solr.common.SolrException
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at
org.apache.spark.ThrowableSerializationWrapper.readObject(TaskEndReason.scala:163)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply$mcV$sp(TaskResultGetter.scala:108)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$2.apply(TaskResultGetter.scala:105)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:105)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

I'm actually not sure how either one of these would possibly cause Spark to
find SolrException. Whether the driver or executor class path is first,
should it not matter, if the class is in the consumer job jar?




On Tue, Sep 29, 2015 at 9:12 PM, Dmitry Goldenberg <dgoldenberg...@gmail.com
> wrote:

> Ted, I think I have tried these settings with the hbase protocol jar, to
> no avail.
>
> I'm going to see if I can try and use these with this SolrException issue
> though it now may be harder to reproduce it. Thanks for the suggestion.
>
> On Tue, Sep 29, 2015 at 8:03 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Have you tried the following ?
>> --conf spark.driver.userClassPathFirst=true --conf spark.executor.
>> userClassPathFirst=true
>>
>> On Tue, Sep 29, 2015 at 4:38 PM, Dmitry Goldenberg <
>> dgoldenberg...@gmail.com> wrote:
>>
>>> Release of Spark: 1.5.0.
>>>
>>> Command line invokation:
>>>
>>> ACME_INGEST_HOME=/mnt/acme/acme-ingest
>>> ACME_INGEST_VERSION=0.0.1-SNAPSHOT
>>> ACME_BATCH_DURATION_MILLIS=5000
>>> SPARK_MASTER_URL=spark://data1:7077
>>> JAVA_OPTIONS="-Dspark.streaming.kafka.maxRatePerPartition=1000"
>>> JAVA_OPTIONS="$JAVA_OPTIONS -Dspark.executor.memory=2g"
>>>
>>> $SPARK_HOME/bin/spark-submit \
>>> --driver-class-path  $ACME_INGEST_HOME \
>>> --driver-java-options "$JAVA_OPTIONS" \
>>> --class
>>> "com.acme.consumer.kafka.spark.KafkaSparkStreamingDriver" \
>>> --master $SPARK_MASTER_URL  \
>>> --conf
>>> "spark.executor.extraClassPath=$ACME_INGEST_HOME/conf:$ACME_INGEST_HOME/lib/hbase-protocol-0.98.9-hadoop2.jar"
>>> \
>>>
>>> $ACME_INGEST_HOME/lib/acme-ingest-kafka-spark-$ACME_INGEST_VERSION.jar \
>>> -brokerlist $METADATA_BROKER_LIST \
>>> -topic acme.topic1 \
>>> -autooffsetreset largest \
>>> -batchdurationmillis $ACME_BATCH_DURATION_MILLIS \
>>> -appname Acme.App1 \
>>> -checkpointdir file://$SPARK_HOME/acme/checkpoint-acme-app1
>>> Note that SolrException is definitely in our consumer jar
>>> acme-ingest-kafka-spark-$ACME_INGEST_VERSION.jar which gets deployed to
>>> $ACME_INGEST_HOME.
>>>
>>> For the extraClassPath on the executors, we've got additionally
>>> hbase-protocol-0.98.9-hadoop2.jar: we're using Apache Phoenix from the
>>> Spark jobs to communicate with HBase.  The only way to force Phoenix to
>>> successfully communicate with HBase was to have that JAR explicitly added
>>> to the executor classpath regardless of the fact that the contents of the
>>> hbase-protocol hadoop jar get rolled up into the consumer jar at build time.
>>>
>>> I'm starting to wonder whether there's some class loading pattern here
>>> where some classes may not get loaded out of the consumer jar and therefore
>>> have to have their respective jars added to the executor extraClassPath?
>>>
>>> Or is this a serialization problem for SolrException as Divya
>>> Ravichandran suggested?
>>>
>>>
>>>
>>>
>>> On Tue, Sep 29, 2015 at 6:16 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Mind providing a bit more information:
>>>>
>>>> release of Spark
>>>> command line for running Spark job
>>>>
>>>> Cheers
>>>>
>>>> On Tue, Sep 29, 2015 at 1:37 PM, Dmitry Goldenberg <
>>>> dgoldenberg...@gmail.com> wrote:
>>>>
>>>>> We're seeing this occasionally. Granted, this was caused by a wrinkle
>>>>> in the Solr schema but this bubbled up all the way in Spark and caused job
>>>>> failures.
>>>>>
>>>>> I just checked and SolrException class is actually in the consumer job
>>>>> jar we use.  Is there any reason why Spark cannot find the SolrException
>>>>> class?
>>>>>
>>>>> 15/09/29 15:41:58 WARN ThrowableSerializationWrapper: Task exception
>>>>> could not be deserialized
>>>>> java.lang.ClassNotFoundException: org.apache.solr.common.SolrException
>>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>

Re: ThrowableSerializationWrapper: Task exception could not be deserialized / ClassNotFoundException: org.apache.solr.common.SolrException

Ted, I think I have tried these settings with the hbase protocol jar, to no
avail.

I'm going to see if I can try and use these with this SolrException issue
though it now may be harder to reproduce it. Thanks for the suggestion.

On Tue, Sep 29, 2015 at 8:03 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Have you tried the following ?
> --conf spark.driver.userClassPathFirst=true --conf spark.executor.
> userClassPathFirst=true
>
> On Tue, Sep 29, 2015 at 4:38 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> Release of Spark: 1.5.0.
>>
>> Command line invokation:
>>
>> ACME_INGEST_HOME=/mnt/acme/acme-ingest
>> ACME_INGEST_VERSION=0.0.1-SNAPSHOT
>> ACME_BATCH_DURATION_MILLIS=5000
>> SPARK_MASTER_URL=spark://data1:7077
>> JAVA_OPTIONS="-Dspark.streaming.kafka.maxRatePerPartition=1000"
>> JAVA_OPTIONS="$JAVA_OPTIONS -Dspark.executor.memory=2g"
>>
>> $SPARK_HOME/bin/spark-submit \
>> --driver-class-path  $ACME_INGEST_HOME \
>> --driver-java-options "$JAVA_OPTIONS" \
>> --class "com.acme.consumer.kafka.spark.KafkaSparkStreamingDriver"
>> \
>> --master $SPARK_MASTER_URL  \
>> --conf
>> "spark.executor.extraClassPath=$ACME_INGEST_HOME/conf:$ACME_INGEST_HOME/lib/hbase-protocol-0.98.9-hadoop2.jar"
>> \
>>
>> $ACME_INGEST_HOME/lib/acme-ingest-kafka-spark-$ACME_INGEST_VERSION.jar \
>> -brokerlist $METADATA_BROKER_LIST \
>> -topic acme.topic1 \
>> -autooffsetreset largest \
>> -batchdurationmillis $ACME_BATCH_DURATION_MILLIS \
>> -appname Acme.App1 \
>> -checkpointdir file://$SPARK_HOME/acme/checkpoint-acme-app1
>> Note that SolrException is definitely in our consumer jar
>> acme-ingest-kafka-spark-$ACME_INGEST_VERSION.jar which gets deployed to
>> $ACME_INGEST_HOME.
>>
>> For the extraClassPath on the executors, we've got additionally
>> hbase-protocol-0.98.9-hadoop2.jar: we're using Apache Phoenix from the
>> Spark jobs to communicate with HBase.  The only way to force Phoenix to
>> successfully communicate with HBase was to have that JAR explicitly added
>> to the executor classpath regardless of the fact that the contents of the
>> hbase-protocol hadoop jar get rolled up into the consumer jar at build time.
>>
>> I'm starting to wonder whether there's some class loading pattern here
>> where some classes may not get loaded out of the consumer jar and therefore
>> have to have their respective jars added to the executor extraClassPath?
>>
>> Or is this a serialization problem for SolrException as Divya
>> Ravichandran suggested?
>>
>>
>>
>>
>> On Tue, Sep 29, 2015 at 6:16 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Mind providing a bit more information:
>>>
>>> release of Spark
>>> command line for running Spark job
>>>
>>> Cheers
>>>
>>> On Tue, Sep 29, 2015 at 1:37 PM, Dmitry Goldenberg <
>>> dgoldenberg...@gmail.com> wrote:
>>>
>>>> We're seeing this occasionally. Granted, this was caused by a wrinkle
>>>> in the Solr schema but this bubbled up all the way in Spark and caused job
>>>> failures.
>>>>
>>>> I just checked and SolrException class is actually in the consumer job
>>>> jar we use.  Is there any reason why Spark cannot find the SolrException
>>>> class?
>>>>
>>>> 15/09/29 15:41:58 WARN ThrowableSerializationWrapper: Task exception
>>>> could not be deserialized
>>>> java.lang.ClassNotFoundException: org.apache.solr.common.SolrException
>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>> at java.lang.Class.forName0(Native Method)
>>>> at java.lang.Class.forName(Class.java:348)
>>>> at
>>>> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
>>>> at
>>>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>>>> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>>>> at
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>>

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-15 Thread Dmitry Goldenberg

Thanks, Mark, will look into that...

On Tue, Sep 15, 2015 at 12:33 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> There is the Async API (
> https://github.com/clearstorydata/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/AsyncRDDActions.scala),
> which makes use of FutureAction (
> https://github.com/clearstorydata/spark/blob/master/core/src/main/scala/org/apache/spark/FutureAction.scala).
> You could also wrap up your Jobs in Futures on your own.
>
> On Mon, Sep 14, 2015 at 11:37 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> As of now i think its a no. Not sure if its a naive approach, but yes you
>> can have a separate program to keep an eye in the webui (possibly parsing
>> the content) and make it trigger the kill task/job once it detects a lag.
>> (Again you will have to figure out the correct numbers before killing any
>> job)
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Sep 14, 2015 at 10:40 PM, Dmitry Goldenberg <
>> dgoldenberg...@gmail.com> wrote:
>>
>>> Is there a way in Spark to automatically terminate laggard "stage's",
>>> ones that appear to be hanging?   In other words, is there a timeout for
>>> processing of a given RDD?
>>>
>>> In the Spark GUI, I see the "kill" function for a given Stage under
>>> 'Details for Job <...>".
>>>
>>> Is there something in Spark that would identify and kill laggards
>>> proactively?
>>>
>>> Thanks.
>>>
>>
>>
>

A way to timeout and terminate a laggard 'Stage' ?

2015-09-14 Thread Dmitry Goldenberg

Is there a way in Spark to automatically terminate laggard "stage's", ones
that appear to be hanging?   In other words, is there a timeout for
processing of a given RDD?

In the Spark GUI, I see the "kill" function for a given Stage under
'Details for Job <...>".

Is there something in Spark that would identify and kill laggards
proactively?

Thanks.

A way to kill laggard jobs?

2015-09-11 Thread Dmitry Goldenberg

Is there a way to kill a laggard Spark job manually, and more importantly,
is there a way to do it programmatically based on a configurable timeout
value?

Thanks.

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Dmitry Goldenberg

>> checkpoints can't be used between controlled restarts

Is that true? If so, why? From my testing, checkpoints appear to be working
fine, we get the data we've missed between the time the consumer went down
and the time we brought it back up.

>> If I cannot make checkpoints between code upgrades, does it mean that
Spark does not help me at all with keeping my Kafka offsets? Does it mean,
that I have to implement my own storing to/initalization of offsets from
Zookeeper?

By code upgrades, are code changes to the consumer program meant?

If that is the case, one idea we've been entertaining is that, if the
consumer changes, especially if its configuration parameters change, it
means that some older configuration may still be stuck in the
checkpointing.  What we'd do in this case is, prior to starting the
consumer, blow away the checkpointing directory and re-consume from Kafka
from the smallest offsets.  In our case, it's OK to re-process; I realize
that in many cases that may not be an option.  If that's the case then it
would seem to follow that you have to manage offsets in Zk...

Another thing to consider would be to treat upgrades operationally. In
that, if an upgrade is to happen, consume the data up to a certain point
then bring the system down for an upgrade. Remove checkpointing. Restart
everything; the system would now be rebuilding the checkpointing and using
your upgraded consumers.  (Again, this may not be possible in some systems
where the data influx is constant and/or the data is mission critical)...

Perhaps this discussion implies that there may be a new feature in Spark
where it intelligently drops the checkpointing or allows you to selectively
pluck out and drop some items prior to restarting...

On Thu, Sep 10, 2015 at 6:22 AM, Akhil Das 
wrote:

> This consumer pretty much covers all those scenarios you listed
> github.com/dibbhatt/kafka-spark-consumer Give it a try.
>
> Thanks
> Best Regards
>
> On Thu, Sep 10, 2015 at 3:32 PM, Krzysztof Zarzycki 
> wrote:
>
>> Hi there,
>> I have a problem with fulfilling all my needs when using Spark Streaming
>> on Kafka. Let me enumerate my requirements:
>> 1. I want to have at-least-once/exactly-once processing.
>> 2. I want to have my application fault & simple stop tolerant. The Kafka
>> offsets need to be tracked between restarts.
>> 3. I want to be able to upgrade code of my application without losing
>> Kafka offsets.
>>
>> Now what my requirements imply according to my knowledge:
>> 1. implies using new Kafka DirectStream.
>> 2. implies  using checkpointing. kafka DirectStream will write offsets to
>> the checkpoint as well.
>> 3. implies that checkpoints can't be used between controlled restarts. So
>> I need to install shutdownHook with ssc.stop(stopGracefully=true) (here is
>> a description how:
>> https://metabroadcast.com/blog/stop-your-spark-streaming-application-gracefully
>> )
>>
>> Now my problems are:
>> 1. If I cannot make checkpoints between code upgrades, does it mean that
>> Spark does not help me at all with keeping my Kafka offsets? Does it mean,
>> that I have to implement my own storing to/initalization of offsets from
>> Zookeeper?
>> 2. When I set up shutdownHook and my any executor throws an exception, it
>> seems that application does not fail, but stuck in running state. Is that
>> because stopGracefully deadlocks on exceptions? How to overcome this
>> problem? Maybe I can avoid setting shutdownHook and there is other way to
>> stop gracefully your app?
>>
>> 3. If I somehow overcome 2., is it enough to just stop gracefully my app
>> to be able to upgrade code & not lose Kafka offsets?
>>
>>
>> Thank you a lot for your answers,
>> Krzysztof Zarzycki
>>
>>
>>
>>
>

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-10 Thread Dmitry Goldenberg

>> The whole point of checkpointing is to recover the *exact* computation
where it left off.

That makes sense. We were looking at the metadata checkpointing and the
data checkpointing, and with data checkpointing, you can specify a
checkpoint duration value. With the metadata checkpointing, there doesn't
seem to be a way, which may be the intent but it wasn't clear why there's a
way to override one duration (for data) but not the other (for metadata).

The basic feel was that we'd want to minimize the number of times Spark
Streaming is doing the checkpointing I/O. In other words, some sort of
sweet spot value where we do checkpointing frequently enough without
performing I/O too frequently. Finding that sweet spot would mean
experimenting with the checkpoint duration millis but that parameter
doesn't appear to be exposed in case of metadata checkpointing.



On Wed, Sep 9, 2015 at 10:39 PM, Tathagata Das <t...@databricks.com> wrote:

> The whole point of checkpointing is to recover the *exact* computation
> where it left of.
> If you want any change in the specification of the computation (which
> includes any intervals), then you cannot recover from checkpoint as it can
> be an arbitrarily complex issue to deal with changes in the specs,
> especially because a lot of specs are tied to each other (e.g. checkpoint
> interval dictates other things like clean up intervals, etc.)
>
> Why do you need to change the checkpointing interval at the time of
> recovery? Trying to understand your usecase.
>
>
> On Wed, Sep 9, 2015 at 12:03 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> >> when you use getOrCreate, and there exists a valid checkpoint, it will
>> always return the context from the checkpoint and not call the factory.
>> Simple way to see whats going on is to print something in the factory to
>> verify whether it is ever called.
>>
>> This is probably OK. Seems to explain why we were getting a sticky batch
>> duration millis value. Once I blew away all the checkpointing directories
>> and unplugged the data checkpointing (while keeping the metadata
>> checkpointing) the batch duration millis was no longer sticky.
>>
>> So, there doesn't seem to be a way for metadata checkpointing to override
>> its checkpoint duration millis, is there?  Is the default there
>> max(batchdurationmillis, 10seconds)?  Is there a way to override this?
>> Thanks.
>>
>>
>>
>>
>>
>> On Wed, Sep 9, 2015 at 2:44 PM, Tathagata Das <t...@databricks.com>
>> wrote:
>>
>>>
>>>
>>> See inline.
>>>
>>> On Tue, Sep 8, 2015 at 9:02 PM, Dmitry Goldenberg <
>>> dgoldenberg...@gmail.com> wrote:
>>>
>>>> What's wrong with creating a checkpointed context??  We WANT
>>>> checkpointing, first of all.  We therefore WANT the checkpointed context.
>>>>
>>>> Second of all, it's not true that we're loading the checkpointed
>>>> context independent of whether params.isCheckpointed() is true.  I'm
>>>> quoting the code again:
>>>>
>>>> // This is NOT loading a checkpointed context if isCheckpointed() is
>>>> false.
>>>> JavaStreamingContext jssc = params.isCheckpointed() ?
>>>> createCheckpointedContext(sparkConf, params) : createContext(sparkConf,
>>>> params);
>>>>
>>>>   private JavaStreamingContext createCheckpointedContext(SparkConf
>>>> sparkConf, Parameters params) {
>>>> JavaStreamingContextFactory factory = new
>>>> JavaStreamingContextFactory() {
>>>>   @Override
>>>>   public JavaStreamingContext create() {
>>>> return createContext(sparkConf, params);
>>>>   }
>>>> };
>>>> return *JavaStreamingContext.getOrCreate(params.getCheckpointDir(),
>>>> factory);*
>>>>
>>> ^   when you use getOrCreate, and there exists a valid checkpoint,
>>> it will always return the context from the checkpoint and not call the
>>> factory. Simple way to see whats going on is to print something in the
>>> factory to verify whether it is ever called.
>>>
>>>
>>>
>>>
>>>
>>>>   }
>>>>
>>>>   private JavaStreamingContext createContext(SparkConf sparkConf,
>>>> Parameters params) {
>>>> // Create context with the specified batch interval, in
>>>> milliseconds.
>>>> JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
>>>> Durations.milliseconds(params.getBatchDurationMillis()));
>>>> // Set the checkpoint directory, if we're checkpointing
>>>> if (params.isCheckpointed()) {
>>>>   jssc.checkpoint(params.getCheckpointDir());
>>>>
>>>> }
>>>> ...
>>>> Again, this is *only* calling context.checkpoint() if isCheckpointed()
>>>> is true.  And we WANT it to be true.
>>>>
>>>> What am I missing here?
>>>>
>>>>
>>>>
>>
>

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-09 Thread Dmitry Goldenberg

>> when you use getOrCreate, and there exists a valid checkpoint, it will
always return the context from the checkpoint and not call the factory.
Simple way to see whats going on is to print something in the factory to
verify whether it is ever called.

This is probably OK. Seems to explain why we were getting a sticky batch
duration millis value. Once I blew away all the checkpointing directories
and unplugged the data checkpointing (while keeping the metadata
checkpointing) the batch duration millis was no longer sticky.

So, there doesn't seem to be a way for metadata checkpointing to override
its checkpoint duration millis, is there?  Is the default there
max(batchdurationmillis, 10seconds)?  Is there a way to override this?
Thanks.





On Wed, Sep 9, 2015 at 2:44 PM, Tathagata Das <t...@databricks.com> wrote:

>
>
> See inline.
>
> On Tue, Sep 8, 2015 at 9:02 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> What's wrong with creating a checkpointed context??  We WANT
>> checkpointing, first of all.  We therefore WANT the checkpointed context.
>>
>> Second of all, it's not true that we're loading the checkpointed context
>> independent of whether params.isCheckpointed() is true.  I'm quoting the
>> code again:
>>
>> // This is NOT loading a checkpointed context if isCheckpointed() is
>> false.
>> JavaStreamingContext jssc = params.isCheckpointed() ?
>> createCheckpointedContext(sparkConf, params) : createContext(sparkConf,
>> params);
>>
>>   private JavaStreamingContext createCheckpointedContext(SparkConf
>> sparkConf, Parameters params) {
>> JavaStreamingContextFactory factory = new
>> JavaStreamingContextFactory() {
>>   @Override
>>   public JavaStreamingContext create() {
>> return createContext(sparkConf, params);
>>   }
>> };
>> return *JavaStreamingContext.getOrCreate(params.getCheckpointDir(),
>> factory);*
>>
> ^   when you use getOrCreate, and there exists a valid checkpoint, it
> will always return the context from the checkpoint and not call the
> factory. Simple way to see whats going on is to print something in the
> factory to verify whether it is ever called.
>
>
>
>
>
>>   }
>>
>>   private JavaStreamingContext createContext(SparkConf sparkConf,
>> Parameters params) {
>> // Create context with the specified batch interval, in milliseconds.
>> JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
>> Durations.milliseconds(params.getBatchDurationMillis()));
>> // Set the checkpoint directory, if we're checkpointing
>> if (params.isCheckpointed()) {
>>   jssc.checkpoint(params.getCheckpointDir());
>>
>> }
>> ...
>> Again, this is *only* calling context.checkpoint() if isCheckpointed() is
>> true.  And we WANT it to be true.
>>
>> What am I missing here?
>>
>>
>>

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

>> Why are you checkpointing the direct kafka stream? It serves not purpose.

Could you elaborate on what you mean?

Our goal is fault tolerance.  If a consumer is killed or stopped midstream,
we want to resume where we left off next time the consumer is restarted.

How would that be "not surving a purpose"?  This is already working for
us.  From our testing, we indeed resume where we left off, which is not
possible without checkpointing.  If checkpointing is turned off, we'll
resume with the later Kafka topic entries, which would lead to skipping
over some entries.

Please elaborate.

We need both checkpointing and the ability to set batch duration millis.
The Spark API provides both capabilities but somehow if checkpointing is
turned on, our batch duration millis are always set to 10 seconds
internally by Spark.  What is the resolution?


On Tue, Sep 8, 2015 at 7:23 PM, Tathagata Das <t...@databricks.com> wrote:

> Why are you checkpointing the direct kafka stream? It serves not purpose.
>
> TD
>
> On Tue, Sep 8, 2015 at 9:35 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> I just disabled checkpointing in our consumers and I can see that the
>> batch duration millis set to 20 seconds is now being honored.
>>
>> Why would that be the case?
>>
>> And how can we "untie" batch duration millis from checkpointing?
>>
>> Thanks.
>>
>> On Tue, Sep 8, 2015 at 11:48 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> Well, I'm not sure why you're checkpointing messages.
>>>
>>> I'd also put in some logging to see what values are actually being read
>>> out of your params object for the various settings.
>>>
>>>
>>> On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenberg <
>>> dgoldenberg...@gmail.com> wrote:
>>>
>>>> I've stopped the jobs, the workers, and the master. Deleted the
>>>> contents of the checkpointing dir. Then restarted master, workers, and
>>>> consumers.
>>>>
>>>> I'm seeing the job in question still firing every 10 seconds.  I'm
>>>> seeing the 10 seconds in the Spark Jobs GUI page as well as our logs.
>>>> Seems quite strange given that the jobs used to fire every 1 second, we've
>>>> switched to 10, now trying to switch to 20 and batch duration millis is not
>>>> changing.
>>>>
>>>> Does anything stand out in the code perhaps?
>>>>
>>>> On Tue, Sep 8, 2015 at 9:53 AM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> Have you tried deleting or moving the contents of the checkpoint
>>>>> directory and restarting the job?
>>>>>
>>>>> On Fri, Sep 4, 2015 at 8:02 PM, Dmitry Goldenberg <
>>>>> dgoldenberg...@gmail.com> wrote:
>>>>>
>>>>>> Sorry, more relevant code below:
>>>>>>
>>>>>> SparkConf sparkConf = createSparkConf(appName, kahunaEnv);
>>>>>> JavaStreamingContext jssc = params.isCheckpointed() ?
>>>>>> createCheckpointedContext(sparkConf, params) : createContext(
>>>>>> sparkConf, params);
>>>>>> jssc.start();
>>>>>> jssc.awaitTermination();
>>>>>> jssc.close();
>>>>>> ………..
>>>>>>   private JavaStreamingContext createCheckpointedContext(SparkConf
>>>>>> sparkConf, Parameters params) {
>>>>>> JavaStreamingContextFactory factory = new
>>>>>> JavaStreamingContextFactory() {
>>>>>>   @Override
>>>>>>   public JavaStreamingContext create() {
>>>>>> return createContext(sparkConf, params);
>>>>>>   }
>>>>>> };
>>>>>> return JavaStreamingContext.getOrCreate(params.getCheckpointDir(),
>>>>>> factory);
>>>>>>   }
>>>>>>
>>>>>>   private JavaStreamingContext createContext(SparkConf sparkConf,
>>>>>> Parameters params) {
>>>>>> // Create context with the specified batch interval, in
>>>>>> milliseconds.
>>>>>> JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
>>>>>> Durations.milliseconds(params.getBatchDurationMillis()));
>>>>>> // Set the checkpoint directory, if we're checkpointing
>>>>>> if (params.isCheckpointed()) {
>>>>>>   jssc.

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

What's wrong with creating a checkpointed context??  We WANT checkpointing,
first of all.  We therefore WANT the checkpointed context.

Second of all, it's not true that we're loading the checkpointed context
independent of whether params.isCheckpointed() is true.  I'm quoting the
code again:

// This is NOT loading a checkpointed context if isCheckpointed() is false.
JavaStreamingContext jssc = params.isCheckpointed() ?
createCheckpointedContext(sparkConf, params) : createContext(sparkConf,
params);

  private JavaStreamingContext createCheckpointedContext(SparkConf sparkConf,
Parameters params) {
JavaStreamingContextFactory factory = new JavaStreamingContextFactory()
{
  @Override
  public JavaStreamingContext create() {
return createContext(sparkConf, params);
  }
};
return JavaStreamingContext.getOrCreate(params.getCheckpointDir(),
factory);
  }

  private JavaStreamingContext createContext(SparkConf sparkConf,
Parameters params) {
// Create context with the specified batch interval, in milliseconds.
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.milliseconds(params.getBatchDurationMillis()));
// Set the checkpoint directory, if we're checkpointing
if (params.isCheckpointed()) {
  jssc.checkpoint(params.getCheckpointDir());

}
...
Again, this is *only* calling context.checkpoint() if isCheckpointed() is
true.  And we WANT it to be true.

What am I missing here?



On Tue, Sep 8, 2015 at 11:42 PM, Tathagata Das <t...@databricks.com> wrote:

> Well, you are returning JavaStreamingContext.getOrCreate(params.
> getCheckpointDir(), factory);
> That is loading the checkpointed context, independent of whether params
> .isCheckpointed() is true.
>
>
>
>
> On Tue, Sep 8, 2015 at 8:28 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> That is good to know. However, that doesn't change the problem I'm
>> seeing. Which is that, even with that piece of code commented out
>> (stream.checkpoint()), the batch duration millis aren't getting changed
>> unless I take checkpointing completely out.
>>
>> In other words, this commented out:
>>
>> //if (params.isCheckpointed() && params.getCheckpointMillis() > 0L) {
>> //
>> messages.checkpoint(Durations.milliseconds(params.getCheckpointMillis()));
>> //}
>>
>> doesn't change the "stickiness" of the batch duration millis value.
>> However, if the context is not checkpointed, the problem goes away and the
>> batch duration millis setting is properly updated.
>>
>> I'll take the commented out piece out however I still need to figure out
>> how to fix the batch duration millis so I can also keep the checkpointed
>> context.
>>
>> The checkpointed context is basically created as I stated before, note
>> the invokation of the checkpoint() method:
>>
>>   private JavaStreamingContext createCheckpointedContext(SparkConf
>> sparkConf, Parameters params) {
>> JavaStreamingContextFactory factory = new
>> JavaStreamingContextFactory() {
>>   @Override
>>   public JavaStreamingContext create() {
>> return createContext(sparkConf, params);
>>   }
>> };
>> return JavaStreamingContext.getOrCreate(params.getCheckpointDir(),
>> factory);
>>   }
>>
>>   private JavaStreamingContext createContext(SparkConf sparkConf,
>> Parameters params) {
>> // Create context with the specified batch interval, in milliseconds.
>> JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
>> Durations.milliseconds(params.getBatchDurationMillis()));
>> // Set the checkpoint directory, if we're checkpointing
>> if (params.isCheckpointed()) {
>>   jssc.checkpoint(params.getCheckpointDir());
>> }
>>
>> ...
>>
>> On Tue, Sep 8, 2015 at 11:14 PM, Tathagata Das <t...@databricks.com>
>> wrote:
>>
>>> Calling directKafkaStream.checkpoint() will make the system write the
>>> raw kafka data into HDFS files (that is, RDD checkpointing). This is
>>> completely unnecessary with Direct Kafka because it already tracks the
>>> offset of data in each batch
>>>
>>
>>
>>> (which checkpoint is enabled using
>>> streamingContext.checkpoint(checkpointDir)) and can recover from failure by
>>> reading the exact same data back from Kafka.
>>>
>>>
>>> TD
>>>
>>> On Tue, Sep 8, 2015 at 4:38 PM, Dmitry Goldenberg <
>>> dgoldenberg...@gmail.com> wrote:
>>>
>>>> >> Why are you checkpointing the direct kafka stream? It serv

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

That is good to know. However, that doesn't change the problem I'm seeing.
Which is that, even with that piece of code commented out
(stream.checkpoint()), the batch duration millis aren't getting changed
unless I take checkpointing completely out.

In other words, this commented out:

//if (params.isCheckpointed() && params.getCheckpointMillis() > 0L) {
//
messages.checkpoint(Durations.milliseconds(params.getCheckpointMillis()));
//}

doesn't change the "stickiness" of the batch duration millis value.
However, if the context is not checkpointed, the problem goes away and the
batch duration millis setting is properly updated.

I'll take the commented out piece out however I still need to figure out
how to fix the batch duration millis so I can also keep the checkpointed
context.

The checkpointed context is basically created as I stated before, note the
invokation of the checkpoint() method:

  private JavaStreamingContext createCheckpointedContext(SparkConf sparkConf,
Parameters params) {
JavaStreamingContextFactory factory = new JavaStreamingContextFactory()
{
  @Override
  public JavaStreamingContext create() {
return createContext(sparkConf, params);
  }
};
return JavaStreamingContext.getOrCreate(params.getCheckpointDir(),
factory);
  }

  private JavaStreamingContext createContext(SparkConf sparkConf,
Parameters params) {
// Create context with the specified batch interval, in milliseconds.
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.milliseconds(params.getBatchDurationMillis()));
// Set the checkpoint directory, if we're checkpointing
if (params.isCheckpointed()) {
  jssc.checkpoint(params.getCheckpointDir());
}

...

On Tue, Sep 8, 2015 at 11:14 PM, Tathagata Das <t...@databricks.com> wrote:

> Calling directKafkaStream.checkpoint() will make the system write the raw
> kafka data into HDFS files (that is, RDD checkpointing). This is completely
> unnecessary with Direct Kafka because it already tracks the offset of data
> in each batch
>


> (which checkpoint is enabled using
> streamingContext.checkpoint(checkpointDir)) and can recover from failure by
> reading the exact same data back from Kafka.
>
>
> TD
>
> On Tue, Sep 8, 2015 at 4:38 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> >> Why are you checkpointing the direct kafka stream? It serves not
>> purpose.
>>
>> Could you elaborate on what you mean?
>>
>> Our goal is fault tolerance.  If a consumer is killed or stopped
>> midstream, we want to resume where we left off next time the consumer is
>> restarted.
>>
>> How would that be "not surving a purpose"?  This is already working for
>> us.  From our testing, we indeed resume where we left off, which is not
>> possible without checkpointing.  If checkpointing is turned off, we'll
>> resume with the later Kafka topic entries, which would lead to skipping
>> over some entries.
>>
>> Please elaborate.
>>
>> We need both checkpointing and the ability to set batch duration millis.
>> The Spark API provides both capabilities but somehow if checkpointing is
>> turned on, our batch duration millis are always set to 10 seconds
>> internally by Spark.  What is the resolution?
>>
>>
>> On Tue, Sep 8, 2015 at 7:23 PM, Tathagata Das <t...@databricks.com>
>> wrote:
>>
>>> Why are you checkpointing the direct kafka stream? It serves not purpose.
>>>
>>> TD
>>>
>>> On Tue, Sep 8, 2015 at 9:35 AM, Dmitry Goldenberg <
>>> dgoldenberg...@gmail.com> wrote:
>>>
>>>> I just disabled checkpointing in our consumers and I can see that the
>>>> batch duration millis set to 20 seconds is now being honored.
>>>>
>>>> Why would that be the case?
>>>>
>>>> And how can we "untie" batch duration millis from checkpointing?
>>>>
>>>> Thanks.
>>>>
>>>> On Tue, Sep 8, 2015 at 11:48 AM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> Well, I'm not sure why you're checkpointing messages.
>>>>>
>>>>> I'd also put in some logging to see what values are actually being
>>>>> read out of your params object for the various settings.
>>>>>
>>>>>
>>>>> On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenberg <
>>>>> dgoldenberg...@gmail.com> wrote:
>>>>>
>>>>>> I've stopped the jobs, the workers, and the master. Deleted the
>>>>>> contents of the checkpointing dir.

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

I just disabled checkpointing in our consumers and I can see that the batch
duration millis set to 20 seconds is now being honored.

Why would that be the case?

And how can we "untie" batch duration millis from checkpointing?

Thanks.

On Tue, Sep 8, 2015 at 11:48 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Well, I'm not sure why you're checkpointing messages.
>
> I'd also put in some logging to see what values are actually being read
> out of your params object for the various settings.
>
>
> On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> I've stopped the jobs, the workers, and the master. Deleted the contents
>> of the checkpointing dir. Then restarted master, workers, and consumers.
>>
>> I'm seeing the job in question still firing every 10 seconds.  I'm seeing
>> the 10 seconds in the Spark Jobs GUI page as well as our logs.  Seems quite
>> strange given that the jobs used to fire every 1 second, we've switched to
>> 10, now trying to switch to 20 and batch duration millis is not changing.
>>
>> Does anything stand out in the code perhaps?
>>
>> On Tue, Sep 8, 2015 at 9:53 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> Have you tried deleting or moving the contents of the checkpoint
>>> directory and restarting the job?
>>>
>>> On Fri, Sep 4, 2015 at 8:02 PM, Dmitry Goldenberg <
>>> dgoldenberg...@gmail.com> wrote:
>>>
>>>> Sorry, more relevant code below:
>>>>
>>>> SparkConf sparkConf = createSparkConf(appName, kahunaEnv);
>>>> JavaStreamingContext jssc = params.isCheckpointed() ?
>>>> createCheckpointedContext(sparkConf, params) : createContext(sparkConf,
>>>> params);
>>>> jssc.start();
>>>> jssc.awaitTermination();
>>>> jssc.close();
>>>> ………..
>>>>   private JavaStreamingContext createCheckpointedContext(SparkConf
>>>> sparkConf, Parameters params) {
>>>> JavaStreamingContextFactory factory = new
>>>> JavaStreamingContextFactory() {
>>>>   @Override
>>>>   public JavaStreamingContext create() {
>>>> return createContext(sparkConf, params);
>>>>   }
>>>> };
>>>> return JavaStreamingContext.getOrCreate(params.getCheckpointDir(),
>>>> factory);
>>>>   }
>>>>
>>>>   private JavaStreamingContext createContext(SparkConf sparkConf,
>>>> Parameters params) {
>>>> // Create context with the specified batch interval, in
>>>> milliseconds.
>>>> JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
>>>> Durations.milliseconds(params.getBatchDurationMillis()));
>>>> // Set the checkpoint directory, if we're checkpointing
>>>> if (params.isCheckpointed()) {
>>>>   jssc.checkpoint(params.getCheckpointDir());
>>>> }
>>>>
>>>> Set topicsSet = new HashSet(Arrays.asList(params
>>>> .getTopic()));
>>>>
>>>> // Set the Kafka parameters.
>>>> Map<String, String> kafkaParams = new HashMap<String, String>();
>>>> kafkaParams.put(KafkaProducerProperties.METADATA_BROKER_LIST,
>>>> params.getBrokerList());
>>>> if (StringUtils.isNotBlank(params.getAutoOffsetReset())) {
>>>>   kafkaParams.put(KafkaConsumerProperties.AUTO_OFFSET_RESET, params
>>>> .getAutoOffsetReset());
>>>> }
>>>>
>>>> // Create direct Kafka stream with the brokers and the topic.
>>>> JavaPairInputDStream<String, String> messages =
>>>> KafkaUtils.createDirectStream(
>>>>   jssc,
>>>>   String.class,
>>>>   String.class,
>>>>   StringDecoder.class,
>>>>   StringDecoder.class,
>>>>   kafkaParams,
>>>>   topicsSet);
>>>>
>>>> // See if there's an override of the default checkpoint duration.
>>>> if (params.isCheckpointed() && params.getCheckpointMillis() > 0L) {
>>>>   messages.checkpoint(Durations.milliseconds(params
>>>> .getCheckpointMillis()));
>>>> }
>>>>
>>>> JavaDStream messageBodies = messages.map(new
>>>> Function<Tuple2<String, String>, String>() {
>>>>   @Override
>>>>   public String call(Tuple2<Strin

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

Just verified the logic for passing the batch duration millis in, looks OK.
I see the value of 20 seconds being reflected in the logs - but not in the
spark ui.

Also, just commented out this piece and the consumer is still stuck at
using 10 seconds for batch duration millis.

//if (params.isCheckpointed() && params.getCheckpointMillis() > 0L) {
//
messages.checkpoint(Durations.milliseconds(params.getCheckpointMillis()));
//}

The reason this is in the code is so that we can control the checkpointing
millis.  Doing this through the checkpoint() method seems the only way to
override the default value which is max(batchdurationmillis, 10seconds).
Is there a better way of doing this?

Thanks.


On Tue, Sep 8, 2015 at 11:48 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Well, I'm not sure why you're checkpointing messages.
>
> I'd also put in some logging to see what values are actually being read
> out of your params object for the various settings.
>
>
> On Tue, Sep 8, 2015 at 10:24 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> I've stopped the jobs, the workers, and the master. Deleted the contents
>> of the checkpointing dir. Then restarted master, workers, and consumers.
>>
>> I'm seeing the job in question still firing every 10 seconds.  I'm seeing
>> the 10 seconds in the Spark Jobs GUI page as well as our logs.  Seems quite
>> strange given that the jobs used to fire every 1 second, we've switched to
>> 10, now trying to switch to 20 and batch duration millis is not changing.
>>
>> Does anything stand out in the code perhaps?
>>
>> On Tue, Sep 8, 2015 at 9:53 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> Have you tried deleting or moving the contents of the checkpoint
>>> directory and restarting the job?
>>>
>>> On Fri, Sep 4, 2015 at 8:02 PM, Dmitry Goldenberg <
>>> dgoldenberg...@gmail.com> wrote:
>>>
>>>> Sorry, more relevant code below:
>>>>
>>>> SparkConf sparkConf = createSparkConf(appName, kahunaEnv);
>>>> JavaStreamingContext jssc = params.isCheckpointed() ?
>>>> createCheckpointedContext(sparkConf, params) : createContext(sparkConf,
>>>> params);
>>>> jssc.start();
>>>> jssc.awaitTermination();
>>>> jssc.close();
>>>> ………..
>>>>   private JavaStreamingContext createCheckpointedContext(SparkConf
>>>> sparkConf, Parameters params) {
>>>> JavaStreamingContextFactory factory = new
>>>> JavaStreamingContextFactory() {
>>>>   @Override
>>>>   public JavaStreamingContext create() {
>>>> return createContext(sparkConf, params);
>>>>   }
>>>> };
>>>> return JavaStreamingContext.getOrCreate(params.getCheckpointDir(),
>>>> factory);
>>>>   }
>>>>
>>>>   private JavaStreamingContext createContext(SparkConf sparkConf,
>>>> Parameters params) {
>>>> // Create context with the specified batch interval, in
>>>> milliseconds.
>>>> JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
>>>> Durations.milliseconds(params.getBatchDurationMillis()));
>>>> // Set the checkpoint directory, if we're checkpointing
>>>> if (params.isCheckpointed()) {
>>>>   jssc.checkpoint(params.getCheckpointDir());
>>>> }
>>>>
>>>> Set topicsSet = new HashSet(Arrays.asList(params
>>>> .getTopic()));
>>>>
>>>> // Set the Kafka parameters.
>>>> Map<String, String> kafkaParams = new HashMap<String, String>();
>>>> kafkaParams.put(KafkaProducerProperties.METADATA_BROKER_LIST,
>>>> params.getBrokerList());
>>>> if (StringUtils.isNotBlank(params.getAutoOffsetReset())) {
>>>>   kafkaParams.put(KafkaConsumerProperties.AUTO_OFFSET_RESET, params
>>>> .getAutoOffsetReset());
>>>> }
>>>>
>>>> // Create direct Kafka stream with the brokers and the topic.
>>>> JavaPairInputDStream<String, String> messages =
>>>> KafkaUtils.createDirectStream(
>>>>   jssc,
>>>>   String.class,
>>>>   String.class,
>>>>   StringDecoder.class,
>>>>   StringDecoder.class,
>>>>   kafkaParams,
>>>>   topicsSet);
>>>>
>>>> // See if there's an override of the default checkpoint duration.
>>>

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

I've stopped the jobs, the workers, and the master. Deleted the contents of
the checkpointing dir. Then restarted master, workers, and consumers.

I'm seeing the job in question still firing every 10 seconds.  I'm seeing
the 10 seconds in the Spark Jobs GUI page as well as our logs.  Seems quite
strange given that the jobs used to fire every 1 second, we've switched to
10, now trying to switch to 20 and batch duration millis is not changing.

Does anything stand out in the code perhaps?

On Tue, Sep 8, 2015 at 9:53 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Have you tried deleting or moving the contents of the checkpoint directory
> and restarting the job?
>
> On Fri, Sep 4, 2015 at 8:02 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> Sorry, more relevant code below:
>>
>> SparkConf sparkConf = createSparkConf(appName, kahunaEnv);
>> JavaStreamingContext jssc = params.isCheckpointed() ?
>> createCheckpointedContext(sparkConf, params) : createContext(sparkConf,
>> params);
>> jssc.start();
>> jssc.awaitTermination();
>> jssc.close();
>> ………..
>>   private JavaStreamingContext createCheckpointedContext(SparkConf
>> sparkConf, Parameters params) {
>> JavaStreamingContextFactory factory = new
>> JavaStreamingContextFactory() {
>>   @Override
>>   public JavaStreamingContext create() {
>> return createContext(sparkConf, params);
>>   }
>> };
>> return JavaStreamingContext.getOrCreate(params.getCheckpointDir(),
>> factory);
>>   }
>>
>>   private JavaStreamingContext createContext(SparkConf sparkConf,
>> Parameters params) {
>> // Create context with the specified batch interval, in milliseconds.
>> JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
>> Durations.milliseconds(params.getBatchDurationMillis()));
>> // Set the checkpoint directory, if we're checkpointing
>> if (params.isCheckpointed()) {
>>   jssc.checkpoint(params.getCheckpointDir());
>> }
>>
>> Set topicsSet = new HashSet(Arrays.asList(params
>> .getTopic()));
>>
>> // Set the Kafka parameters.
>> Map<String, String> kafkaParams = new HashMap<String, String>();
>> kafkaParams.put(KafkaProducerProperties.METADATA_BROKER_LIST, params
>> .getBrokerList());
>> if (StringUtils.isNotBlank(params.getAutoOffsetReset())) {
>>   kafkaParams.put(KafkaConsumerProperties.AUTO_OFFSET_RESET, params
>> .getAutoOffsetReset());
>> }
>>
>> // Create direct Kafka stream with the brokers and the topic.
>> JavaPairInputDStream<String, String> messages =
>> KafkaUtils.createDirectStream(
>>   jssc,
>>   String.class,
>>   String.class,
>>   StringDecoder.class,
>>   StringDecoder.class,
>>   kafkaParams,
>>   topicsSet);
>>
>> // See if there's an override of the default checkpoint duration.
>> if (params.isCheckpointed() && params.getCheckpointMillis() > 0L) {
>>   messages.checkpoint(Durations.milliseconds(params
>> .getCheckpointMillis()));
>> }
>>
>> JavaDStream messageBodies = messages.map(new
>> Function<Tuple2<String, String>, String>() {
>>   @Override
>>   public String call(Tuple2<String, String> tuple2) {
>> return tuple2._2();
>>   }
>> });
>>
>> messageBodies.foreachRDD(new Function<JavaRDD, Void>() {
>>   @Override
>>   public Void call(JavaRDD rdd) throws Exception {
>> ProcessPartitionFunction func = new
>> ProcessPartitionFunction(params);
>> rdd.foreachPartition(func);
>> return null;
>>   }
>> });
>>
>> return jssc;
>> }
>>
>> On Fri, Sep 4, 2015 at 8:57 PM, Dmitry Goldenberg <
>> dgoldenberg...@gmail.com> wrote:
>>
>>> I'd think that we wouldn't be "accidentally recovering from checkpoint"
>>> hours or even days after consumers have been restarted, plus the content is
>>> the fresh content that I'm feeding, not some content that had been fed
>>> before the last restart.
>>>
>>> The code is basically as follows:
>>>
>>> SparkConf sparkConf = createSparkConf(...);
>>> // We'd be 'checkpointed' because we specify a checkpoint directory
>>> which makes isCheckpointed true
>>> JavaStreamingContext jssc = params.isCheckpointed() ?
>>> createCheckpointedContext(sparkConf, params)

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

Tathagata,

Checkpointing is turned on but we were not recovering. I'm looking at the
logs now, feeding fresh content hours after the restart. Here's a snippet:

2015-09-04 06:11:20,013 ... Documents processed: 0.
2015-09-04 06:11:30,014 ... Documents processed: 0.
2015-09-04 06:11:40,011 ... Documents processed: 0.
2015-09-04 06:11:50,012 ... Documents processed: 0.
2015-09-04 06:12:00,010 ... Documents processed: 0.
2015-09-04 06:12:10,047 ... Documents processed: 0.
2015-09-04 06:12:20,012 ... Documents processed: 0.
2015-09-04 06:12:30,011 ... Documents processed: 0.
2015-09-04 06:12:40,012 ... Documents processed: 0.
*2015-09-04 06:12:55,629 ... Documents processed: 4.*
2015-09-04 06:13:00,018 ... Documents processed: 0.
2015-09-04 06:13:10,012 ... Documents processed: 0.
2015-09-04 06:13:20,019 ... Documents processed: 0.
2015-09-04 06:13:30,014 ... Documents processed: 0.
2015-09-04 06:13:40,041 ... Documents processed: 0.
2015-09-04 06:13:50,009 ... Documents processed: 0.
...
2015-09-04 06:17:30,019 ... Documents processed: 0.
*2015-09-04 06:17:46,832 ... Documents processed: 40.*

Interestingly, the fresh content (4 documents) is fed about 5.6 seconds
after the previous batch, not 10 seconds. The second fresh batch comes in
6.8 seconds after the previous empty one.

Granted, the log message is printed after iterating over the messages which
may account for some time differences. But generally, looking at the log
messages being printed before we iterate, it's still 10 seconds each time,
not 20 which is what batchdurationmillis is currently set to.

Code:

JavaPairInputDStream<String, String> messages =
KafkaUtils.createDirectStream();
messages.checkpoint(Durations.milliseconds(checkpointMillis));


  JavaDStream messageBodies = messages.map(new Function<Tuple2<String,
String>, String>() {
  @Override
  public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
  }
});

messageBodies.foreachRDD(new Function<JavaRDD, Void>() {
  @Override
  public Void call(JavaRDD rdd) throws Exception {
ProcessPartitionFunction func = new ProcessPartitionFunction(...);
rdd.foreachPartition(func);
return null;
  }
});

The log message comes from ProcessPartitionFunction:

public void call(Iterator messageIterator) throws Exception {
log.info("Starting data partition processing. AppName={}, topic={}.)...",
appName, topic);
// ... iterate ...
log.info("Finished data partition processing (appName={}, topic={}).
Documents processed: {}.", appName, topic, docCount);
}

Any ideas? Thanks.

- Dmitry

On Thu, Sep 3, 2015 at 10:45 PM, Tathagata Das <t...@databricks.com> wrote:

> Are you accidentally recovering from checkpoint files which has 10 second
> as the batch interval?
>
>
> On Thu, Sep 3, 2015 at 7:34 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> I'm seeing an oddity where I initially set the batchdurationmillis to 1
>> second and it works fine:
>>
>> JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
>> Durations.milliseconds(batchDurationMillis));
>>
>> Then I tried changing the value to 10 seconds. The change didn't seem to
>> take. I've bounced the Spark workers and the consumers and now I'm seeing
>> RDD's coming in once around 10 seconds (not always 10 seconds according to
>> the logs).
>>
>> However, now I'm trying to change the value to 20 seconds and it's just
>> not taking. I've bounced Spark master, workers, and consumers and the value
>> seems "stuck" at 10 seconds.
>>
>> Any ideas? We're running Spark 1.3.0 built for Hadoop 2.4.
>>
>> Thanks.
>>
>> - Dmitry
>>
>>
>

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

Tathagata,

In our logs I see the batch duration millis being set first to 10 then to
20 seconds. I don't see the 20 being reflected later during ingestion.

In the Spark UI under Streaming I see the below output, notice the *10
second* Batch interval.  Can you think of a reason why it's stuck at 10?
It used to be 1 second by the way, then somehow over the course of a few
restarts we managed to get it to be 10 seconds.  Now it won't get reset to
20 seconds.  Any ideas?

Streaming

   - *Started at: *Thu Sep 03 10:59:03 EDT 2015
   - *Time since start: *1 day 8 hours 44 minutes
   - *Network receivers: *0
   - *Batch interval: *10 seconds
   - *Processed batches: *11790
   - *Waiting batches: *0
   - *Received records: *0
   - *Processed records: *0



Statistics over last 100 processed batchesReceiver Statistics
No receivers
Batch Processing Statistics

   MetricLast batchMinimum25th percentileMedian75th percentileMaximumProcessing
   Time23 ms7 ms10 ms11 ms14 ms172 msScheduling Delay1 ms0 ms0 ms0 ms1 ms2
   msTotal Delay24 ms8 ms10 ms12 ms14 ms173 ms





On Fri, Sep 4, 2015 at 3:50 PM, Tathagata Das <t...@databricks.com> wrote:

> Could you see what the streaming tab in the Spark UI says? It should show
> the underlying batch duration of the StreamingContext, the details of when
> the batch starts, etc.
>
> BTW, it seems that the 5.6 or 6.8 seconds delay is present only when data
> is present (that is, * Documents processed: > 0)*
>
> On Fri, Sep 4, 2015 at 3:38 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> Tathagata,
>>
>> Checkpointing is turned on but we were not recovering. I'm looking at the
>> logs now, feeding fresh content hours after the restart. Here's a snippet:
>>
>> 2015-09-04 06:11:20,013 ... Documents processed: 0.
>> 2015-09-04 06:11:30,014 ... Documents processed: 0.
>> 2015-09-04 06:11:40,011 ... Documents processed: 0.
>> 2015-09-04 06:11:50,012 ... Documents processed: 0.
>> 2015-09-04 06:12:00,010 ... Documents processed: 0.
>> 2015-09-04 06:12:10,047 ... Documents processed: 0.
>> 2015-09-04 06:12:20,012 ... Documents processed: 0.
>> 2015-09-04 06:12:30,011 ... Documents processed: 0.
>> 2015-09-04 06:12:40,012 ... Documents processed: 0.
>> *2015-09-04 06:12:55,629 ... Documents processed: 4.*
>> 2015-09-04 06:13:00,018 ... Documents processed: 0.
>> 2015-09-04 06:13:10,012 ... Documents processed: 0.
>> 2015-09-04 06:13:20,019 ... Documents processed: 0.
>> 2015-09-04 06:13:30,014 ... Documents processed: 0.
>> 2015-09-04 06:13:40,041 ... Documents processed: 0.
>> 2015-09-04 06:13:50,009 ... Documents processed: 0.
>> ...
>> 2015-09-04 06:17:30,019 ... Documents processed: 0.
>> *2015-09-04 06:17:46,832 ... Documents processed: 40.*
>>
>> Interestingly, the fresh content (4 documents) is fed about 5.6 seconds
>> after the previous batch, not 10 seconds. The second fresh batch comes in
>> 6.8 seconds after the previous empty one.
>>
>> Granted, the log message is printed after iterating over the messages
>> which may account for some time differences. But generally, looking at the
>> log messages being printed before we iterate, it's still 10 seconds each
>> time, not 20 which is what batchdurationmillis is currently set to.
>>
>> Code:
>>
>> JavaPairInputDStream<String, String> messages =
>> KafkaUtils.createDirectStream();
>> messages.checkpoint(Durations.milliseconds(checkpointMillis));
>>
>>
>>   JavaDStream messageBodies = messages.map(new 
>> Function<Tuple2<String,
>> String>, String>() {
>>   @Override
>>   public String call(Tuple2<String, String> tuple2) {
>> return tuple2._2();
>>   }
>> });
>>
>> messageBodies.foreachRDD(new Function<JavaRDD, Void>() {
>>   @Override
>>   public Void call(JavaRDD rdd) throws Exception {
>> ProcessPartitionFunction func = new ProcessPartitionFunction(...);
>> rdd.foreachPartition(func);
>> return null;
>>   }
>> });
>>
>> The log message comes from ProcessPartitionFunction:
>>
>> public void call(Iterator messageIterator) throws Exception {
>> log.info("Starting data partition processing. AppName={},
>> topic={}.)...", appName, topic);
>> // ... iterate ...
>> log.info("Finished data partition processing (appName={}, topic={}).
>> Documents processed: {}.", appName, topic, docCount);
>> }
>>
>> Any ideas? Thanks.
>>
>> - Dmitry
>>
>> On Thu, Sep 3, 2015 at 10:45 PM, Tathagata Das <t...@databrick

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

I'd think that we wouldn't be "accidentally recovering from checkpoint"
hours or even days after consumers have been restarted, plus the content is
the fresh content that I'm feeding, not some content that had been fed
before the last restart.

The code is basically as follows:

SparkConf sparkConf = createSparkConf(...);
// We'd be 'checkpointed' because we specify a checkpoint directory
which makes isCheckpointed true
JavaStreamingContext jssc = params.isCheckpointed() ?
createCheckpointedContext(sparkConf, params) : createContext(sparkConf,
params);jssc.start();

jssc.awaitTermination();

jssc.close();



On Fri, Sep 4, 2015 at 8:48 PM, Tathagata Das <t...@databricks.com> wrote:

> Are you sure you are not accidentally recovering from checkpoint? How are
> you using StreamingContext.getOrCreate() in your code?
>
> TD
>
> On Fri, Sep 4, 2015 at 4:53 PM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> Tathagata,
>>
>> In our logs I see the batch duration millis being set first to 10 then to
>> 20 seconds. I don't see the 20 being reflected later during ingestion.
>>
>> In the Spark UI under Streaming I see the below output, notice the *10
>> second* Batch interval.  Can you think of a reason why it's stuck at
>> 10?  It used to be 1 second by the way, then somehow over the course of a
>> few restarts we managed to get it to be 10 seconds.  Now it won't get reset
>> to 20 seconds.  Any ideas?
>>
>> Streaming
>>
>>- *Started at: *Thu Sep 03 10:59:03 EDT 2015
>>- *Time since start: *1 day 8 hours 44 minutes
>>- *Network receivers: *0
>>- *Batch interval: *10 seconds
>>- *Processed batches: *11790
>>- *Waiting batches: *0
>>- *Received records: *0
>>- *Processed records: *0
>>
>>
>>
>> Statistics over last 100 processed batchesReceiver Statistics
>> No receivers
>> Batch Processing Statistics
>>
>>MetricLast batchMinimum25th percentileMedian75th 
>> percentileMaximumProcessing
>>Time23 ms7 ms10 ms11 ms14 ms172 msScheduling Delay1 ms0 ms0 ms0 ms1 ms2
>>msTotal Delay24 ms8 ms10 ms12 ms14 ms173 ms
>>
>>
>>
>>
>>
>> On Fri, Sep 4, 2015 at 3:50 PM, Tathagata Das <t...@databricks.com>
>> wrote:
>>
>>> Could you see what the streaming tab in the Spark UI says? It should
>>> show the underlying batch duration of the StreamingContext, the details of
>>> when the batch starts, etc.
>>>
>>> BTW, it seems that the 5.6 or 6.8 seconds delay is present only when
>>> data is present (that is, * Documents processed: > 0)*
>>>
>>> On Fri, Sep 4, 2015 at 3:38 AM, Dmitry Goldenberg <
>>> dgoldenberg...@gmail.com> wrote:
>>>
>>>> Tathagata,
>>>>
>>>> Checkpointing is turned on but we were not recovering. I'm looking at
>>>> the logs now, feeding fresh content hours after the restart. Here's a
>>>> snippet:
>>>>
>>>> 2015-09-04 06:11:20,013 ... Documents processed: 0.
>>>> 2015-09-04 06:11:30,014 ... Documents processed: 0.
>>>> 2015-09-04 06:11:40,011 ... Documents processed: 0.
>>>> 2015-09-04 06:11:50,012 ... Documents processed: 0.
>>>> 2015-09-04 06:12:00,010 ... Documents processed: 0.
>>>> 2015-09-04 06:12:10,047 ... Documents processed: 0.
>>>> 2015-09-04 06:12:20,012 ... Documents processed: 0.
>>>> 2015-09-04 06:12:30,011 ... Documents processed: 0.
>>>> 2015-09-04 06:12:40,012 ... Documents processed: 0.
>>>> *2015-09-04 06:12:55,629 ... Documents processed: 4.*
>>>> 2015-09-04 06:13:00,018 ... Documents processed: 0.
>>>> 2015-09-04 06:13:10,012 ... Documents processed: 0.
>>>> 2015-09-04 06:13:20,019 ... Documents processed: 0.
>>>> 2015-09-04 06:13:30,014 ... Documents processed: 0.
>>>> 2015-09-04 06:13:40,041 ... Documents processed: 0.
>>>> 2015-09-04 06:13:50,009 ... Documents processed: 0.
>>>> ...
>>>> 2015-09-04 06:17:30,019 ... Documents processed: 0.
>>>> *2015-09-04 06:17:46,832 ... Documents processed: 40.*
>>>>
>>>> Interestingly, the fresh content (4 documents) is fed about 5.6 seconds
>>>> after the previous batch, not 10 seconds. The second fresh batch comes in
>>>> 6.8 seconds after the previous empty one.
>>>>
>>>> Granted, the log message is printed after iterating over the messages
>>>> which may account for some time differences.

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

Sorry, more relevant code below:

SparkConf sparkConf = createSparkConf(appName, kahunaEnv);
JavaStreamingContext jssc = params.isCheckpointed() ?
createCheckpointedContext(sparkConf, params) : createContext(sparkConf,
params);
jssc.start();
jssc.awaitTermination();
jssc.close();
………..
  private JavaStreamingContext createCheckpointedContext(SparkConf sparkConf,
Parameters params) {
JavaStreamingContextFactory factory = new JavaStreamingContextFactory()
{
  @Override
  public JavaStreamingContext create() {
return createContext(sparkConf, params);
  }
};
return JavaStreamingContext.getOrCreate(params.getCheckpointDir(),
factory);
  }

  private JavaStreamingContext createContext(SparkConf sparkConf,
Parameters params) {
// Create context with the specified batch interval, in milliseconds.
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.milliseconds(params.getBatchDurationMillis()));
// Set the checkpoint directory, if we're checkpointing
if (params.isCheckpointed()) {
  jssc.checkpoint(params.getCheckpointDir());
}

Set topicsSet = new HashSet(Arrays.asList(params
.getTopic()));

// Set the Kafka parameters.
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put(KafkaProducerProperties.METADATA_BROKER_LIST, params
.getBrokerList());
if (StringUtils.isNotBlank(params.getAutoOffsetReset())) {
  kafkaParams.put(KafkaConsumerProperties.AUTO_OFFSET_RESET, params
.getAutoOffsetReset());
}

// Create direct Kafka stream with the brokers and the topic.
JavaPairInputDStream<String, String> messages =
KafkaUtils.createDirectStream(
  jssc,
  String.class,
  String.class,
  StringDecoder.class,
  StringDecoder.class,
  kafkaParams,
  topicsSet);

// See if there's an override of the default checkpoint duration.
if (params.isCheckpointed() && params.getCheckpointMillis() > 0L) {
  messages.checkpoint(Durations.milliseconds(params
.getCheckpointMillis()));
}

JavaDStream messageBodies = messages.map(new
Function<Tuple2<String, String>, String>() {
  @Override
  public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
  }
});

messageBodies.foreachRDD(new Function<JavaRDD, Void>() {
  @Override
  public Void call(JavaRDD rdd) throws Exception {
ProcessPartitionFunction func = new
ProcessPartitionFunction(params);
rdd.foreachPartition(func);
return null;
  }
});

return jssc;
}

On Fri, Sep 4, 2015 at 8:57 PM, Dmitry Goldenberg <dgoldenberg...@gmail.com>
wrote:

> I'd think that we wouldn't be "accidentally recovering from checkpoint"
> hours or even days after consumers have been restarted, plus the content is
> the fresh content that I'm feeding, not some content that had been fed
> before the last restart.
>
> The code is basically as follows:
>
> SparkConf sparkConf = createSparkConf(...);
> // We'd be 'checkpointed' because we specify a checkpoint directory
> which makes isCheckpointed true
> JavaStreamingContext jssc = params.isCheckpointed() ?
> createCheckpointedContext(sparkConf, params) : createContext(sparkConf,
> params);jssc.start();
>
> jssc.awaitTermination();
>
> jssc.close();
>
>
>
> On Fri, Sep 4, 2015 at 8:48 PM, Tathagata Das <t...@databricks.com> wrote:
>
>> Are you sure you are not accidentally recovering from checkpoint? How are
>> you using StreamingContext.getOrCreate() in your code?
>>
>> TD
>>
>> On Fri, Sep 4, 2015 at 4:53 PM, Dmitry Goldenberg <
>> dgoldenberg...@gmail.com> wrote:
>>
>>> Tathagata,
>>>
>>> In our logs I see the batch duration millis being set first to 10 then
>>> to 20 seconds. I don't see the 20 being reflected later during ingestion.
>>>
>>> In the Spark UI under Streaming I see the below output, notice the *10
>>> second* Batch interval.  Can you think of a reason why it's stuck at
>>> 10?  It used to be 1 second by the way, then somehow over the course of a
>>> few restarts we managed to get it to be 10 seconds.  Now it won't get reset
>>> to 20 seconds.  Any ideas?
>>>
>>> Streaming
>>>
>>>- *Started at: *Thu Sep 03 10:59:03 EDT 2015
>>>- *Time since start: *1 day 8 hours 44 minutes
>>>- *Network receivers: *0
>>>- *Batch interval: *10 seconds
>>>- *Processed batches: *11790
>>>- *Waiting batches: *0
>>>- *Received records: *0
>>>- *Processed records: *0
>>>
>>>
>>>
>>> Statistics over last 100 processed batchesRe

Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-03 Thread Dmitry Goldenberg

I'm seeing an oddity where I initially set the batchdurationmillis to 1
second and it works fine:

JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.milliseconds(batchDurationMillis));

Then I tried changing the value to 10 seconds. The change didn't seem to
take. I've bounced the Spark workers and the consumers and now I'm seeing
RDD's coming in once around 10 seconds (not always 10 seconds according to
the logs).

However, now I'm trying to change the value to 20 seconds and it's just not
taking. I've bounced Spark master, workers, and consumers and the value
seems "stuck" at 10 seconds.

Any ideas? We're running Spark 1.3.0 built for Hadoop 2.4.

Thanks.

- Dmitry

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg

Our additional question on checkpointing is basically the logistics of it --

At which point does the data get written into checkpointing?  Is it written
as soon as the driver program retrieves an RDD from Kafka (or another
source)?  Or, is it written after that RDD has been processed and we're
basically moving on to the next RDD?

What I'm driving at is, what happens if the driver program is killed?  The
next time it's started, will it know, from Spark Streaming's checkpointing,
to resume from the same RDD that was being processed at the time of the
program getting killed?  In other words, will we, upon restarting the
consumer, resume from the RDD that was unfinished, or will we be looking at
the next RDD?

Will we pick up from the last known *successfully processed* topic offset?

Thanks.




On Fri, Jul 31, 2015 at 1:52 PM, Sean Owen so...@cloudera.com wrote:

 If you've set the checkpoint dir, it seems like indeed the intent is
 to use a default checkpoint interval in DStream:

 private[streaming] def initialize(time: Time) {
 ...
   // Set the checkpoint interval to be slideDuration or 10 seconds,
 which ever is larger
   if (mustCheckpoint  checkpointDuration == null) {
 checkpointDuration = slideDuration * math.ceil(Seconds(10) /
 slideDuration).toInt
 logInfo(Checkpoint interval automatically set to  +
 checkpointDuration)
   }

 Do you see that log message? what's the interval? that could at least
 explain why it's not doing anything, if it's quite long.

 It sort of seems wrong though since
 https://spark.apache.org/docs/latest/streaming-programming-guide.html
 suggests it was intended to be a multiple of the batch interval. The
 slide duration wouldn't always be relevant anyway.

 On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg
 dgoldenberg...@gmail.com wrote:
  I've instrumented checkpointing per the programming guide and I can tell
  that Spark Streaming is creating the checkpoint directories but I'm not
  seeing any content being created in those directories nor am I seeing the
  effects I'd expect from checkpointing.  I'd expect any data that comes
 into
  Kafka while the consumers are down, to get picked up when the consumers
 are
  restarted; I'm not seeing that.
 
  For now my checkpoint directory is set to the local file system with the
  directory URI being in this form:   file:///mnt/dir1/dir2.  I see a
  subdirectory named with a UUID being created under there but no files.
 
  I'm using a custom JavaStreamingContextFactory which creates a
  JavaStreamingContext with the directory set into it via the
  checkpoint(String) method.
 
  I'm currently not invoking the checkpoint(Duration) method on the DStream
  since I want to first rely on Spark's default checkpointing interval.  My
  streaming batch duration millis is set to 1 second.
 
  Anyone have any idea what might be going wrong?
 
  Also, at which point does Spark delete files from checkpointing?
 
  Thanks.

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg

Thanks, Cody. It sounds like Spark Streaming has enough state info to know
how many batches have been processed and if not all of them then the RDD is
'unfinished'. I wonder if it would know whether the last micro-batch has
been fully processed successfully. Hypothetically, the driver program could
terminate as the last batch is being processed...

On Fri, Aug 14, 2015 at 6:17 PM, Cody Koeninger c...@koeninger.org wrote:

 You'll resume and re-process the rdd that didnt finish

 On Fri, Aug 14, 2015 at 1:31 PM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Our additional question on checkpointing is basically the logistics of it
 --

 At which point does the data get written into checkpointing?  Is it
 written as soon as the driver program retrieves an RDD from Kafka (or
 another source)?  Or, is it written after that RDD has been processed and
 we're basically moving on to the next RDD?

 What I'm driving at is, what happens if the driver program is killed?
 The next time it's started, will it know, from Spark Streaming's
 checkpointing, to resume from the same RDD that was being processed at the
 time of the program getting killed?  In other words, will we, upon
 restarting the consumer, resume from the RDD that was unfinished, or will
 we be looking at the next RDD?

 Will we pick up from the last known *successfully processed* topic
 offset?

 Thanks.




 On Fri, Jul 31, 2015 at 1:52 PM, Sean Owen so...@cloudera.com wrote:

 If you've set the checkpoint dir, it seems like indeed the intent is
 to use a default checkpoint interval in DStream:

 private[streaming] def initialize(time: Time) {
 ...
   // Set the checkpoint interval to be slideDuration or 10 seconds,
 which ever is larger
   if (mustCheckpoint  checkpointDuration == null) {
 checkpointDuration = slideDuration * math.ceil(Seconds(10) /
 slideDuration).toInt
 logInfo(Checkpoint interval automatically set to  +
 checkpointDuration)
   }

 Do you see that log message? what's the interval? that could at least
 explain why it's not doing anything, if it's quite long.

 It sort of seems wrong though since
 https://spark.apache.org/docs/latest/streaming-programming-guide.html
 suggests it was intended to be a multiple of the batch interval. The
 slide duration wouldn't always be relevant anyway.

 On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg
 dgoldenberg...@gmail.com wrote:
  I've instrumented checkpointing per the programming guide and I can
 tell
  that Spark Streaming is creating the checkpoint directories but I'm not
  seeing any content being created in those directories nor am I seeing
 the
  effects I'd expect from checkpointing.  I'd expect any data that comes
 into
  Kafka while the consumers are down, to get picked up when the
 consumers are
  restarted; I'm not seeing that.
 
  For now my checkpoint directory is set to the local file system with
 the
  directory URI being in this form:   file:///mnt/dir1/dir2.  I see a
  subdirectory named with a UUID being created under there but no files.
 
  I'm using a custom JavaStreamingContextFactory which creates a
  JavaStreamingContext with the directory set into it via the
  checkpoint(String) method.
 
  I'm currently not invoking the checkpoint(Duration) method on the
 DStream
  since I want to first rely on Spark's default checkpointing interval.
 My
  streaming batch duration millis is set to 1 second.
 
  Anyone have any idea what might be going wrong?
 
  Also, at which point does Spark delete files from checkpointing?
 
  Thanks.

How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

We're getting the below error.  Tried increasing spark.executor.memory e.g.
from 1g to 2g but the below error still happens.

Any recommendations? Something to do with specifying -Xmx in the submit job
scripts?

Thanks.

Exception in thread main java.lang.OutOfMemoryError: GC overhead limit
exceeded
at java.util.Arrays.copyOf(Arrays.java:3332)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at java.lang.StackTraceElement.toString(StackTraceElement.java:173)
at
org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1212)
at
org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1190)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.util.Utils$.getCallSite(Utils.scala:1190)
at
org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
at
org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1441)
at org.apache.spark.rdd.RDD.init(RDD.scala:1365)
at org.apache.spark.streaming.kafka.KafkaRDD.init(KafkaRDD.scala:46)
at
org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:155)
at
org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:153)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData.restore(DirectKafkaInputDStream.scala:153)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:402)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:149)

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

Would there be a way to chunk up/batch up the contents of the checkpointing
directories as they're being processed by Spark Streaming?  Is it mandatory
to load the whole thing in one go?

On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu yuzhih...@gmail.com wrote:

 I wonder during recovery from a checkpoint whether we can estimate the
 size of the checkpoint and compare with Runtime.getRuntime().freeMemory().

 If the size of checkpoint is much bigger than free memory, log warning, etc

 Cheers

 On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Thanks, Cody, will try that. Unfortunately due to a reinstall I don't
 have the original checkpointing directory :(  Thanks for the clarification
 on spark.driver.memory, I'll keep testing (at 2g things seem OK for now).

 On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger c...@koeninger.org
 wrote:

 That looks like it's during recovery from a checkpoint, so it'd be
 driver memory not executor memory.

 How big is the checkpoint directory that you're trying to restore from?

 On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 We're getting the below error.  Tried increasing spark.executor.memory
 e.g. from 1g to 2g but the below error still happens.

 Any recommendations? Something to do with specifying -Xmx in the submit
 job scripts?

 Thanks.

 Exception in thread main java.lang.OutOfMemoryError: GC overhead
 limit exceeded
 at java.util.Arrays.copyOf(Arrays.java:3332)
 at
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
 at
 java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
 at
 java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
 at java.lang.StringBuilder.append(StringBuilder.java:136)
 at java.lang.StackTraceElement.toString(StackTraceElement.java:173)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1212)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1190)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at org.apache.spark.util.Utils$.getCallSite(Utils.scala:1190)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1441)
 at org.apache.spark.rdd.RDD.init(RDD.scala:1365)
 at org.apache.spark.streaming.kafka.KafkaRDD.init(KafkaRDD.scala:46)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:155)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:153)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData.restore(DirectKafkaInputDStream.scala:153)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:402)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:149)

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

You need to keep a certain number of rdds around for checkpointing --
that seems like a hefty expense to pay in order to achieve fault
tolerance.  Why does Spark persist whole RDD's of data?  Shouldn't it be
sufficient to just persist the offsets, to know where to resume from?

Thanks.

On Mon, Aug 10, 2015 at 1:07 PM, Cody Koeninger c...@koeninger.org wrote:

 You need to keep a certain number of rdds around for checkpointing, based
 on e.g. the window size.  Those would all need to be loaded at once.

 On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Would there be a way to chunk up/batch up the contents of the
 checkpointing directories as they're being processed by Spark Streaming?
 Is it mandatory to load the whole thing in one go?

 On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu yuzhih...@gmail.com wrote:

 I wonder during recovery from a checkpoint whether we can estimate the
 size of the checkpoint and compare with Runtime.getRuntime().freeMemory
 ().

 If the size of checkpoint is much bigger than free memory, log warning,
 etc

 Cheers

 On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Thanks, Cody, will try that. Unfortunately due to a reinstall I don't
 have the original checkpointing directory :(  Thanks for the clarification
 on spark.driver.memory, I'll keep testing (at 2g things seem OK for now).

 On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger c...@koeninger.org
 wrote:

 That looks like it's during recovery from a checkpoint, so it'd be
 driver memory not executor memory.

 How big is the checkpoint directory that you're trying to restore from?

 On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 We're getting the below error.  Tried increasing
 spark.executor.memory e.g. from 1g to 2g but the below error still 
 happens.

 Any recommendations? Something to do with specifying -Xmx in the
 submit job scripts?

 Thanks.

 Exception in thread main java.lang.OutOfMemoryError: GC overhead
 limit exceeded
 at java.util.Arrays.copyOf(Arrays.java:3332)
 at
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
 at
 java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
 at
 java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
 at java.lang.StringBuilder.append(StringBuilder.java:136)
 at java.lang.StackTraceElement.toString(StackTraceElement.java:173)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1212)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1190)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at org.apache.spark.util.Utils$.getCallSite(Utils.scala:1190)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1441)
 at org.apache.spark.rdd.RDD.init(RDD.scala:1365)
 at org.apache.spark.streaming.kafka.KafkaRDD.init(KafkaRDD.scala:46)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:155)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:153)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData.restore(DirectKafkaInputDStream.scala:153)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:402)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:403)
 at
 org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:149)

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

Well, RDDs also contain data, don't they?

The question is, what can be so hefty in the checkpointing directory to
cause Spark driver to run out of memory?  It seems that it makes
checkpointing expensive, in terms of I/O and memory consumption.  Two
network hops -- to driver, then to workers.  Hefty file system usage, hefty
memory consumption...   What can we do to offset some of these costs?



On Mon, Aug 10, 2015 at 4:27 PM, Cody Koeninger c...@koeninger.org wrote:

 The rdd is indeed defined by mostly just the offsets / topic partitions.

 On Mon, Aug 10, 2015 at 3:24 PM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 You need to keep a certain number of rdds around for checkpointing --
 that seems like a hefty expense to pay in order to achieve fault
 tolerance.  Why does Spark persist whole RDD's of data?  Shouldn't it be
 sufficient to just persist the offsets, to know where to resume from?

 Thanks.


 On Mon, Aug 10, 2015 at 1:07 PM, Cody Koeninger c...@koeninger.org
 wrote:

 You need to keep a certain number of rdds around for checkpointing,
 based on e.g. the window size.  Those would all need to be loaded at once.

 On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Would there be a way to chunk up/batch up the contents of the
 checkpointing directories as they're being processed by Spark Streaming?
 Is it mandatory to load the whole thing in one go?

 On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu yuzhih...@gmail.com wrote:

 I wonder during recovery from a checkpoint whether we can estimate
 the size of the checkpoint and compare with Runtime.getRuntime().
 freeMemory().

 If the size of checkpoint is much bigger than free memory, log
 warning, etc

 Cheers

 On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Thanks, Cody, will try that. Unfortunately due to a reinstall I don't
 have the original checkpointing directory :(  Thanks for the 
 clarification
 on spark.driver.memory, I'll keep testing (at 2g things seem OK for now).

 On Mon, Aug 10, 2015 at 12:10 PM, Cody Koeninger c...@koeninger.org
 wrote:

 That looks like it's during recovery from a checkpoint, so it'd be
 driver memory not executor memory.

 How big is the checkpoint directory that you're trying to restore
 from?

 On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 We're getting the below error.  Tried increasing
 spark.executor.memory e.g. from 1g to 2g but the below error still 
 happens.

 Any recommendations? Something to do with specifying -Xmx in the
 submit job scripts?

 Thanks.

 Exception in thread main java.lang.OutOfMemoryError: GC overhead
 limit exceeded
 at java.util.Arrays.copyOf(Arrays.java:3332)
 at
 java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
 at
 java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
 at
 java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
 at java.lang.StringBuilder.append(StringBuilder.java:136)
 at java.lang.StackTraceElement.toString(StackTraceElement.java:173)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1212)
 at
 org.apache.spark.util.Utils$$anonfun$getCallSite$1.apply(Utils.scala:1190)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at org.apache.spark.util.Utils$.getCallSite(Utils.scala:1190)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at
 org.apache.spark.SparkContext$$anonfun$getCallSite$2.apply(SparkContext.scala:1441)
 at scala.Option.getOrElse(Option.scala:120)
 at
 org.apache.spark.SparkContext.getCallSite(SparkContext.scala:1441)
 at org.apache.spark.rdd.RDD.init(RDD.scala:1365)
 at
 org.apache.spark.streaming.kafka.KafkaRDD.init(KafkaRDD.scala:46)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:155)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData$$anonfun$restore$2.apply(DirectKafkaInputDStream.scala:153)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream$DirectKafkaInputDStreamCheckpointData.restore(DirectKafkaInputDStream.scala:153)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:402)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:403)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg

It looks like there's an issue with the 'Parameters' pojo I'm using within
my driver program. For some reason that needs to be serializable, which is
odd.

java.io.NotSerializableException: com.kona.consumer.kafka.spark.Parameters


Giving it another whirl though having to make it serializable seems odd to
me..

On Fri, Jul 31, 2015 at 1:52 PM, Sean Owen so...@cloudera.com wrote:

 If you've set the checkpoint dir, it seems like indeed the intent is
 to use a default checkpoint interval in DStream:

 private[streaming] def initialize(time: Time) {
 ...
   // Set the checkpoint interval to be slideDuration or 10 seconds,
 which ever is larger
   if (mustCheckpoint  checkpointDuration == null) {
 checkpointDuration = slideDuration * math.ceil(Seconds(10) /
 slideDuration).toInt
 logInfo(Checkpoint interval automatically set to  +
 checkpointDuration)
   }

 Do you see that log message? what's the interval? that could at least
 explain why it's not doing anything, if it's quite long.

 It sort of seems wrong though since
 https://spark.apache.org/docs/latest/streaming-programming-guide.html
 suggests it was intended to be a multiple of the batch interval. The
 slide duration wouldn't always be relevant anyway.

 On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg
 dgoldenberg...@gmail.com wrote:
  I've instrumented checkpointing per the programming guide and I can tell
  that Spark Streaming is creating the checkpoint directories but I'm not
  seeing any content being created in those directories nor am I seeing the
  effects I'd expect from checkpointing.  I'd expect any data that comes
 into
  Kafka while the consumers are down, to get picked up when the consumers
 are
  restarted; I'm not seeing that.
 
  For now my checkpoint directory is set to the local file system with the
  directory URI being in this form:   file:///mnt/dir1/dir2.  I see a
  subdirectory named with a UUID being created under there but no files.
 
  I'm using a custom JavaStreamingContextFactory which creates a
  JavaStreamingContext with the directory set into it via the
  checkpoint(String) method.
 
  I'm currently not invoking the checkpoint(Duration) method on the DStream
  since I want to first rely on Spark's default checkpointing interval.  My
  streaming batch duration millis is set to 1 second.
 
  Anyone have any idea what might be going wrong?
 
  Also, at which point does Spark delete files from checkpointing?
 
  Thanks.

Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg

I've instrumented checkpointing per the programming guide and I can tell
that Spark Streaming is creating the checkpoint directories but I'm not
seeing any content being created in those directories nor am I seeing the
effects I'd expect from checkpointing.  I'd expect any data that comes into
Kafka while the consumers are down, to get picked up when the consumers are
restarted; I'm not seeing that.

For now my checkpoint directory is set to the local file system with the
directory URI being in this form:   file:///mnt/dir1/dir2.  I see a
subdirectory named with a UUID being created under there but no files.

I'm using a custom JavaStreamingContextFactory which creates a
JavaStreamingContext with the directory set into it via the
checkpoint(String) method.

I'm currently not invoking the checkpoint(Duration) method on the DStream
since I want to first rely on Spark's default checkpointing interval.  My
streaming batch duration millis is set to 1 second.

Anyone have any idea what might be going wrong?

Also, at which point does Spark delete files from checkpointing?

Thanks.

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg

I'll check the log info message..

Meanwhile, the code is basically

public class KafkaSparkStreamingDriver implements Serializable {

..

SparkConf sparkConf = createSparkConf(appName, kahunaEnv);

JavaStreamingContext jssc = params.isCheckpointed() ?
createCheckpointedContext(sparkConf, params) : createContext(sparkConf,
params);


jssc.start();

jssc.awaitTermination();

jssc.close();

..

  private JavaStreamingContext createCheckpointedContext(SparkConf sparkConf,
Parameters params) {

JavaStreamingContextFactory factory = new JavaStreamingContextFactory()
{

  @Override

  public JavaStreamingContext create() {

return createContext(sparkConf, params);

  }

};

return JavaStreamingContext.getOrCreate(params.getCheckpointDir(),
factory);

  }

...

  private JavaStreamingContext createContext(SparkConf sparkConf,
Parameters params) {

// Create context with the specified batch interval, in milliseconds.

JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.milliseconds(params.getBatchDurationMillis()));

// Set the checkpoint directory, if we're checkpointing

if (params.isCheckpointed()) {

  jssc.checkpoint(params.getCheckpointDir());

}


SetString topicsSet = new HashSetString(Arrays.asList(params
.getTopic()));


// Set the Kafka parameters.

MapString, String kafkaParams = new HashMapString, String();

kafkaParams.put(KafkaProducerProperties.METADATA_BROKER_LIST, params
.getBrokerList());

if (StringUtils.isNotBlank(params.getAutoOffsetReset())) {

  kafkaParams.put(KafkaConsumerProperties.AUTO_OFFSET_RESET, params
.getAutoOffsetReset());

}


// Create direct Kafka stream with the brokers and the topic.

JavaPairInputDStreamString, String messages =
KafkaUtils.createDirectStream(

  jssc,

  String.class,

  String.class,

  StringDecoder.class,

  StringDecoder.class,

  kafkaParams,

  topicsSet);

// See if there's an override of the default checkpoint duration.

if (params.isCheckpointed()  params.getCheckpointMillis()  0L) {

  messages.checkpoint(Durations.milliseconds(params
.getCheckpointMillis()));

}

.




On Fri, Jul 31, 2015 at 1:52 PM, Sean Owen so...@cloudera.com wrote:

 If you've set the checkpoint dir, it seems like indeed the intent is
 to use a default checkpoint interval in DStream:

 private[streaming] def initialize(time: Time) {
 ...
   // Set the checkpoint interval to be slideDuration or 10 seconds,
 which ever is larger
   if (mustCheckpoint  checkpointDuration == null) {
 checkpointDuration = slideDuration * math.ceil(Seconds(10) /
 slideDuration).toInt
 logInfo(Checkpoint interval automatically set to  +
 checkpointDuration)
   }

 Do you see that log message? what's the interval? that could at least
 explain why it's not doing anything, if it's quite long.

 It sort of seems wrong though since
 https://spark.apache.org/docs/latest/streaming-programming-guide.html
 suggests it was intended to be a multiple of the batch interval. The
 slide duration wouldn't always be relevant anyway.

 On Fri, Jul 31, 2015 at 6:16 PM, Dmitry Goldenberg
 dgoldenberg...@gmail.com wrote:
  I've instrumented checkpointing per the programming guide and I can tell
  that Spark Streaming is creating the checkpoint directories but I'm not
  seeing any content being created in those directories nor am I seeing the
  effects I'd expect from checkpointing.  I'd expect any data that comes
 into
  Kafka while the consumers are down, to get picked up when the consumers
 are
  restarted; I'm not seeing that.
 
  For now my checkpoint directory is set to the local file system with the
  directory URI being in this form:   file:///mnt/dir1/dir2.  I see a
  subdirectory named with a UUID being created under there but no files.
 
  I'm using a custom JavaStreamingContextFactory which creates a
  JavaStreamingContext with the directory set into it via the
  checkpoint(String) method.
 
  I'm currently not invoking the checkpoint(Duration) method on the DStream
  since I want to first rely on Spark's default checkpointing interval.  My
  streaming batch duration millis is set to 1 second.
 
  Anyone have any idea what might be going wrong?
 
  Also, at which point does Spark delete files from checkpointing?
 
  Thanks.

Re: What is a best practice for passing environment variables to Spark workers?

2015-07-10 Thread Dmitry Goldenberg

Thanks, Akhil.

We're trying the conf.setExecutorEnv() approach since we've already got
environment variables set. For system properties we'd go the
conf.set(spark.) route.

We were concerned that doing the below type of thing did not work, which
this blog post seems to confirm (
http://progexc.blogspot.com/2014/12/spark-configuration-mess-solved.html):

$SPARK_HOME/spark-submit \
--class com.acme.Driver \
--conf spark.executorEnv.VAR1=VAL1 \
--conf spark.executorEnv.VAR2=VAL2 \
.

The code running on the workers does not see these variables.

On Fri, Jul 10, 2015 at 4:03 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

It basically filters out everything which doesn't starts with spark
https://github.com/apache/spark/blob/658814c898bec04c31a8e57f8da0103497aac6ec/core/src/main/scala/org/apache/spark/SparkConf.scala#L314.
so it is necessary to keep spark. in the property name.

Thanks
Best Regards

On Fri, Jul 10, 2015 at 12:06 AM, dgoldenberg dgoldenberg...@gmail.com
wrote:

I have about 20 environment variables to pass to my Spark workers. Even
though they're in the init scripts on the Linux box, the workers don't see
these variables.

Does Spark do something to shield itself from what may be defined in the
environment?

I see multiple pieces of info on how to pass the env vars into workers and
they seem dated and/or unclear.

Here:

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-pass-config-variables-to-workers-tt5780.html

SparkConf conf = new SparkConf();
conf.set(spark.myapp.myproperty, propertyValue);
OR
set them in spark-defaults.conf, as in
spark.config.one value
spark.config.two value2

In another posting,

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-environment-variable-for-a-spark-job-tt3180.html
:
conf.setExecutorEnv(ORACLE_HOME, myOraHome)
conf.setExecutorEnv(SPARK_JAVA_OPTS,
-Djava.library.path=/my/custom/path)

The configuration guide talks about
spark.executorEnv.[EnvironmentVariableName] -- Add the environment
variable
specified by EnvironmentVariableName to the Executor process. The user can
specify multiple of these to set multiple environment variables.

Then there are mentions of SPARK_JAVA_OPTS which seems to be deprecated
(?)

What is the easiest/cleanest approach here? Ideally, I'd not want to
burden
my driver program with explicit knowledge of all the env vars that are
needed on the worker side. I'd also like to avoid having to jam them into
spark-defaults.conf since they're already set in the system init scripts,
so
why duplicate.

I suppose one approach would be to namespace all my vars to start with a
well-known prefix, then cycle through the env in the driver and stuff all
these variables into the Spark context. If I'm doing that, would I want
to

conf.set(spark.myapp.myproperty, propertyValue);

and is spark. necessary? or was that just part of the example?

or would I want to

conf.setExecutorEnv(MYPREFIX_MY_VAR_1, some-value);

Thanks.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-a-best-practice-for-passing-environment-variables-to-Spark-workers-tp23751.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Best practice for using singletons on workers (seems unanswered) ?

Richard,

That's exactly the strategy I've been trying, which is a wrapper singleton
class. But I was seeing the inner object being created multiple times.

I wonder if the problem has to do with the way I'm processing the RDD's.
I'm using JavaDStream to stream data (from Kafka). Then I'm processing the
RDD's like so

JavaPairInputDStreamString, String messages =
KafkaUtils.createDirectStream(...)
JavaDStreamString messageBodies = messages.map(...)
messageBodies.foreachRDD(new MyFunction());

where MyFunction implements FunctionJavaRDDString, Void {
  ...
  rdd.map / rdd.filter ...
  rdd.foreach(... perform final action ...)
}

Perhaps the multiple singletons I'm seeing are the per-executor instances?
Judging by the streaming programming guide, perhaps I should follow the
connection sharing example:

dstream.foreachRDD { rdd =
  rdd.foreachPartition { partitionOfRecords =
val connection = createNewConnection()
partitionOfRecords.foreach(record = connection.send(record))
connection.close()
  }
}

So I'd pre-create my singletons in the foreachPartition call which would
let them be the per-JVM singletons, to be passed into MyFunction which
would now be a partition processing function rather than an RDD processing
function.

I wonder whether these singletons would still be created on every call as
the master sends RDD data over to the workers ?

I also wonder whether using foreachPartition would be more efficient anyway
and prevent some of the over-network data shuffling effects that I imagine
may happen with just doing a foreachRDD ?











On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher rmarsc...@localytics.com
wrote:

 Would it be possible to have a wrapper class that just represents a
 reference to a singleton holding the 3rd party object? It could proxy over
 calls to the singleton object which will instantiate a private instance of
 the 3rd party object lazily? I think something like this might work if the
 workers have the singleton object in their classpath.

 here's a rough sketch of what I was thinking:

 object ThirdPartySingleton {
   private lazy val thirdPartyObj = ...

   def someProxyFunction() = thirdPartyObj.()
 }

 class ThirdPartyReference extends Serializable {
   def someProxyFunction() = ThirdPartySingleton.someProxyFunction()
 }

 also found this SO post:
 http://stackoverflow.com/questions/26369916/what-is-the-right-way-to-have-a-static-object-on-all-workers


 On Tue, Jul 7, 2015 at 11:04 AM, dgoldenberg dgoldenberg...@gmail.com
 wrote:

 Hi,

 I am seeing a lot of posts on singletons vs. broadcast variables, such as
 *

 http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-have-some-singleton-per-worker-tt20277.html
 *

 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-a-NonSerializable-variable-among-tasks-in-the-same-worker-node-tt11048.html#a21219

 What's the best approach to instantiate an object once and have it be
 reused
 by the worker(s).

 E.g. I have an object that loads some static state such as e.g. a
 dictionary/map, is a part of 3rd party API and is not serializable.  I
 can't
 seem to get it to be a singleton on the worker side as the JVM appears to
 be
 wiped on every request so I get a new instance.  So the singleton doesn't
 stick.

 Is there an approach where I could have this object or a wrapper of it be
 a
 broadcast var? Can Kryo get me there? would that basically mean writing a
 custom serializer?  However, the 3rd party object may have a bunch of
 member
 vars hanging off it, so serializing it properly may be non-trivial...

 Any pointers/hints greatly appreciated.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-using-singletons-on-workers-seems-unanswered-tp23692.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Best practice for using singletons on workers (seems unanswered) ?

My singletons do in fact stick around. They're one per worker, looks like.
So with 4 workers running on the box, we're creating one singleton per
worker process/jvm, which seems OK.

Still curious about foreachPartition vs. foreachRDD though...

On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher rmarsc...@localytics.com
wrote:

 Would it be possible to have a wrapper class that just represents a
 reference to a singleton holding the 3rd party object? It could proxy over
 calls to the singleton object which will instantiate a private instance of
 the 3rd party object lazily? I think something like this might work if the
 workers have the singleton object in their classpath.

 here's a rough sketch of what I was thinking:

 object ThirdPartySingleton {
   private lazy val thirdPartyObj = ...

   def someProxyFunction() = thirdPartyObj.()
 }

 class ThirdPartyReference extends Serializable {
   def someProxyFunction() = ThirdPartySingleton.someProxyFunction()
 }

 also found this SO post:
 http://stackoverflow.com/questions/26369916/what-is-the-right-way-to-have-a-static-object-on-all-workers


 On Tue, Jul 7, 2015 at 11:04 AM, dgoldenberg dgoldenberg...@gmail.com
 wrote:

 Hi,

 I am seeing a lot of posts on singletons vs. broadcast variables, such as
 *

 http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-have-some-singleton-per-worker-tt20277.html
 *

 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-a-NonSerializable-variable-among-tasks-in-the-same-worker-node-tt11048.html#a21219

 What's the best approach to instantiate an object once and have it be
 reused
 by the worker(s).

 E.g. I have an object that loads some static state such as e.g. a
 dictionary/map, is a part of 3rd party API and is not serializable.  I
 can't
 seem to get it to be a singleton on the worker side as the JVM appears to
 be
 wiped on every request so I get a new instance.  So the singleton doesn't
 stick.

 Is there an approach where I could have this object or a wrapper of it be
 a
 broadcast var? Can Kryo get me there? would that basically mean writing a
 custom serializer?  However, the 3rd party object may have a bunch of
 member
 vars hanging off it, so serializing it properly may be non-trivial...

 Any pointers/hints greatly appreciated.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-using-singletons-on-workers-seems-unanswered-tp23692.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: foreachRDD vs. forearchPartition ?

These are quite different operations. One operates on RDDs in DStream and
one operates on partitions of an RDD. They are not alternatives.

Sean, different operations as they are, they can certainly be used on the
same data set. In that sense, they are alternatives. Code can be written
using one or the other which reaches the same effect - likely at a
different efficiency cost.

The question is, what are the effects of applying one vs. the other?

My specific scenario is, I'm streaming data out of Kafka. I want to
perform a few transformations then apply an action which results in e.g.
writing this data to Solr. According to Evo, my best bet is
foreachPartition because of increased parallelism (which I'd need to grok
to understand the details of what that means).

Another scenario is, I've done a few transformations and send a result
somewhere, e.g. I write a message into a socket. Let's say I have one
socket per a client of my streaming app and I get a host:port of that
socket as part of the message and want to send the response via that
socket. Is foreachPartition still a better choice?

On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen so...@cloudera.com wrote:

These are quite different operations. One operates on RDDs in DStream and
one operates on partitions of an RDD. They are not alternatives.

On Wed, Jul 8, 2015, 2:43 PM dgoldenberg dgoldenberg...@gmail.com wrote:

Is there a set of best practices for when to use foreachPartition vs.
foreachRDD?

Is it generally true that using foreachPartition avoids some of the
over-network data shuffling overhead?

When would I definitely want to use one method vs. the other?

Thanks.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: foreachRDD vs. forearchPartition ?

Thanks, Sean.

are you asking about foreach vs foreachPartition? that's quite
different. foreachPartition does not give more parallelism but lets
you operate on a whole batch of data at once, which is nice if you
need to allocate some expensive resource to do the processing

This is basically what I was looking for.

On Wed, Jul 8, 2015 at 11:15 AM, Sean Owen so...@cloudera.com wrote:

@Evo There is no foreachRDD operation on RDDs; it is a method of
DStream. It gives each RDD in the stream. RDD has a foreach, and
foreachPartition. These give elements of an RDD. What do you mean it
'works' to call foreachRDD on an RDD?

@Dmitry are you asking about foreach vs foreachPartition? that's quite
different. foreachPartition does not give more parallelism but lets
you operate on a whole batch of data at once, which is nice if you
need to allocate some expensive resource to do the processing.

On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
These are quite different operations. One operates on RDDs in DStream
and
one operates on partitions of an RDD. They are not alternatives.

The question is, what are the effects of applying one vs. the other?

My specific scenario is, I'm streaming data out of Kafka. I want to
perform
a few transformations then apply an action which results in e.g. writing
this data to Solr. According to Evo, my best bet is foreachPartition
because of increased parallelism (which I'd need to grok to understand
the
details of what that means).

Another scenario is, I've done a few transformations and send a result
somewhere, e.g. I write a message into a socket. Let's say I have one
socket per a client of my streaming app and I get a host:port of that
socket
as part of the message and want to send the response via that socket. Is
foreachPartition still a better choice?

On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen so...@cloudera.com wrote:

These are quite different operations. One operates on RDDs in DStream
and
one operates on partitions of an RDD. They are not alternatives.

On Wed, Jul 8, 2015, 2:43 PM dgoldenberg dgoldenberg...@gmail.com
wrote:

Is there a set of best practices for when to use foreachPartition vs.
foreachRDD?

Is it generally true that using foreachPartition avoids some of the
over-network data shuffling overhead?

When would I definitely want to use one method vs. the other?

Thanks.

--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: foreachRDD vs. forearchPartition ?

Thanks, Cody. The good boy comment wasn't from me :) I was the one
asking for help.

On Wed, Jul 8, 2015 at 10:52 AM, Cody Koeninger c...@koeninger.org wrote:

Sean already answered your question. foreachRDD and foreachPartition are
completely different, there's nothing fuzzy or insufficient about that
answer. The fact that you can call foreachPartition on an rdd within the
scope of foreachRDD should tell you that they aren't in any way comparable.

I'm not sure if your rudeness (be a good boy...really?) is intentional
or not. If you're asking for help from people that are in most cases
donating their time, I'd suggest that you'll have more success with a
little more politeness.

On Wed, Jul 8, 2015 at 9:05 AM, Evo Eftimov evo.efti...@isecc.com wrote:

That was a) fuzzy b) insufficient – one can certainly use forach (only)
on DStream RDDs – it works as empirical observation

As another empirical observation:

For each partition results in having one instance of the lambda/closure
per partition when e.g. publishing to output systems like message brokers,
databases and file systems - that increases the level of parallelism of
your output processing

As an architect I deal with gazillions of products and don’t have time to
read the source code of all of them to make up for documentation
deficiencies. On the other hand I believe you have been involved in writing
some of the code so be a good boy and either answer this question properly
or enhance the product documentation of that area of the system

*From:* Sean Owen [mailto:so...@cloudera.com]
*Sent:* Wednesday, July 8, 2015 2:52 PM
*To:* dgoldenberg; user@spark.apache.org
*Subject:* Re: foreachRDD vs. forearchPartition ?

These are quite different operations. One operates on RDDs in DStream
and one operates on partitions of an RDD. They are not alternatives.

On Wed, Jul 8, 2015, 2:43 PM dgoldenberg dgoldenberg...@gmail.com
wrote:

Is there a set of best practices for when to use foreachPartition vs.
foreachRDD?

Is it generally true that using foreachPartition avoids some of the
over-network data shuffling overhead?

When would I definitely want to use one method vs. the other?

Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Registering custom metrics

2015-06-22 Thread Dmitry Goldenberg

Great, thank you, Silvio. In your experience, is there any way to instument
a callback into Coda Hale or the Spark consumers from the metrics sink? If
the sink performs some steps once it has received the metrics, I'd like to
be able to make the consumers aware of that via some sort of a callback..

On Mon, Jun 22, 2015 at 10:14 AM, Silvio Fiorito
silvio.fior...@granturing.com wrote:

Sorry, replied to Gerard’s question vs yours.

See here:

Yes, you have to implement your own custom Metrics Source using the Code
Hale library. See here for some examples:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/source/JvmSource.scala

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/ApplicationSource.scala

The source gets registered, then you have to configure a sink for it just
as the JSON servlet you mentioned.

I had done it in the past but don’t have the access to the source for that
project anymore unfortunately.

Thanks,
Silvio

On 6/22/15, 9:57 AM, dgoldenberg dgoldenberg...@gmail.com wrote:

Hi Gerard,

Have there been any responses? Any insights as to what you ended up doing
to
enable custom metrics? I'm thinking of implementing a custom metrics sink,
not sure how doable that is yet...

Thanks.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Registering-custom-metrics-tp17765p23426.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg

If I want to restart my consumers into an updated cluster topology after
the cluster has been expanded or contracted, would I need to call stop() on
them, then call start() on them, or would I need to instantiate and start
new context objects (new JavaStreamingContext(...)) ?  I'm thinking of
actually acquiescing these streaming consumers but letting them finish
their current batch first.

Right now I'm doing

jssc.start();
jssc.awaitTermination();

Must jssc.close() be called as well, after awaitTermination(), to avoid
potentially leaking contexts?  I don't see that in things
like JavaDirectKafkaWordCount but wondering if that's needed.

On Wed, Jun 3, 2015 at 11:49 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 Makes sense especially if you have a cloud with “infinite” resources /
 nodes which allows you to double, triple etc in the background/parallel the
 resources of the currently running cluster



 I was thinking more about the scenario where you have e.g. 100 boxes and
 want to / can add e.g. 20 more



 *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
 *Sent:* Wednesday, June 3, 2015 4:46 PM
 *To:* Evo Eftimov
 *Cc:* Cody Koeninger; Andrew Or; Gerard Maas; spark users
 *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic
 sizes/rate of growth in Kafka or Spark's metrics?



 Evo,



 One of the ideas is to shadow the current cluster. This way there's no
 extra latency incurred due to shutting down of the consumers. If two sets
 of consumers are running, potentially processing the same data, that is OK.
 We phase out the older cluster and gradually flip over to the new one,
 insuring no downtime or extra latency.  Thoughts?



 On Wed, Jun 3, 2015 at 11:27 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 You should monitor vital performance / job clogging stats of the Spark
 Streaming Runtime not “kafka topics”



 You should be able to bring new worker nodes online and make them contact
 and register with the Master without bringing down the Master (or any of
 the currently running worker nodes)



 Then just shutdown your currently running spark streaming job/app and
 restart it with new params to take advantage of the larger cluster



 *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
 *Sent:* Wednesday, June 3, 2015 4:14 PM
 *To:* Cody Koeninger
 *Cc:* Andrew Or; Evo Eftimov; Gerard Maas; spark users
 *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic
 sizes/rate of growth in Kafka or Spark's metrics?



 Would it be possible to implement Spark autoscaling somewhat along these
 lines? --



 1. If we sense that a new machine is needed, by watching the data load in
 Kafka topic(s), then

 2. Provision a new machine via a Provisioner interface (e.g. talk to AWS
 and get a machine);

 3. Create a shadow/mirror Spark master running alongside the initial
 version which talks to N machines. The new mirror version is aware of N+1
 machines (or N+M if we had decided we needed M new boxes).

 4. The previous version of the Spark runtime is
 acquiesced/decommissioned.  We possibly get both clusters working on the
 same data which may actually be OK (at least for our specific use-cases).

 5. Now the new Spark cluster is running.



 Similarly, the decommissioning of M unused boxes would happen, via this
 notion of a mirror Spark runtime.  How feasible would it be for such a
 mirrorlike setup to be created, especially created programmatically?
 Especially point #3.



 The other idea we'd entertained was to bring in a new machine, acquiesce
 down all currently running workers by telling them to process their current
 batch then shut down, then restart the consumers now that Spark is aware of
 a modified cluster.  This has the drawback of a downtime that may not be
 tolerable in terms of latency, by the system's clients waiting for their
 responses in a synchronous fashion.



 Thanks.



 On Thu, May 28, 2015 at 5:15 PM, Cody Koeninger c...@koeninger.org
 wrote:

 I'm not sure that points 1 and 2 really apply to the kafka direct stream.
 There are no receivers, and you know at the driver how big each of your
 batches is.



 On Thu, May 28, 2015 at 2:21 PM, Andrew Or and...@databricks.com wrote:

 Hi all,



 As the author of the dynamic allocation feature I can offer a few insights
 here.



 Gerard's explanation was both correct and concise: dynamic allocation is
 not intended to be used in Spark streaming at the moment (1.4 or before).
 This is because of two things:



 (1) Number of receivers is necessarily fixed, and these are started in
 executors. Since we need a receiver for each InputDStream, if we kill these
 receivers we essentially stop the stream, which is not what we want. It
 makes little sense to close and restart a stream the same way we kill and
 relaunch executors.



 (2) Records come in every batch, and when there is data to process your
 executors are not idle. If your idle timeout is less than the batch
 duration, then you'll

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg

 to do anything further.

 (b) Receiver: This is definitely tricky. If you dont need to increase the
 number of receivers, then a new executor will start getting used for
 computations (shuffles, writing out, etc.), but the parallelism in
 receiving will not increase. If you need to increase that, then its best to
 shutdown the context gracefully (so that no data is lost), and a new
 StreamingContext can be started with more receivers (# receivers = #
 executors), and may be more #partitions for shuffles. You have call stop on
 currently running streaming context, to start a new one. If a context is
 stopped, any thread stuck in awaitTermniation will get unblocked.

 Does that clarify things?







 On Thu, Jun 11, 2015 at 7:30 AM, Cody Koeninger c...@koeninger.org
 wrote:

 Depends on what you're reusing multiple times (if anything).

 Read
 http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

 On Wed, Jun 10, 2015 at 12:18 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 At which point would I call cache()?  I just want the runtime to spill
 to disk when necessary without me having to know when the necessary is.


 On Thu, Jun 4, 2015 at 9:42 AM, Cody Koeninger c...@koeninger.org
 wrote:

 direct stream isn't a receiver, it isn't required to cache data
 anywhere unless you want it to.

 If you want it, just call cache.

 On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 set the storage policy for the DStream RDDs to MEMORY AND DISK - it
 appears the storage level can be specified in the createStream methods but
 not createDirectStream...


 On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 You can also try Dynamic Resource Allocation




 https://spark.apache.org/docs/1.3.1/job-scheduling.html#dynamic-resource-allocation



 Also re the Feedback Loop for automatic message consumption rate
 adjustment – there is a “dumb” solution option – simply set the storage
 policy for the DStream RDDs to MEMORY AND DISK – when the memory gets
 exhausted spark streaming will resort to keeping new RDDs on disk which
 will prevent it from crashing and hence loosing them. Then some memory 
 will
 get freed and it will resort back to RAM and so on and so forth





 Sent from Samsung Mobile

  Original message 

 From: Evo Eftimov

 Date:2015/05/28 13:22 (GMT+00:00)

 To: Dmitry Goldenberg

 Cc: Gerard Maas ,spark users

 Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of
 growth in Kafka or Spark's metrics?



 You can always spin new boxes in the background and bring them into
 the cluster fold when fully operational and time that with job relaunch 
 and
 param change



 Kafka offsets are mabaged automatically for you by the kafka clients
 which keep them in zoomeeper dont worry about that ad long as you shut 
 down
 your job gracefuly. Besides msnaging the offsets explicitly is not a big
 deal if necessary





 Sent from Samsung Mobile



  Original message 

 From: Dmitry Goldenberg

 Date:2015/05/28 13:16 (GMT+00:00)

 To: Evo Eftimov

 Cc: Gerard Maas ,spark users

 Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of
 growth in Kafka or Spark's metrics?



 Thanks, Evo.  Per the last part of your comment, it sounds like we
 will need to implement a job manager which will be in control of starting
 the jobs, monitoring the status of the Kafka topic(s), shutting jobs down
 and marking them as ones to relaunch, scaling the cluster up/down by
 adding/removing machines, and relaunching the 'suspended' (shut down) 
 jobs.



 I suspect that relaunching the jobs may be tricky since that means
 keeping track of the starter offsets in Kafka topic(s) from which the 
 jobs
 started working on.



 Ideally, we'd want to avoid a re-launch.  The 'suspension' and
 relaunching of jobs, coupled with the wait for the new machines to come
 online may turn out quite time-consuming which will make for lengthy
 request times, and our requests are not asynchronous.  Ideally, the
 currently running jobs would continue to run on the machines currently
 available in the cluster.



 In the scale-down case, the job manager would want to signal to
 Spark's job scheduler not to send work to the node being taken out, find
 out when the last job has finished running on the node, then take the 
 node
 out.



 This is somewhat like changing the number of cylinders in a car
 engine while the car is running...



 Sounds like a great candidate for a set of enhancements in Spark...



 On Thu, May 28, 2015 at 7:52 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 @DG; The key metrics should be



 -  Scheduling delay – its ideal state is to remain constant
 over time and ideally be less than the time of the microbatch window

 -  The average job processing time should remain less than
 the micro-batch window

 -  Number of Lost Jobs – even if there is a single Job lost

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-09 Thread Dmitry Goldenberg

At which point would I call cache()?  I just want the runtime to spill to
disk when necessary without me having to know when the necessary is.


On Thu, Jun 4, 2015 at 9:42 AM, Cody Koeninger c...@koeninger.org wrote:

 direct stream isn't a receiver, it isn't required to cache data anywhere
 unless you want it to.

 If you want it, just call cache.

 On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 set the storage policy for the DStream RDDs to MEMORY AND DISK - it
 appears the storage level can be specified in the createStream methods but
 not createDirectStream...


 On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 You can also try Dynamic Resource Allocation




 https://spark.apache.org/docs/1.3.1/job-scheduling.html#dynamic-resource-allocation



 Also re the Feedback Loop for automatic message consumption rate
 adjustment – there is a “dumb” solution option – simply set the storage
 policy for the DStream RDDs to MEMORY AND DISK – when the memory gets
 exhausted spark streaming will resort to keeping new RDDs on disk which
 will prevent it from crashing and hence loosing them. Then some memory will
 get freed and it will resort back to RAM and so on and so forth





 Sent from Samsung Mobile

  Original message 

 From: Evo Eftimov

 Date:2015/05/28 13:22 (GMT+00:00)

 To: Dmitry Goldenberg

 Cc: Gerard Maas ,spark users

 Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of
 growth in Kafka or Spark's metrics?



 You can always spin new boxes in the background and bring them into the
 cluster fold when fully operational and time that with job relaunch and
 param change



 Kafka offsets are mabaged automatically for you by the kafka clients
 which keep them in zoomeeper dont worry about that ad long as you shut down
 your job gracefuly. Besides msnaging the offsets explicitly is not a big
 deal if necessary





 Sent from Samsung Mobile



  Original message 

 From: Dmitry Goldenberg

 Date:2015/05/28 13:16 (GMT+00:00)

 To: Evo Eftimov

 Cc: Gerard Maas ,spark users

 Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of
 growth in Kafka or Spark's metrics?



 Thanks, Evo.  Per the last part of your comment, it sounds like we will
 need to implement a job manager which will be in control of starting the
 jobs, monitoring the status of the Kafka topic(s), shutting jobs down and
 marking them as ones to relaunch, scaling the cluster up/down by
 adding/removing machines, and relaunching the 'suspended' (shut down) jobs.



 I suspect that relaunching the jobs may be tricky since that means
 keeping track of the starter offsets in Kafka topic(s) from which the jobs
 started working on.



 Ideally, we'd want to avoid a re-launch.  The 'suspension' and
 relaunching of jobs, coupled with the wait for the new machines to come
 online may turn out quite time-consuming which will make for lengthy
 request times, and our requests are not asynchronous.  Ideally, the
 currently running jobs would continue to run on the machines currently
 available in the cluster.



 In the scale-down case, the job manager would want to signal to Spark's
 job scheduler not to send work to the node being taken out, find out when
 the last job has finished running on the node, then take the node out.



 This is somewhat like changing the number of cylinders in a car engine
 while the car is running...



 Sounds like a great candidate for a set of enhancements in Spark...



 On Thu, May 28, 2015 at 7:52 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 @DG; The key metrics should be



 -  Scheduling delay – its ideal state is to remain constant
 over time and ideally be less than the time of the microbatch window

 -  The average job processing time should remain less than the
 micro-batch window

 -  Number of Lost Jobs – even if there is a single Job lost
 that means that you have lost all messages for the DStream RDD processed by
 that job due to the previously described spark streaming memory leak
 condition and subsequent crash – described in previous postings submitted
 by me



 You can even go one step further and periodically issue “get/check free
 memory” to see whether it is decreasing relentlessly at a constant rate –
 if it touches a predetermined RAM threshold that should be your third
 metric



 Re the “back pressure” mechanism – this is a Feedback Loop mechanism and
 you can implement one on your own without waiting for Jiras and new
 features whenever they might be implemented by the Spark dev team –
 moreover you can avoid using slow mechanisms such as ZooKeeper and even
 incorporate some Machine Learning in your Feedback Loop to make it handle
 the message consumption rate more intelligently and benefit from ongoing
 online learning – BUT this is STILL about voluntarily sacrificing your
 performance in the name of keeping your system stable

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-05 Thread Dmitry Goldenberg

Thanks everyone. Evo, could you provide a link to the Lookup RDD project? I
can't seem to locate it exactly on Github. (Yes, to your point, our project
is Spark streaming based). Thank you.

On Fri, Jun 5, 2015 at 6:04 AM, Evo Eftimov evo.efti...@isecc.com wrote:

Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark
Batch Jobs (besides anyone can put something like that in 5 min), while I
am under the impression that Dmytiy is working on Spark Streaming app

Besides the Job Server is essentially for sharing the Spark Context
between multiple threads

Re Dmytiis intial question – you can load large data sets as Batch
(Static) RDD from any Spark Streaming App and then join DStream RDDs
against them to emulate “lookups” , you can also try the “Lookup RDD” –
there is a git hub project

*From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
*Sent:* Friday, June 5, 2015 12:12 AM
*To:* Yiannis Gkoufas
*Cc:* Olivier Girardot; user@spark.apache.org
*Subject:* Re: How to share large resources like dictionaries while
processing data with Spark ?

Thanks so much, Yiannis, Olivier, Huang!

On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com
wrote:

Hi there,

I would recommend checking out
https://github.com/spark-jobserver/spark-jobserver which I think gives
the functionality you are looking for.

I haven't tested it though.

On 5 June 2015 at 01:35, Olivier Girardot ssab...@gmail.com wrote:

You can use it as a broadcast variable, but if it's too large (more than
1Gb I guess), you may need to share it joining this using some kind of key
to the other RDDs.

But this is the kind of thing broadcast variables were designed for.

Regards,

Olivier.

Le jeu. 4 juin 2015 à 23:50, dgoldenberg dgoldenberg...@gmail.com a
écrit :

We have some pipelines defined where sometimes we need to load potentially
large resources such as dictionaries.

What would be the best strategy for sharing such resources among the
transformations/actions within a consumer? Can they be shared somehow
across the RDD's?

I'm looking for a way to load such a resource once into the cluster memory
and have it be available throughout the lifecycle of a consumer...

Thanks.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to share large resources like dictionaries while processing data with Spark ?

2015-06-04 Thread Dmitry Goldenberg

Thanks so much, Yiannis, Olivier, Huang!

On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com
wrote:

Hi there,

I would recommend checking out
https://github.com/spark-jobserver/spark-jobserver which I think gives
the functionality you are looking for.
I haven't tested it though.

On 5 June 2015 at 01:35, Olivier Girardot ssab...@gmail.com wrote:

You can use it as a broadcast variable, but if it's too large (more
than 1Gb I guess), you may need to share it joining this using some kind of
key to the other RDDs.
But this is the kind of thing broadcast variables were designed for.

Regards,

Olivier.

Le jeu. 4 juin 2015 à 23:50, dgoldenberg dgoldenberg...@gmail.com a
écrit :

We have some pipelines defined where sometimes we need to load
potentially
large resources such as dictionaries.

What would be the best strategy for sharing such resources among the
transformations/actions within a consumer? Can they be shared somehow
across the RDD's?

I'm looking for a way to load such a resource once into the cluster
memory
and have it be available throughout the lifecycle of a consumer...

Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: StreamingListener, anyone?

2015-06-04 Thread Dmitry Goldenberg

Shixiong,

Thanks, interesting point. So if we want to only process one batch then
terminate the consumer, what's the best way to achieve that? Presumably the
listener could set a flag on the driver notifying it that it can terminate.
But the driver is not in a loop, it's basically blocked in
awaitTermination.  So what would be a way to trigger the termination in the
driver?

context.awaitTermination() allows the current thread to wait for the
termination of a context by stop() or by an exception - presumably, we
need to call stop() somewhere or perhaps throw.

Cheers,
- Dmitry

On Thu, Jun 4, 2015 at 3:55 AM, Shixiong Zhu zsxw...@gmail.com wrote:

 You should not call `jssc.stop(true);` in a StreamingListener. It will
 cause a dead-lock: `jssc.stop` won't return until `listenerBus` exits. But
 since `jssc.stop` blocks `StreamingListener`, `listenerBus` cannot exit.

 Best Regards,
 Shixiong Zhu

 2015-06-04 0:39 GMT+08:00 dgoldenberg dgoldenberg...@gmail.com:

 Hi,

 I've got a Spark Streaming driver job implemented and in it, I register a
 streaming listener, like so:

 JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.milliseconds(params.getBatchDurationMillis()));
 jssc.addStreamingListener(new JobListener(jssc));

 where JobListener is defined like so
 private static class JobListener implements StreamingListener {

 private JavaStreamingContext jssc;

 JobListener(JavaStreamingContext jssc) {
 this.jssc = jssc;
 }

 @Override
 public void
 onBatchCompleted(StreamingListenerBatchCompleted
 batchCompleted) {
 System.out.println( Batch completed.);
 jssc.stop(true);
 System.out.println( The job has been
 stopped.);
 }
 

 I do not seem to be seeing onBatchCompleted being triggered.  Am I doing
 something wrong?

 In this particular case, I was trying to implement a bulk ingest type of
 logic where the first batch is all we're interested in (reading out of a
 Kafka topic with offset reset set to smallest).




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/StreamingListener-anyone-tp23140.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-04 Thread Dmitry Goldenberg

set the storage policy for the DStream RDDs to MEMORY AND DISK - it
appears the storage level can be specified in the createStream methods but
not createDirectStream...


On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 You can also try Dynamic Resource Allocation




 https://spark.apache.org/docs/1.3.1/job-scheduling.html#dynamic-resource-allocation



 Also re the Feedback Loop for automatic message consumption rate
 adjustment – there is a “dumb” solution option – simply set the storage
 policy for the DStream RDDs to MEMORY AND DISK – when the memory gets
 exhausted spark streaming will resort to keeping new RDDs on disk which
 will prevent it from crashing and hence loosing them. Then some memory will
 get freed and it will resort back to RAM and so on and so forth





 Sent from Samsung Mobile

  Original message 

 From: Evo Eftimov

 Date:2015/05/28 13:22 (GMT+00:00)

 To: Dmitry Goldenberg

 Cc: Gerard Maas ,spark users

 Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth
 in Kafka or Spark's metrics?



 You can always spin new boxes in the background and bring them into the
 cluster fold when fully operational and time that with job relaunch and
 param change



 Kafka offsets are mabaged automatically for you by the kafka clients which
 keep them in zoomeeper dont worry about that ad long as you shut down your
 job gracefuly. Besides msnaging the offsets explicitly is not a big deal if
 necessary





 Sent from Samsung Mobile



  Original message 

 From: Dmitry Goldenberg

 Date:2015/05/28 13:16 (GMT+00:00)

 To: Evo Eftimov

 Cc: Gerard Maas ,spark users

 Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth
 in Kafka or Spark's metrics?



 Thanks, Evo.  Per the last part of your comment, it sounds like we will
 need to implement a job manager which will be in control of starting the
 jobs, monitoring the status of the Kafka topic(s), shutting jobs down and
 marking them as ones to relaunch, scaling the cluster up/down by
 adding/removing machines, and relaunching the 'suspended' (shut down) jobs.



 I suspect that relaunching the jobs may be tricky since that means keeping
 track of the starter offsets in Kafka topic(s) from which the jobs started
 working on.



 Ideally, we'd want to avoid a re-launch.  The 'suspension' and relaunching
 of jobs, coupled with the wait for the new machines to come online may turn
 out quite time-consuming which will make for lengthy request times, and our
 requests are not asynchronous.  Ideally, the currently running jobs would
 continue to run on the machines currently available in the cluster.



 In the scale-down case, the job manager would want to signal to Spark's
 job scheduler not to send work to the node being taken out, find out when
 the last job has finished running on the node, then take the node out.



 This is somewhat like changing the number of cylinders in a car engine
 while the car is running...



 Sounds like a great candidate for a set of enhancements in Spark...



 On Thu, May 28, 2015 at 7:52 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 @DG; The key metrics should be



 -  Scheduling delay – its ideal state is to remain constant over
 time and ideally be less than the time of the microbatch window

 -  The average job processing time should remain less than the
 micro-batch window

 -  Number of Lost Jobs – even if there is a single Job lost that
 means that you have lost all messages for the DStream RDD processed by that
 job due to the previously described spark streaming memory leak condition
 and subsequent crash – described in previous postings submitted by me



 You can even go one step further and periodically issue “get/check free
 memory” to see whether it is decreasing relentlessly at a constant rate –
 if it touches a predetermined RAM threshold that should be your third
 metric



 Re the “back pressure” mechanism – this is a Feedback Loop mechanism and
 you can implement one on your own without waiting for Jiras and new
 features whenever they might be implemented by the Spark dev team –
 moreover you can avoid using slow mechanisms such as ZooKeeper and even
 incorporate some Machine Learning in your Feedback Loop to make it handle
 the message consumption rate more intelligently and benefit from ongoing
 online learning – BUT this is STILL about voluntarily sacrificing your
 performance in the name of keeping your system stable – it is not about
 scaling your system/solution



 In terms of how to scale the Spark Framework Dynamically – even though
 this is not supported at the moment out of the box I guess you can have a
 sys management framework spin dynamically a few more boxes (spark worker
 nodes), stop dynamically your currently running Spark Streaming Job,
 relaunch it with new params e.g. more Receivers, larger number of
 Partitions (hence tasks), more RAM per executor

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

Great.

You should monitor vital performance / job clogging stats of the Spark
Streaming Runtime not “kafka topics” -- anything specific you were thinking
of?

On Wed, Jun 3, 2015 at 11:49 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 Makes sense especially if you have a cloud with “infinite” resources /
 nodes which allows you to double, triple etc in the background/parallel the
 resources of the currently running cluster



 I was thinking more about the scenario where you have e.g. 100 boxes and
 want to / can add e.g. 20 more



 *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
 *Sent:* Wednesday, June 3, 2015 4:46 PM
 *To:* Evo Eftimov
 *Cc:* Cody Koeninger; Andrew Or; Gerard Maas; spark users
 *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic
 sizes/rate of growth in Kafka or Spark's metrics?



 Evo,



 One of the ideas is to shadow the current cluster. This way there's no
 extra latency incurred due to shutting down of the consumers. If two sets
 of consumers are running, potentially processing the same data, that is OK.
 We phase out the older cluster and gradually flip over to the new one,
 insuring no downtime or extra latency.  Thoughts?



 On Wed, Jun 3, 2015 at 11:27 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 You should monitor vital performance / job clogging stats of the Spark
 Streaming Runtime not “kafka topics”



 You should be able to bring new worker nodes online and make them contact
 and register with the Master without bringing down the Master (or any of
 the currently running worker nodes)



 Then just shutdown your currently running spark streaming job/app and
 restart it with new params to take advantage of the larger cluster



 *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
 *Sent:* Wednesday, June 3, 2015 4:14 PM
 *To:* Cody Koeninger
 *Cc:* Andrew Or; Evo Eftimov; Gerard Maas; spark users
 *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic
 sizes/rate of growth in Kafka or Spark's metrics?



 Would it be possible to implement Spark autoscaling somewhat along these
 lines? --



 1. If we sense that a new machine is needed, by watching the data load in
 Kafka topic(s), then

 2. Provision a new machine via a Provisioner interface (e.g. talk to AWS
 and get a machine);

 3. Create a shadow/mirror Spark master running alongside the initial
 version which talks to N machines. The new mirror version is aware of N+1
 machines (or N+M if we had decided we needed M new boxes).

 4. The previous version of the Spark runtime is
 acquiesced/decommissioned.  We possibly get both clusters working on the
 same data which may actually be OK (at least for our specific use-cases).

 5. Now the new Spark cluster is running.



 Similarly, the decommissioning of M unused boxes would happen, via this
 notion of a mirror Spark runtime.  How feasible would it be for such a
 mirrorlike setup to be created, especially created programmatically?
 Especially point #3.



 The other idea we'd entertained was to bring in a new machine, acquiesce
 down all currently running workers by telling them to process their current
 batch then shut down, then restart the consumers now that Spark is aware of
 a modified cluster.  This has the drawback of a downtime that may not be
 tolerable in terms of latency, by the system's clients waiting for their
 responses in a synchronous fashion.



 Thanks.



 On Thu, May 28, 2015 at 5:15 PM, Cody Koeninger c...@koeninger.org
 wrote:

 I'm not sure that points 1 and 2 really apply to the kafka direct stream.
 There are no receivers, and you know at the driver how big each of your
 batches is.



 On Thu, May 28, 2015 at 2:21 PM, Andrew Or and...@databricks.com wrote:

 Hi all,



 As the author of the dynamic allocation feature I can offer a few insights
 here.



 Gerard's explanation was both correct and concise: dynamic allocation is
 not intended to be used in Spark streaming at the moment (1.4 or before).
 This is because of two things:



 (1) Number of receivers is necessarily fixed, and these are started in
 executors. Since we need a receiver for each InputDStream, if we kill these
 receivers we essentially stop the stream, which is not what we want. It
 makes little sense to close and restart a stream the same way we kill and
 relaunch executors.



 (2) Records come in every batch, and when there is data to process your
 executors are not idle. If your idle timeout is less than the batch
 duration, then you'll end up having to constantly kill and restart
 executors. If your idle timeout is greater than the batch duration, then
 you'll never kill executors.



 Long answer short, with Spark streaming there is currently no
 straightforward way to scale the size of your cluster. I had a long
 discussion with TD (Spark streaming lead) about what needs to be done to
 provide some semblance of dynamic scaling to streaming applications, e.g.
 take into account the batch queue instead. We came up

Re: Objects serialized before foreachRDD/foreachPartition ?

So Evo, option b is to singleton the Param, as in your modified snippet,
i.e. instantiate is once per an RDD.

But if I understand correctly the a) option is broadcast, meaning
instantiation is in the Driver once before any transformations and actions,
correct?  That's where my serialization costs concerns were.  There's the
Kryo serialization but Param might still be too heavy.  If some of its
member variables are lazy loaded we may be OK.  But it seems then on every
worker node the lazy initialization would have to happen to load these lazy
loaded resources into Param - ?

public class Param {
   // == potentially a very hefty resource to load
   private MapString, String dictionary = new HashMapString, String();
   ...
}

I'm groking that Spark will serialize Param right before the call to
foreachRDD, if we're to broadcast...



On Wed, Jun 3, 2015 at 9:58 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 Dmitry was concerned about the “serialization cost” NOT the “memory
 footprint – hence option a) is still viable since a Broadcast is performed
 only ONCE for the lifetime of Driver instance



 *From:* Ted Yu [mailto:yuzhih...@gmail.com]
 *Sent:* Wednesday, June 3, 2015 2:44 PM
 *To:* Evo Eftimov
 *Cc:* dgoldenberg; user
 *Subject:* Re: Objects serialized before foreachRDD/foreachPartition ?



 Considering memory footprint of param as mentioned by Dmitry, option b
 seems better.



 Cheers



 On Wed, Jun 3, 2015 at 6:27 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 Hmmm a spark streaming app code doesn't execute in the linear fashion
 assumed in your previous code snippet - to achieve your objectives you
 should do something like the following

 in terms of your second objective - saving the initialization and
 serialization of the params you can:

 a) broadcast them
 b) have them as a Singleton (initialized from e.g. params in a file on
 HDFS)
 on each Executor

 messageBodies.foreachRDD(new FunctionJavaRDDlt;String, Void() {

 Param param = new Param();
 param.initialize();

   @Override
   public Void call(JavaRDDString rdd) throws Exception {
 ProcessPartitionFunction func = new
 ProcessPartitionFunction(param);
 rdd.foreachPartition(func);
 return null;
   }

 });

 //put this in e.g. the object destructor
 param.deinitialize();


 -Original Message-
 From: dgoldenberg [mailto:dgoldenberg...@gmail.com]
 Sent: Wednesday, June 3, 2015 1:56 PM
 To: user@spark.apache.org
 Subject: Objects serialized before foreachRDD/foreachPartition ?

 I'm looking at https://spark.apache.org/docs/latest/tuning.html.
 Basically
 the takeaway is that all objects passed into the code processing RDD's must
 be serializable. So if I've got a few objects that I'd rather initialize
 once and deinitialize once outside of the logic processing the RDD's, I'd
 need to think twice about the costs of serializing such objects, it would
 seem.

 In the below, does the Spark serialization happen before calling foreachRDD
 or before calling foreachPartition?

 Param param = new Param();
 param.initialize();
 messageBodies.foreachRDD(new FunctionJavaRDDlt;String, Void() {
   @Override
   public Void call(JavaRDDString rdd) throws Exception {
 ProcessPartitionFunction func = new
 ProcessPartitionFunction(param);
 rdd.foreachPartition(func);
 return null;
   }
 });
 param.deinitialize();

 If param gets initialized to a significant memory footprint, are we better
 off creating/initializing it before calling new ProcessPartitionFunction()
 or perhaps in the 'call' method within that function?

 I'm trying to avoid calling expensive init()/deinit() methods while
 balancing against the serialization costs. Thanks.



 --
 View this message in context:

 http://apache-spark-user-list.1001560.n3.nabble.com/Objects-serialized-befor
 e-foreachRDD-foreachPartition-tp23134.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

If we have a hand-off between the older consumer and the newer consumer, I
wonder if we need to manually manage the offsets in Kafka so as not to miss
some messages as the hand-off is happening.

Or if we let the new consumer run for a bit then let the old consumer know
the 'new guy is in town' then the old consumer can be shut off.  Some
overlap is OK...

On Wed, Jun 3, 2015 at 11:49 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 Makes sense especially if you have a cloud with “infinite” resources /
 nodes which allows you to double, triple etc in the background/parallel the
 resources of the currently running cluster



 I was thinking more about the scenario where you have e.g. 100 boxes and
 want to / can add e.g. 20 more



 *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
 *Sent:* Wednesday, June 3, 2015 4:46 PM
 *To:* Evo Eftimov
 *Cc:* Cody Koeninger; Andrew Or; Gerard Maas; spark users
 *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic
 sizes/rate of growth in Kafka or Spark's metrics?



 Evo,



 One of the ideas is to shadow the current cluster. This way there's no
 extra latency incurred due to shutting down of the consumers. If two sets
 of consumers are running, potentially processing the same data, that is OK.
 We phase out the older cluster and gradually flip over to the new one,
 insuring no downtime or extra latency.  Thoughts?



 On Wed, Jun 3, 2015 at 11:27 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 You should monitor vital performance / job clogging stats of the Spark
 Streaming Runtime not “kafka topics”



 You should be able to bring new worker nodes online and make them contact
 and register with the Master without bringing down the Master (or any of
 the currently running worker nodes)



 Then just shutdown your currently running spark streaming job/app and
 restart it with new params to take advantage of the larger cluster



 *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
 *Sent:* Wednesday, June 3, 2015 4:14 PM
 *To:* Cody Koeninger
 *Cc:* Andrew Or; Evo Eftimov; Gerard Maas; spark users
 *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic
 sizes/rate of growth in Kafka or Spark's metrics?



 Would it be possible to implement Spark autoscaling somewhat along these
 lines? --



 1. If we sense that a new machine is needed, by watching the data load in
 Kafka topic(s), then

 2. Provision a new machine via a Provisioner interface (e.g. talk to AWS
 and get a machine);

 3. Create a shadow/mirror Spark master running alongside the initial
 version which talks to N machines. The new mirror version is aware of N+1
 machines (or N+M if we had decided we needed M new boxes).

 4. The previous version of the Spark runtime is
 acquiesced/decommissioned.  We possibly get both clusters working on the
 same data which may actually be OK (at least for our specific use-cases).

 5. Now the new Spark cluster is running.



 Similarly, the decommissioning of M unused boxes would happen, via this
 notion of a mirror Spark runtime.  How feasible would it be for such a
 mirrorlike setup to be created, especially created programmatically?
 Especially point #3.



 The other idea we'd entertained was to bring in a new machine, acquiesce
 down all currently running workers by telling them to process their current
 batch then shut down, then restart the consumers now that Spark is aware of
 a modified cluster.  This has the drawback of a downtime that may not be
 tolerable in terms of latency, by the system's clients waiting for their
 responses in a synchronous fashion.



 Thanks.



 On Thu, May 28, 2015 at 5:15 PM, Cody Koeninger c...@koeninger.org
 wrote:

 I'm not sure that points 1 and 2 really apply to the kafka direct stream.
 There are no receivers, and you know at the driver how big each of your
 batches is.



 On Thu, May 28, 2015 at 2:21 PM, Andrew Or and...@databricks.com wrote:

 Hi all,



 As the author of the dynamic allocation feature I can offer a few insights
 here.



 Gerard's explanation was both correct and concise: dynamic allocation is
 not intended to be used in Spark streaming at the moment (1.4 or before).
 This is because of two things:



 (1) Number of receivers is necessarily fixed, and these are started in
 executors. Since we need a receiver for each InputDStream, if we kill these
 receivers we essentially stop the stream, which is not what we want. It
 makes little sense to close and restart a stream the same way we kill and
 relaunch executors.



 (2) Records come in every batch, and when there is data to process your
 executors are not idle. If your idle timeout is less than the batch
 duration, then you'll end up having to constantly kill and restart
 executors. If your idle timeout is greater than the batch duration, then
 you'll never kill executors.



 Long answer short, with Spark streaming there is currently no
 straightforward way to scale the size of your cluster. I had a long
 discussion

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

Would it be possible to implement Spark autoscaling somewhat along these
lines? --

1. If we sense that a new machine is needed, by watching the data load in
Kafka topic(s), then
2. Provision a new machine via a Provisioner interface (e.g. talk to AWS
and get a machine);
3. Create a shadow/mirror Spark master running alongside the initial
version which talks to N machines. The new mirror version is aware of N+1
machines (or N+M if we had decided we needed M new boxes).
4. The previous version of the Spark runtime is acquiesced/decommissioned.
We possibly get both clusters working on the same data which may actually
be OK (at least for our specific use-cases).
5. Now the new Spark cluster is running.

Similarly, the decommissioning of M unused boxes would happen, via this
notion of a mirror Spark runtime.  How feasible would it be for such a
mirrorlike setup to be created, especially created programmatically?
Especially point #3.

The other idea we'd entertained was to bring in a new machine, acquiesce
down all currently running workers by telling them to process their current
batch then shut down, then restart the consumers now that Spark is aware of
a modified cluster.  This has the drawback of a downtime that may not be
tolerable in terms of latency, by the system's clients waiting for their
responses in a synchronous fashion.

Thanks.

On Thu, May 28, 2015 at 5:15 PM, Cody Koeninger c...@koeninger.org wrote:

 I'm not sure that points 1 and 2 really apply to the kafka direct stream.
 There are no receivers, and you know at the driver how big each of your
 batches is.

 On Thu, May 28, 2015 at 2:21 PM, Andrew Or and...@databricks.com wrote:

 Hi all,

 As the author of the dynamic allocation feature I can offer a few
 insights here.

 Gerard's explanation was both correct and concise: dynamic allocation is
 not intended to be used in Spark streaming at the moment (1.4 or before).
 This is because of two things:

 (1) Number of receivers is necessarily fixed, and these are started in
 executors. Since we need a receiver for each InputDStream, if we kill these
 receivers we essentially stop the stream, which is not what we want. It
 makes little sense to close and restart a stream the same way we kill and
 relaunch executors.

 (2) Records come in every batch, and when there is data to process your
 executors are not idle. If your idle timeout is less than the batch
 duration, then you'll end up having to constantly kill and restart
 executors. If your idle timeout is greater than the batch duration, then
 you'll never kill executors.

 Long answer short, with Spark streaming there is currently no
 straightforward way to scale the size of your cluster. I had a long
 discussion with TD (Spark streaming lead) about what needs to be done to
 provide some semblance of dynamic scaling to streaming applications, e.g.
 take into account the batch queue instead. We came up with a few ideas that
 I will not detail here, but we are looking into this and do intend to
 support it in the near future.

 -Andrew



 2015-05-28 8:02 GMT-07:00 Evo Eftimov evo.efti...@isecc.com:

 Probably you should ALWAYS keep the RDD storage policy to MEMORY AND DISK
 – it will be your insurance policy against sys crashes due to memory leaks.
 Until there is free RAM, spark streaming (spark) will NOT resort to disk –
 and of course resorting to disk from time to time (ie when there is no free
 RAM ) and taking a performance hit from that, BUT only until there is no
 free RAM



 *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
 *Sent:* Thursday, May 28, 2015 2:34 PM
 *To:* Evo Eftimov
 *Cc:* Gerard Maas; spark users
 *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic
 sizes/rate of growth in Kafka or Spark's metrics?



 Evo, good points.



 On the dynamic resource allocation, I'm surmising this only works within
 a particular cluster setup.  So it improves the usage of current cluster
 resources but it doesn't make the cluster itself elastic. At least, that's
 my understanding.



 Memory + disk would be good and hopefully it'd take *huge* load on the
 system to start exhausting the disk space too.  I'd guess that falling onto
 disk will make things significantly slower due to the extra I/O.



 Perhaps we'll really want all of these elements eventually.  I think
 we'd want to start with memory only, keeping maxRate low enough not to
 overwhelm the consumers; implement the cluster autoscaling.  We might
 experiment with dynamic resource allocation before we get to implement the
 cluster autoscale.







 On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 You can also try Dynamic Resource Allocation




 https://spark.apache.org/docs/1.3.1/job-scheduling.html#dynamic-resource-allocation



 Also re the Feedback Loop for automatic message consumption rate
 adjustment – there is a “dumb” solution option – simply set the storage
 policy for the DStream RDDs to MEMORY AND DISK – when

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

I think what we'd want to do is track the ingestion rate in the consumer(s)
via Spark's aggregation functions and such. If we're at a critical level
(load too high / load too low) then we issue a request into our
Provisioning Component to add/remove machines. Once it comes back with an
OK, each consumer can finish its current batch, then terminate itself,
and restart with a new context.  The new context would be aware of the
updated cluster - correct?  Therefore the refreshed consumer would restart
on the updated cluster.

Could we even terminate the consumer immediately upon sensing a critical
event?  When it would restart, could it resume right where it left off?

On Wed, Jun 3, 2015 at 11:49 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 Makes sense especially if you have a cloud with “infinite” resources /
 nodes which allows you to double, triple etc in the background/parallel the
 resources of the currently running cluster



 I was thinking more about the scenario where you have e.g. 100 boxes and
 want to / can add e.g. 20 more



 *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
 *Sent:* Wednesday, June 3, 2015 4:46 PM
 *To:* Evo Eftimov
 *Cc:* Cody Koeninger; Andrew Or; Gerard Maas; spark users
 *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic
 sizes/rate of growth in Kafka or Spark's metrics?



 Evo,



 One of the ideas is to shadow the current cluster. This way there's no
 extra latency incurred due to shutting down of the consumers. If two sets
 of consumers are running, potentially processing the same data, that is OK.
 We phase out the older cluster and gradually flip over to the new one,
 insuring no downtime or extra latency.  Thoughts?



 On Wed, Jun 3, 2015 at 11:27 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 You should monitor vital performance / job clogging stats of the Spark
 Streaming Runtime not “kafka topics”



 You should be able to bring new worker nodes online and make them contact
 and register with the Master without bringing down the Master (or any of
 the currently running worker nodes)



 Then just shutdown your currently running spark streaming job/app and
 restart it with new params to take advantage of the larger cluster



 *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
 *Sent:* Wednesday, June 3, 2015 4:14 PM
 *To:* Cody Koeninger
 *Cc:* Andrew Or; Evo Eftimov; Gerard Maas; spark users
 *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic
 sizes/rate of growth in Kafka or Spark's metrics?



 Would it be possible to implement Spark autoscaling somewhat along these
 lines? --



 1. If we sense that a new machine is needed, by watching the data load in
 Kafka topic(s), then

 2. Provision a new machine via a Provisioner interface (e.g. talk to AWS
 and get a machine);

 3. Create a shadow/mirror Spark master running alongside the initial
 version which talks to N machines. The new mirror version is aware of N+1
 machines (or N+M if we had decided we needed M new boxes).

 4. The previous version of the Spark runtime is
 acquiesced/decommissioned.  We possibly get both clusters working on the
 same data which may actually be OK (at least for our specific use-cases).

 5. Now the new Spark cluster is running.



 Similarly, the decommissioning of M unused boxes would happen, via this
 notion of a mirror Spark runtime.  How feasible would it be for such a
 mirrorlike setup to be created, especially created programmatically?
 Especially point #3.



 The other idea we'd entertained was to bring in a new machine, acquiesce
 down all currently running workers by telling them to process their current
 batch then shut down, then restart the consumers now that Spark is aware of
 a modified cluster.  This has the drawback of a downtime that may not be
 tolerable in terms of latency, by the system's clients waiting for their
 responses in a synchronous fashion.



 Thanks.



 On Thu, May 28, 2015 at 5:15 PM, Cody Koeninger c...@koeninger.org
 wrote:

 I'm not sure that points 1 and 2 really apply to the kafka direct stream.
 There are no receivers, and you know at the driver how big each of your
 batches is.



 On Thu, May 28, 2015 at 2:21 PM, Andrew Or and...@databricks.com wrote:

 Hi all,



 As the author of the dynamic allocation feature I can offer a few insights
 here.



 Gerard's explanation was both correct and concise: dynamic allocation is
 not intended to be used in Spark streaming at the moment (1.4 or before).
 This is because of two things:



 (1) Number of receivers is necessarily fixed, and these are started in
 executors. Since we need a receiver for each InputDStream, if we kill these
 receivers we essentially stop the stream, which is not what we want. It
 makes little sense to close and restart a stream the same way we kill and
 relaunch executors.



 (2) Records come in every batch, and when there is data to process your
 executors are not idle. If your idle timeout is less than the batch

Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Dmitry Goldenberg

Thank you, Tathagata, Cody, Otis.

- Dmitry

On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic otis.gospodne...@gmail.com
wrote:

I think you can use SPM - http://sematext.com/spm - it will give you all
Spark and all Kafka metrics, including offsets broken down by topic, etc.
out of the box. I see more and more people using it to monitor various
components in data processing pipelines, a la
http://blog.sematext.com/2015/04/22/monitoring-stream-processing-tools-cassandra-kafka-and-spark/

Otis

On Mon, Jun 1, 2015 at 5:23 PM, dgoldenberg dgoldenberg...@gmail.com
wrote:

Hi,

What are some of the good/adopted approached to monitoring Spark Streaming
from Kafka? I see that there are things like
http://quantifind.github.io/KafkaOffsetMonitor, for example. Do they all
assume that Receiver-based streaming is used?

Then Note that one disadvantage of this approach (Receiverless Approach,
#2) is that it does not update offsets in Zookeeper, hence Zookeeper-based
Kafka monitoring tools will not show progress. However, you can access the
offsets processed by this approach in each batch and update Zookeeper
yourself.

The code sample, however, seems sparse. What do you need to do here? -
directKafkaStream.foreachRDD(
new FunctionJavaPairRDDlt;String, String, Void() {
@Override
public Void call(JavaPairRDDString, Integer rdd) throws
IOException {
OffsetRange[] offsetRanges =
((HasOffsetRanges)rdd).offsetRanges
// offsetRanges.length = # of Kafka partitions being consumed
...
return null;
}
}
);

and if these are updated, will KafkaOffsetMonitor work?

Monitoring seems to center around the notion of a consumer group. But in
the receiverless approach, code on the Spark consumer side doesn't seem to
expose a consumer group parameter. Where does it go? Can I/should I just
pass in group.id as part of the kafkaParams HashMap?

Thanks

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-monitor-Spark-Streaming-from-Kafka-tp23103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

Which would imply that if there was a load manager type of service, it
could signal to the driver(s) that they need to acquiesce, i.e. process
what's at hand and terminate.  Then bring up a new machine, then restart
the driver(s)...  Same deal with removing machines from the cluster. Send a
signal for the drivers to pipe down and terminate, then restart them.

On Thu, May 28, 2015 at 5:15 PM, Cody Koeninger c...@koeninger.org wrote:

 I'm not sure that points 1 and 2 really apply to the kafka direct stream.
 There are no receivers, and you know at the driver how big each of your
 batches is.

 On Thu, May 28, 2015 at 2:21 PM, Andrew Or and...@databricks.com wrote:

 Hi all,

 As the author of the dynamic allocation feature I can offer a few
 insights here.

 Gerard's explanation was both correct and concise: dynamic allocation is
 not intended to be used in Spark streaming at the moment (1.4 or before).
 This is because of two things:

 (1) Number of receivers is necessarily fixed, and these are started in
 executors. Since we need a receiver for each InputDStream, if we kill these
 receivers we essentially stop the stream, which is not what we want. It
 makes little sense to close and restart a stream the same way we kill and
 relaunch executors.

 (2) Records come in every batch, and when there is data to process your
 executors are not idle. If your idle timeout is less than the batch
 duration, then you'll end up having to constantly kill and restart
 executors. If your idle timeout is greater than the batch duration, then
 you'll never kill executors.

 Long answer short, with Spark streaming there is currently no
 straightforward way to scale the size of your cluster. I had a long
 discussion with TD (Spark streaming lead) about what needs to be done to
 provide some semblance of dynamic scaling to streaming applications, e.g.
 take into account the batch queue instead. We came up with a few ideas that
 I will not detail here, but we are looking into this and do intend to
 support it in the near future.

 -Andrew



 2015-05-28 8:02 GMT-07:00 Evo Eftimov evo.efti...@isecc.com:

 Probably you should ALWAYS keep the RDD storage policy to MEMORY AND DISK
 – it will be your insurance policy against sys crashes due to memory leaks.
 Until there is free RAM, spark streaming (spark) will NOT resort to disk –
 and of course resorting to disk from time to time (ie when there is no free
 RAM ) and taking a performance hit from that, BUT only until there is no
 free RAM



 *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
 *Sent:* Thursday, May 28, 2015 2:34 PM
 *To:* Evo Eftimov
 *Cc:* Gerard Maas; spark users
 *Subject:* Re: FW: Re: Autoscaling Spark cluster based on topic
 sizes/rate of growth in Kafka or Spark's metrics?



 Evo, good points.



 On the dynamic resource allocation, I'm surmising this only works within
 a particular cluster setup.  So it improves the usage of current cluster
 resources but it doesn't make the cluster itself elastic. At least, that's
 my understanding.



 Memory + disk would be good and hopefully it'd take *huge* load on the
 system to start exhausting the disk space too.  I'd guess that falling onto
 disk will make things significantly slower due to the extra I/O.



 Perhaps we'll really want all of these elements eventually.  I think
 we'd want to start with memory only, keeping maxRate low enough not to
 overwhelm the consumers; implement the cluster autoscaling.  We might
 experiment with dynamic resource allocation before we get to implement the
 cluster autoscale.







 On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 You can also try Dynamic Resource Allocation




 https://spark.apache.org/docs/1.3.1/job-scheduling.html#dynamic-resource-allocation



 Also re the Feedback Loop for automatic message consumption rate
 adjustment – there is a “dumb” solution option – simply set the storage
 policy for the DStream RDDs to MEMORY AND DISK – when the memory gets
 exhausted spark streaming will resort to keeping new RDDs on disk which
 will prevent it from crashing and hence loosing them. Then some memory will
 get freed and it will resort back to RAM and so on and so forth





 Sent from Samsung Mobile

  Original message 

 From: Evo Eftimov

 Date:2015/05/28 13:22 (GMT+00:00)

 To: Dmitry Goldenberg

 Cc: Gerard Maas ,spark users

 Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of
 growth in Kafka or Spark's metrics?



 You can always spin new boxes in the background and bring them into the
 cluster fold when fully operational and time that with job relaunch and
 param change



 Kafka offsets are mabaged automatically for you by the kafka clients
 which keep them in zoomeeper dont worry about that ad long as you shut down
 your job gracefuly. Besides msnaging the offsets explicitly is not a big
 deal if necessary





 Sent from Samsung Mobile



  Original message 

 From

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

Thank you, Gerard.

We're looking at the receiver-less setup with Kafka Spark streaming so I'm
not sure how to apply your comments to that case (not that we have to use
receiver-less but it seems to offer some advantages over the
receiver-based).

As far as the number of Kafka receivers is fixed for the lifetime of your
DStream -- this may be OK to start with. What I'm researching is the
ability to add worker nodes to the Spark cluster when needed and remove
them when no longer needed. Do I understand correctly that a single
receiver may cause work to be farmed out to multiple 'slave'
machines/worker nodes? If that's the case, we're less concerned with
multiple receivers; we're concerned with the worker node cluster itself.

If we use the ConsumerOffsetChecker class in Kafka that Rajesh mentioned
and instrument dynamic adding/removal of machines, my subsequent questions
then are, a) will Spark sense the addition of a new node / is it sufficient
that the cluster manager is aware, then work just starts flowing there?
and b) what would be a way to gracefully remove a worker node when the
load subsides, so that no currently running Spark job is killed?

- Dmitry

On Thu, May 28, 2015 at 7:36 AM, Gerard Maas gerard.m...@gmail.com wrote:

Hi,

tl;dr At the moment (with a BIG disclaimer *) elastic scaling of spark
streaming processes is not supported.

*Longer version.*

I assume that you are talking about Spark Streaming as the discussion is
about handing Kafka streaming data.

Then you have two things to consider: the Streaming receivers and the
Spark processing cluster.

Currently, the receiving topology is static. One receiver is allocated
with each DStream instantiated and it will use 1 core in the cluster. Once
the StreamingContext is started, this topology cannot be changed, therefore
the number of Kafka receivers is fixed for the lifetime of your DStream.
What we do is to calculate the cluster capacity and use that as a fixed
upper bound (with a margin) for the receiver throughput.

There's work in progress to add a reactive model to the receiver, where
backpressure can be applied to handle overload conditions. See
https://issues.apache.org/jira/browse/SPARK-7398

Once the data is received, it will be processed in a 'classical' Spark
pipeline, so previous posts on spark resource scheduling might apply.

Regarding metrics, the standard metrics subsystem of spark will report
streaming job performance. Check the driver's metrics endpoint to peruse
the available metrics:

driver:ui-port/metrics/json

-kr, Gerard.

(*) Spark is a project that moves so fast that statements might be
invalidated by new work every minute.

On Thu, May 28, 2015 at 1:21 AM, dgoldenberg dgoldenberg...@gmail.com
wrote:

Hi,

I'm trying to understand if there are design patterns for autoscaling
Spark
(add/remove slave machines to the cluster) based on the throughput.

Assuming we can throttle Spark consumers, the respective Kafka topics we
stream data from would start growing. What are some of the ways to
generate
the metrics on the number of new messages and the rate they are piling up?
This perhaps is more of a Kafka question; I see a pretty sparse javadoc
with
the Metric interface and not much else...

What are some of the ways to expand/contract the Spark cluster? Someone
has
mentioned Mesos...

I see some info on Spark metrics in the Spark monitoring guide
https://spark.apache.org/docs/latest/monitoring.html . Do we want to
perhaps implement a custom sink that would help us autoscale up or down
based on the throughput?

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Autoscaling-Spark-cluster-based-on-topic-sizes-rate-of-growth-in-Kafka-or-Spark-s-metrics-tp23062.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

Thanks, Evo.  Per the last part of your comment, it sounds like we will
need to implement a job manager which will be in control of starting the
jobs, monitoring the status of the Kafka topic(s), shutting jobs down and
marking them as ones to relaunch, scaling the cluster up/down by
adding/removing machines, and relaunching the 'suspended' (shut down) jobs.

I suspect that relaunching the jobs may be tricky since that means keeping
track of the starter offsets in Kafka topic(s) from which the jobs started
working on.

Ideally, we'd want to avoid a re-launch.  The 'suspension' and relaunching
of jobs, coupled with the wait for the new machines to come online may turn
out quite time-consuming which will make for lengthy request times, and our
requests are not asynchronous.  Ideally, the currently running jobs would
continue to run on the machines currently available in the cluster.

In the scale-down case, the job manager would want to signal to Spark's job
scheduler not to send work to the node being taken out, find out when the
last job has finished running on the node, then take the node out.

This is somewhat like changing the number of cylinders in a car engine
while the car is running...

Sounds like a great candidate for a set of enhancements in Spark...

On Thu, May 28, 2015 at 7:52 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 @DG; The key metrics should be



 -  Scheduling delay – its ideal state is to remain constant over
 time and ideally be less than the time of the microbatch window

 -  The average job processing time should remain less than the
 micro-batch window

 -  Number of Lost Jobs – even if there is a single Job lost that
 means that you have lost all messages for the DStream RDD processed by that
 job due to the previously described spark streaming memory leak condition
 and subsequent crash – described in previous postings submitted by me



 You can even go one step further and periodically issue “get/check free
 memory” to see whether it is decreasing relentlessly at a constant rate –
 if it touches a predetermined RAM threshold that should be your third
 metric



 Re the “back pressure” mechanism – this is a Feedback Loop mechanism and
 you can implement one on your own without waiting for Jiras and new
 features whenever they might be implemented by the Spark dev team –
 moreover you can avoid using slow mechanisms such as ZooKeeper and even
 incorporate some Machine Learning in your Feedback Loop to make it handle
 the message consumption rate more intelligently and benefit from ongoing
 online learning – BUT this is STILL about voluntarily sacrificing your
 performance in the name of keeping your system stable – it is not about
 scaling your system/solution



 In terms of how to scale the Spark Framework Dynamically – even though
 this is not supported at the moment out of the box I guess you can have a
 sys management framework spin dynamically a few more boxes (spark worker
 nodes), stop dynamically your currently running Spark Streaming Job,
 relaunch it with new params e.g. more Receivers, larger number of
 Partitions (hence tasks), more RAM per executor etc. Obviously this will
 cause some temporary delay in fact interruption in your processing but if
 the business use case can tolerate that then go for it



 *From:* Gerard Maas [mailto:gerard.m...@gmail.com]
 *Sent:* Thursday, May 28, 2015 12:36 PM
 *To:* dgoldenberg
 *Cc:* spark users
 *Subject:* Re: Autoscaling Spark cluster based on topic sizes/rate of
 growth in Kafka or Spark's metrics?



 Hi,



 tl;dr At the moment (with a BIG disclaimer *) elastic scaling of spark
 streaming processes is not supported.





 *Longer version.*



 I assume that you are talking about Spark Streaming as the discussion is
 about handing Kafka streaming data.



 Then you have two things to consider: the Streaming receivers and the
 Spark processing cluster.



 Currently, the receiving topology is static. One receiver is allocated
 with each DStream instantiated and it will use 1 core in the cluster. Once
 the StreamingContext is started, this topology cannot be changed, therefore
 the number of Kafka receivers is fixed for the lifetime of your DStream.

 What we do is to calculate the cluster capacity and use that as a fixed
 upper bound (with a margin) for the receiver throughput.



 There's work in progress to add a reactive model to the receiver, where
 backpressure can be applied to handle overload conditions. See
 https://issues.apache.org/jira/browse/SPARK-7398



 Once the data is received, it will be processed in a 'classical' Spark
 pipeline, so previous posts on spark resource scheduling might apply.



 Regarding metrics, the standard metrics subsystem of spark will report
 streaming job performance. Check the driver's metrics endpoint to peruse
 the available metrics:



 driver:ui-port/metrics/json



 -kr, Gerard.





 (*) Spark is a project that moves so fast that statements might be

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

Evo, good points.

On the dynamic resource allocation, I'm surmising this only works within a
particular cluster setup.  So it improves the usage of current cluster
resources but it doesn't make the cluster itself elastic. At least, that's
my understanding.

Memory + disk would be good and hopefully it'd take *huge* load on the
system to start exhausting the disk space too.  I'd guess that falling onto
disk will make things significantly slower due to the extra I/O.

Perhaps we'll really want all of these elements eventually.  I think we'd
want to start with memory only, keeping maxRate low enough not to overwhelm
the consumers; implement the cluster autoscaling.  We might experiment with
dynamic resource allocation before we get to implement the cluster
autoscale.



On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 You can also try Dynamic Resource Allocation




 https://spark.apache.org/docs/1.3.1/job-scheduling.html#dynamic-resource-allocation



 Also re the Feedback Loop for automatic message consumption rate
 adjustment – there is a “dumb” solution option – simply set the storage
 policy for the DStream RDDs to MEMORY AND DISK – when the memory gets
 exhausted spark streaming will resort to keeping new RDDs on disk which
 will prevent it from crashing and hence loosing them. Then some memory will
 get freed and it will resort back to RAM and so on and so forth





 Sent from Samsung Mobile

  Original message 

 From: Evo Eftimov

 Date:2015/05/28 13:22 (GMT+00:00)

 To: Dmitry Goldenberg

 Cc: Gerard Maas ,spark users

 Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth
 in Kafka or Spark's metrics?



 You can always spin new boxes in the background and bring them into the
 cluster fold when fully operational and time that with job relaunch and
 param change



 Kafka offsets are mabaged automatically for you by the kafka clients which
 keep them in zoomeeper dont worry about that ad long as you shut down your
 job gracefuly. Besides msnaging the offsets explicitly is not a big deal if
 necessary





 Sent from Samsung Mobile



  Original message 

 From: Dmitry Goldenberg

 Date:2015/05/28 13:16 (GMT+00:00)

 To: Evo Eftimov

 Cc: Gerard Maas ,spark users

 Subject: Re: Autoscaling Spark cluster based on topic sizes/rate of growth
 in Kafka or Spark's metrics?



 Thanks, Evo.  Per the last part of your comment, it sounds like we will
 need to implement a job manager which will be in control of starting the
 jobs, monitoring the status of the Kafka topic(s), shutting jobs down and
 marking them as ones to relaunch, scaling the cluster up/down by
 adding/removing machines, and relaunching the 'suspended' (shut down) jobs.



 I suspect that relaunching the jobs may be tricky since that means keeping
 track of the starter offsets in Kafka topic(s) from which the jobs started
 working on.



 Ideally, we'd want to avoid a re-launch.  The 'suspension' and relaunching
 of jobs, coupled with the wait for the new machines to come online may turn
 out quite time-consuming which will make for lengthy request times, and our
 requests are not asynchronous.  Ideally, the currently running jobs would
 continue to run on the machines currently available in the cluster.



 In the scale-down case, the job manager would want to signal to Spark's
 job scheduler not to send work to the node being taken out, find out when
 the last job has finished running on the node, then take the node out.



 This is somewhat like changing the number of cylinders in a car engine
 while the car is running...



 Sounds like a great candidate for a set of enhancements in Spark...



 On Thu, May 28, 2015 at 7:52 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 @DG; The key metrics should be



 -  Scheduling delay – its ideal state is to remain constant over
 time and ideally be less than the time of the microbatch window

 -  The average job processing time should remain less than the
 micro-batch window

 -  Number of Lost Jobs – even if there is a single Job lost that
 means that you have lost all messages for the DStream RDD processed by that
 job due to the previously described spark streaming memory leak condition
 and subsequent crash – described in previous postings submitted by me



 You can even go one step further and periodically issue “get/check free
 memory” to see whether it is decreasing relentlessly at a constant rate –
 if it touches a predetermined RAM threshold that should be your third
 metric



 Re the “back pressure” mechanism – this is a Feedback Loop mechanism and
 you can implement one on your own without waiting for Jiras and new
 features whenever they might be implemented by the Spark dev team –
 moreover you can avoid using slow mechanisms such as ZooKeeper and even
 incorporate some Machine Learning in your Feedback Loop to make it handle
 the message consumption rate more intelligently

Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Dmitry Goldenberg

Got it, thank you, Tathagata and Ted.

Could you comment on my other question
http://apache-spark-user-list.1001560.n3.nabble.com/Autoscaling-Spark-cluster-based-on-topic-sizes-rate-of-growth-in-Kafka-or-Spark-s-metrics-tt23062.html
as well?  Basically, I'm trying to get a handle on a good approach to
throttling, on the one hand side, and autoscaling the cluster, on the
other.  Are there any recommended approaches or design patterns for
autoscaling that you have implemented or could point me at? Thanks!

On Wed, May 27, 2015 at 8:08 PM, Tathagata Das t...@databricks.com wrote:

 You can throttle the no receiver direct Kafka stream using
 spark.streaming.kafka.maxRatePerPartition
 http://spark.apache.org/docs/latest/configuration.html#spark-streaming


 On Wed, May 27, 2015 at 4:34 PM, Ted Yu yuzhih...@gmail.com wrote:

 Have you seen
 http://stackoverflow.com/questions/29051579/pausing-throttling-spark-spark-streaming-application
 ?

 Cheers

 On Wed, May 27, 2015 at 4:11 PM, dgoldenberg dgoldenberg...@gmail.com
 wrote:

 Hi,

 With the no receivers approach to streaming from Kafka, is there a way to
 set something like spark.streaming.receiver.maxRate so as not to
 overwhelm
 the Spark consumers?

 What would be some of the ways to throttle the streamed messages so that
 the
 consumers don't run out of memory?





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-from-Kafka-no-receivers-and-spark-streaming-receiver-maxRate-tp23061.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming and reducing latency

2015-05-18 Thread Dmitry Goldenberg

Thanks, Akhil. So what do folks typically do to increase/contract the capacity?
Do you plug in some cluster auto-scaling solution to make this elastic?

Does Spark have any hooks for instrumenting auto-scaling?

In other words, how do you avoid overwheling the receivers in a scenario when
your system's input can be unpredictable, based on users' activity?

On May 17, 2015, at 11:04 AM, Akhil Das ak...@sigmoidanalytics.com wrote:

With receiver based streaming, you can actually specify
spark.streaming.blockInterval which is the interval at which the receiver
will fetch data from the source. Default value is 200ms and hence if your
batch duration is 1 second, it will produce 5 blocks of data. And yes, with
sparkstreaming when your processing time goes beyond your batch duration and
you are having a higher data consumption then you will overwhelm the
receiver's memory and hence will throw up block not found exceptions.

Thanks
Best Regards

On Sun, May 17, 2015 at 7:21 PM, dgoldenberg dgoldenberg...@gmail.com
wrote:
I keep hearing the argument that the way Discretized Streams work with Spark
Streaming is a lot more of a batch processing algorithm than true streaming.
For streaming, one would expect a new item, e.g. in a Kafka topic, to be
available to the streaming consumer immediately.

With the discretized streams, streaming is done with batch intervals i.e.
the consumer has to wait the interval to be able to get at the new items. If
one wants to reduce latency it seems the only way to do this would be by
reducing the batch interval window. However, that may lead to a great deal
of churn, with many requests going into Kafka out of the consumers,
potentially with no results whatsoever as there's nothing new in the topic
at the moment.

Is there a counter-argument to this reasoning? What are some of the general
approaches to reduce latency folks might recommend? Or, perhaps there are
ways of dealing with this at the streaming API level?

If latency is of great concern, is it better to look into streaming from
something like Flume where data is pushed to consumers rather than pulled by
them? Are there techniques, in that case, to ensure the consumers don't get
overwhelmed with new data?

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-reducing-latency-tp22922.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark and RabbitMQ

2015-05-12 Thread Dmitry Goldenberg

Thanks, Akhil. It looks like in the second example, for Rabbit they're
doing this: https://www.rabbitmq.com/mqtt.html.

On Tue, May 12, 2015 at 7:37 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 I found two examples Java version
 https://github.com/deepakkashyap/Spark-Streaming-with-RabbitMQ-/blob/master/example/Spark_project/CustomReceiver.java,
 and Scala version. https://github.com/d1eg0/spark-streaming-toy

 Thanks
 Best Regards

 On Tue, May 12, 2015 at 2:31 AM, dgoldenberg dgoldenberg...@gmail.com
 wrote:

 Are there existing or under development versions/modules for streaming
 messages out of RabbitMQ with SparkStreaming, or perhaps a RabbitMQ RDD?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-RabbitMQ-tp22852.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Running Spark in local mode seems to ignore local[N]

Sean,

How does this model actually work? Let's say we want to run one job as N
threads executing one particular task, e.g. streaming data out of Kafka
into a search engine. How do we configure our Spark job execution?

Right now, I'm seeing this job running as a single thread. And it's quite a
bit slower than just running a simple utility with a thread executor with a
thread pool of N threads doing the same task.

The performance I'm seeing of running the Kafka-Spark Streaming job is 7
times slower than that of the utility. What's pulling Spark back?

Thanks.

On Mon, May 11, 2015 at 4:55 PM, Sean Owen so...@cloudera.com wrote:

You have one worker with one executor with 32 execution slots.

On Mon, May 11, 2015 at 9:52 PM, dgoldenberg dgoldenberg...@gmail.com
wrote:
Hi,

Is there anything special one must do, running locally and submitting a
job
like so:

spark-submit \
--class com.myco.Driver \
--master local[*] \
./lib/myco.jar

In my logs, I'm only seeing log messages with the thread identifier of
Executor task launch worker-0.

There are 4 cores on the machine so I expected 4 threads to be at play.
Running with local[32] did not yield 32 worker threads.

Any recommendations? Thanks.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-local-mode-seems-to-ignore-local-N-tp22851.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Running Spark in local mode seems to ignore local[N]

Thanks, Sean. This was not yet digested data for me :)

The number of partitions in a streaming RDD is determined by the
block interval and the batch interval. I have seen the bit on
spark.streaming.blockInterval
in the doc but I didn't connect it with the batch interval and the number
of partitions.

On Mon, May 11, 2015 at 5:34 PM, Sean Owen so...@cloudera.com wrote:

You might have a look at the Spark docs to start. 1 batch = 1 RDD, but
1 RDD can have many partitions. And should, for scale. You do not
submit multiple jobs to get parallelism.

The number of partitions in a streaming RDD is determined by the block
interval and the batch interval. If you have a batch interval of 10s
and block interval of 1s you'll get 10 partitions of data in the RDD.

On Mon, May 11, 2015 at 10:29 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Understood. We'll use the multi-threaded code we already have..

How are these execution slots filled up? I assume each slot is dedicated
to
one submitted task. If that's the case, how is each task distributed
then,
i.e. how is that task run in a multi-node fashion? Say 1000
batches/RDD's
are extracted out of Kafka, how does that relate to the number of
executors
vs. task slots?

Presumably we can fill up the slots with multiple instances of the same
task... How do we know how many to launch?

On Mon, May 11, 2015 at 5:20 PM, Sean Owen so...@cloudera.com wrote:

BTW I think my comment was wrong as marcelo demonstrated. In
standalone mode you'd have one worker, and you do have one executor,
but his explanation is right. But, you certainly have execution slots
for each core.

Are you talking about your own user code? you can make threads, but
that's nothing do with Spark then. If you run code on your driver,
it's not distributed. If you run Spark over an RDD with 1 partition,
only one task works on it.

On Mon, May 11, 2015 at 10:16 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Sean,

How does this model actually work? Let's say we want to run one job
as N
threads executing one particular task, e.g. streaming data out of
Kafka
into
a search engine. How do we configure our Spark job execution?

Right now, I'm seeing this job running as a single thread. And it's
quite a
bit slower than just running a simple utility with a thread executor
with a
thread pool of N threads doing the same task.

The performance I'm seeing of running the Kafka-Spark Streaming job
is 7
times slower than that of the utility. What's pulling Spark back?

Thanks.

On Mon, May 11, 2015 at 4:55 PM, Sean Owen so...@cloudera.com
wrote:

You have one worker with one executor with 32 execution slots.

On Mon, May 11, 2015 at 9:52 PM, dgoldenberg
dgoldenberg...@gmail.com
wrote:
Hi,

Is there anything special one must do, running locally and
submitting
a
job
like so:

spark-submit \
--class com.myco.Driver \
--master local[*] \
./lib/myco.jar

In my logs, I'm only seeing log messages with the thread identifier
of
Executor task launch worker-0.

There are 4 cores on the machine so I expected 4 threads to be at
play.
Running with local[32] did not yield 32 worker threads.

Any recommendations? Thanks.

--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-local-mode-seems-to-ignore-local-N-tp22851.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Running Spark in local mode seems to ignore local[N]

Understood. We'll use the multi-threaded code we already have..

How are these execution slots filled up? I assume each slot is dedicated to
one submitted task. If that's the case, how is each task distributed then,
i.e. how is that task run in a multi-node fashion? Say 1000 batches/RDD's
are extracted out of Kafka, how does that relate to the number of executors
vs. task slots?

Presumably we can fill up the slots with multiple instances of the same
task... How do we know how many to launch?

On Mon, May 11, 2015 at 5:20 PM, Sean Owen so...@cloudera.com wrote:

On Mon, May 11, 2015 at 10:16 PM, Dmitry Goldenberg
dgoldenberg...@gmail.com wrote:
Sean,

How does this model actually work? Let's say we want to run one job as N
threads executing one particular task, e.g. streaming data out of Kafka
into
a search engine. How do we configure our Spark job execution?

Right now, I'm seeing this job running as a single thread. And it's
quite a
bit slower than just running a simple utility with a thread executor
with a
thread pool of N threads doing the same task.

The performance I'm seeing of running the Kafka-Spark Streaming job is 7
times slower than that of the utility. What's pulling Spark back?

Thanks.

On Mon, May 11, 2015 at 4:55 PM, Sean Owen so...@cloudera.com wrote:

You have one worker with one executor with 32 execution slots.

On Mon, May 11, 2015 at 9:52 PM, dgoldenberg dgoldenberg...@gmail.com
wrote:
Hi,

Is there anything special one must do, running locally and submitting
a
job
like so:

spark-submit \
--class com.myco.Driver \
--master local[*] \
./lib/myco.jar

In my logs, I'm only seeing log messages with the thread identifier of
Executor task launch worker-0.

There are 4 cores on the machine so I expected 4 threads to be at
play.
Running with local[32] did not yield 32 worker threads.

Any recommendations? Thanks.

--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-local-mode-seems-to-ignore-local-N-tp22851.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Running Spark in local mode seems to ignore local[N]