Re: Unsubscribe

2019-01-30 Thread Soumitra Johri
unsubscribe
On Wed, Jan 30, 2019 at 10:05 PM wang wei 
wrote:

> Unsubscribe
>


Re: Kafka failover with multiple data centers

2017-03-27 Thread Soumitra Johri
Hi, did you guys figure it out?

Thanks
Soumitra
On Sun, Mar 5, 2017 at 9:51 PM nguyen duc Tuan  wrote:

> Hi everyone,
> We are deploying kafka cluster for ingesting streaming data. But
> sometimes, some of nodes on the cluster have troubles (node dies, kafka
> daemon is killed...). However, Recovering data in Kafka can be very slow.
> It takes serveral hours to recover from disaster. I saw a slide here
> suggesting using multiple data centers (
> https://www.slideshare.net/HadoopSummit/building-largescale-stream-infrastructures-across-multiple-data-centers-with-apache-kafka).
> But I wonder, how can we detect the problem and switch between datacenters
> in Spark Streaming? Since kafka 0.10.1 support timestamp index, how can
> seek to right offsets?
> Are there any opensource library out there that supports handling the
> problem on the fly?
> Thanks.
>


Re: Is executor computing time affected by network latency?

2016-09-22 Thread Soumitra Johri
If your job involves a shuffle then the compute for the entire batch will
increase with network latency. What would be interesting is to see how much
time each task/job/stage takes.
On Thu, Sep 22, 2016 at 5:11 PM Peter Figliozzi 
wrote:

> It seems to me they must communicate for joins, sorts, grouping, and so
> forth, where the original data partitioning needs to change.  You could
> repeat your experiment for different code snippets.  I'll bet it depends on
> what you do.
>
> On Thu, Sep 22, 2016 at 8:54 AM, gusiri  wrote:
>
>> Hi,
>>
>> When I increase the network latency among spark nodes,
>>
>> I see compute time (=executor computing time in Spark Web UI) also
>> increases.
>>
>> In the graph attached, left = latency 1ms vs right = latency 500ms.
>>
>> Is there any communication between worker and driver/master even 'during'
>> executor computing? or any idea on this result?
>>
>>
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n27779/Screen_Shot_2016-09-21_at_5.png
>> >
>>
>>
>>
>>
>>
>> Thank you very much in advance.
>>
>> //gusiri
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-executor-computing-time-affected-by-network-latency-tp27779.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


How does MapWithStateRDD distribute the data

2016-08-03 Thread Soumitra Johri
Hi,

I am running a steaming job with 4 executors and 16 cores so that each
executor has two cores to work with. The input Kafka topic has 4 partitions.
With this given configuration I was expecting MapWithStateRDD to be evenly
distributed across all executors, how ever I see that it uses only two
executors on which MapWithStateRDD data is distributed. Sometimes the data
goes only to one executor.

How can this be explained and pretty sure there would be some math to
understand this behavior.

I am using the standard standalone 1.6.2 cluster.

Thanks
Soumitra


Re: Spark Streaming : is spark.streaming.receiver.maxRate valid for DirectKafkaApproach

2016-05-10 Thread Soumitra Johri
I think a better partitioning scheme can help u too.
On Tue, May 10, 2016 at 10:31 AM Cody Koeninger  wrote:

> maxRate is not used by the direct stream.
>
> Significant skew in rate across different partitions for the same
> topic is going to cause you all kinds of problems, not just with spark
> streaming.
>
> You can turn on backpressure, but you're better off addressing the
> underlying issue if you can.
>
> On Tue, May 10, 2016 at 8:08 AM, Soumitra Siddharth Johri
>  wrote:
> > Also look at back pressure enabled. Both of these can be used to limit
> the
> > rate
> >
> > Sent from my iPhone
> >
> > On May 10, 2016, at 8:02 AM, chandan prakash 
> > wrote:
> >
> > Hi,
> > I am using Spark Streaming with Direct kafka approach.
> > Want to limit number of event records coming in my batches.
> > Have question regarding  following 2 parameters :
> > 1. spark.streaming.receiver.maxRate
> > 2. spark.streaming.kafka.maxRatePerPartition
> >
> >
> > The documentation
> > (
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#deploying-applications
> > ) says .
> > " spark.streaming.receiver.maxRate for receivers and
> > spark.streaming.kafka.maxRatePerPartition for Direct Kafka approach "
> >
> > Does it mean that  spark.streaming.receiver.maxRate  is valid only for
> > Receiver based approach only ?  (not the DirectKafkaApproach as well)
> >
> > If yes, then how do we control total number of records/sec in DirectKafka
> > ?.because spark.streaming.kafka.maxRatePerPartition  only controls
> max
> > rate per partition and not whole records. There might be many
> partitions
> > some with very fast rate and some with very slow rate.
> >
> > Regards,
> > Chandan
> >
> >
> >
> > --
> > Chandan Prakash
> >
>


inter spark application communication

2016-04-18 Thread Soumitra Johri
Hi,

I have two applications : App1 and App2.
On a single cluster I have to spawn 5 instances os App1 and 1 instance of
App2.

What would be the best way to send data from the 5 App1 instances to the
single App2 instance ?

Right now I am using Kafka to send data from one spark application to the
spark application  but the setup doesn't seem right and I hope there is a
better way to do this.

Warm Regards
Soumitra


UpdateStateByKey : Partitioning and Shuffle

2016-01-05 Thread Soumitra Johri
Hi,

I am relatively new to Spark and am using updateStateByKey() operation to
maintain state in my Spark Streaming application. The input data is coming
through a Kafka topic.

   1. I want to understand how are DStreams partitioned?
   2. How does the partitioning work with mapWithState() or
   updateStatebyKey() method?
   3. In updateStateByKey() does the old state and the new values against a
   given key processed on same node ?
   4. How frequent is the shuffle for updateStateByKey() method ?

The state I have to maintaining contains ~ 10 keys and I want to avoid
shuffle every time I update the state , any tips to do it ?

Warm Regards
Soumitra


2 of 20,675 Spark Streaming : Out put frequency different from read frequency in StatefulNetworkWordCount

2015-12-30 Thread Soumitra Johri
Hi, in the example :
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala

the streaming frequency is 1 seconds however I do not want to print the
contents of the word-counts every minute and resent the word counts again
back to 0 every minute. How can I do that ?

I have to print per minute work counts with streaming frequency of 1
second. I though of using scala schedulers but then there can be
concurrency issues.

My algorithm is as follows :

   1. Read the words every 1 second
   2. Do cumulative work count for 60 seconds
   3. After the end of every 60 second (1 minute ) print the workcounts and
   resent the counters to zero.

Any help would be appreciated!

Thanks

Warm Regards


RDD Indexes and how to fetch all edges with a given label

2014-10-14 Thread Soumitra Johri
Hi All,

I have a Graph with millions of edges. Edges are represented by
org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]]
= MappedRDD[4] . I have two questions :

1)How can I fetch all the nodes with a given edge label ( all edges with a
given property )

2) Is it possible to create indexes on the RDDs or a specific column of the
RDD to make the look up faster?

Please excuse me for the triviality of the question, I am new to the
language and its taking me sometime to get used to it.
Warm Regards
Soumitra


Re: RDD Indexes and how to fetch all edges with a given label

2014-10-14 Thread Soumitra Johri
Hi,

With respect to the first issue, one possible way is to filter the graph
via 'graph.subgraph(epred = e = e.attr == edgeLabel)'  ,  but I am still
curious if we can index RDDs.


Warm Regards
Soumitra

On Tue, Oct 14, 2014 at 2:46 PM, Soumitra Johri 
soumitra.siddha...@gmail.com wrote:

 Hi All,

 I have a Graph with millions of edges. Edges are represented by 
 org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]]
 = MappedRDD[4] . I have two questions :

 1)How can I fetch all the nodes with a given edge label ( all edges with a
 given property )

 2) Is it possible to create indexes on the RDDs or a specific column of
 the RDD to make the look up faster?

 Please excuse me for the triviality of the question, I am new to the
 language and its taking me sometime to get used to it.
 Warm Regards
 Soumitra