Re: Unsubscribe
unsubscribe On Wed, Jan 30, 2019 at 10:05 PM wang wei wrote: > Unsubscribe >
Re: Kafka failover with multiple data centers
Hi, did you guys figure it out? Thanks Soumitra On Sun, Mar 5, 2017 at 9:51 PM nguyen duc Tuan wrote: > Hi everyone, > We are deploying kafka cluster for ingesting streaming data. But > sometimes, some of nodes on the cluster have troubles (node dies, kafka > daemon is killed...). However, Recovering data in Kafka can be very slow. > It takes serveral hours to recover from disaster. I saw a slide here > suggesting using multiple data centers ( > https://www.slideshare.net/HadoopSummit/building-largescale-stream-infrastructures-across-multiple-data-centers-with-apache-kafka). > But I wonder, how can we detect the problem and switch between datacenters > in Spark Streaming? Since kafka 0.10.1 support timestamp index, how can > seek to right offsets? > Are there any opensource library out there that supports handling the > problem on the fly? > Thanks. >
Re: Is executor computing time affected by network latency?
If your job involves a shuffle then the compute for the entire batch will increase with network latency. What would be interesting is to see how much time each task/job/stage takes. On Thu, Sep 22, 2016 at 5:11 PM Peter Figliozzi wrote: > It seems to me they must communicate for joins, sorts, grouping, and so > forth, where the original data partitioning needs to change. You could > repeat your experiment for different code snippets. I'll bet it depends on > what you do. > > On Thu, Sep 22, 2016 at 8:54 AM, gusiri wrote: > >> Hi, >> >> When I increase the network latency among spark nodes, >> >> I see compute time (=executor computing time in Spark Web UI) also >> increases. >> >> In the graph attached, left = latency 1ms vs right = latency 500ms. >> >> Is there any communication between worker and driver/master even 'during' >> executor computing? or any idea on this result? >> >> >> < >> http://apache-spark-user-list.1001560.n3.nabble.com/file/n27779/Screen_Shot_2016-09-21_at_5.png >> > >> >> >> >> >> >> Thank you very much in advance. >> >> //gusiri >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Is-executor-computing-time-affected-by-network-latency-tp27779.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >
How does MapWithStateRDD distribute the data
Hi, I am running a steaming job with 4 executors and 16 cores so that each executor has two cores to work with. The input Kafka topic has 4 partitions. With this given configuration I was expecting MapWithStateRDD to be evenly distributed across all executors, how ever I see that it uses only two executors on which MapWithStateRDD data is distributed. Sometimes the data goes only to one executor. How can this be explained and pretty sure there would be some math to understand this behavior. I am using the standard standalone 1.6.2 cluster. Thanks Soumitra
Re: Spark Streaming : is spark.streaming.receiver.maxRate valid for DirectKafkaApproach
I think a better partitioning scheme can help u too. On Tue, May 10, 2016 at 10:31 AM Cody Koeninger wrote: > maxRate is not used by the direct stream. > > Significant skew in rate across different partitions for the same > topic is going to cause you all kinds of problems, not just with spark > streaming. > > You can turn on backpressure, but you're better off addressing the > underlying issue if you can. > > On Tue, May 10, 2016 at 8:08 AM, Soumitra Siddharth Johri > wrote: > > Also look at back pressure enabled. Both of these can be used to limit > the > > rate > > > > Sent from my iPhone > > > > On May 10, 2016, at 8:02 AM, chandan prakash > > wrote: > > > > Hi, > > I am using Spark Streaming with Direct kafka approach. > > Want to limit number of event records coming in my batches. > > Have question regarding following 2 parameters : > > 1. spark.streaming.receiver.maxRate > > 2. spark.streaming.kafka.maxRatePerPartition > > > > > > The documentation > > ( > http://spark.apache.org/docs/latest/streaming-programming-guide.html#deploying-applications > > ) says . > > " spark.streaming.receiver.maxRate for receivers and > > spark.streaming.kafka.maxRatePerPartition for Direct Kafka approach " > > > > Does it mean that spark.streaming.receiver.maxRate is valid only for > > Receiver based approach only ? (not the DirectKafkaApproach as well) > > > > If yes, then how do we control total number of records/sec in DirectKafka > > ?.because spark.streaming.kafka.maxRatePerPartition only controls > max > > rate per partition and not whole records. There might be many > partitions > > some with very fast rate and some with very slow rate. > > > > Regards, > > Chandan > > > > > > > > -- > > Chandan Prakash > > >
inter spark application communication
Hi, I have two applications : App1 and App2. On a single cluster I have to spawn 5 instances os App1 and 1 instance of App2. What would be the best way to send data from the 5 App1 instances to the single App2 instance ? Right now I am using Kafka to send data from one spark application to the spark application but the setup doesn't seem right and I hope there is a better way to do this. Warm Regards Soumitra
UpdateStateByKey : Partitioning and Shuffle
Hi, I am relatively new to Spark and am using updateStateByKey() operation to maintain state in my Spark Streaming application. The input data is coming through a Kafka topic. 1. I want to understand how are DStreams partitioned? 2. How does the partitioning work with mapWithState() or updateStatebyKey() method? 3. In updateStateByKey() does the old state and the new values against a given key processed on same node ? 4. How frequent is the shuffle for updateStateByKey() method ? The state I have to maintaining contains ~ 10 keys and I want to avoid shuffle every time I update the state , any tips to do it ? Warm Regards Soumitra
2 of 20,675 Spark Streaming : Out put frequency different from read frequency in StatefulNetworkWordCount
Hi, in the example : https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala the streaming frequency is 1 seconds however I do not want to print the contents of the word-counts every minute and resent the word counts again back to 0 every minute. How can I do that ? I have to print per minute work counts with streaming frequency of 1 second. I though of using scala schedulers but then there can be concurrency issues. My algorithm is as follows : 1. Read the words every 1 second 2. Do cumulative work count for 60 seconds 3. After the end of every 60 second (1 minute ) print the workcounts and resent the counters to zero. Any help would be appreciated! Thanks Warm Regards
Re: RDD Indexes and how to fetch all edges with a given label
Hi, With respect to the first issue, one possible way is to filter the graph via 'graph.subgraph(epred = e => e.attr == "edgeLabel")' , but I am still curious if we can index RDDs. Warm Regards Soumitra On Tue, Oct 14, 2014 at 2:46 PM, Soumitra Johri < soumitra.siddha...@gmail.com> wrote: > Hi All, > > I have a Graph with millions of edges. Edges are represented by > org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] > = MappedRDD[4] . I have two questions : > > 1)How can I fetch all the nodes with a given edge label ( all edges with a > given property ) > > 2) Is it possible to create indexes on the RDDs or a specific column of > the RDD to make the look up faster? > > Please excuse me for the triviality of the question, I am new to the > language and its taking me sometime to get used to it. > Warm Regards > Soumitra >
RDD Indexes and how to fetch all edges with a given label
Hi All, I have a Graph with millions of edges. Edges are represented by org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] = MappedRDD[4] . I have two questions : 1)How can I fetch all the nodes with a given edge label ( all edges with a given property ) 2) Is it possible to create indexes on the RDDs or a specific column of the RDD to make the look up faster? Please excuse me for the triviality of the question, I am new to the language and its taking me sometime to get used to it. Warm Regards Soumitra