Help with collect() in Spark Streaming

2015-09-11 Thread allonsy
Hi everyone, I have a JavaPairDStream object and I'd like the Driver to create a txt file (on HDFS) containing all of its elements. At the moment, I use the /coalesce(1, true)/ method: JavaPairDStream unified = [partitioned stuff] unified.foreachRDD(new

spark.streaming.maxRatePerPartition parameter: what are the benefits?

2015-08-13 Thread allonsy
Hello everyone, in the new Kafka Direct API, what are the benefits of setting a value for *spark.streaming.maxRatePerPartition*? In my case, I have 2 seconds batches consuming ~15k tuples from a topic split into 48 partitions (4 workers, 16 total cores). Is there any particular value I should

Kafka direct approach: blockInterval and topic partitions

2015-08-10 Thread allonsy
Hi everyone, I recently started using the new Kafka direct approach. Now, as far as I understood, each Kafka partition /is/ an RDD partition that will be processed by a single core. What I don't understand is the relation between those partitions and the blocks generated every blockInterval.

Total delay per batch in a CSV file

2015-08-04 Thread allonsy
Hi everyone, I'm working with Spark Streaming, and I need to perform some offline performance measures. What I'd like to have is a CSV file that reports something like this: *Batch number/timestampInput SizeTotal Delay* which is in fact similar to what the UI outputs. I tried

Spark: configuration file 'metrics.properties'

2015-07-24 Thread allonsy
Hi, Spark's configuration file (useful to retrieve metrics), namely //conf/metrics.properties/, states what follows: Within an instance, a source specifies a particular set of grouped metrics. there are two kinds of sources: 1. Spark internal sources, like /MasterSource/, /WorkerSource/, etc,

Equivalent to Storm's 'field grouping' in Spark.

2015-06-03 Thread allonsy
Hi everybody, is there in Spark anything sharing the philosophy of Storm's field grouping? I'd like to manage data partitioning across the workers by sending tuples sharing the same key to the very same worker in the cluster, but I did not find any method to do that. Suggestions? :) -- View

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-05-07 Thread allonsy
Has there been any follow up on this topic? Here http://search-hadoop.com/m/q3RTtm4EtI1hoD8d2 there were suggestions that someone was going to publish some code, but no news since (TD himself looked pretty interested). Did anybody come up with something in the last months? -- View this

[Spark Streaming] Help with updateStateByKey()

2015-04-23 Thread allonsy
Hi everybody, I think I could use some help with the /updateStateByKey()/ JAVA method in Spark Streaming. *Context:* I have a /JavaReceiverInputDStreamDataUpdate du/ DStream, where object /DataUpdate/ mainly has 2 fields of interest (in my case), namely du.personId (an Integer) and