Re: How to execute a function from class in distributed jar on each worker node?

2014-11-11 Thread aaronjosephs
I'm not sure that this will work but it makes sense to me. Basically you write the functionality in a static block in a class and broadcast that class. Not sure what your use case is but I need to load a native library and want to avoid running the init in mapPartitions if it's not necessary (just

Re: Spark Streaming with long batch / window duration

2014-07-21 Thread aaronjosephs
So I think I may end up using hourglass (https://engineering.linkedin.com/datafu/datafus-hourglass-incremental-data-processing-hadoop) a hadoop framework for incremental data processing, it would be very cool if spark (not streaming ) could support something like this -- View this message in

Spark Streaming with long batch / window duration

2014-07-18 Thread aaronjosephs
Would it be a reasonable use case of spark streaming to have a very large window size (lets say on the scale of weeks). In this particular case the reduce function would be invertible so that would aid in efficiency. I assume that having a larger batch size since the window is so large would also

Re: NullPointerException When Reading Avro Sequence Files

2014-07-18 Thread aaronjosephs
I think you probably want to use `AvroSequenceFileOutputFormat` with `newAPIHadoopFile`. I'm not even sure that in hadoop you would use SequenceFileInput format to read an avro sequence file -- View this message in context:

Re: Spark Streaming with long batch / window duration

2014-07-18 Thread aaronjosephs
Unfortunately for reasons I won't go into my options for what I can use are limited, it was more of a curiosity to see if spark could handle a use case like this since the functionality I wanted fit perfectly into the reduceByKeyAndWindow frame of thinking. Anyway thanks for answering. -- View

Re: Difference among batchDuration, windowDuration, slideDuration

2014-07-16 Thread aaronjosephs
The only other thing to keep in mind is that window duration and slide duration have to be multiples of batch duration, IDK if you made that fully clear -- View this message in context: