Spark Streaming: Async action scheduling inside foreachRDD

2017-08-02 Thread Andrii Biletskyi
Hi all, What is the correct way to schedule multiple jobs inside foreachRDD method and importantly await on result to ensure those jobs have completed successfully? E.g.: kafkaDStream.foreachRDD{ rdd => val rdd1 = rdd.map(...) val rdd2 = rdd1.map(...) val job1Future = Future{

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
> number of partitions, you’re potentially doing less work in parallel > depending on your cluster setup. > > On May 23, 2017, at 4:23 PM, Andrii Biletskyi <andrii.bilets...@yahoo.com. > INVALID <andrii.bilets...@yahoo.com.invalid>> wrote: > > > No, I didn't t

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
ll also reduce parallelism of the preceding computation.  Have you tried using repartition instead? On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi <andrii.bilets...@yahoo.com.invalid> wrote: Hi all, I'm trying to understand the impact of coalesce operation on spark job performance. As a side

Re: Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
m of the preceding > computation. Have you tried using repartition instead? > > On Tue, May 23, 2017 at 12:14 PM, Andrii Biletskyi < > andrii.bilets...@yahoo.com.invalid> wrote: > >> Hi all, >> >> I'm trying to understand the impact of coalesce operation on sp

Impact of coalesce operation before writing dataframe

2017-05-23 Thread Andrii Biletskyi
Hi all, I'm trying to understand the impact of coalesce operation on spark job performance. As a side note: were are using emrfs (i.e. aws s3) as source and a target for the job. Omitting unnecessary details job can be explained as: join 200M records Dataframe stored in orc format on emrfs with

Re: MapWithState partitioning

2016-10-31 Thread Andrii Biletskyi
Koeninger <c...@koeninger.org>: > You may know that those streams share the same keys, but Spark doesn't > unless you tell it. > > mapWithState takes a StateSpec, which should allow you to specify a > partitioner. > > On Mon, Oct 31, 2016 at 9:40 AM, Andrii Biletskyi &

MapWithState partitioning

2016-10-31 Thread Andrii Biletskyi
Hi all, I'm using Spark Streaming mapWithState operation to do a stateful operation on my Kafka stream (though I think similar arguments would apply for any source). Trying to understand a way to control mapWithState's partitioning schema. My transformations are simple: 1) create KafkaDStream