Fwd: Re: spark-sql force parallel union

2018-11-20 Thread onmstester onmstester
Thanks Kathleen, 1. So if i've got 4 df's and i want "dfv1 union dfv2 union dfv3 union dfv4", would it first compute "dfv1 union dfv2" and "dfv3 union dfv4" independently and simultaneously? then union their results? 2. Its going to be hundreds of partitions to union, creating a temp view for ea

Re: spark-sql force parallel union

2018-11-20 Thread kathleen li
you might first write the code to construct query statement with "union all" like below: scala> val query="select * from dfv1 union all select * from dfv2 union all select * from dfv3" query: String = select * from dfv1 union all select * from dfv2 union all select * from dfv3 then write loop to

spark-sql force parallel union

2018-11-20 Thread onmstester onmstester
I'm using Spark-Sql to query Cassandra tables. In Cassandra, i've partitioned my data with time bucket and one id, so based on queries i need to union multiple partitions with spark-sql and do the aggregations/group-by on union-result, something like this: for(all cassandra partitions){ DataSet

Monthly Apache Spark Newsletter

2018-11-20 Thread Ankur Gupta
Hey Guys, Just launched a monthly Apache Spark Newsletter. https://newsletterspot.com/apache-spark/ Cheers, Ankur Sent from Mail for Windows 10

Is there any window operation for RDDs in Pyspark? like for DStreams

2018-11-20 Thread zakhavan
Hello, I have two RDDs and my goal is to calculate the Pearson's correlation between them using sliding window. I want to have 200 samples in each window from rdd1 and rdd2 and calculate the correlation between them and then slide the window with 120 samples and calculate the correlation between n

Re: Spark DataSets and multiple write(.) calls

2018-11-20 Thread Gourav Sengupta
Hi, this is interesting, can you please share the code for this and if possible the source schema and it will be great if you could kindly share a sample file. Regards, Gourav Sengupta On Tue, Nov 20, 2018 at 9:50 AM Michael Shtelma wrote: > > You can also cache the data frame on disk, if it

Re: Spark DataSets and multiple write(.) calls

2018-11-20 Thread Michael Shtelma
You can also cache the data frame on disk, if it does not fit into memory. An alternative would be to write out data frame as parquet and then read it, you can check if in this case the whole pipeline works faster as with the standard cache. Best, Michael On Tue, Nov 20, 2018 at 9:14 AM Dipl.-In

Re: Spark DataSets and multiple write(.) calls

2018-11-20 Thread Dipl.-Inf. Rico Bergmann
Hi! Thanks Vadim for your answer. But this would be like caching the dataset, right? Or is checkpointing faster then persisting to memory or disk? I attach a pdf of my dataflow program. If I could compute the output of outputs 1-5 in parallel the output of flatmap1 and groupBy could be reused, av