Re: dropDuplicates and watermark in structured streaming

2020-02-27 Thread lec ssmi
Such as : df.withWarmark("time","window size").dropDulplicates("id").withWatermark("time","real watermark").groupBy(window("time","window size","window size")).agg(count("id")) can It make count(distinct id) success? lec ssmi 于2020年2月28日周五 下午1:11写道: > Such as : >

Re: dropDuplicates and watermark in structured streaming

2020-02-27 Thread lec ssmi
Such as : df.withWarmark("time","window size").dropDulplicates("id").withWatermark("time","real watermark").groupBy(window("time","window size","window size")).agg(count("id")) can It make count(distinct count) success? Tathagata Das 于2020年2月28日周五 上午10:25写道: > 1. Yes. All

Re: dropDuplicates and watermark in structured streaming

2020-02-27 Thread Tathagata Das
1. Yes. All times in event time, not processing time. So you may get 10AM event time data at 11AM processing time, but it will still be compared again all data within 9-10AM event times. 2. Show us your code. On Thu, Feb 27, 2020 at 2:30 AM lec ssmi wrote: > Hi: > I'm new to structured

Re: Spark Streaming: Aggregating values across batches

2020-02-27 Thread Tathagata Das
Use Structured Streaming. Its aggregation, by definition, is across batches. On Thu, Feb 27, 2020 at 3:17 PM Something Something < mailinglist...@gmail.com> wrote: > We've a Spark Streaming job that calculates some values in each batch. > What we need to do now is aggregate values across ALL

Spark Streaming: Aggregating values across batches

2020-02-27 Thread Something Something
We've a Spark Streaming job that calculates some values in each batch. What we need to do now is aggregate values across ALL batches. What is the best strategy to do this in Spark Streaming. Should we use 'Spark Accumulators' for this?

Re: Convert each partition of RDD to Dataframe

2020-02-27 Thread prosp4300
Looks no obvious relationship between the partition or tables, maybe try make them in different jobs, so they could run at same time to fully make use of the cluster resource. | | prosp4300 邮箱:prosp4...@163.com | Signature is customized by Netease Mail Master On 02/27/2020 22:50, Manjunath

Re: Convert each partition of RDD to Dataframe

2020-02-27 Thread Enrico Minack
Manjunath, You can define your DataFrame in parallel in a multi-threaded driver. Enrico Am 27.02.20 um 15:50 schrieb Manjunath Shetty H: Hi Enrico, In that case how to make effective use of all nodes in the cluster ?. And also whats your opinion on the below * Create 10 Dataframes

Re: Convert each partition of RDD to Dataframe

2020-02-27 Thread Manjunath Shetty H
Hi Enrico, In that case how to make effective use of all nodes in the cluster ?. And also whats your opinion on the below * Create 10 Dataframes sequentially in Driver program and transform/write to hdfs one after the other * Or the current approach mentioned in the previous mail What

Re: Convert each partition of RDD to Dataframe

2020-02-27 Thread Enrico Minack
Hi Manjunath, why not creating 10 DataFrames loading the different tables in the first place? Enrico Am 27.02.20 um 14:53 schrieb Manjunath Shetty H: Hi Vinodh, Thanks for the quick response. Didn't got what you meant exactly, any reference or snippet  will be helpful. To explain the

Re: Convert each partition of RDD to Dataframe

2020-02-27 Thread Manjunath Shetty H
Hi Vinodh, Thanks for the quick response. Didn't got what you meant exactly, any reference or snippet will be helpful. To explain the problem more, * I have 10 partitions , each partition loads the data from different table and different SQL shard. * Most of the partitions will have

Re: Convert each partition of RDD to Dataframe

2020-02-27 Thread Charles vinodh
Just split the single rdd into multiple individual rdds using a filter operation and then convert each individual rdds to it's respective dataframe.. On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H wrote: > > Hello All, > > In spark i am creating the custom partitions with Custom RDD, each >

Convert each partition of RDD to Dataframe

2020-02-27 Thread Manjunath Shetty H
Hello All, In spark i am creating the custom partitions with Custom RDD, each partition will have different schema. Now in the transformation step we need to get the schema and run some Dataframe SQL queries per partition, because each partition data has different schema. How to get the

Unsubscribe

2020-02-27 Thread Phillip Pienaar
On Thu, 27 Feb 2020, 9:30 pm lec ssmi, wrote: > Hi: > I'm new to structured streaming. Because the built-in API cannot > perform the Count Distinct operation of Window, I want to use > dropDuplicates first, and then perform the window count. >But in the process of using, there are two

dropDuplicates and watermark in structured streaming

2020-02-27 Thread lec ssmi
Hi: I'm new to structured streaming. Because the built-in API cannot perform the Count Distinct operation of Window, I want to use dropDuplicates first, and then perform the window count. But in the process of using, there are two problems: 1. Because it is streaming computing,