Re: dropDuplicates and watermark in structured streaming

2020-02-27 Thread lec ssmi
  Such as :
df.withWarmark("time","window
size").dropDulplicates("id").withWatermark("time","real
watermark").groupBy(window("time","window size","window
size")).agg(count("id"))
   can It  make count(distinct id) success?


lec ssmi  于2020年2月28日周五 下午1:11写道:

>   Such as :
> df.withWarmark("time","window
> size").dropDulplicates("id").withWatermark("time","real
> watermark").groupBy(window("time","window size","window
> size")).agg(count("id"))
>can It  make count(distinct count) success?
>
> Tathagata Das  于2020年2月28日周五 上午10:25写道:
>
>> 1. Yes. All times in event time, not processing time. So you may get 10AM
>> event time data at 11AM processing time, but it will still be compared
>> again all data within 9-10AM event times.
>>
>> 2. Show us your code.
>>
>> On Thu, Feb 27, 2020 at 2:30 AM lec ssmi  wrote:
>>
>>> Hi:
>>> I'm new to structured streaming. Because the built-in API cannot
>>> perform the Count Distinct operation of Window, I want to use
>>> dropDuplicates first, and then perform the window count.
>>>But in the process of using, there are two problems:
>>>1. Because it is streaming computing, in the process of
>>> deduplication, the state needs to be cleared in time, which requires the
>>> cooperation of watermark. Assuming my event time field is consistently
>>>   increasing, and I set the watermark to 1 hour, does it
>>> mean that the data at 10 o'clock will only be compared in these data from 9
>>> o'clock to 10 o'clock, and the data before 9 o'clock will be cleared ?
>>>2. Because it is window deduplication, I set the watermark
>>> before deduplication to the window size.But after deduplication, I need to
>>> call withWatermark () again to set the watermark to the real
>>>watermark. Will setting the watermark again take effect?
>>>
>>>  Thanks a lot !
>>>
>>


Re: dropDuplicates and watermark in structured streaming

2020-02-27 Thread lec ssmi
  Such as :
df.withWarmark("time","window
size").dropDulplicates("id").withWatermark("time","real
watermark").groupBy(window("time","window size","window
size")).agg(count("id"))
   can It  make count(distinct count) success?

Tathagata Das  于2020年2月28日周五 上午10:25写道:

> 1. Yes. All times in event time, not processing time. So you may get 10AM
> event time data at 11AM processing time, but it will still be compared
> again all data within 9-10AM event times.
>
> 2. Show us your code.
>
> On Thu, Feb 27, 2020 at 2:30 AM lec ssmi  wrote:
>
>> Hi:
>> I'm new to structured streaming. Because the built-in API cannot
>> perform the Count Distinct operation of Window, I want to use
>> dropDuplicates first, and then perform the window count.
>>But in the process of using, there are two problems:
>>1. Because it is streaming computing, in the process of
>> deduplication, the state needs to be cleared in time, which requires the
>> cooperation of watermark. Assuming my event time field is consistently
>>   increasing, and I set the watermark to 1 hour, does it mean
>> that the data at 10 o'clock will only be compared in these data from 9
>> o'clock to 10 o'clock, and the data before 9 o'clock will be cleared ?
>>2. Because it is window deduplication, I set the watermark
>> before deduplication to the window size.But after deduplication, I need to
>> call withWatermark () again to set the watermark to the real
>>watermark. Will setting the watermark again take effect?
>>
>>  Thanks a lot !
>>
>


Re: dropDuplicates and watermark in structured streaming

2020-02-27 Thread Tathagata Das
1. Yes. All times in event time, not processing time. So you may get 10AM
event time data at 11AM processing time, but it will still be compared
again all data within 9-10AM event times.

2. Show us your code.

On Thu, Feb 27, 2020 at 2:30 AM lec ssmi  wrote:

> Hi:
> I'm new to structured streaming. Because the built-in API cannot
> perform the Count Distinct operation of Window, I want to use
> dropDuplicates first, and then perform the window count.
>But in the process of using, there are two problems:
>1. Because it is streaming computing, in the process of
> deduplication, the state needs to be cleared in time, which requires the
> cooperation of watermark. Assuming my event time field is consistently
>   increasing, and I set the watermark to 1 hour, does it mean
> that the data at 10 o'clock will only be compared in these data from 9
> o'clock to 10 o'clock, and the data before 9 o'clock will be cleared ?
>2. Because it is window deduplication, I set the watermark
> before deduplication to the window size.But after deduplication, I need to
> call withWatermark () again to set the watermark to the real
>watermark. Will setting the watermark again take effect?
>
>  Thanks a lot !
>


Re: Spark Streaming: Aggregating values across batches

2020-02-27 Thread Tathagata Das
Use Structured Streaming. Its aggregation, by definition, is across batches.

On Thu, Feb 27, 2020 at 3:17 PM Something Something <
mailinglist...@gmail.com> wrote:

> We've a Spark Streaming job that calculates some values in each batch.
> What we need to do now is aggregate values across ALL batches. What is the
> best strategy to do this in Spark Streaming. Should we use 'Spark
> Accumulators' for this?
>


Spark Streaming: Aggregating values across batches

2020-02-27 Thread Something Something
We've a Spark Streaming job that calculates some values in each batch. What
we need to do now is aggregate values across ALL batches. What is the best
strategy to do this in Spark Streaming. Should we use 'Spark Accumulators'
for this?


Re: Convert each partition of RDD to Dataframe

2020-02-27 Thread prosp4300
Looks no obvious relationship between the partition or tables, maybe try make 
them in different jobs, so they could run at same time to fully make use of the 
cluster resource.




| |
prosp4300
邮箱:prosp4...@163.com
|

Signature is customized by Netease Mail Master

On 02/27/2020 22:50, Manjunath Shetty H wrote:
Hi Enrico,


In that case how to make effective use of all nodes in the cluster ?.


And also whats your opinion on the below
Create 10 Dataframes sequentially in Driver program and transform/write to hdfs 
one after the other
Or the current approach mentioned in the previous mail 
What will be the performance implications ?


Regards
Manjunath


From: Enrico Minack 
Sent: Thursday, February 27, 2020 7:57 PM
To:user@spark.apache.org 
Subject: Re: Convert each partition of RDD to Dataframe
 
Hi Manjunath,


why not creating 10 DataFrames loading the different tables in the first place?


Enrico




Am 27.02.20 um 14:53 schrieb Manjunath Shetty H:

Hi Vinodh,


Thanks for the quick response. Didn't got what you meant exactly, any reference 
or snippet  will be helpful.


To explain the problem more,
I have 10 partitions , each partition loads the data from different table and 
different SQL shard.
Most of the partitions will have different schema.
Before persisting the data i want to do some column level manipulation using 
data frame.
So thats why i want to create 10 (based on partitions ) dataframes that maps to 
10 different table/shard from a RDD.


Regards
Manjunath
From: Charles vinodh 
Sent: Thursday, February 27, 2020 7:04 PM
To: manjunathshe...@live.com 
Cc: user 
Subject: Re: Convert each partition of RDD to Dataframe
 
Just split the single rdd into multiple individual rdds using a filter 
operation and then convert each individual rdds to it's respective dataframe.. 


On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H  
wrote:



Hello All,



In spark i am creating the custom partitions with Custom RDD, each partition 
will have different schema. Now in the transformation step we need to get the 
schema and run some Dataframe SQL queries per partition, because each partition 
data has different schema.

How to get the Dataframe's per partition of a RDD?.

As of now i am doing foreachPartition on RDD and converting Iterable to 
List and converting that to Dataframe. But the problem is converting Iterable 
to List will bring all the data to memory and it might crash the process.

Is there any known way to do this ? or is there any way to handle Custom 
Partitions in Dataframes instead of using RDD ?

I am using Spark version 1.6.2.

Any pointers would be helpful. Thanks in advance







Re: Convert each partition of RDD to Dataframe

2020-02-27 Thread Enrico Minack

Manjunath,

You can define your DataFrame in parallel in a multi-threaded driver.

Enrico

Am 27.02.20 um 15:50 schrieb Manjunath Shetty H:

Hi Enrico,

In that case how to make effective use of all nodes in the cluster ?.

And also whats your opinion on the below

  * Create 10 Dataframes sequentially in Driver program and
transform/write to hdfs one after the other
  * Or the current approach mentioned in the previous mail

What will be the performance implications ?

Regards
Manjunath


*From:* Enrico Minack 
*Sent:* Thursday, February 27, 2020 7:57 PM
*To:* user@spark.apache.org 
*Subject:* Re: Convert each partition of RDD to Dataframe
Hi Manjunath,

why not creating 10 DataFrames loading the different tables in the 
first place?


Enrico


Am 27.02.20 um 14:53 schrieb Manjunath Shetty H:

Hi Vinodh,

Thanks for the quick response. Didn't got what you meant exactly, any 
reference or snippet  will be helpful.


To explain the problem more,

  * I have 10 partitions , each partition loads the data from
different table and different SQL shard.
  * Most of the partitions will have different schema.
  * Before persisting the data i want to do some column level
manipulation using data frame.

So thats why i want to create 10 (based on partitions ) dataframes 
that maps to 10 different table/shard from a RDD.


Regards
Manjunath

*From:* Charles vinodh  


*Sent:* Thursday, February 27, 2020 7:04 PM
*To:* manjunathshe...@live.com  
 

*Cc:* user  
*Subject:* Re: Convert each partition of RDD to Dataframe
Just split the single rdd into multiple individual rdds using a 
filter operation and then convert each individual rdds to it's 
respective dataframe..


On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H 
mailto:manjunathshe...@live.com>> wrote:



Hello All,

In spark i am creating the custom partitions with Custom RDD,
each partition will have different schema. Now in the
transformation step we need to get the schema and run some
Dataframe SQL queries per partition, because each partition data
has different schema.

How to get the Dataframe's per partition of a RDD?.

As of now i am doing|foreachPartition|on RDD and
converting|Iterable|to|List|and converting that to
Dataframe. But the problem is converting|Iterable|to|List|will
bring all the data to memory and it might crash the process.

Is there any known way to do this ? or is there any way to handle
Custom Partitions in|Dataframes|instead of using|RDD|?

I am using Spark version|1.6.2|.

Any pointers would be helpful. Thanks in advance








Re: Convert each partition of RDD to Dataframe

2020-02-27 Thread Manjunath Shetty H
Hi Enrico,

In that case how to make effective use of all nodes in the cluster ?.

And also whats your opinion on the below

  *   Create 10 Dataframes sequentially in Driver program and transform/write 
to hdfs one after the other
  *   Or the current approach mentioned in the previous mail

What will be the performance implications ?

Regards
Manjunath


From: Enrico Minack 
Sent: Thursday, February 27, 2020 7:57 PM
To: user@spark.apache.org 
Subject: Re: Convert each partition of RDD to Dataframe

Hi Manjunath,

why not creating 10 DataFrames loading the different tables in the first place?

Enrico


Am 27.02.20 um 14:53 schrieb Manjunath Shetty H:
Hi Vinodh,

Thanks for the quick response. Didn't got what you meant exactly, any reference 
or snippet  will be helpful.

To explain the problem more,

  *   I have 10 partitions , each partition loads the data from different table 
and different SQL shard.
  *   Most of the partitions will have different schema.
  *   Before persisting the data i want to do some column level manipulation 
using data frame.

So thats why i want to create 10 (based on partitions ) dataframes that maps to 
10 different table/shard from a RDD.

Regards
Manjunath

From: Charles vinodh 
Sent: Thursday, February 27, 2020 7:04 PM
To: manjunathshe...@live.com 

Cc: user 
Subject: Re: Convert each partition of RDD to Dataframe

Just split the single rdd into multiple individual rdds using a filter 
operation and then convert each individual rdds to it's respective dataframe..

On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H 
mailto:manjunathshe...@live.com>> wrote:

Hello All,


In spark i am creating the custom partitions with Custom RDD, each partition 
will have different schema. Now in the transformation step we need to get the 
schema and run some Dataframe SQL queries per partition, because each partition 
data has different schema.

How to get the Dataframe's per partition of a RDD?.

As of now i am doing foreachPartition on RDD and converting Iterable to 
List and converting that to Dataframe. But the problem is converting Iterable 
to List will bring all the data to memory and it might crash the process.

Is there any known way to do this ? or is there any way to handle Custom 
Partitions in Dataframes instead of using RDD ?

I am using Spark version 1.6.2.

Any pointers would be helpful. Thanks in advance




Re: Convert each partition of RDD to Dataframe

2020-02-27 Thread Enrico Minack

Hi Manjunath,

why not creating 10 DataFrames loading the different tables in the first 
place?


Enrico


Am 27.02.20 um 14:53 schrieb Manjunath Shetty H:

Hi Vinodh,

Thanks for the quick response. Didn't got what you meant exactly, any 
reference or snippet  will be helpful.


To explain the problem more,

  * I have 10 partitions , each partition loads the data from
different table and different SQL shard.
  * Most of the partitions will have different schema.
  * Before persisting the data i want to do some column level
manipulation using data frame.

So thats why i want to create 10 (based on partitions ) dataframes 
that maps to 10 different table/shard from a RDD.


Regards
Manjunath

*From:* Charles vinodh 
*Sent:* Thursday, February 27, 2020 7:04 PM
*To:* manjunathshe...@live.com 
*Cc:* user 
*Subject:* Re: Convert each partition of RDD to Dataframe
Just split the single rdd into multiple individual rdds using a filter 
operation and then convert each individual rdds to it's respective 
dataframe..


On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H 
mailto:manjunathshe...@live.com>> wrote:



Hello All,

In spark i am creating the custom partitions with Custom RDD, each
partition will have different schema. Now in the transformation
step we need to get the schema and run some Dataframe SQL queries
per partition, because each partition data has different schema.

How to get the Dataframe's per partition of a RDD?.

As of now i am doing|foreachPartition|on RDD and
converting|Iterable|to|List|and converting that to Dataframe.
But the problem is converting|Iterable|to|List|will bring all the
data to memory and it might crash the process.

Is there any known way to do this ? or is there any way to handle
Custom Partitions in|Dataframes|instead of using|RDD|?

I am using Spark version|1.6.2|.

Any pointers would be helpful. Thanks in advance






Re: Convert each partition of RDD to Dataframe

2020-02-27 Thread Manjunath Shetty H
Hi Vinodh,

Thanks for the quick response. Didn't got what you meant exactly, any reference 
or snippet  will be helpful.

To explain the problem more,

  *   I have 10 partitions , each partition loads the data from different table 
and different SQL shard.
  *   Most of the partitions will have different schema.
  *   Before persisting the data i want to do some column level manipulation 
using data frame.

So thats why i want to create 10 (based on partitions ) dataframes that maps to 
10 different table/shard from a RDD.

Regards
Manjunath

From: Charles vinodh 
Sent: Thursday, February 27, 2020 7:04 PM
To: manjunathshe...@live.com 
Cc: user 
Subject: Re: Convert each partition of RDD to Dataframe

Just split the single rdd into multiple individual rdds using a filter 
operation and then convert each individual rdds to it's respective dataframe..

On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H 
mailto:manjunathshe...@live.com>> wrote:

Hello All,


In spark i am creating the custom partitions with Custom RDD, each partition 
will have different schema. Now in the transformation step we need to get the 
schema and run some Dataframe SQL queries per partition, because each partition 
data has different schema.

How to get the Dataframe's per partition of a RDD?.

As of now i am doing foreachPartition on RDD and converting Iterable to 
List and converting that to Dataframe. But the problem is converting Iterable 
to List will bring all the data to memory and it might crash the process.

Is there any known way to do this ? or is there any way to handle Custom 
Partitions in Dataframes instead of using RDD ?

I am using Spark version 1.6.2.

Any pointers would be helpful. Thanks in advance



Re: Convert each partition of RDD to Dataframe

2020-02-27 Thread Charles vinodh
Just split the single rdd into multiple individual rdds using a filter
operation and then convert each individual rdds to it's respective
dataframe..

On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H 
wrote:

>
> Hello All,
>
> In spark i am creating the custom partitions with Custom RDD, each
> partition will have different schema. Now in the transformation step we
> need to get the schema and run some Dataframe SQL queries per partition,
> because each partition data has different schema.
>
> How to get the Dataframe's per partition of a RDD?.
>
> As of now i am doing foreachPartition on RDD and converting Iterable
> to List and converting that to Dataframe. But the problem is converting
> Iterable to List will bring all the data to memory and it might crash the
> process.
>
> Is there any known way to do this ? or is there any way to handle Custom
> Partitions in Dataframes instead of using RDD ?
>
> I am using Spark version 1.6.2.
>
> Any pointers would be helpful. Thanks in advance
>
>


Convert each partition of RDD to Dataframe

2020-02-27 Thread Manjunath Shetty H

Hello All,


In spark i am creating the custom partitions with Custom RDD, each partition 
will have different schema. Now in the transformation step we need to get the 
schema and run some Dataframe SQL queries per partition, because each partition 
data has different schema.

How to get the Dataframe's per partition of a RDD?.

As of now i am doing foreachPartition on RDD and converting Iterable to 
List and converting that to Dataframe. But the problem is converting Iterable 
to List will bring all the data to memory and it might crash the process.

Is there any known way to do this ? or is there any way to handle Custom 
Partitions in Dataframes instead of using RDD ?

I am using Spark version 1.6.2.

Any pointers would be helpful. Thanks in advance



Unsubscribe

2020-02-27 Thread Phillip Pienaar
On Thu, 27 Feb 2020, 9:30 pm lec ssmi,  wrote:

> Hi:
> I'm new to structured streaming. Because the built-in API cannot
> perform the Count Distinct operation of Window, I want to use
> dropDuplicates first, and then perform the window count.
>But in the process of using, there are two problems:
>1. Because it is streaming computing, in the process of
> deduplication, the state needs to be cleared in time, which requires the
> cooperation of watermark. Assuming my event time field is consistently
>   increasing, and I set the watermark to 1 hour, does it mean
> that the data at 10 o'clock will only be compared in these data from 9
> o'clock to 10 o'clock, and the data before 9 o'clock will be cleared ?
>2. Because it is window deduplication, I set the watermark
> before deduplication to the window size.But after deduplication, I need to
> call withWatermark () again to set the watermark to the real
>watermark. Will setting the watermark again take effect?
>
>  Thanks a lot !
>


dropDuplicates and watermark in structured streaming

2020-02-27 Thread lec ssmi
Hi:
I'm new to structured streaming. Because the built-in API cannot
perform the Count Distinct operation of Window, I want to use
dropDuplicates first, and then perform the window count.
   But in the process of using, there are two problems:
   1. Because it is streaming computing, in the process of
deduplication, the state needs to be cleared in time, which requires the
cooperation of watermark. Assuming my event time field is consistently
  increasing, and I set the watermark to 1 hour, does it mean
that the data at 10 o'clock will only be compared in these data from 9
o'clock to 10 o'clock, and the data before 9 o'clock will be cleared ?
   2. Because it is window deduplication, I set the watermark
before deduplication to the window size.But after deduplication, I need to
call withWatermark () again to set the watermark to the real
   watermark. Will setting the watermark again take effect?

 Thanks a lot !