Such as :
df.withWarmark("time","window
size").dropDulplicates("id").withWatermark("time","real
watermark").groupBy(window("time","window size","window
size")).agg(count("id"))
can It make count(distinct id) success?
lec ssmi 于2020年2月28日周五 下午1:11写道:
> Such as :
>
Such as :
df.withWarmark("time","window
size").dropDulplicates("id").withWatermark("time","real
watermark").groupBy(window("time","window size","window
size")).agg(count("id"))
can It make count(distinct count) success?
Tathagata Das 于2020年2月28日周五 上午10:25写道:
> 1. Yes. All
1. Yes. All times in event time, not processing time. So you may get 10AM
event time data at 11AM processing time, but it will still be compared
again all data within 9-10AM event times.
2. Show us your code.
On Thu, Feb 27, 2020 at 2:30 AM lec ssmi wrote:
> Hi:
> I'm new to structured
Use Structured Streaming. Its aggregation, by definition, is across batches.
On Thu, Feb 27, 2020 at 3:17 PM Something Something <
mailinglist...@gmail.com> wrote:
> We've a Spark Streaming job that calculates some values in each batch.
> What we need to do now is aggregate values across ALL
We've a Spark Streaming job that calculates some values in each batch. What
we need to do now is aggregate values across ALL batches. What is the best
strategy to do this in Spark Streaming. Should we use 'Spark Accumulators'
for this?
Looks no obvious relationship between the partition or tables, maybe try make
them in different jobs, so they could run at same time to fully make use of the
cluster resource.
| |
prosp4300
邮箱:prosp4...@163.com
|
Signature is customized by Netease Mail Master
On 02/27/2020 22:50, Manjunath
Manjunath,
You can define your DataFrame in parallel in a multi-threaded driver.
Enrico
Am 27.02.20 um 15:50 schrieb Manjunath Shetty H:
Hi Enrico,
In that case how to make effective use of all nodes in the cluster ?.
And also whats your opinion on the below
* Create 10 Dataframes
Hi Enrico,
In that case how to make effective use of all nodes in the cluster ?.
And also whats your opinion on the below
* Create 10 Dataframes sequentially in Driver program and transform/write
to hdfs one after the other
* Or the current approach mentioned in the previous mail
What
Hi Manjunath,
why not creating 10 DataFrames loading the different tables in the first
place?
Enrico
Am 27.02.20 um 14:53 schrieb Manjunath Shetty H:
Hi Vinodh,
Thanks for the quick response. Didn't got what you meant exactly, any
reference or snippet will be helpful.
To explain the
Hi Vinodh,
Thanks for the quick response. Didn't got what you meant exactly, any reference
or snippet will be helpful.
To explain the problem more,
* I have 10 partitions , each partition loads the data from different table
and different SQL shard.
* Most of the partitions will have
Just split the single rdd into multiple individual rdds using a filter
operation and then convert each individual rdds to it's respective
dataframe..
On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H
wrote:
>
> Hello All,
>
> In spark i am creating the custom partitions with Custom RDD, each
>
Hello All,
In spark i am creating the custom partitions with Custom RDD, each partition
will have different schema. Now in the transformation step we need to get the
schema and run some Dataframe SQL queries per partition, because each partition
data has different schema.
How to get the
On Thu, 27 Feb 2020, 9:30 pm lec ssmi, wrote:
> Hi:
> I'm new to structured streaming. Because the built-in API cannot
> perform the Count Distinct operation of Window, I want to use
> dropDuplicates first, and then perform the window count.
>But in the process of using, there are two
Hi:
I'm new to structured streaming. Because the built-in API cannot
perform the Count Distinct operation of Window, I want to use
dropDuplicates first, and then perform the window count.
But in the process of using, there are two problems:
1. Because it is streaming computing,
14 matches
Mail list logo