There is expected to be about 5 million UUIDs in a day. I need to use this
field to drop duplicate records and count number. If I simply count numbers
without using dropDuplicates it only occupies less than 1g memory. I
believe most of the memory is occupied by the state store for keeping the
How many UUIDs do you expect to have in a day? That is likely where all
the memory is being used. Does it work without that?
On Tue, Sep 12, 2017 at 8:42 PM, 张万新 wrote:
> *Yes, my code is shown below(I also post my code in another mail)*
> /**
> * input
> */
>
*Yes, my code is shown below(I also post my code in another mail)*
/**
* input
*/
val logs = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", BROKER_SERVER)
.option("subscribe", TOPIC)
.option("startingOffset", "latest")
.load()
/**
*
Can you show the full query you are running?
On Tue, Sep 12, 2017 at 10:11 AM, 张万新 wrote:
> Hi,
>
> I'm using structured streaming to count unique visits of our website. I
> use spark on yarn mode with 4 executor instances and from 2 cores * 5g
> memory to 4 cores * 10g
Hi,
I'm using structured streaming to count unique visits of our website. I use
spark on yarn mode with 4 executor instances and from 2 cores * 5g memory
to 4 cores * 10g memory for each executor, but there are frequent full gc,
and once the count raises to about more than 4.5 millions the