Re: [SS] Any way to optimize memory consumption of SS?

2017-09-14 Thread 张万新
There is expected to be about 5 million UUIDs in a day. I need to use this field to drop duplicate records and count number. If I simply count numbers without using dropDuplicates it only occupies less than 1g memory. I believe most of the memory is occupied by the state store for keeping the

Re: [SS] Any way to optimize memory consumption of SS?

2017-09-14 Thread Michael Armbrust
How many UUIDs do you expect to have in a day? That is likely where all the memory is being used. Does it work without that? On Tue, Sep 12, 2017 at 8:42 PM, 张万新 wrote: > *Yes, my code is shown below(I also post my code in another mail)* > /** > * input > */ >

Re: [SS] Any way to optimize memory consumption of SS?

2017-09-12 Thread 张万新
*Yes, my code is shown below(I also post my code in another mail)* /** * input */ val logs = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", BROKER_SERVER) .option("subscribe", TOPIC) .option("startingOffset", "latest") .load() /** *

Re: [SS] Any way to optimize memory consumption of SS?

2017-09-12 Thread Michael Armbrust
Can you show the full query you are running? On Tue, Sep 12, 2017 at 10:11 AM, 张万新 wrote: > Hi, > > I'm using structured streaming to count unique visits of our website. I > use spark on yarn mode with 4 executor instances and from 2 cores * 5g > memory to 4 cores * 10g

[SS] Any way to optimize memory consumption of SS?

2017-09-12 Thread 张万新
Hi, I'm using structured streaming to count unique visits of our website. I use spark on yarn mode with 4 executor instances and from 2 cores * 5g memory to 4 cores * 10g memory for each executor, but there are frequent full gc, and once the count raises to about more than 4.5 millions the