Use case :

250 million records per day partition by date.

Out of 250 million records have 20 to 30% 
Update on previous day partition span across 100 days.

Currently rebuilding entire 100 days partition.

My goal is to build similar 25 billion rows table and do a upsert using 250 
million records that span across 100 days.

My key :

Concat(2019-01-01 00:00:15, 123456)

First column value increase by 15 mins and have 96 records per day  And other 
value remain same.

96 records for given meter ID per day.


This use case is related to smart meter reading from each household and 
business user.

I have created 2.5 million meter ID and  extrapolate by joining with 96 records 
per day to create close to 250 million records per day.

Let me know if any suggestions on key setting and also is there any way to set 
multiple columns as key when you use data source write option.





Sent from my iPhone

> On Jul 19, 2019, at 7:51 AM, Vinoth Chandar <vin...@apache.org> wrote:
> 
> sg!
> 
> As with any database-like systems, performance is dependent on key design
> and configuration.
> Happy to share more tips on tuning if you can give more details on
> 
> - use-case, what operation you are using?
> - % of the 25 Billion records updated in each run (for e.g if you are
> upserting the entire dataset, then it will be slower ofc than just
> bulk_inserting)
> - can you make the key by prefixed by some increasing/ordered value like a
> timestamp
> 
> a lot of this is also covered in the two links I sent.
> 
> 
> On Thu, Jul 18, 2019 at 10:37 PM Amarnath Venkataswamy <
> amarnath.venkatasw...@gmail.com> wrote:
> 
>> After I set the shuffle parallelism i can able to complete the job without
>> failure but there is one more challenge to reduce the GC time.Currently it
>> is taking 20 to 30% per task from overall run time.
>> 
>> I have to test with GC with extra java options by tomorrow.
>> 
>> My goal is to do the update on 25 billion rows span across 100 days of
>> partitions with 240 million records(2GB size) in  each partition with 50%
>> update on previous day partition and rest spread across remaining 99 days.
>> 
>> Currently it is taking 30 to 40 mins for  just to write into 1
>> partition.out of this 20 to 30% time goes to GC.
>> 
>> If we can do this in less than one to 2 hours(incremental update : 240
>> million daily) after tuning all the memory and other parameters i would be
>> very happy.
>> 
>> 
>> 
>> 
>> On Fri, Jul 19, 2019 at 12:19 AM Amarnath Venkataswamy <
>> amarnath.venkatasw...@gmail.com> wrote:
>> 
>>> yes.I am looking for the same thing only.
>>> 
>>> On Thu, Jul 18, 2019 at 9:20 PM Vinoth Chandar <vin...@apache.org>
>> wrote:
>>> 
>>>> No real reason. If you notice a sample configuration is  presented under
>>>> “gc tuning” section and asks the user to add it to extraJavaOptions. Its
>>>> separate coz its for cms and someone else may want to do g1
>>>> 
>>>> On Thu, Jul 18, 2019 at 5:26 PM Gary Li <yanjia.gary...@gmail.com>
>> wrote:
>>>> 
>>>>> One related question. The GC tuning part says [must] use G1/CMS
>>>> collector,
>>>>> but the recommended production config doesn’t specify any GC. Is
>> there a
>>>>> reason behind this?
>>>>> 
>>>>> On Thu, Jul 18, 2019 at 9:37 AM Vinoth Chandar <vin...@apache.org>
>>>> wrote:
>>>>> 
>>>>>> https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide
>>>>>> https://hudi.apache.org/performance.html
>>>>>> are good resources for what you need.
>>>>>> 
>>>>>> On Thu, Jul 18, 2019 at 7:37 AM Amarnath Venkataswamy <
>>>>>> amarnath.venkatasw...@gmail.com> wrote:
>>>>>> 
>>>>>>> Hi
>>>>>>> 
>>>>>>> Can you anyone of you share the Spark configuration used at UBER I
>>>>> didn't
>>>>>>> save that link to my favorites.
>>>>>>> 
>>>>>>> I am currently doing some performance test against 240million
>>>> records
>>>>> and
>>>>>>> job is failing for one or other reasons with memory.
>>>>>>> 
>>>>>>> Regards
>>>>>>> Amarnath
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Reply via email to