Hello Prabhakar,

For the first question, I recommend you try the performance tool like the 
"nmon" to check the cost of the machine's CPU.

I think you can get the results as how many nodes, once you scale the number of 
drillbits to 2 or 3, because you can look at the velocity changes.

You can try to reduce the file size to 512 MB, but we can't make sure there's a 
big impact, but we can reduce the cost of the JVM’ GC.

Parquet is good for data analysis but not for data queries, because the Parquet 
is the columnar storage.

If you like using the "select * from table1", then the Parquet is not a good 
idea.

If you like using the `select max(f1), min(f2) from table1` query text, then 
the Parquet is a great solution.

Next, you may have to choose between CTAS or direct query costs.


- luoc

> 2022年7月10日 下午1:22,Prabhakar Bhosale <bhosale....@gmail.com> 写道:
> 
> Dear Luoc,
> Thanks for the insights. This is just a week's data. Production will have 15 
> times more data. So inline with it, I have following queries
> 
> 1. Is there any template or calculator which will help me size the production 
> server (CPU, memory and IO) based on size of data?
> 2. For such a huge size of data, what are the best practices to be followed 
> to store and retrieve the data?
> 2. What should be the optimal size of the file? Currently the uncompressed 
> size of the file is 2GB. So how do we balance between the number of files and 
> file size?
> 3. Do you think the parquet format will perform better than JSON?
> 4. Is there any way in Drill to detect the "File Create" event and then 
> convert JSON to parquet using CTAS?
> 
> Thanks And Regards
> Prabhakar
> 
> On Sat, Jul 9, 2022 at 8:41 PM luoc <l...@apache.org 
> <mailto:l...@apache.org>> wrote:
> 
> Hello Prabhakar,
> 
> I will present my check process and hope to give you advice :
> 
> 1. I imported the file in your attachment using the `View` button on the 
> right side of the `Profile` page.
> 
> 2. The fragment profile record that: the major fragment (02-xx-xx) cost about 
> 45min+.
> 
> 3. The 02-xx-xx phase used 3 parallels, and the json-scan (JSON Reader) cost 
> the most of the time.
> 
> 4. Each minor fragment reads nearly 0.12 billion records. Killer !
> 
> As a result, three JSON readers read a total of 338,398,798 records.
> 
> And then, your JSON files are in GZ compression format, with a total of 297, 
> meaning the Drill need a lot of CPUs to decompress.
> 
> Simply understand that your hardware resources are bottlenecks and can't use 
> faster to query large-scale records, and recommend scale-out nodes and using 
> distributed clusters.
> 
> - luoc
> 
> 
> 
> 
> 
>> 2022年7月9日 上午1:01,Prabhakar Bhosale <bhosale....@gmail.com 
>> <mailto:bhosale....@gmail.com>> 写道:
>> 
>> the
> 

Reply via email to