Hi Yun,

Could you please provide more details on your json data structure for 400 MB 
json file.


Structure 1:


‘{ "key":[obj1, obj2, obj3..objn]}’


Structure 2:
[ {obj1},{obj2}..,{objn}]

Structure 3:
{obj1}
{obj1}
..
{objn}



Thanks,


Arjun


________________________________
From: Yun Liu <y....@castsoftware.com>
Sent: Saturday, November 4, 2017 1:49 AM
To: user@drill.apache.org
Subject: RE: Drill Capacity

Hi Paul,

Thanks for you detailed explanation. First off- I have 2 issues and I wanted to 
clear it out before continuing.

Current setting: planner.memory.max_query_memory_per_node = 10GB, HEAP = 12G, 
Direct memory = 32G, Perm 1024M, and planner.width.max_per_node = 5

Issue # 1:
When loading a json file with 400MB I keep getting a DATA_READ ERROR.
Each record in the file is about 64KB. Since it's a json file, there are only 4 
fields per each record. Not sure how many records this file contains as it's 
too large to open with any tools, but I am guessing about 3k rows.
With all the recommendations provided by various experts, nothing has worked.

Issue 2#:
While processing a query with is a join of 2 functional .json files, I am 
getting a RESOURCE ERROR: One or more nodes ran out of memory while executing 
the query. These 2 json files alone process fine but when joined together, 
Drill throws me that error.
Json#1 is 11k KB, has 8 fields with 74091 rows
Json#2 is 752kb, has 8 fields with 4245 rows

Besides breaking them up to smaller files, not sure what else I could do.

Thanks for the help so far!

Yun

-----Original Message-----
From: Paul Rogers [mailto:prog...@mapr.com]
Sent: Thursday, November 2, 2017 11:06 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,

I’m going to give you multiple ways to understand the issue based on the 
information you’ve provided. I generally like to see the full logs to diagnose 
such problems, but we’ll start with what you’ve provided thus far.

How large is each record in your file? How many fields? How many bytes? 
(Alternatively, how big is a single input file and how many records does it 
contain?)

You mention the limit of 64K columns in CSV. This makes me wonder if you have a 
“jumbo” record. If each individual record is large, then there won’t be enough 
space in the sort to take even a single batch of records, and you’ll get the 
sv2 error that you saw.

We can guess the size, however, from the info you provided:

batchGroups.size 1
spilledBatchGroups.size 0
allocated memory 42768000
allocator limit 41943040

This says you have a batch in memory and are trying to allocate some memory 
(the “sv2”). The allocated memory number tells us that each batch size is 
probably ~43 MB. But, the sort only has 42 MB to play with. The sort needs at 
least two batches in memory to make progress, hence the out-of-memory errors.

It would be nice to confirm this from the logs, but unfortunately, Drill does 
not normally log the size of each batch. As it turns out, however, the 
“managed” version that Boaz mentioned added more logging around this problem: 
it will tell you how large it thinks each batch is, and will warn if you have, 
say, a 43 MB batch but only 42 MB in which to sort.

(If you do want to use the “managed” version of the sort, I suggest you try 
Drill 1.12 when it is released as that version contains additional fixes to 
handle constrained memory.)

Also, at present, The JSON record reader loads 4096 records into each batch. If 
your file has at least that many records, then we can guess each record is 
about 43 MB / 4096 =~ 10K in size. (You can confirm, as noted above, by 
dividing total file size by record count.)

We are doing work to handle such large batches, but the work is not yet 
available in a release. Unfortunately, in the meanwhile, we also don’t let you 
control the batch size. But, we can provide another solution.

Let's explain why the message you provided said that the “allocator limit” was 
42 MB. Drill does the following to allocate memory to the sort:

* Take the “max query memory per node” (default of 2 GB regardless of actual 
direct memory),
* Divide by the number of sort operators in the plan (as shown in the 
visualized query profile)
* Divide by the “planner width” which is, by default, 70% of the number of 
cores on your system.

In your case, if you are using the default 2 GB total, but getting 41 MB per 
sort, the divisor is 50. Maybe you have 2 sorts and 32 cores? (2 * 32 * 70% =~ 
45.) Or some other combination.

We can’t reduce the number of sorts; that’s determined by your query. But, we 
can play with the other numbers.

First, we can increase the memory per query:

ALTER SESSION SET `planner.memory.max_query_memory_per_node` = 4,294,967,296

That is, 4 GB. This obviously means you must have at least 6 GB of direct 
memory; more is better.

And/or, we can reduce the number of fragments:

ALTER SESSION SET `planner.width.max_per_node` = <a number>

The value is a bit tricky. Drill normally creates a number of fragments equal 
to 70% of the number of CPUs on your system. Let’s say you have 32 cores. If 
so, change the max_per_node to, say, 10 or even 5. This will mean fewer sorts 
and so more memory per sort, helping compensate for the “jumbo” batches in your 
query. Pick a number based on your actual number of cores.

As an alternative, as Ted suggested, you could create a larger number of 
smaller files as this would solve the batch size problem while also getting the 
parallelization benefits that Kunal mentioned.

That is three separate possible solutions. Try them one by one or (carefully) 
together.

- Paul

>> On 11/2/17, 12:31 PM, "Yun Liu" <y....@castsoftware.com> wrote:
>>
>>    Hi Kunal and Andries,
>>
>>    Thanks for your reply. We need json in this case because Drill
>> only supports up to 65536 columns in a csv file.

Reply via email to