RE: Drill Capacity

Yun Liu Wed, 08 Nov 2017 07:50:11 -0800

Hi Paul,

I've already done this: alter session set `store.json.all_text_mode`=true;


I don't believe this is an accurate error message because when I reduce # of 
rows in the Compliance.json file by half (while all fields and queries stay the 
same), everything works with no issues. I've tried the same with another 
dataset (same format, same fields but smaller size pool), no issue there 
either. So I am still convinced it's a size issue.

Please let me know what else I could provide to troubleshoot this.

Thanks for all your help so far.

Yun

-----Original Message-----
From: Paul Rogers [mailto:prog...@mapr.com] 
Sent: Tuesday, November 7, 2017 7:55 PM
To: user@drill.apache.org
Subject: Re: Drill Capacity

Hi Yun,

I looked at the sqlline.log file you posted. (Thanks much for doing so.) Here’s 
what I noted:

The log shows a failed query, but this one is different than the one we 
discussed earlier. Query:

SELECT * FROM `dfs`.`Inputs`.`./Compliance.json` LIMIT 100

Since this is a LIMIT query, with no ORDER BY, we got a different plan than the 
query we discussed earlier. The earlier one had a stack trace that suggested 
the query had an ORDER BY that used the legacy (non-managed) version of the 
sort.

Despite the fact that the query is different, the above query did, in fact, 
fail, but for a different reason.

JsonReader - User Error Occurred: You tried to write a VarChar type when you 
are using a ValueWriter of type NullableBitWriterImpl. (You tried to write a 
VarChar type when you are using a ValueWriter of type NullableBitWriterImpl.)
org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: You tried to 
write a VarChar type when you are using a ValueWriter of type 
NullableBitWriterImpl.

This is saying that there is a schema change: but one within a single batch. 
Variations of this bug can occur if you have several values of the form 10, 20, 
30. But, then later, you have a value like “Hi” — we create a numeric vector 
then try to write a string.

Here, it appears you have values that are boolean, followed by a string:

… “a”: true …
… “a”: false …
… “a”: “a string!”

The JSON writer sees the boolean and locates a bit vector. Then, it sees the 
string, tries to write that into a bit vector, and gets the error displayed 
above.

You can work around this by using “all text mode” that reads all fields as 
text. Or, you can clean up your data.

Once this file works, perhaps you can try another run to recreate the original 
memory issue with the sort so we can track that one down.

Thanks,

- Paul

> On Nov 7, 2017, at 1:49 PM, Kunal Khatua <kkha...@mapr.com> wrote:
> 
> Hi Yun
> 
> The new release might not address this issue as we don't have a repro for 
> this. Any chance you can provide a sample anonymized data set. The JSON data 
> doesn't have to be meaningful, but we need to be able to reproduce it to 
> ensure that we are indeed addressing the issue you faced. 
> 
> Thanks
> ~K
> -----Original Message-----
> From: Yun Liu [mailto:y....@castsoftware.com]
> Sent: Tuesday, November 07, 2017 7:17 AM
> To: user@drill.apache.org
> Subject: RE: Drill Capacity
> 
> Hi Arjun,
> 
> That was already altered and schema was not changed. I've reduced the json 
> size and everything works fine. I believe it was giving a false error. Seems 
> that's the only way to bypass this error until your new release comes out?
> 
> Thanks,
> Yun
> 
> -----Original Message-----
> From: Arjun kr [mailto:arjun...@outlook.com]
> Sent: Monday, November 6, 2017 7:39 PM
> To: user@drill.apache.org
> Subject: Re: Drill Capacity
> 
> Hi Yun,
> 
> 
> Looking at the log shared, You seems to be running below query.
> 
> 
> 2017-11-06 15:09:37,383 [25ff3e7e-39ef-a175-93e7-e4e62b284add:foreman] 
> INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id 
> 25ff3e7e-39ef-a175-93e7-e4e62b284add: SELECT * FROM 
> `dfs`.`Inputs`.`./Compliance.json` LIMIT 100
> 
> 
> Below is the exception with query failure.
> 
> 
> 2017-11-06 15:09:45,852 
> [25ff3e7e-39ef-a175-93e7-e4e62b284add:frag:0:0] INFO  
> o.a.d.e.vector.complex.fn.JsonReader - User Error Occurred: You tried 
> to write a VarChar type when you are using a ValueWriter of type 
> NullableBitWriterImpl. (You tried to write a VarChar type when you are 
> using a ValueWriter of type NullableBitWriterImpl.)^M
> org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: You tried 
> to write a VarChar type when you are using a ValueWriter of type 
> NullableBitWriterImpl.
> 
> It could be related to schema change. Can you try setting below session 
> parameter if not tried already?
> 
> 
> alter session set `store.json.all_text_mode`=true;
> 
> 
> 
> Thanks,
> 
> Arjun
> ________________________________
> From: Yun Liu <y....@castsoftware.com>
> Sent: Tuesday, November 7, 2017 1:46 AM
> To: user@drill.apache.org
> Subject: RE: Drill Capacity
> 
> Hi Arjun and Paul,
> 
> Yep those are turned and I am reading it from sqlline.log. Only max 
> allocation number I am reading is 10,000,000,000. Posted the logs in my 
> Dropbox:
> https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c9jwijTjkkac_a?d
> l=0 
> [https://cfl.dropboxstatic.com/static/images/logo_catalog/glyph_m1@2x-
> vflA6lTFZ.png]<https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c
> 9jwijTjkkac_a?dl=0>
> 
> Drill<https://www.dropbox.com/sh/5akxrzm078jsabw/AADuD92swH6c9jwijTjkk
> ac_a?dl=0>
> www.dropbox.com
> Shared with Dropbox
> 
> 
> 
> 
> Thank you!
> Yun
> 
> -----Original Message-----
> From: Arjun kr [mailto:arjun...@outlook.com]
> Sent: Monday, November 6, 2017 1:20 PM
> To: user@drill.apache.org
> Subject: Re: Drill Capacity
> 
> Hi Yun,
> 
> 
> Are you running in Drill embedded mode ? If so , the logs will be available 
> in sqllline.log and drillbit.log will not be populated. You can enable DEBUG 
> logging in logback.xml , run the query and share log file as Paul suggested.
> 
> 
> Edit $DRILL_HOME/conf/logback.xml to enable DEBUG level logging.
> 
> 
> <logger name="org.apache.drill" additivity="false">
>    <level value="debug" />
>    <appender-ref ref="FILE" />
>  </logger>
> 
> 
> Thanks,
> 
> 
> Arjun
> 
> ________________________________
> From: Paul Rogers <prog...@mapr.com>
> Sent: Monday, November 6, 2017 10:56 PM
> To: user@drill.apache.org
> Subject: Re: Drill Capacity
> 
> Hi Yun,
> 
> Sorry, it is a bit confusing. The log will contain two kinds of JSON. One is 
> the query profile, which is what you found. The other is the physical plan 
> used to run the query. It is the physical plan you want to find; that is the 
> one that has the max allocation.
> 
> If you can post your logs somewhere, I'll d/l them and take a look.
> 
> - Paul
> 
>> On Nov 6, 2017, at 7:27 AM, Yun Liu <y....@castsoftware.com> wrote:
>> 
>> Hi Paul,
>> 
>> I am using Drill v 1.11.0 so I am only seeing sqlline.log and 
>> sqlline_queries.log. hopefully the same.
>> 
>> I am following your instructions and I am not seeing any maxAllocation other 
>> than 10,000,000,000. No other number (or small number) than this. The query 
>> profile reads the following:
>> 
>> {"queryId":"25ff81fc-3b7a-a840-b557-d2194cc6819a","schema":"","queryT
>> e xt":"SELECT * FROM `dfs`.`Inputs`.`./ Compliance.json` LIMIT 
>> 100","start":1509981699406,"finish":1509981707544,"outcome":"FAILED","
>> username":"","remoteAddress":"localhost"}
>> 
>> Is this what you're looking for?
>> 
>> Thanks,
>> Yun
>> 
>> -----Original Message-----
>> From: Paul Rogers [mailto:prog...@mapr.com]
>> Sent: Friday, November 3, 2017 6:45 PM
>> To: user@drill.apache.org
>> Subject: Re: Drill Capacity
>> 
>> Thanks for the info. Clearly you are way ahead of me.
>> 
>> In issue 1, although you have only four (top level) fields, your example 
>> shows that you have many nested fields. It is the total field count (across 
>> all maps) that drives total width. And, it is the total amount of data that 
>> drives memory consumption.
>> 
>> You mentioned each record is 64KB and 3K rows. That suggests a total size of 
>> around 200MB. But, you mention the total file size is 400MB. So, either the 
>> rows are twice as large, or there are twice as many. If you have 3K rows of 
>> 128MB each, then each batch of data is 400MB, which is pretty large.
>> 
>> If your records are 64K in size, and we read 4K per batch, then the total 
>> size is 256MB, which is also large.
>> 
>> So, we are dealing with jumbo records and you really want the "batch size 
>> control" feature that we are working on, but have not yet shipped.
>> 
>> Let's work out the math. How many sorts in your query? What other operators 
>> does the query include? Let's assume a single sort.
>> 
>> Max query memory is 10 GB. 10 GB / 1 sort / max width of 5 = 2 GB per sort. 
>> Since your batches are ~400 MB, things should work.
>> 
>> Since things don't work, I suspect that we're missing something.
>> (Note that the memory size we just calculated does not match the 
>> numbers shown in an earlier post in which the sort got just ~40 MB of
>> memory...)
>> 
>> Try this:
>> 
>> * With your current settings, enable debug-level logging. Run your query.
>> 
>> * Open the Drillbit log. Look for the JSON version of the query plan (there 
>> will be two). One will tell you how much memory is given to the sort:
>> 
>> maxAllocation: (some number)
>> 
>> * Ignore the one that says 10,000,000, find the one with a smaller number. 
>> What is that number?
>> 
>> * Then, look in the query profile for your query. Look at the peak memory 
>> for your JSON reader scan operator. The peak memory more-or-less reflects 
>> the batch size. What is that number?
>> 
>> With those, we can tell if the settings and sizes we think we are using are, 
>> in fact, correct.
>> 
>> Thanks,
>> 
>> - Paul
>> 
>>> On Nov 3, 2017, at 1:19 PM, Yun Liu <y....@castsoftware.com> wrote:
>>> 
>>> Hi Paul,
>>> 
>>> Thanks for you detailed explanation. First off- I have 2 issues and I 
>>> wanted to clear it out before continuing.
>>> 
>>> Current setting: planner.memory.max_query_memory_per_node = 10GB, 
>>> HEAP = 12G, Direct memory = 32G, Perm 1024M, and 
>>> planner.width.max_per_node = 5
>>> 
>>> Issue # 1:
>>> When loading a json file with 400MB I keep getting a DATA_READ ERROR.
>>> Each record in the file is about 64KB. Since it's a json file, there are 
>>> only 4 fields per each record. Not sure how many records this file contains 
>>> as it's too large to open with any tools, but I am guessing about 3k rows.
>>> With all the recommendations provided by various experts, nothing has 
>>> worked.
>>> 
>>> Issue 2#:
>>> While processing a query with is a join of 2 functional .json files, I am 
>>> getting a RESOURCE ERROR: One or more nodes ran out of memory while 
>>> executing the query. These 2 json files alone process fine but when joined 
>>> together, Drill throws me that error.
>>> Json#1 is 11k KB, has 8 fields with 74091 rows
>>> Json#2 is 752kb, has 8 fields with 4245 rows
>>> 
>>> Besides breaking them up to smaller files, not sure what else I could do.
>>> 
>>> Thanks for the help so far!
>>> 
>>> Yun
>>> 
>>> -----Original Message-----
>>> From: Paul Rogers [mailto:prog...@mapr.com]
>>> Sent: Thursday, November 2, 2017 11:06 PM
>>> To: user@drill.apache.org
>>> Subject: Re: Drill Capacity
>>> 
>>> Hi Yun,
>>> 
>>> I'm going to give you multiple ways to understand the issue based on the 
>>> information you've provided. I generally like to see the full logs to 
>>> diagnose such problems, but we'll start with what you've provided thus far.
>>> 
>>> How large is each record in your file? How many fields? How many 
>>> bytes? (Alternatively, how big is a single input file and how many 
>>> records does it contain?)
>>> 
>>> You mention the limit of 64K columns in CSV. This makes me wonder if you 
>>> have a "jumbo" record. If each individual record is large, then there won't 
>>> be enough space in the sort to take even a single batch of records, and 
>>> you'll get the sv2 error that you saw.
>>> 
>>> We can guess the size, however, from the info you provided:
>>> 
>>> batchGroups.size 1
>>> spilledBatchGroups.size 0
>>> allocated memory 42768000
>>> allocator limit 41943040
>>> 
>>> This says you have a batch in memory and are trying to allocate some memory 
>>> (the "sv2"). The allocated memory number tells us that each batch size is 
>>> probably ~43 MB. But, the sort only has 42 MB to play with. The sort needs 
>>> at least two batches in memory to make progress, hence the out-of-memory 
>>> errors.
>>> 
>>> It would be nice to confirm this from the logs, but unfortunately, Drill 
>>> does not normally log the size of each batch. As it turns out, however, the 
>>> "managed" version that Boaz mentioned added more logging around this 
>>> problem: it will tell you how large it thinks each batch is, and will warn 
>>> if you have, say, a 43 MB batch but only 42 MB in which to sort.
>>> 
>>> (If you do want to use the "managed" version of the sort, I suggest 
>>> you try Drill 1.12 when it is released as that version contains 
>>> additional fixes to handle constrained memory.)
>>> 
>>> Also, at present, The JSON record reader loads 4096 records into 
>>> each batch. If your file has at least that many records, then we can 
>>> guess each record is about 43 MB / 4096 =~ 10K in size. (You can 
>>> confirm, as noted above, by dividing total file size by record 
>>> count.)
>>> 
>>> We are doing work to handle such large batches, but the work is not yet 
>>> available in a release. Unfortunately, in the meanwhile, we also don't let 
>>> you control the batch size. But, we can provide another solution.
>>> 
>>> Let's explain why the message you provided said that the "allocator limit" 
>>> was 42 MB. Drill does the following to allocate memory to the sort:
>>> 
>>> * Take the "max query memory per node" (default of 2 GB regardless 
>>> of actual direct memory),
>>> * Divide by the number of sort operators in the plan (as shown in 
>>> the visualized query profile)
>>> * Divide by the "planner width" which is, by default, 70% of the number of 
>>> cores on your system.
>>> 
>>> In your case, if you are using the default 2 GB total, but getting 41 MB 
>>> per sort, the divisor is 50. Maybe you have 2 sorts and 32 cores? (2 * 32 * 
>>> 70% =~ 45.) Or some other combination.
>>> 
>>> We can't reduce the number of sorts; that's determined by your query. But, 
>>> we can play with the other numbers.
>>> 
>>> First, we can increase the memory per query:
>>> 
>>> ALTER SESSION SET `planner.memory.max_query_memory_per_node` =
>>> 4,294,967,296
>>> 
>>> That is, 4 GB. This obviously means you must have at least 6 GB of direct 
>>> memory; more is better.
>>> 
>>> And/or, we can reduce the number of fragments:
>>> 
>>> ALTER SESSION SET `planner.width.max_per_node` = <a number>
>>> 
>>> The value is a bit tricky. Drill normally creates a number of fragments 
>>> equal to 70% of the number of CPUs on your system. Let's say you have 32 
>>> cores. If so, change the max_per_node to, say, 10 or even 5. This will mean 
>>> fewer sorts and so more memory per sort, helping compensate for the "jumbo" 
>>> batches in your query. Pick a number based on your actual number of cores.
>>> 
>>> As an alternative, as Ted suggested, you could create a larger number of 
>>> smaller files as this would solve the batch size problem while also getting 
>>> the parallelization benefits that Kunal mentioned.
>>> 
>>> That is three separate possible solutions. Try them one by one or 
>>> (carefully) together.
>>> 
>>> - Paul
>>> 
>>>>> On 11/2/17, 12:31 PM, "Yun Liu" <y....@castsoftware.com> wrote:
>>>>> 
>>>>> Hi Kunal and Andries,
>>>>> 
>>>>> Thanks for your reply. We need json in this case because Drill 
>>>>> only supports up to 65536 columns in a csv file.
>> 
>

RE: Drill Capacity

Reply via email to