Re: Data types

Charles Givre Fri, 27 Jan 2017 08:53:40 -0800

I’m actually one of the contributors for the forthcoming O’Reilly book on Drill 
(along with Ted and Ellen), and this is a specific functionality I’m planning 
on writing a chapter about.  (Not the buffers, but how to get Drill to ingest 
other file formats)




 
> On Jan 27, 2017, at 11:50, Paul Rogers <prog...@mapr.com> wrote:
> 
> Hi Charles,
> 
> Congrats! Unfortunately, no, there is no documentation. Drill seems to be of 
> the “code speaks for itself” persuasion. I try to document the bits I’ve had 
> to learn on my Github Wiki, but (until now) I’ve not looked at this 
> particular area.
> 
> IMHO, now that the plugins basically work, the API could use a good scrubbing 
> to make it simpler, easier to document, and easier to use. As it is, you have 
> to be an expert on Drill internals to understand all the little knick-knacks 
> that have to be in your code to make various Drill subsystems happy.
> 
> That said, perhaps you can use your own Git Wiki to document what you’ve 
> learned so that we capture that for the next plugin developer.
> 
> Thanks,
> 
> - Paul
> 
>> On Jan 27, 2017, at 8:42 AM, Charles Givre <cgi...@gmail.com> wrote:
>> 
>> Hi Paul,
>> VICTORY!!  I just set the buffer size to 4096 and it worked perfectly 
>> without truncating my data! 
>> Is this documented anywhere?  I’ve been trying to really wrap my head around 
>> the mechanics of how Drill reads data and how the format plugins work and 
>> really haven’t found much.  I’ve hacked together a few other plugins like 
>> this—which work—but if I could find some docs, that would be great.
>> Thanks,
>> — Charles
>> 
>> 
>> 
>>> On Jan 27, 2017, at 02:11, Paul Rogers <prog...@mapr.com> wrote:
>>> 
>>> Looks like I gave you advice that as a bit off. The function you want is 
>>> either:
>>> 
>>>          this.buffer = fragmentContext.getManagedBuffer();
>>> 
>>> The above allocates a 256 byte buffer. You can initially allocate a larger 
>>> one:
>>> 
>>>          this.buffer = fragmentContext.getManagedBuffer(4096);
>>> 
>>> Or, to reallocate:
>>> 
>>>         buffer = fragmentContext.replace(buffer, 8192);
>>> 
>>> Again, I’ve not used these method myself, but they seem they might do the 
>>> trick.
>>> 
>>> - Paul
>>> 
>>>> On Jan 26, 2017, at 9:51 PM, Charles Givre <cgi...@gmail.com> wrote:
>>>> 
>>>> Thanks!  I’m hoping to submit a PR eventually once I have this all done.  
>>>> I tried your changes and now I’m getting this error:
>>>> 
>>>> 0: jdbc:drill:zk=local> select * from dfs.client.`small.misolog`;
>>>> Error: DATA_READ ERROR: Tried to remove unmanaged buffer.
>>>> 
>>>> Fragment 0:0
>>>> 
>>>> [Error Id: 52fc846a-1d94-4300-bcb4-7000d0949b3c on 
>>>> charless-mbp-2.fios-router.home:31010] (state=,code=0)
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Jan 26, 2017, at 23:08, Paul Rogers <prog...@mapr.com> wrote:
>>>>> 
>>>>> Hi Charles,
>>>>> 
>>>>> Very cool plugin!
>>>>> 
>>>>> My knowledge in this area is a bit sketchy… That said, the problem 
>>>>> appears to be that the code does not extend the Drillbuf to ensure it has 
>>>>> sufficient capacity. Try calling this method: reallocIfNeeded, something 
>>>>> like this:
>>>>> 
>>>>>   this.buffer.reallocIfNeeded(stringLength);
>>>>>   this.buffer.setBytes(0, bytes, 0, stringLength);
>>>>>   map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
>>>>> 
>>>>> Then, comment out the 256 length hack and see if it works.
>>>>> 
>>>>> To avoid memory fragmentation, maybe change your loop as:
>>>>> 
>>>>>        int maxRecords = MAX_RECORDS_PER_BATCH;
>>>>>        int maxWidth = 256;
>>>>>        while(recordCount < maxRecords &&(line = this.reader.readLine()) 
>>>>> != null){
>>>>>        …
>>>>>           if(stringLength > maxWidth) {
>>>>>              maxWidth = stringLength;
>>>>>              maxRecords = 16 * 1024 * 1024 / maxWidth;
>>>>>           }
>>>>> 
>>>>> The above is not perfect (the last record added might be much larger than 
>>>>> the others, causing the corresponding vector to grow larger than 16 MB, 
>>>>> but the occasional large vector should be OK.)
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> - Paul
>>>>> 
>>>>> On Jan 26, 2017, at 5:31 PM, Charles Givre 
>>>>> <cgi...@gmail.com<mailto:cgi...@gmail.com>> wrote:
>>>>> 
>>>>> Hi Paul,
>>>>> Would you mind taking a look at my code?  I’m wondering if I’m doing this 
>>>>> correctly.  Just for context, I’m working on a generic log file reader 
>>>>> for drill (https://github.com/cgivre/drill-logfile-plugin 
>>>>> <https://github.com/cgivre/drill-logfile-plugin>), and I encountered some 
>>>>> errors when working with fields that were > 256 characters long.  It 
>>>>> isn’t a storage plugin, but it extends the EasyFormatPlugin.
>>>>> 
>>>>> I added some code to truncate the strings to 256 chars, it worked.  
>>>>> Before this it was throwing errors as shown below:
>>>>> 
>>>>> 
>>>>> 
>>>>> Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256))
>>>>> 
>>>>> Fragment 0:0
>>>>> 
>>>>> [Error Id: b2250326-f983-440c-a73c-4ef4a6cf3898 on 
>>>>> charless-mbp-2.fios-router.home:31010] (state=,code=0)
>>>>> 
>>>>> 
>>>>> The query that generated this was just a SELECT * FROM dfs.`file`.  Also, 
>>>>> how do I set the size of each row batch?
>>>>> Thank you for your help.
>>>>> — C
>>>>> 
>>>>> 
>>>>> if (m.find()) {
>>>>> for( int i = 1; i <= m.groupCount(); i++ )
>>>>> {
>>>>>   //TODO Add option for date fields
>>>>>   String fieldName  = fieldNames.get(i - 1);
>>>>>   String fieldValue;
>>>>> 
>>>>>   fieldValue = m.group(i);
>>>>> 
>>>>>   if( fieldValue == null){
>>>>>       fieldValue = "";
>>>>>   }
>>>>>   byte[] bytes = fieldValue.getBytes("UTF-8");
>>>>> 
>>>>> //Added this and it worked….
>>>>>   int stringLength = bytes.length;
>>>>>   if( stringLength > 256 ){
>>>>>       stringLength = 256;
>>>>>   }
>>>>> 
>>>>>   this.buffer.setBytes(0, bytes, 0, stringLength);
>>>>>   map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
>>>>> }
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Jan 26, 2017, at 20:20, Paul Rogers 
>>>>> <prog...@mapr.com<mailto:prog...@mapr.com>> wrote:
>>>>> 
>>>>> Hi Charles,
>>>>> 
>>>>> The Varchar column can hold any length of data. We’ve recently been 
>>>>> working on tests that have columns up to 8K in length.
>>>>> 
>>>>> The one caveat is that, when working with data larger than 256 bytes, you 
>>>>> must be extremely careful in your reader. The out-of-box text reader will 
>>>>> always read 64K rows. This (due to various issues) can cause memory 
>>>>> fragmentation and OOM errors when used with columns greater than 256 
>>>>> bytes in width.
>>>>> 
>>>>> If you are developing your own storage plugin, then adjust the size of 
>>>>> each row batch so that no single vector is larger than 16 MB in size. 
>>>>> Then you can use any size of column.
>>>>> 
>>>>> Suppose your logs contain text lines up to, say, 1K in size. This means 
>>>>> that each record batch your reader produces must be of size less than 16 
>>>>> MB / 1K / row = 1600 rows (rather than the usual 64K.)
>>>>> 
>>>>> Once the data is in the Varchar column, the rest of Drill should “just 
>>>>> work” on that data.
>>>>> 
>>>>> - Paul
>>>>> 
>>>>> On Jan 26, 2017, at 4:11 PM, Charles Givre 
>>>>> <cgi...@gmail.com<mailto:cgi...@gmail.com>> wrote:
>>>>> 
>>>>> I’m working on a plugin to read log files and the data has some long 
>>>>> strings.  Is there a data type that can hold strings longer than 256 
>>>>> characters?
>>>>> Thanks,
>>>>> — Charles
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: Data types

Reply via email to