Hi Charles,

Congrats! Unfortunately, no, there is no documentation. Drill seems to be of 
the “code speaks for itself” persuasion. I try to document the bits I’ve had to 
learn on my Github Wiki, but (until now) I’ve not looked at this particular 
area.

IMHO, now that the plugins basically work, the API could use a good scrubbing 
to make it simpler, easier to document, and easier to use. As it is, you have 
to be an expert on Drill internals to understand all the little knick-knacks 
that have to be in your code to make various Drill subsystems happy.

That said, perhaps you can use your own Git Wiki to document what you’ve 
learned so that we capture that for the next plugin developer.

Thanks,

- Paul

> On Jan 27, 2017, at 8:42 AM, Charles Givre <cgi...@gmail.com> wrote:
> 
> Hi Paul,
> VICTORY!!  I just set the buffer size to 4096 and it worked perfectly without 
> truncating my data! 
> Is this documented anywhere?  I’ve been trying to really wrap my head around 
> the mechanics of how Drill reads data and how the format plugins work and 
> really haven’t found much.  I’ve hacked together a few other plugins like 
> this—which work—but if I could find some docs, that would be great.
> Thanks,
> — Charles
> 
> 
> 
>> On Jan 27, 2017, at 02:11, Paul Rogers <prog...@mapr.com> wrote:
>> 
>> Looks like I gave you advice that as a bit off. The function you want is 
>> either:
>> 
>>           this.buffer = fragmentContext.getManagedBuffer();
>> 
>> The above allocates a 256 byte buffer. You can initially allocate a larger 
>> one:
>> 
>>           this.buffer = fragmentContext.getManagedBuffer(4096);
>> 
>> Or, to reallocate:
>> 
>>          buffer = fragmentContext.replace(buffer, 8192);
>> 
>> Again, I’ve not used these method myself, but they seem they might do the 
>> trick.
>> 
>> - Paul
>> 
>>> On Jan 26, 2017, at 9:51 PM, Charles Givre <cgi...@gmail.com> wrote:
>>> 
>>> Thanks!  I’m hoping to submit a PR eventually once I have this all done.  I 
>>> tried your changes and now I’m getting this error:
>>> 
>>> 0: jdbc:drill:zk=local> select * from dfs.client.`small.misolog`;
>>> Error: DATA_READ ERROR: Tried to remove unmanaged buffer.
>>> 
>>> Fragment 0:0
>>> 
>>> [Error Id: 52fc846a-1d94-4300-bcb4-7000d0949b3c on 
>>> charless-mbp-2.fios-router.home:31010] (state=,code=0)
>>> 
>>> 
>>> 
>>> 
>>>> On Jan 26, 2017, at 23:08, Paul Rogers <prog...@mapr.com> wrote:
>>>> 
>>>> Hi Charles,
>>>> 
>>>> Very cool plugin!
>>>> 
>>>> My knowledge in this area is a bit sketchy… That said, the problem appears 
>>>> to be that the code does not extend the Drillbuf to ensure it has 
>>>> sufficient capacity. Try calling this method: reallocIfNeeded, something 
>>>> like this:
>>>> 
>>>>    this.buffer.reallocIfNeeded(stringLength);
>>>>    this.buffer.setBytes(0, bytes, 0, stringLength);
>>>>    map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
>>>> 
>>>> Then, comment out the 256 length hack and see if it works.
>>>> 
>>>> To avoid memory fragmentation, maybe change your loop as:
>>>> 
>>>>         int maxRecords = MAX_RECORDS_PER_BATCH;
>>>>         int maxWidth = 256;
>>>>         while(recordCount < maxRecords &&(line = this.reader.readLine()) 
>>>> != null){
>>>>         …
>>>>            if(stringLength > maxWidth) {
>>>>               maxWidth = stringLength;
>>>>               maxRecords = 16 * 1024 * 1024 / maxWidth;
>>>>            }
>>>> 
>>>> The above is not perfect (the last record added might be much larger than 
>>>> the others, causing the corresponding vector to grow larger than 16 MB, 
>>>> but the occasional large vector should be OK.)
>>>> 
>>>> Thanks,
>>>> 
>>>> - Paul
>>>> 
>>>> On Jan 26, 2017, at 5:31 PM, Charles Givre 
>>>> <cgi...@gmail.com<mailto:cgi...@gmail.com>> wrote:
>>>> 
>>>> Hi Paul,
>>>> Would you mind taking a look at my code?  I’m wondering if I’m doing this 
>>>> correctly.  Just for context, I’m working on a generic log file reader for 
>>>> drill (https://github.com/cgivre/drill-logfile-plugin 
>>>> <https://github.com/cgivre/drill-logfile-plugin>), and I encountered some 
>>>> errors when working with fields that were > 256 characters long.  It isn’t 
>>>> a storage plugin, but it extends the EasyFormatPlugin.
>>>> 
>>>> I added some code to truncate the strings to 256 chars, it worked.  Before 
>>>> this it was throwing errors as shown below:
>>>> 
>>>> 
>>>> 
>>>> Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256))
>>>> 
>>>> Fragment 0:0
>>>> 
>>>> [Error Id: b2250326-f983-440c-a73c-4ef4a6cf3898 on 
>>>> charless-mbp-2.fios-router.home:31010] (state=,code=0)
>>>> 
>>>> 
>>>> The query that generated this was just a SELECT * FROM dfs.`file`.  Also, 
>>>> how do I set the size of each row batch?
>>>> Thank you for your help.
>>>> — C
>>>> 
>>>> 
>>>> if (m.find()) {
>>>> for( int i = 1; i <= m.groupCount(); i++ )
>>>> {
>>>>    //TODO Add option for date fields
>>>>    String fieldName  = fieldNames.get(i - 1);
>>>>    String fieldValue;
>>>> 
>>>>    fieldValue = m.group(i);
>>>> 
>>>>    if( fieldValue == null){
>>>>        fieldValue = "";
>>>>    }
>>>>    byte[] bytes = fieldValue.getBytes("UTF-8");
>>>> 
>>>> //Added this and it worked….
>>>>    int stringLength = bytes.length;
>>>>    if( stringLength > 256 ){
>>>>        stringLength = 256;
>>>>    }
>>>> 
>>>>    this.buffer.setBytes(0, bytes, 0, stringLength);
>>>>    map.varChar(fieldName).writeVarChar(0, stringLength, buffer);
>>>> }
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Jan 26, 2017, at 20:20, Paul Rogers 
>>>> <prog...@mapr.com<mailto:prog...@mapr.com>> wrote:
>>>> 
>>>> Hi Charles,
>>>> 
>>>> The Varchar column can hold any length of data. We’ve recently been 
>>>> working on tests that have columns up to 8K in length.
>>>> 
>>>> The one caveat is that, when working with data larger than 256 bytes, you 
>>>> must be extremely careful in your reader. The out-of-box text reader will 
>>>> always read 64K rows. This (due to various issues) can cause memory 
>>>> fragmentation and OOM errors when used with columns greater than 256 bytes 
>>>> in width.
>>>> 
>>>> If you are developing your own storage plugin, then adjust the size of 
>>>> each row batch so that no single vector is larger than 16 MB in size. Then 
>>>> you can use any size of column.
>>>> 
>>>> Suppose your logs contain text lines up to, say, 1K in size. This means 
>>>> that each record batch your reader produces must be of size less than 16 
>>>> MB / 1K / row = 1600 rows (rather than the usual 64K.)
>>>> 
>>>> Once the data is in the Varchar column, the rest of Drill should “just 
>>>> work” on that data.
>>>> 
>>>> - Paul
>>>> 
>>>> On Jan 26, 2017, at 4:11 PM, Charles Givre 
>>>> <cgi...@gmail.com<mailto:cgi...@gmail.com>> wrote:
>>>> 
>>>> I’m working on a plugin to read log files and the data has some long 
>>>> strings.  Is there a data type that can hold strings longer than 256 
>>>> characters?
>>>> Thanks,
>>>> — Charles
>>>> 
>>>> 
>>>> 
>>> 
>> 
> 

Reply via email to