I’m actually one of the contributors for the forthcoming O’Reilly book on Drill (along with Ted and Ellen), and this is a specific functionality I’m planning on writing a chapter about. (Not the buffers, but how to get Drill to ingest other file formats)
> On Jan 27, 2017, at 11:50, Paul Rogers <prog...@mapr.com> wrote: > > Hi Charles, > > Congrats! Unfortunately, no, there is no documentation. Drill seems to be of > the “code speaks for itself” persuasion. I try to document the bits I’ve had > to learn on my Github Wiki, but (until now) I’ve not looked at this > particular area. > > IMHO, now that the plugins basically work, the API could use a good scrubbing > to make it simpler, easier to document, and easier to use. As it is, you have > to be an expert on Drill internals to understand all the little knick-knacks > that have to be in your code to make various Drill subsystems happy. > > That said, perhaps you can use your own Git Wiki to document what you’ve > learned so that we capture that for the next plugin developer. > > Thanks, > > - Paul > >> On Jan 27, 2017, at 8:42 AM, Charles Givre <cgi...@gmail.com> wrote: >> >> Hi Paul, >> VICTORY!! I just set the buffer size to 4096 and it worked perfectly >> without truncating my data! >> Is this documented anywhere? I’ve been trying to really wrap my head around >> the mechanics of how Drill reads data and how the format plugins work and >> really haven’t found much. I’ve hacked together a few other plugins like >> this—which work—but if I could find some docs, that would be great. >> Thanks, >> — Charles >> >> >> >>> On Jan 27, 2017, at 02:11, Paul Rogers <prog...@mapr.com> wrote: >>> >>> Looks like I gave you advice that as a bit off. The function you want is >>> either: >>> >>> this.buffer = fragmentContext.getManagedBuffer(); >>> >>> The above allocates a 256 byte buffer. You can initially allocate a larger >>> one: >>> >>> this.buffer = fragmentContext.getManagedBuffer(4096); >>> >>> Or, to reallocate: >>> >>> buffer = fragmentContext.replace(buffer, 8192); >>> >>> Again, I’ve not used these method myself, but they seem they might do the >>> trick. >>> >>> - Paul >>> >>>> On Jan 26, 2017, at 9:51 PM, Charles Givre <cgi...@gmail.com> wrote: >>>> >>>> Thanks! I’m hoping to submit a PR eventually once I have this all done. >>>> I tried your changes and now I’m getting this error: >>>> >>>> 0: jdbc:drill:zk=local> select * from dfs.client.`small.misolog`; >>>> Error: DATA_READ ERROR: Tried to remove unmanaged buffer. >>>> >>>> Fragment 0:0 >>>> >>>> [Error Id: 52fc846a-1d94-4300-bcb4-7000d0949b3c on >>>> charless-mbp-2.fios-router.home:31010] (state=,code=0) >>>> >>>> >>>> >>>> >>>>> On Jan 26, 2017, at 23:08, Paul Rogers <prog...@mapr.com> wrote: >>>>> >>>>> Hi Charles, >>>>> >>>>> Very cool plugin! >>>>> >>>>> My knowledge in this area is a bit sketchy… That said, the problem >>>>> appears to be that the code does not extend the Drillbuf to ensure it has >>>>> sufficient capacity. Try calling this method: reallocIfNeeded, something >>>>> like this: >>>>> >>>>> this.buffer.reallocIfNeeded(stringLength); >>>>> this.buffer.setBytes(0, bytes, 0, stringLength); >>>>> map.varChar(fieldName).writeVarChar(0, stringLength, buffer); >>>>> >>>>> Then, comment out the 256 length hack and see if it works. >>>>> >>>>> To avoid memory fragmentation, maybe change your loop as: >>>>> >>>>> int maxRecords = MAX_RECORDS_PER_BATCH; >>>>> int maxWidth = 256; >>>>> while(recordCount < maxRecords &&(line = this.reader.readLine()) >>>>> != null){ >>>>> … >>>>> if(stringLength > maxWidth) { >>>>> maxWidth = stringLength; >>>>> maxRecords = 16 * 1024 * 1024 / maxWidth; >>>>> } >>>>> >>>>> The above is not perfect (the last record added might be much larger than >>>>> the others, causing the corresponding vector to grow larger than 16 MB, >>>>> but the occasional large vector should be OK.) >>>>> >>>>> Thanks, >>>>> >>>>> - Paul >>>>> >>>>> On Jan 26, 2017, at 5:31 PM, Charles Givre >>>>> <cgi...@gmail.com<mailto:cgi...@gmail.com>> wrote: >>>>> >>>>> Hi Paul, >>>>> Would you mind taking a look at my code? I’m wondering if I’m doing this >>>>> correctly. Just for context, I’m working on a generic log file reader >>>>> for drill (https://github.com/cgivre/drill-logfile-plugin >>>>> <https://github.com/cgivre/drill-logfile-plugin>), and I encountered some >>>>> errors when working with fields that were > 256 characters long. It >>>>> isn’t a storage plugin, but it extends the EasyFormatPlugin. >>>>> >>>>> I added some code to truncate the strings to 256 chars, it worked. >>>>> Before this it was throwing errors as shown below: >>>>> >>>>> >>>>> >>>>> Error: DATA_READ ERROR: index: 0, length: 430 (expected: range(0, 256)) >>>>> >>>>> Fragment 0:0 >>>>> >>>>> [Error Id: b2250326-f983-440c-a73c-4ef4a6cf3898 on >>>>> charless-mbp-2.fios-router.home:31010] (state=,code=0) >>>>> >>>>> >>>>> The query that generated this was just a SELECT * FROM dfs.`file`. Also, >>>>> how do I set the size of each row batch? >>>>> Thank you for your help. >>>>> — C >>>>> >>>>> >>>>> if (m.find()) { >>>>> for( int i = 1; i <= m.groupCount(); i++ ) >>>>> { >>>>> //TODO Add option for date fields >>>>> String fieldName = fieldNames.get(i - 1); >>>>> String fieldValue; >>>>> >>>>> fieldValue = m.group(i); >>>>> >>>>> if( fieldValue == null){ >>>>> fieldValue = ""; >>>>> } >>>>> byte[] bytes = fieldValue.getBytes("UTF-8"); >>>>> >>>>> //Added this and it worked…. >>>>> int stringLength = bytes.length; >>>>> if( stringLength > 256 ){ >>>>> stringLength = 256; >>>>> } >>>>> >>>>> this.buffer.setBytes(0, bytes, 0, stringLength); >>>>> map.varChar(fieldName).writeVarChar(0, stringLength, buffer); >>>>> } >>>>> >>>>> >>>>> >>>>> >>>>> On Jan 26, 2017, at 20:20, Paul Rogers >>>>> <prog...@mapr.com<mailto:prog...@mapr.com>> wrote: >>>>> >>>>> Hi Charles, >>>>> >>>>> The Varchar column can hold any length of data. We’ve recently been >>>>> working on tests that have columns up to 8K in length. >>>>> >>>>> The one caveat is that, when working with data larger than 256 bytes, you >>>>> must be extremely careful in your reader. The out-of-box text reader will >>>>> always read 64K rows. This (due to various issues) can cause memory >>>>> fragmentation and OOM errors when used with columns greater than 256 >>>>> bytes in width. >>>>> >>>>> If you are developing your own storage plugin, then adjust the size of >>>>> each row batch so that no single vector is larger than 16 MB in size. >>>>> Then you can use any size of column. >>>>> >>>>> Suppose your logs contain text lines up to, say, 1K in size. This means >>>>> that each record batch your reader produces must be of size less than 16 >>>>> MB / 1K / row = 1600 rows (rather than the usual 64K.) >>>>> >>>>> Once the data is in the Varchar column, the rest of Drill should “just >>>>> work” on that data. >>>>> >>>>> - Paul >>>>> >>>>> On Jan 26, 2017, at 4:11 PM, Charles Givre >>>>> <cgi...@gmail.com<mailto:cgi...@gmail.com>> wrote: >>>>> >>>>> I’m working on a plugin to read log files and the data has some long >>>>> strings. Is there a data type that can hold strings longer than 256 >>>>> characters? >>>>> Thanks, >>>>> — Charles >>>>> >>>>> >>>>> >>>> >>> >> >