Great feature and this fixes my problem. All I do is in my python script when I open a file, it opens with the .prefix. When I "close" it I rename it without the . prefix. Easy fix. Thanks for the pointer Andries!
John On Tue, Nov 3, 2015 at 1:52 PM, Andries Engelbrecht < aengelbre...@maprtech.com> wrote: > See DRILL-2424 and DRILL-1131 > Incomplete records/files can cause issues, in Drill 1.2 hey have added the > ability to ignore data files with a .prefix. > > Perhaps copy files in over NFS using a . prefix and then rename once > copied on the DFS. > > I had the same issue with Flume data streaming in and incomplete records, > not been able to test with Drill 1.2. However if I copy an existing file to > the same directory with a . prefix I can see in the query plan that the > hidden file is being ignored. > > —Andries > > > On Nov 3, 2015, at 11:07 AM, John Omernik <j...@omernik.com> wrote: > > > > I am doing some "active" loading of data into json files on MapRFS. > > Basically I have feeds pulling from a message queue and outputting the > > JSON messages. > > > > I have a query that is doing aggregations on all the data that seem to > work > > 90% of the time. > > > > The other 10%, I get this error: > > > > Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input in > > VALUE_STRING > > > > File: /path/to/file > > Record: someint > > Column: someint > > Fragment someint:someint > > > > (I replaced the actual record, column, and fragment info obviously) > > > > > > When I get this error, I can run the same query again, and all is well. > > > > My questions are this: > > > > 1. My "gut" is telling me this is because I have files being written in > > real time with MapR FS using POSIX tools over NFS and when this is > > occuring, it's because the python fh.write() is "in mid stream" when > drill > > tries to query the file, thus it's not perfectly formatted. Does this > seem > > feasible? > > > > 2. Just waiting a bit fixes things, thus because of how Drill works, > i.e. > > it has to read all the data on an aggregate query, if it was going to > fail > > because there was corrupt data permanently written, it would fail every > > time. (I.e. I shouldn't be troubleshooting this because if it's working, > > the problem is resolved at least until the next time I try to read a half > > written json object. Is this accurate? > > > > 3. This is always going to be the case with "realtime" data, or is > there a > > way to address this? > > > > 4. Is there a way to address this type of issues by skipping that > > line/record? I know there was some talk about skipping records in other > > posts/JIRAs, but not sure if this would be taken into account there. > > > > 5. Am I completely off base and the actual problem is something else? > > > > John > >