I am doing some "active" loading of data into json files on MapRFS.
Basically I have feeds pulling from a message  queue and outputting the
JSON messages.

I have a query that is doing aggregations on all the data that seem to work
90% of the time.

The other 10%, I get this error:

Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input in
VALUE_STRING

File: /path/to/file
Record: someint
Column: someint
Fragment someint:someint

(I replaced the actual record, column, and fragment info obviously)


When I get this error, I can run the same query again, and all is well.

My questions are this:

1. My "gut" is telling me this is because I have files being written in
real time with MapR FS using POSIX tools over NFS and when this is
occuring, it's because the python fh.write() is "in mid stream" when drill
tries to query the file, thus it's not perfectly formatted.  Does this seem
feasible?

2.  Just waiting a bit fixes things, thus because of how Drill works, i.e.
it has to read all the data on an aggregate query,  if it was going to fail
because there was corrupt data permanently written, it would fail every
time. (I.e. I shouldn't be troubleshooting this because if it's working,
the problem is resolved at least until the next time I try to read a half
written json object. Is this accurate?

3.  This is always going to be the case with "realtime" data, or is there a
way to address this?

4. Is there a way to address this type of issues by skipping that
line/record?  I know there was some talk about skipping records in other
posts/JIRAs, but not sure if this would be taken into account there.

5. Am I completely off base and the actual problem is something else?

John

Reply via email to