I am doing some "active" loading of data into json files on MapRFS. Basically I have feeds pulling from a message queue and outputting the JSON messages.
I have a query that is doing aggregations on all the data that seem to work 90% of the time. The other 10%, I get this error: Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input in VALUE_STRING File: /path/to/file Record: someint Column: someint Fragment someint:someint (I replaced the actual record, column, and fragment info obviously) When I get this error, I can run the same query again, and all is well. My questions are this: 1. My "gut" is telling me this is because I have files being written in real time with MapR FS using POSIX tools over NFS and when this is occuring, it's because the python fh.write() is "in mid stream" when drill tries to query the file, thus it's not perfectly formatted. Does this seem feasible? 2. Just waiting a bit fixes things, thus because of how Drill works, i.e. it has to read all the data on an aggregate query, if it was going to fail because there was corrupt data permanently written, it would fail every time. (I.e. I shouldn't be troubleshooting this because if it's working, the problem is resolved at least until the next time I try to read a half written json object. Is this accurate? 3. This is always going to be the case with "realtime" data, or is there a way to address this? 4. Is there a way to address this type of issues by skipping that line/record? I know there was some talk about skipping records in other posts/JIRAs, but not sure if this would be taken into account there. 5. Am I completely off base and the actual problem is something else? John