[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Github user ssriniva123 commented on the issue: https://github.com/apache/drill/pull/518 I have a similar data set checked in. Hope that is good enough. On Fri, Oct 7, 2016 at 10:00 AM, Paul Rogers wrote: > Base data set: > > { "p": 1, "a": { "x": 10, "y": 20, "z": 30 }, "b": 50, "c": 60 } > { "p": 2, "a": { "x": 11, "y": 21, "z": 31 }, "b": 51, "c": 61 } > { "p": 3, "a": { "x": 12, "y": 22, "z": 32 }, "b": 52, "c": 62 } > { "p": 4, "a": { "x": 13, "y": 23, "z": 33 }, "b": 53, "c": 63 } > { "p": 5, "a": { "x": 14, "y": 24, "z": 34 }, "b": 54, "c": 64 } > { "p": 6, "a": { "x": 15, "y": 25, "z": 35 }, "b": 55, "c": 65 } > > Create various errors: > > { "p": 2x, "a": { "x": 11, "y": 21, "z": 31 }, "b": 51, "c": 61 } > > and > > { "p": 2, "a": { "x": 11x, "y": 21, "z": 31 }, "b": 51, "c": 61 } > > And so on. > > In running the tests, I consistently saw that only the second (bad) row > was omitted. Other rows properly appeared, and no partial row appeared. > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/drill/pull/518#issuecomment-252306011>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABsaH1IiZY7CSrdvx17MdDLrIyMFizqeks5qxnrDgaJpZM4IzZkt> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Github user ssriniva123 commented on the issue: https://github.com/apache/drill/pull/518 Paul, The code you have listed is semantically equivalent to that of what I already I have submitted for pull and will not solve handling of all malformed json records. Also the code for reporting the error records is working correctly as long as is it is reported by the Parser correctly. As I explained earlier the JSON parser is not just a simple tokenizer, it keeps track of internal state, hence the issue. SERDE's in hive etc work because they are record oriented with clean record demarkations using a new line. One solution is to submit a patch to jackson parser to expose a method to skip to new line in the event of a parsing exception. This can be parametrized so that behavior can customized. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Github user ssriniva123 commented on the issue: https://github.com/apache/drill/pull/518 There is not much to do except change the JSON parser to support this functionality. - Indicate to the parser that a current record terminates when it encounters a \n (Of course this then assumes that druid also aligns to record separators using a new line). This is a change to make to the jackson parser. - Right now the current code works for all the standard cases except one case where the inner most sub-structure within a JSON is malformed. Given that this is a great recipe for approximation algorithms , I am requesting this change to be pulled in. If need be we can work on change to the jackson parser using a different request. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Github user ssriniva123 commented on the issue: https://github.com/apache/drill/pull/518 Apologize for getting back on this thread late, got tied up with some issues@work. Paul, The json parser is not just a tokenizer, it keeps track of the JSON structure and understands various aspects of it like root, array/objectcontext and all parsing is done under that context. - we cannot keep track of {} accurately - For eg: The counting json processor does a parser. skipChildren which tries to skip to the end of the JSON, but this can rollover to next line when there is a malformed JSON in the bottom most json sub object - see example below (missing " in last json structure). This is similar behavior with the JsonReader. {"balance": 1000.0,"num": 100,"is_vip": true,"name": "foo3","curr":{"denom":"pound","test":{"value :false}}} - One possible solution is to rewind the input source to reset the stream (which is not recommended and there is no guarentee that all streams support mark/reset semantics. Given where we are, I think the solution proposed works perfect for almost all malformed JSON's. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Github user ssriniva123 commented on the issue: https://github.com/apache/drill/pull/518 Paul, Thanks for taking the time out in writing out a detailed email. Here are some of my thoughts. - Drill uses the com.fasterxml.jackson.core.json.UTF8StreamJsonParser for parsing of JSON records. This parser does not rely on line delimiters for record separators but instead uses the JSON structure as a natural way to signal End of record (EOR). There are methods internal to the parser which check for line feeds but is not exposed to callers. - The CountingJsonReader uses the parser.skipChildren() method to skip the rest of the children for this record, hence it is not possible to accurately count and match the no of braces to cleanly skip that bad record. - One thought is to tap the inputsource of the parser on an exception condition, but is not encouraged. My thought process was exactly along the lines you have been thinking. On an exception scenario the code attempts to locate a closing bracket(}) followed by a opening bracket ({). This is what is being done in the BaseJsonProcessor.processJSONException method. Please note that it works in all cases except when we do not have proper brackets to signify end of a JSON record. Hope this explanation helps clarify. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Github user ssriniva123 commented on the issue: https://github.com/apache/drill/pull/518 Folks, I have tests on my local drill setup with various combinations of invalid json format. The only caveat is that some records may be skipped if JSON is not properly delimitted by }. Can this be reviewed for next release ? I can work on the documentation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Github user ssriniva123 commented on the issue: https://github.com/apache/drill/pull/518 I have made several modifications to get accurate line nos: - The main change is to move the json parser to the end of the current record being processed since previously multiple exceptions were thrown. - BaseJsonProcessor contains new method protected JsonExceptionProcessingState processJSONException(). -This is called by both CountingJsonReader and JsonProcessor whenever parser encounters a jackson parsing exception. -I have also added a new system setting store.json.reader.print_skipped_invalid_record_number so that we can suppress printing of line numbers. - Added more unit test cases. - System tested with various combinations on a local drill bit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Github user ssriniva123 commented on the issue: https://github.com/apache/drill/pull/518 What is the recommendation with this bug? Should we just remove the line nos for now and just name the file name. I was not able to work on this last couple of days, not sure if there is sufficient time until the next candidate release. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Github user ssriniva123 commented on the issue: https://github.com/apache/drill/pull/518 I have also squashed my commits --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #518: DRILL-4653.json - Malformed JSON should not stop th...
Github user ssriniva123 commented on a diff in the pull request: https://github.com/apache/drill/pull/518#discussion_r67431369 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONRecordReader.java --- @@ -189,39 +191,33 @@ private long currentRecordNumberInFile() { public int next() { writer.allocate(); writer.reset(); - recordCount = 0; ReadState write = null; -//Stopwatch p = new Stopwatch().start(); -try{ - outside: while(recordCount < DEFAULT_ROWS_PER_BATCH) { -writer.setPosition(recordCount); -write = jsonReader.write(writer); - -if(write == ReadState.WRITE_SUCCEED) { -// logger.debug("Wrote record."); - recordCount++; -}else{ -// logger.debug("Exiting."); - break outside; -} - +outside: while(recordCount < DEFAULT_ROWS_PER_BATCH){ +try + { +writer.setPosition(recordCount); --- End diff -- Aman, maven checkstyle:checkstyle did not report any errors before I did my last check in. I have changed to reflect 2 spaces for indendation. On Thu, Jun 16, 2016 at 2:22 PM, Aman Sinha wrote: > In > exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONRecordReader.java > <https://github.com/apache/drill/pull/518#discussion_r67426381>: > > > - outside: while(recordCount < DEFAULT_ROWS_PER_BATCH) { > > -writer.setPosition(recordCount); > > -write = jsonReader.write(writer); > > - > > -if(write == ReadState.WRITE_SUCCEED) { > > -// logger.debug("Wrote record."); > > - recordCount++; > > -}else{ > > -// logger.debug("Exiting."); > > - break outside; > > -} > > - > > +outside: while(recordCount < DEFAULT_ROWS_PER_BATCH){ > > +try > > + { > > +writer.setPosition(recordCount); > > seems this is still doing indent of 4. We use 2 spaces (see > https://drill.apache.org/docs/apache-drill-contribution-guidelines/ > scroll down to Step 2). Did it pass the mvn command line build without > checkstyle violations ? > > â > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <https://github.com/apache/drill/pull/518/files/56a16fe3f2bbd1554d65cef6faeaeb63ba41f9a2#r67426381>, > or mute the thread > <https://github.com/notifications/unsubscribe/ABsaHw_13TKQS-3mIEDhxkCiglpCBA3Sks5qMb6IgaJpZM4IzZkt> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Github user ssriniva123 commented on the issue: https://github.com/apache/drill/pull/518 I have made changes are recommended by reviewers: - Changed JSON_READER_SKIP_INVALID_RECORDS_FLAG constant - Modified unit test to use builder framework - Code indendation changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Github user ssriniva123 commented on the issue: https://github.com/apache/drill/pull/518 I updated the pull request with more changes requested by the reviewers. On Wed, Jun 15, 2016 at 5:28 PM, Aman Sinha wrote: > Yes, it does in fact have a conflict with DRILL-3764 which has changes to > the JsonRecordReader, although this issue is still in progress. I noticed > that @adeneche <https://github.com/adeneche> mentioned this in the JIRA. > @ssriniva123 <https://github.com/ssriniva123> did you get a chance to > look at DRILL-3764 ? > > â > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/drill/pull/518#issuecomment-226357921>, or mute > the thread > <https://github.com/notifications/unsubscribe/ABsaH_Xd1WBKpu396uYeUVluiQ4iN0_Rks5qMJihgaJpZM4IzZkt> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Github user ssriniva123 commented on the issue: https://github.com/apache/drill/pull/518 For some reason the build terminated- What needs to be done to fire it again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Github user ssriniva123 commented on the issue: https://github.com/apache/drill/pull/518 I have added support so that it prints the offending file/line nos. Error parsing JSON in DRILL-4653.json : line nos :13 Error parsing JSON in DRILL-4653.json : line nos :14 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Github user ssriniva123 commented on the issue: https://github.com/apache/drill/pull/518 Yes, I looked at DRILL-3764. The changes made to JSONProcessing do not conflict with my changes. The only conflict I see is these two constants, I have defined my own. String ENABLE_SKIP_INVALID_RECORD_KEY = "exec.enable_skip_invalid_record"; + BooleanValidator ENABLE_SKIP_INVALID_RECORD = new BooleanValidator(ENABLE_SKIP_INVALID_RECORD_KEY, false); --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...
Github user ssriniva123 commented on the issue: https://github.com/apache/drill/pull/518 Do u anticipate this to be logged using log4j? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #518: DRILL-4653.json - Malformed JSON should not stop th...
GitHub user ssriniva123 opened a pull request: https://github.com/apache/drill/pull/518 DRILL-4653.json - Malformed JSON should not stop the entire query from progressing https://issues.apache.org/jira/browse/DRILL-4653 - The default is to stop processing as is today when JSON parser encounters an exception - Setting store.json.reader.skip_malformed_records will ensure that query progresses after skipping the bad records - Added two unit tests - Also did testing after deploying the new build: Both positive and negative tests were done. - Negative test result: org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: Error parsing JSON - Unexpected character ('{' (code 123)): was expecting comma to separate OBJECT entries You can merge this pull request into a Git repository by running: $ git pull https://github.com/ssriniva123/drill master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/518.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #518 commit 4fc70faf0e5ad5d434b944d084bc0b0e90d41c39 Author: Subbu Srinivasan Date: 2016-06-10T22:58:49Z Fixes for DRILL-4653.json --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---