[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

2016-10-07 Thread ssriniva123
Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
I have a similar data set checked in. Hope that is good enough.


On Fri, Oct 7, 2016 at 10:00 AM, Paul Rogers <notificati...@github.com>
wrote:

> Base data set:
>
> { "p": 1, "a": { "x": 10, "y": 20, "z": 30 }, "b": 50, "c": 60 }
> { "p": 2, "a": { "x": 11, "y": 21, "z": 31 }, "b": 51, "c": 61 }
> { "p": 3, "a": { "x": 12, "y": 22, "z": 32 }, "b": 52, "c": 62 }
> { "p": 4, "a": { "x": 13, "y": 23, "z": 33 }, "b": 53, "c": 63 }
> { "p": 5, "a": { "x": 14, "y": 24, "z": 34 }, "b": 54, "c": 64 }
> { "p": 6, "a": { "x": 15, "y": 25, "z": 35 }, "b": 55, "c": 65 }
>
> Create various errors:
>
> { "p": 2x, "a": { "x": 11, "y": 21, "z": 31 }, "b": 51, "c": 61 }
>
> and
>
> { "p": 2, "a": { "x": 11x, "y": 21, "z": 31 }, "b": 51, "c": 61 }
>
> And so on.
>
> In running the tests, I consistently saw that only the second (bad) row
> was omitted. Other rows properly appeared, and no partial row appeared.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/drill/pull/518#issuecomment-252306011>, or mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/ABsaH1IiZY7CSrdvx17MdDLrIyMFizqeks5qxnrDgaJpZM4IzZkt>
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

2016-09-27 Thread ssriniva123
Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
Paul,
The code you have listed is semantically equivalent to that of what I 
already I have submitted for pull and will not solve handling of all malformed 
json records. Also the code for reporting the 
error records is working correctly as long as is it is reported by the 
Parser correctly.

As I explained earlier the JSON parser is not just a simple tokenizer, it 
keeps track of internal state,
hence the issue. SERDE's in hive etc work because they  are record oriented 
with clean record demarkations using a new line.

One solution is to submit a patch to jackson parser to expose a method to 
skip to new line in the
event of a parsing exception. This can be parametrized so that behavior can 
customized.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

2016-09-26 Thread ssriniva123
Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
There is not much to do except change the JSON parser to support this 
functionality.

- Indicate to the parser that a current record terminates when it 
encounters a \n (Of course
this then assumes that druid also aligns to record separators using a new 
line).
This is a change to make to the jackson parser.

- Right now the current code works for all the standard cases except one 
case where the inner 
most sub-structure within a JSON is malformed.

Given that this is a great recipe for approximation algorithms , I am 
requesting this change to be pulled in.

If need be we can work on change to the jackson parser using a different 
request.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

2016-09-19 Thread ssriniva123
Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
Apologize for getting back on this thread late, got tied up with some 
issues@work.

Paul,
The json parser is not just a tokenizer, it keeps track of the JSON 
structure and understands various aspects of it like root, array/objectcontext 
and all parsing is done under that context.

- we cannot keep track of {} accurately - For eg: The counting json 
processor does a parser. skipChildren which tries to skip to the end of the 
JSON, but this can rollover to next line when
there is a malformed JSON in the bottom most json sub object - see example 
below (missing " in last json structure). This is similar behavior with the 
JsonReader.

{"balance": 1000.0,"num": 100,"is_vip": true,"name": 
"foo3","curr":{"denom":"pound","test":{"value  :false}}}

- One possible solution is to rewind the input source to reset the stream 
(which is not recommended and there is no guarentee that all streams support 
mark/reset semantics.

Given where we are, I think the solution proposed works perfect for almost 
all malformed JSON's.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

2016-09-13 Thread ssriniva123
Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
Paul,
Thanks for taking the time out in writing out a detailed email. Here are 
some of my thoughts.

- Drill uses the com.fasterxml.jackson.core.json.UTF8StreamJsonParser for 
parsing of JSON records. This parser does not rely on line delimiters for 
record separators but instead uses
the JSON structure as a natural way to signal End of record (EOR). There 
are methods internal
to the parser which check for line feeds but is not exposed to callers.

- The CountingJsonReader uses the parser.skipChildren() method to skip the 
rest of the children for this record, hence it is not possible to accurately 
count and match the no of braces to cleanly skip that bad record.

- One thought is to tap the inputsource of the parser on an exception 
condition, but is not
encouraged.

My thought process was exactly along the lines you have been thinking. On 
an exception scenario the code attempts to locate a closing bracket(}) followed 
by a opening bracket ({).
This is what is being done in the BaseJsonProcessor.processJSONException 
method. Please note that it works in all cases except when we do not have 
proper brackets to signify end of a JSON record. 

Hope this explanation helps clarify.









---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

2016-09-12 Thread ssriniva123
Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
Folks,
I have tests on my local drill setup with various combinations of invalid 
json format. The only caveat is that some records may be skipped if JSON is not 
properly delimitted by }. Can this be reviewed for next release ? I can work on 
the documentation.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

2016-06-24 Thread ssriniva123
Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
I have made several modifications to get accurate line nos:

- The main change is to move the json parser to the end of the current 
record being processed since
previously multiple exceptions were thrown.

- BaseJsonProcessor contains new method protected 
JsonExceptionProcessingState processJSONException().
-This is called by both CountingJsonReader and JsonProcessor whenever 
parser encounters a jackson parsing exception.
-I have also added a new system setting 
store.json.reader.print_skipped_invalid_record_number so that we can suppress 
printing of line numbers.
- Added more unit test cases.
- System tested with various combinations on a local drill bit.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

2016-06-22 Thread ssriniva123
Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
What is the recommendation with this bug? Should we just remove the line 
nos for now and just name the file name. I was not able to work on this last 
couple of days, not sure if there is sufficient time until the next candidate 
release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #518: DRILL-4653.json - Malformed JSON should not stop th...

2016-06-16 Thread ssriniva123
Github user ssriniva123 commented on a diff in the pull request:

https://github.com/apache/drill/pull/518#discussion_r67431369
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONRecordReader.java
 ---
@@ -189,39 +191,33 @@ private long currentRecordNumberInFile() {
   public int next() {
 writer.allocate();
 writer.reset();
-
 recordCount = 0;
 ReadState write = null;
-//Stopwatch p = new Stopwatch().start();
-try{
-  outside: while(recordCount < DEFAULT_ROWS_PER_BATCH) {
-writer.setPosition(recordCount);
-write = jsonReader.write(writer);
-
-if(write == ReadState.WRITE_SUCCEED) {
-//  logger.debug("Wrote record.");
-  recordCount++;
-}else{
-//  logger.debug("Exiting.");
-  break outside;
-}
-
+outside: while(recordCount < DEFAULT_ROWS_PER_BATCH){
+try
+  {
+writer.setPosition(recordCount);
--- End diff --

Aman,
maven checkstyle:checkstyle did not report any errors before I did my last
check in. I have changed to reflect 2 spaces for indendation.

On Thu, Jun 16, 2016 at 2:22 PM, Aman Sinha <notificati...@github.com>
wrote:

> In
> 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONRecordReader.java
> <https://github.com/apache/drill/pull/518#discussion_r67426381>:
>
> > -  outside: while(recordCount < DEFAULT_ROWS_PER_BATCH) {
> > -writer.setPosition(recordCount);
> > -write = jsonReader.write(writer);
> > -
> > -if(write == ReadState.WRITE_SUCCEED) {
> > -//  logger.debug("Wrote record.");
> > -  recordCount++;
> > -}else{
> > -//  logger.debug("Exiting.");
> > -  break outside;
> > -}
> > -
> > +outside: while(recordCount < DEFAULT_ROWS_PER_BATCH){
> > +try
> > +  {
> > +writer.setPosition(recordCount);
>
> seems this is still doing indent of 4. We use 2 spaces (see
> https://drill.apache.org/docs/apache-drill-contribution-guidelines/
> scroll down to Step 2). Did it pass the mvn command line build without
> checkstyle violations ?
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> 
<https://github.com/apache/drill/pull/518/files/56a16fe3f2bbd1554d65cef6faeaeb63ba41f9a2#r67426381>,
> or mute the thread
> 
<https://github.com/notifications/unsubscribe/ABsaHw_13TKQS-3mIEDhxkCiglpCBA3Sks5qMb6IgaJpZM4IzZkt>
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

2016-06-16 Thread ssriniva123
Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
I have made changes are recommended by reviewers:

- Changed JSON_READER_SKIP_INVALID_RECORDS_FLAG constant
- Modified unit test to use builder framework
- Code indendation changes.  



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

2016-06-15 Thread ssriniva123
Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
I updated the pull request with more changes requested by the reviewers.

On Wed, Jun 15, 2016 at 5:28 PM, Aman Sinha <notificati...@github.com>
wrote:

> Yes, it does in fact have a conflict with DRILL-3764 which has changes to
> the JsonRecordReader, although this issue is still in progress. I noticed
> that @adeneche <https://github.com/adeneche> mentioned this in the JIRA.
> @ssriniva123 <https://github.com/ssriniva123> did you get a chance to
> look at DRILL-3764 ?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/drill/pull/518#issuecomment-226357921>, or mute
> the thread
> 
<https://github.com/notifications/unsubscribe/ABsaH_Xd1WBKpu396uYeUVluiQ4iN0_Rks5qMJihgaJpZM4IzZkt>
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

2016-06-15 Thread ssriniva123
Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
For some reason the build terminated- What needs to be done to fire it 
again?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

2016-06-15 Thread ssriniva123
Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
I have added support so that it prints the offending file/line nos.

Error parsing JSON in DRILL-4653.json : line nos :13
Error parsing JSON in DRILL-4653.json : line nos :14


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

2016-06-15 Thread ssriniva123
Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
Yes, I looked at DRILL-3764. The changes made to JSONProcessing do not 
conflict with my changes.
The only conflict I see is these two constants, I have defined my own.

 String ENABLE_SKIP_INVALID_RECORD_KEY = "exec.enable_skip_invalid_record";
 +  BooleanValidator ENABLE_SKIP_INVALID_RECORD = new 
BooleanValidator(ENABLE_SKIP_INVALID_RECORD_KEY, false);




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #518: DRILL-4653.json - Malformed JSON should not stop the entir...

2016-06-15 Thread ssriniva123
Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
Do u anticipate this to be logged using log4j?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #518: DRILL-4653.json - Malformed JSON should not stop th...

2016-06-10 Thread ssriniva123
GitHub user ssriniva123 opened a pull request:

https://github.com/apache/drill/pull/518

DRILL-4653.json - Malformed JSON should not stop the entire query from 
progressing

https://issues.apache.org/jira/browse/DRILL-4653

- The default is to stop processing as is today when JSON parser encounters 
an exception
- Setting store.json.reader.skip_malformed_records will ensure that query 
progresses after
skipping the bad records
- Added two unit tests
- Also did testing after deploying the new build: Both positive and 
negative tests were done.
- Negative test result:
org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: 
Error parsing JSON - Unexpected character ('{' (code 123)): was expecting comma 
to separate OBJECT entries

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ssriniva123/drill master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/518.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #518


commit 4fc70faf0e5ad5d434b944d084bc0b0e90d41c39
Author: Subbu Srinivasan <ssriniva...@zscaler.com>
Date:   2016-06-10T22:58:49Z

Fixes for DRILL-4653.json




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---