[GitHub] drill issue #593: DRILL-3178 csv reader should allow newlines inside quotes
Github user fmethot commented on the issue: https://github.com/apache/drill/pull/593 Thanks for the comments so far, See my newer changes, I suggest that we remove the flag and add an extra method instead. There is no more check for a boolean in nextChar, but instead there is an extra method call (readNext()->readNextNoNewLineCheck()) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #593: Add flag for end of line monitoring
Github user fmethot commented on a diff in the pull request: https://github.com/apache/drill/pull/593#discussion_r80170855 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextInput.java --- @@ -273,7 +278,7 @@ public final byte nextChar() throws IOException { byte byteChar = PlatformDependent.getByte(bStartMinus1 + bufferPtr); if (bufferPtr >= length) { - if (length != -1) { + if(length != -1) { --- End diff -- fixed, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #593: Add flag for end of line monitoring
Github user fmethot commented on a diff in the pull request: https://github.com/apache/drill/pull/593#discussion_r80169931 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/TextReader.java --- @@ -231,33 +231,34 @@ private void parseQuotedValue(byte prev) throws IOException { final TextInput input = this.input; final byte quote = this.quote; -ch = input.nextChar(); +try { + input.setMonitorForNewLine(false); + ch = input.nextChar(); -while (!(prev == quote && (ch == delimiter || ch == newLine || isWhite(ch { - if (ch != quote) { -if (prev == quote) { // unescaped quote detected - if (parseUnescapedQuotes) { -output.append(quote); -output.append(ch); -parseQuotedValue(ch); -break; - } else { -throw new TextParsingException( -context, -"Unescaped quote character '" -+ quote -+ "' inside quoted value of CSV field. To allow unescaped quotes, set 'parseUnescapedQuotes' to 'true' in the CSV parser settings. Cannot parse CSV input."); + while (!(prev == quote && (ch == delimiter || ch == newLine || isWhite(ch { +if (ch != quote) { + if (prev == quote) { // unescaped quote detected +if (parseUnescapedQuotes) { + output.append(quote); + output.append(ch); + parseQuotedValue(ch); + break; +} else { + throw new TextParsingException(context, "Unescaped quote character '" + quote + "' inside quoted value of CSV field. To allow unescaped quotes, set 'parseUnescapedQuotes' to 'true' in the CSV parser settings. Cannot parse CSV input."); +} } + output.append(ch); + prev = ch; +} else if (prev == quoteEscape) { + output.append(quote); + prev = NULL_BYTE; +} else { + prev = ch; } -output.append(ch); -prev = ch; - } else if (prev == quoteEscape) { -output.append(quote); -prev = NULL_BYTE; - } else { -prev = ch; +ch = input.nextChar(); } - ch = input.nextChar(); +} finally { --- End diff -- Because the finally block always runs, it is important to always set the flag back to false because the input.getChar() is called from everywhere. - We could remove the finally assuming that when an exception occurs the TextReader will just stop doing any parsing and exit. (input.getChar never gets called again) Please advice if that's the way we should do. - In a custom version of the CompliantTextRecordReader that we use internally, we are resilient to error within rows, once an exception occurs we are able to recover to a next line and keep parsing, the finally clause ensure the TextInput is in proper state after failure. - Only thing I am worried is the extra operations required to handle try finally in a method that gets called 100 000s of time per seconds per thread, I haven't tested it. Compiler must be doing a good job at optimizing these. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #593: Add flag for end of line monitoring
Github user fmethot commented on a diff in the pull request: https://github.com/apache/drill/pull/593#discussion_r80169935 --- Diff: exec/java-exec/src/test/resources/store/text/WithQuotedCrLf.tbl --- @@ -0,0 +1,6 @@ +"a +1"|a|a +a|"a +2"|a +a|a|"a +3" --- End diff -- This csv contains new lines. The test check for new line with \r\n chars, --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] drill pull request #593: Add flag for end of line monitoring
GitHub user fmethot opened a pull request: https://github.com/apache/drill/pull/593 Add flag for end of line monitoring Add a option for disabling for end-of-line character detection. You can merge this pull request into a Git repository by running: $ git pull https://github.com/fmethot/drill-1 drill-3178 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/593.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #593 commit 60ffa1dd45a5c9f9ade23da001cf73201e892713 Author: FranÂcois Methot Date: 2016-09-21T00:16:15Z Add flag for end of line monitoring --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---