[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding
Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/2060 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding
Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/2060#discussion_r78203358 --- Diff: flink-core/src/main/java/org/apache/flink/types/parser/FieldParser.java --- @@ -75,8 +76,30 @@ /** Invalid Boolean value **/ BOOLEAN_INVALID } - + + private Charset charset = Charset.forName("US-ASCII"); + private ParseErrorState errorState = ParseErrorState.NONE; + + /** +* Parses the value of a field from the byte array. +* The start position within the byte array and the array's valid length is given. +* The content of the value is delimited by a field delimiter. +* +* @param bytes The byte array that holds the value. +* @param startPos The index where the field starts +* @param limit The limit unto which the byte contents is valid for the parser. The limit is the +* position one after the last valid byte. +* @param delim The field delimiter character +* @param reuse An optional reusable field to hold the value +* @param charset The charset to parse with +* +* @return The index of the next delimiter, if the field was parsed correctly. A value less than 0 otherwise. +*/ + public int parseField(byte[] bytes, int startPos, int limit, byte[] delim, T reuse, Charset charset){ + this.charset = charset; --- End diff -- Is this method needed? `GenericCsvInputFormat.open` can `setCharset` on each newly instantiated `FieldParser`, and in the case where a user decided to change charset on an open file `GenericCsvInputFormat.setCharset` could go through and `setCharset` on the list of `FieldParser`s. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding
Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/2060#discussion_r78199868 --- Diff: flink-core/src/main/java/org/apache/flink/types/parser/FieldParser.java --- @@ -75,8 +76,30 @@ /** Invalid Boolean value **/ BOOLEAN_INVALID } - + + private Charset charset = Charset.forName("US-ASCII"); --- End diff -- Should this also default to `UFT-8`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding
Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/2060#discussion_r78199594 --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java --- @@ -314,6 +320,25 @@ protected void setFieldsGeneric(boolean[] includedMask, Class[] fieldTypes) { this.fieldIncluded = includedMask; } + /** +* Gets the Charset for the parser.Default is set to UTF-8 +* +* @return The charset for the parser. +*/ + public Charset getCharset() { + return this.charset; + } + + /** +* Sets the charset of the parser. Called by subclasses of the parser to set the type of charset +* when doing a parse. +* +* @param charset The charset to set. +*/ + protected void setCharset(Charset charset){ --- End diff -- Add space between `)` and `{` (and I count two more instances). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding
Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/2060#discussion_r78199365 --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java --- @@ -314,6 +320,25 @@ protected void setFieldsGeneric(boolean[] includedMask, Class[] fieldTypes) { this.fieldIncluded = includedMask; } + /** +* Gets the Charset for the parser.Default is set to UTF-8 +* +* @return The charset for the parser. +*/ + public Charset getCharset() { + return this.charset; + } + + /** +* Sets the charset of the parser. Called by subclasses of the parser to set the type of charset +* when doing a parse. +* +* @param charset The charset to set. +*/ + protected void setCharset(Charset charset){ + this.charset = charset != null ? charset : Charset.forName("UTF-8"); --- End diff -- Use `Preconditions.checkNotNull`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding
Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/2060#discussion_r78199262 --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java --- @@ -314,6 +320,25 @@ protected void setFieldsGeneric(boolean[] includedMask, Class[] fieldTypes) { this.fieldIncluded = includedMask; } + /** +* Gets the Charset for the parser.Default is set to UTF-8 +* +* @return The charset for the parser. +*/ + public Charset getCharset() { + return this.charset; + } + + /** +* Sets the charset of the parser. Called by subclasses of the parser to set the type of charset +* when doing a parse. +* +* @param charset The charset to set. --- End diff -- Double space. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding
Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/2060#discussion_r78199216 --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java --- @@ -314,6 +320,25 @@ protected void setFieldsGeneric(boolean[] includedMask, Class[] fieldTypes) { this.fieldIncluded = includedMask; } + /** +* Gets the Charset for the parser.Default is set to UTF-8 --- End diff -- Should we spell out `charset` as "character set" in the documentation? Also, space after period and missing period at end. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding
Github user greghogan commented on a diff in the pull request: https://github.com/apache/flink/pull/2060#discussion_r71137136 --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java --- @@ -106,6 +106,11 @@ protected GenericCsvInputFormat() { protected GenericCsvInputFormat(Path filePath) { super(filePath); } + + protected GenericCsvInputFormat(Path filePath, Charset charset) { + super(filePath); + this.charset = charset != null ? charset : Charset.forName("UTF-8"); --- End diff -- Would this be better as `this.charset = Preconditions.checkNotNull(charset);` since the user should use `GenericCsvInputFormat(Path)` when using the default charset? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request #2060: [FLINK-3921] StringParser encoding
GitHub user rekhajoshm opened a pull request: https://github.com/apache/flink/pull/2060 [FLINK-3921] StringParser encoding Corrected StringParser encoding You can merge this pull request into a Git repository by running: $ git pull https://github.com/rekhajoshm/flink FLINK-3921 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2060.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2060 commit 1e1ce9efefccf5a5585be802af4200e9d1ae7a98 Author: Rekha JoshiDate: 2016-06-01T20:03:50Z Merge pull request #1 from apache/master Apache Flink master pull commit 675b6a44e76ae71901bc6d4eaea1d09b6f789ff6 Author: Joshi Date: 2016-06-01T20:26:47Z [FLINK-3921] StringParser encoding --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---