Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
yaooqinn commented on PR #46509: URL: https://github.com/apache/spark/pull/46509#issuecomment-2113991332 Thank you @xuzifu666. Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
yaooqinn closed pull request #46509: [SPARK-48219][CORE] StreamReader Charset fix with UTF8 URL: https://github.com/apache/spark/pull/46509 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
xuzifu666 commented on PR #46509: URL: https://github.com/apache/spark/pull/46509#issuecomment-2113986338 XSDtoSchema would not modify it, than HiveImpl had also changed can refer recent pr: https://github.com/apache/hive/pull/5243,so I Think it is nesscery to change it? @yaooqinn @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
xuzifu666 commented on code in PR #46509: URL: https://github.com/apache/spark/pull/46509#discussion_r1602578323 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala: ## @@ -48,7 +49,7 @@ object XSDToSchema extends Logging{ val in = ValidatorUtil.openSchemaFile(xsdPath) val xmlSchemaCollection = new XmlSchemaCollection() xmlSchemaCollection.setBaseUri(xsdPath.toString) -val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in)) +val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, StandardCharsets.UTF_8)) Review Comment: XSDtoSchema would not modify it, than HiveImpl had also changed can refer recent pr: https://github.com/apache/hive/pull/5243 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
xuzifu666 commented on code in PR #46509: URL: https://github.com/apache/spark/pull/46509#discussion_r1602578323 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala: ## @@ -48,7 +49,7 @@ object XSDToSchema extends Logging{ val in = ValidatorUtil.openSchemaFile(xsdPath) val xmlSchemaCollection = new XmlSchemaCollection() xmlSchemaCollection.setBaseUri(xsdPath.toString) -val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in)) +val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, StandardCharsets.UTF_8)) Review Comment: XSDtoSchema would not modify it, than HiveImpl had also changed can refer recent pr: https://github.com/apache/hive/pull/5243 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
HyukjinKwon commented on code in PR #46509: URL: https://github.com/apache/spark/pull/46509#discussion_r1596440366 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala: ## @@ -48,7 +49,7 @@ object XSDToSchema extends Logging{ val in = ValidatorUtil.openSchemaFile(xsdPath) val xmlSchemaCollection = new XmlSchemaCollection() xmlSchemaCollection.setBaseUri(xsdPath.toString) -val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in)) +val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, StandardCharsets.UTF_8)) Review Comment: As described above, I wouldn't touch Hive side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
xuzifu666 commented on code in PR #46509: URL: https://github.com/apache/spark/pull/46509#discussion_r1596437790 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala: ## @@ -48,7 +49,7 @@ object XSDToSchema extends Logging{ val in = ValidatorUtil.openSchemaFile(xsdPath) val xmlSchemaCollection = new XmlSchemaCollection() xmlSchemaCollection.setBaseUri(xsdPath.toString) -val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in)) +val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, StandardCharsets.UTF_8)) Review Comment: Thanks,I cite hive firstly,than feedback to this pr? @HyukjinKwon XSD not do the change,only change hive -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
xuzifu666 commented on code in PR #46509: URL: https://github.com/apache/spark/pull/46509#discussion_r1596437790 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala: ## @@ -48,7 +49,7 @@ object XSDToSchema extends Logging{ val in = ValidatorUtil.openSchemaFile(xsdPath) val xmlSchemaCollection = new XmlSchemaCollection() xmlSchemaCollection.setBaseUri(xsdPath.toString) -val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in)) +val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, StandardCharsets.UTF_8)) Review Comment: Thanks,I cite hive firstly,than feedback to this pr? @HyukjinKwon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
HyukjinKwon commented on code in PR #46509: URL: https://github.com/apache/spark/pull/46509#discussion_r1596436036 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala: ## @@ -48,7 +49,7 @@ object XSDToSchema extends Logging{ val in = ValidatorUtil.openSchemaFile(xsdPath) val xmlSchemaCollection = new XmlSchemaCollection() xmlSchemaCollection.setBaseUri(xsdPath.toString) -val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in)) +val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, StandardCharsets.UTF_8)) Review Comment: This is arguable because if you don't specify the encoding, then it will pick the system default encoding up. If your XSD file is written in other encodings, but here tries to read UTF-8, it will fail. Other places they have to be UTF-8 because Spark encodes so explicitly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
xuzifu666 commented on code in PR #46509: URL: https://github.com/apache/spark/pull/46509#discussion_r1596434880 ## sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java: ## @@ -171,7 +172,7 @@ protected BufferedReader loadFile(String fileName) throws IOException { FileInputStream initStream = null; BufferedReader bufferedReader = null; initStream = new FileInputStream(fileName); - bufferedReader = new BufferedReader(new InputStreamReader(initStream)); + bufferedReader = new BufferedReader(new InputStreamReader(initStream, StandardCharsets.UTF_8)); Review Comment: OK,I would address it @yaooqinn -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
yaooqinn commented on code in PR #46509: URL: https://github.com/apache/spark/pull/46509#discussion_r1596434063 ## sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java: ## @@ -171,7 +172,7 @@ protected BufferedReader loadFile(String fileName) throws IOException { FileInputStream initStream = null; BufferedReader bufferedReader = null; initStream = new FileInputStream(fileName); - bufferedReader = new BufferedReader(new InputStreamReader(initStream)); + bufferedReader = new BufferedReader(new InputStreamReader(initStream, StandardCharsets.UTF_8)); Review Comment: the code here is copied from Hive, do we have the same issue in the upstream repo? Better to cite here -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
HyukjinKwon commented on code in PR #46509: URL: https://github.com/apache/spark/pull/46509#discussion_r1596434994 ## sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java: ## @@ -171,7 +172,7 @@ protected BufferedReader loadFile(String fileName) throws IOException { FileInputStream initStream = null; BufferedReader bufferedReader = null; initStream = new FileInputStream(fileName); - bufferedReader = new BufferedReader(new InputStreamReader(initStream)); + bufferedReader = new BufferedReader(new InputStreamReader(initStream, StandardCharsets.UTF_8)); Review Comment: Yeah, I wouldn't touch here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
yaooqinn commented on PR #46509: URL: https://github.com/apache/spark/pull/46509#issuecomment-2104161932 The change itself looks reasonable to me. I also agree with @dongjoon-hyun that we shall add a simple test, maybe in `XSDToSchemaSuite`. BTW, the PR is tagged as CORE but the changes belong to XML of sql datasource and hive thriftserver -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
xuzifu666 commented on PR #46509: URL: https://github.com/apache/spark/pull/46509#issuecomment-2103937351 @HyukjinKwon could you help to give a review? Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
dongjoon-hyun commented on PR #46509: URL: https://github.com/apache/spark/pull/46509#issuecomment-2103908417 Sorry but I'll leave this to the other reviewers, @xuzifu666 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
xuzifu666 commented on PR #46509: URL: https://github.com/apache/spark/pull/46509#issuecomment-2103906360 @dongjoon-hyun Could you give a final review? Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]
xuzifu666 commented on PR #46509: URL: https://github.com/apache/spark/pull/46509#issuecomment-2103709532 > Do you think you can provide a test coverage to protect your contribution from potential future regression, @xuzifu666 ? > > > Not need @dongjoon-hyun Thanks for you attentions,In my option this code change not need to provide tests for it's a specification for ReadStream usage,if not set utf8 charset may occur error when system default charset not contains Chinese Chars. You can refer it in other framework such as Calcite,Hive,all set utf8 when this method be called. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-48219][core] StreamReader Charset fix with UTF8 [spark]
xuzifu666 opened a new pull request, #46509: URL: https://github.com/apache/spark/pull/46509 ### What changes were proposed in this pull request? Fix some StreamReader not set with UTF8 ### Why are the changes needed? May cause string decode not as expected ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Not need ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org