Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-15 Thread via GitHub


yaooqinn commented on PR #46509:
URL: https://github.com/apache/spark/pull/46509#issuecomment-2113991332

   Thank you @xuzifu666.
   
   Merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-15 Thread via GitHub


yaooqinn closed pull request #46509: [SPARK-48219][CORE] StreamReader Charset 
fix with UTF8
URL: https://github.com/apache/spark/pull/46509


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-15 Thread via GitHub


xuzifu666 commented on PR #46509:
URL: https://github.com/apache/spark/pull/46509#issuecomment-2113986338

   XSDtoSchema would not modify it, than HiveImpl had also changed can refer 
recent pr: https://github.com/apache/hive/pull/5243,so I Think it is nesscery 
to change it? @yaooqinn @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-15 Thread via GitHub


xuzifu666 commented on code in PR #46509:
URL: https://github.com/apache/spark/pull/46509#discussion_r1602578323


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala:
##
@@ -48,7 +49,7 @@ object XSDToSchema extends Logging{
 val in = ValidatorUtil.openSchemaFile(xsdPath)
 val xmlSchemaCollection = new XmlSchemaCollection()
 xmlSchemaCollection.setBaseUri(xsdPath.toString)
-val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in))
+val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, 
StandardCharsets.UTF_8))

Review Comment:
   XSDtoSchema would not modify it, than HiveImpl had also changed can refer 
recent pr: https://github.com/apache/hive/pull/5243



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-15 Thread via GitHub


xuzifu666 commented on code in PR #46509:
URL: https://github.com/apache/spark/pull/46509#discussion_r1602578323


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala:
##
@@ -48,7 +49,7 @@ object XSDToSchema extends Logging{
 val in = ValidatorUtil.openSchemaFile(xsdPath)
 val xmlSchemaCollection = new XmlSchemaCollection()
 xmlSchemaCollection.setBaseUri(xsdPath.toString)
-val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in))
+val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, 
StandardCharsets.UTF_8))

Review Comment:
   XSDtoSchema would not modify it, than HiveImpl had also changed can refer 
recent pr: https://github.com/apache/hive/pull/5243



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-10 Thread via GitHub


HyukjinKwon commented on code in PR #46509:
URL: https://github.com/apache/spark/pull/46509#discussion_r1596440366


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala:
##
@@ -48,7 +49,7 @@ object XSDToSchema extends Logging{
 val in = ValidatorUtil.openSchemaFile(xsdPath)
 val xmlSchemaCollection = new XmlSchemaCollection()
 xmlSchemaCollection.setBaseUri(xsdPath.toString)
-val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in))
+val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, 
StandardCharsets.UTF_8))

Review Comment:
   As described above, I wouldn't touch Hive side.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-10 Thread via GitHub


xuzifu666 commented on code in PR #46509:
URL: https://github.com/apache/spark/pull/46509#discussion_r1596437790


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala:
##
@@ -48,7 +49,7 @@ object XSDToSchema extends Logging{
 val in = ValidatorUtil.openSchemaFile(xsdPath)
 val xmlSchemaCollection = new XmlSchemaCollection()
 xmlSchemaCollection.setBaseUri(xsdPath.toString)
-val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in))
+val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, 
StandardCharsets.UTF_8))

Review Comment:
   Thanks,I cite hive firstly,than feedback to this pr? @HyukjinKwon XSD not do 
the change,only change hive



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-10 Thread via GitHub


xuzifu666 commented on code in PR #46509:
URL: https://github.com/apache/spark/pull/46509#discussion_r1596437790


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala:
##
@@ -48,7 +49,7 @@ object XSDToSchema extends Logging{
 val in = ValidatorUtil.openSchemaFile(xsdPath)
 val xmlSchemaCollection = new XmlSchemaCollection()
 xmlSchemaCollection.setBaseUri(xsdPath.toString)
-val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in))
+val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, 
StandardCharsets.UTF_8))

Review Comment:
   Thanks,I cite hive firstly,than feedback to this pr? @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-10 Thread via GitHub


HyukjinKwon commented on code in PR #46509:
URL: https://github.com/apache/spark/pull/46509#discussion_r1596436036


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala:
##
@@ -48,7 +49,7 @@ object XSDToSchema extends Logging{
 val in = ValidatorUtil.openSchemaFile(xsdPath)
 val xmlSchemaCollection = new XmlSchemaCollection()
 xmlSchemaCollection.setBaseUri(xsdPath.toString)
-val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in))
+val xmlSchema = xmlSchemaCollection.read(new InputStreamReader(in, 
StandardCharsets.UTF_8))

Review Comment:
   This is arguable because if you don't specify the encoding, then it will 
pick the system default encoding up. If your XSD file is written in other 
encodings, but here tries to read UTF-8, it will fail.
   
   Other places they have to be UTF-8 because Spark encodes so explicitly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-10 Thread via GitHub


xuzifu666 commented on code in PR #46509:
URL: https://github.com/apache/spark/pull/46509#discussion_r1596434880


##
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java:
##
@@ -171,7 +172,7 @@ protected BufferedReader loadFile(String fileName) throws 
IOException {
   FileInputStream initStream = null;
   BufferedReader bufferedReader = null;
   initStream = new FileInputStream(fileName);
-  bufferedReader = new BufferedReader(new InputStreamReader(initStream));
+  bufferedReader = new BufferedReader(new InputStreamReader(initStream, 
StandardCharsets.UTF_8));

Review Comment:
   OK,I would address it @yaooqinn 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-10 Thread via GitHub


yaooqinn commented on code in PR #46509:
URL: https://github.com/apache/spark/pull/46509#discussion_r1596434063


##
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java:
##
@@ -171,7 +172,7 @@ protected BufferedReader loadFile(String fileName) throws 
IOException {
   FileInputStream initStream = null;
   BufferedReader bufferedReader = null;
   initStream = new FileInputStream(fileName);
-  bufferedReader = new BufferedReader(new InputStreamReader(initStream));
+  bufferedReader = new BufferedReader(new InputStreamReader(initStream, 
StandardCharsets.UTF_8));

Review Comment:
   the code here is copied from Hive, do we have the same issue in the upstream 
repo? Better to cite here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-10 Thread via GitHub


HyukjinKwon commented on code in PR #46509:
URL: https://github.com/apache/spark/pull/46509#discussion_r1596434994


##
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java:
##
@@ -171,7 +172,7 @@ protected BufferedReader loadFile(String fileName) throws 
IOException {
   FileInputStream initStream = null;
   BufferedReader bufferedReader = null;
   initStream = new FileInputStream(fileName);
-  bufferedReader = new BufferedReader(new InputStreamReader(initStream));
+  bufferedReader = new BufferedReader(new InputStreamReader(initStream, 
StandardCharsets.UTF_8));

Review Comment:
   Yeah, I wouldn't touch here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-10 Thread via GitHub


yaooqinn commented on PR #46509:
URL: https://github.com/apache/spark/pull/46509#issuecomment-2104161932

   The change itself looks reasonable to me. I also agree with @dongjoon-hyun 
that we shall add a simple test, maybe in `XSDToSchemaSuite`.
   
   BTW, the PR is tagged as CORE but the changes belong to XML of sql 
datasource and hive thriftserver 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-10 Thread via GitHub


xuzifu666 commented on PR #46509:
URL: https://github.com/apache/spark/pull/46509#issuecomment-2103937351

   @HyukjinKwon could you help to give a review? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-09 Thread via GitHub


dongjoon-hyun commented on PR #46509:
URL: https://github.com/apache/spark/pull/46509#issuecomment-2103908417

   Sorry but I'll leave this to the other reviewers, @xuzifu666 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-09 Thread via GitHub


xuzifu666 commented on PR #46509:
URL: https://github.com/apache/spark/pull/46509#issuecomment-2103906360

   @dongjoon-hyun Could you give a final review? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-48219][CORE] StreamReader Charset fix with UTF8 [spark]

2024-05-09 Thread via GitHub


xuzifu666 commented on PR #46509:
URL: https://github.com/apache/spark/pull/46509#issuecomment-2103709532

   > Do you think you can provide a test coverage to protect your contribution 
from potential future regression, @xuzifu666 ?
   > 
   > > Not need
   
   @dongjoon-hyun Thanks for you attentions,In my option this code change not 
need to provide tests for it's a specification for ReadStream usage,if not set 
utf8 charset may occur error when system default charset not contains Chinese 
Chars. You can refer it in other framework such as Calcite,Hive,all set utf8 
when this method be called.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-48219][core] StreamReader Charset fix with UTF8 [spark]

2024-05-09 Thread via GitHub


xuzifu666 opened a new pull request, #46509:
URL: https://github.com/apache/spark/pull/46509

   
   
   ### What changes were proposed in this pull request?
   Fix some StreamReader not set with UTF8
   
   
   ### Why are the changes needed?
   May cause string decode not as expected
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes
   
   
   ### How was this patch tested?
   Not need
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org