[GitHub] sumitsu commented on a change in pull request #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect

GitBox Sun, 27 Jan 2019 19:09:40 -0800

sumitsu commented on a change in pull request #23665: [SPARK-26745][SQL] Skip 
empty lines in JSON-derived DataFrames when skipParsing optimization in effect
URL: https://github.com/apache/spark/pull/23665#discussion_r251273243


 ##########
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
 ##########
 @@ -55,11 +56,15 @@ class FailureSafeParser[IN](
 
   def parse(input: IN): Iterator[InternalRow] = {
     try {
-     if (skipParsing) {
-       Iterator.single(InternalRow.empty)
-     } else {
-       rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () 
=> null))
-     }
+      if (skipParsing) {
+        if (unparsedRecordIsNonEmpty(input)) {
+          Iterator.single(InternalRow.empty)
+        } else {
+          Iterator.empty
+        }
+      } else {
+        rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () 
=> null))
+      }
 
 Review comment:
   One of my earlier revisions [worked in the way you 
suggest](https://github.com/sumitsu/spark/blob/13942b8da17d7e58e690b7a8b27c1a3fe20be135/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala#L132-L160)
 (if I've understood your point correctly); I changed it in order to avoid 
redundant empty-line filtering in the case where the full parser has to run 
anyway (i.e. where `skipParsing == false`).
   
   What do you think? Is that a valid optimization, or is it better to do it on 
the JSON code as in 13942b8 to avoid changes to `FailureSafeParser`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] sumitsu commented on a change in pull request #23665: [SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect

Reply via email to