[GitHub] spark issue #19814: [SPARK-22484][DOC] Document PySpark DataFrame csv writer...

HyukjinKwon Mon, 27 Nov 2017 05:13:26 -0800

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19814
  
    To be honest, I would like to suggest disallow it. I just ran few tests and 
looks we are still not able to read it back:
    
    Empty `quote` (\u0000):
    
    ```scala
    Seq(Tuple2("a\n a", "b \nb"), Tuple2("c\n c", "d 
\nd")).toDF.write.mode("overwrite").option("quote", "").csv("tmp.csv")
    spark.read.option("multiLine", true).option("quote", 
"").csv("tmp.csv").collect()
    ```
    
    ```
    Array[org.apache.spark.sql.Row] = Array([ a?,??b ], [b?,null], [ c?,??d ], 
[d?,null])
    ```
    
    
    If \u0000 really disables quoting when read, I think it should give the 
same results when `quote` is another character for example:
    
    ```
    Seq(Tuple2("a\n a", "b \nb"), Tuple2("c\n c", "d 
\nd")).toDF.write.mode("overwrite").option("quote", "").csv("tmp.csv")
    spark.read.option("multiLine", true).option("quote", 
"^").csv("tmp.csv").collect()
    ```
    
    ```
    Array[org.apache.spark.sql.Row] = Array([ a?,?b ], [b?,null], [ c?,?d ], 
[d?,null])
    ```
    
    but the output is different as above.
    
    It's Array(0, 98, 32) vs Array(0, 0, 98, 32) in `"?b "` vs `"??b "`
    
    Default `quote`:
    
    ```
    Seq(Tuple2("a\n a", "b \nb"), Tuple2("c\n c", "d 
\nd")).toDF.write.mode("overwrite").csv("tmp.csv")
    spark.read.option("multiLine", true).csv("tmp.csv").collect()
    ```
    
    ```
    Array[org.apache.spark.sql.Row] =
    Array([a
     a,b
    b], [c
     c,d
    d])
    ```
    
    Another `quote`:
    
    ```scala
    Seq(Tuple2("a\n a", "b \nb"), Tuple2("c\n c", "d 
\nd")).toDF.write.mode("overwrite").option("quote", "!").csv("tmp.csv")
    spark.read.option("multiLine", true).option("quote", 
"!").csv("tmp.csv").collect()
    ```
    
    ```
    Array[org.apache.spark.sql.Row] =
    Array([a
     a,b
    b], [c
     c,d
    d])
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #19814: [SPARK-22484][DOC] Document PySpark DataFrame csv writer...

Reply via email to