[jira] [Commented] (SPARK-25890) Null rows are ignored with Ctrl-A as a delimiter when reading a CSV file.

2018-11-05 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675110#comment-16675110
 ] 

Maxim Gekk commented on SPARK-25890:


I have double checked on branch-2.4. It doesn't have the problem too.

> Null rows are ignored with Ctrl-A as a delimiter when reading a CSV file.
> -
>
> Key: SPARK-25890
> URL: https://issues.apache.org/jira/browse/SPARK-25890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.3.2
>Reporter: Lakshminarayan Kamath
>Priority: Major
>
> Reading a Ctrl-A delimited CSV file ignores rows with all null values. 
> However a comma delimited CSV file doesn't.
> *Reproduction in spark-shell:*
> import org.apache.spark.sql._
>  import org.apache.spark.sql.types._
> val l = List(List(1, 2), List(null,null), List(2,3))
>  val datasetSchema = StructType(List(StructField("colA", IntegerType, true), 
> StructField("colB", IntegerType, true)))
>  val rdd = sc.parallelize(l).map(item ⇒ Row.fromSeq(item.toSeq))
>  val df = spark.createDataFrame(rdd, datasetSchema)
> df.show()
> |colA|colB|
> |1   |2   |
> |null|null|
> |2   |3   | |
> df.write.option("delimiter", "\u0001").option("header", 
> "true").csv("/ctrl-a-separated.csv")
>  df.write.option("delimiter", ",").option("header", 
> "true").csv("/comma-separated.csv")
> val commaDf = spark.read.option("header", "true").option("delimiter", 
> ",").csv("/comma-separated.csv")
>  commaDf.show
> |colA|colB|
> |1   |2   |
> |2   |3   |
> |null|null|
> val ctrlaDf = spark.read.option("header", "true").option("delimiter", 
> "\u0001").csv("/ctrl-a-separated.csv")
>  ctrlaDf.show
> |colA|colB|
> |1   |2   |
> |2   |3   |
>  
> As seen above, for Ctrl-A delimited CSV, rows containing only null values are 
> ignored.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25890) Null rows are ignored with Ctrl-A as a delimiter when reading a CSV file.

2018-11-05 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675088#comment-16675088
 ] 

Maxim Gekk commented on SPARK-25890:


I got the following on the commit *4afb35*:
{code:scala}
scala> val ctrlaDf = spark.read.option("header", "true").option("delimiter", 
"\u0001").csv("ctrl-a-separated.csv")
ctrlaDf: org.apache.spark.sql.DataFrame = [colA: string, colB: string]

scala> ctrlaDf.show
+++
|colA|colB|
+++
|null|null|
|   2|   3|
|   1|   2|
+++
{code}

The ctrl-a-separated.csv file contains \u0001 as the delimiter:
{code}
hexdump -C 
ctrl-a-separated.csv/part-2-b13ced94-e5d1-406b-afd8-565acd649261-c000.csv 
  63 6f 6c 41 01 63 6f 6c  42 0a 31 01 32 0a|colA.colB.1.2.|
000e
{code}

> Null rows are ignored with Ctrl-A as a delimiter when reading a CSV file.
> -
>
> Key: SPARK-25890
> URL: https://issues.apache.org/jira/browse/SPARK-25890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.3.2
>Reporter: Lakshminarayan Kamath
>Priority: Major
>
> Reading a Ctrl-A delimited CSV file ignores rows with all null values. 
> However a comma delimited CSV file doesn't.
> *Reproduction in spark-shell:*
> import org.apache.spark.sql._
>  import org.apache.spark.sql.types._
> val l = List(List(1, 2), List(null,null), List(2,3))
>  val datasetSchema = StructType(List(StructField("colA", IntegerType, true), 
> StructField("colB", IntegerType, true)))
>  val rdd = sc.parallelize(l).map(item ⇒ Row.fromSeq(item.toSeq))
>  val df = spark.createDataFrame(rdd, datasetSchema)
> df.show()
> |colA|colB|
> |1   |2   |
> |null|null|
> |2   |3   | |
> df.write.option("delimiter", "\u0001").option("header", 
> "true").csv("/ctrl-a-separated.csv")
>  df.write.option("delimiter", ",").option("header", 
> "true").csv("/comma-separated.csv")
> val commaDf = spark.read.option("header", "true").option("delimiter", 
> ",").csv("/comma-separated.csv")
>  commaDf.show
> |colA|colB|
> |1   |2   |
> |2   |3   |
> |null|null|
> val ctrlaDf = spark.read.option("header", "true").option("delimiter", 
> "\u0001").csv("/ctrl-a-separated.csv")
>  ctrlaDf.show
> |colA|colB|
> |1   |2   |
> |2   |3   |
>  
> As seen above, for Ctrl-A delimited CSV, rows containing only null values are 
> ignored.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25890) Null rows are ignored with Ctrl-A as a delimiter when reading a CSV file.

2018-11-05 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675038#comment-16675038
 ] 

Hyukjin Kwon commented on SPARK-25890:
--

[~maxgekk], do you mind you couldn't reproduce this against the current master?

> Null rows are ignored with Ctrl-A as a delimiter when reading a CSV file.
> -
>
> Key: SPARK-25890
> URL: https://issues.apache.org/jira/browse/SPARK-25890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.3.2
>Reporter: Lakshminarayan Kamath
>Priority: Major
>
> Reading a Ctrl-A delimited CSV file ignores rows with all null values. 
> However a comma delimited CSV file doesn't.
> *Reproduction in spark-shell:*
> import org.apache.spark.sql._
>  import org.apache.spark.sql.types._
> val l = List(List(1, 2), List(null,null), List(2,3))
>  val datasetSchema = StructType(List(StructField("colA", IntegerType, true), 
> StructField("colB", IntegerType, true)))
>  val rdd = sc.parallelize(l).map(item ⇒ Row.fromSeq(item.toSeq))
>  val df = spark.createDataFrame(rdd, datasetSchema)
> df.show()
> |colA|colB|
> |1   |2   |
> |null|null|
> |2   |3   | |
> df.write.option("delimiter", "\u0001").option("header", 
> "true").csv("/ctrl-a-separated.csv")
>  df.write.option("delimiter", ",").option("header", 
> "true").csv("/comma-separated.csv")
> val commaDf = spark.read.option("header", "true").option("delimiter", 
> ",").csv("/comma-separated.csv")
>  commaDf.show
> |colA|colB|
> |1   |2   |
> |2   |3   |
> |null|null|
> val ctrlaDf = spark.read.option("header", "true").option("delimiter", 
> "\u0001").csv("/ctrl-a-separated.csv")
>  ctrlaDf.show
> |colA|colB|
> |1   |2   |
> |2   |3   |
>  
> As seen above, for Ctrl-A delimited CSV, rows containing only null values are 
> ignored.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25890) Null rows are ignored with Ctrl-A as a delimiter when reading a CSV file.

2018-11-03 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16674112#comment-16674112
 ] 

Maxim Gekk commented on SPARK-25890:


I haven't reproduced the issue on the master branch.

> Null rows are ignored with Ctrl-A as a delimiter when reading a CSV file.
> -
>
> Key: SPARK-25890
> URL: https://issues.apache.org/jira/browse/SPARK-25890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.3.2
>Reporter: Lakshminarayan Kamath
>Priority: Major
>
> Reading a Ctrl-A delimited CSV file ignores rows with all null values. 
> However a comma delimited CSV file doesn't.
> *Reproduction in spark-shell:*
> import org.apache.spark.sql._
>  import org.apache.spark.sql.types._
> val l = List(List(1, 2), List(null,null), List(2,3))
>  val datasetSchema = StructType(List(StructField("colA", IntegerType, true), 
> StructField("colB", IntegerType, true)))
>  val rdd = sc.parallelize(l).map(item ⇒ Row.fromSeq(item.toSeq))
>  val df = spark.createDataFrame(rdd, datasetSchema)
> df.show()
> |colA|colB|
> |1   |2   |
> |null|null|
> |2   |3   | |
> df.write.option("delimiter", "\u0001").option("header", 
> "true").csv("/ctrl-a-separated.csv")
>  df.write.option("delimiter", ",").option("header", 
> "true").csv("/comma-separated.csv")
> val commaDf = spark.read.option("header", "true").option("delimiter", 
> ",").csv("/comma-separated.csv")
>  commaDf.show
> |colA|colB|
> |1   |2   |
> |2   |3   |
> |null|null|
> val ctrlaDf = spark.read.option("header", "true").option("delimiter", 
> "\u0001").csv("/ctrl-a-separated.csv")
>  ctrlaDf.show
> |colA|colB|
> |1   |2   |
> |2   |3   |
>  
> As seen above, for Ctrl-A delimited CSV, rows containing only null values are 
> ignored.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org