[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-09-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610164#comment-16610164
 ] 

Apache Spark commented on SPARK-17916:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22389

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-09-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608217#comment-16608217
 ] 

Apache Spark commented on SPARK-17916:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22367

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-09-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608215#comment-16608215
 ] 

Apache Spark commented on SPARK-17916:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22367

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-09-01 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599717#comment-16599717
 ] 

Apache Spark commented on SPARK-17916:
--

User 'koertkuipers' has created a pull request for this issue:
https://github.com/apache/spark/pull/22312

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-08-17 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584393#comment-16584393
 ] 

koert kuipers commented on SPARK-17916:
---

now the particular unit test that broke for us, where nulls come back in as 
quoted strings, is this:
{code:scala}
  val litNull: String = null
  val df = Seq(
(1, "John Doe"),
(2, ""),
(3, "-"),
(4, litNull)
  ).toDF("id", "name")

  df
.write
.format("csv")
.option("header", true)
.option("delimiter", "|")
.option("quote", "☃")
.save("/tmp/abc")

  val df1 = spark
.read
.format("csv")
.option("header", true)
.option("delimiter", "|")
.option("quote", "☃")
.load("/tmp/abc")
  df1.show

+---++
| id|name|
+---++
|  1|John Doe|
|  2|  ""|
|  4|  ""|
|  3|   -|
+---++
{code}

so here we are writing out with bar as delimiter, and quoting is not supported 
at all (so i set it to a silly character, i cannot think of better option).
i am not sure i understand yet why the nulls come back in as strings, but it 
has to do something with setting the quote character, because if i dont do that 
the test passes.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-08-17 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584381#comment-16584381
 ] 

koert kuipers commented on SPARK-17916:
---

we also use csv format to write files like for example bsv (bar delimited, 
interpreted by systems that do not support quoting at all). having an output 
that used to be:
{code:bash}
1|John Doe
2|
3|-
4|
{code}
become:
{code:bash}
1|John Doe
2|""
3|-
4|""
{code}
will no go down well.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-08-17 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584378#comment-16584378
 ] 

koert kuipers commented on SPARK-17916:
---

my first observation is that if i do this:
{code:scala}
val litNull: String = null
val df = Seq(
  (1, "John Doe"),
  (2, ""),
  (3, "-"),
  (4, litNull)
).toDF("id", "name")

df
  .write
  .csv("/tmp/abc1")
{code}
and inspect in bash
{code:bash}
cat /tmp/abc1/part-*.csv
1,John Doe
2,""
3,-
4,""
{code}
notice how that line has 4,""
if i do the same exercise in spark 2.3 i get:
{code:bash}
cat /tmp/abc1/part-*.csv
1,John Doe
2,
3,-
4,
{code}

so my actual csv data has changed upon writing. that makes me nervous about 
compatibility with other systems that read data we produce.


> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-08-17 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584374#comment-16584374
 ] 

koert kuipers commented on SPARK-17916:
---

hi [~maxgekk]
i saw your unit test for the old behavior. let me try to find out what is not 
working for us.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-08-17 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584357#comment-16584357
 ] 

Maxim Gekk commented on SPARK-17916:


> he default behavior in 2.3.x for csv format is that when i write out null 
>value, it comes back in as null. when i write out empty string, it also comes 
>back in as null.

[~koert] Please, have a look at the added test: 
[https://github.com/apache/spark/pull/21273/files#diff-219ac8201e443435499123f96e94d29fR1355]
 . It checks exactly what you described. If you have something different, 
please, leave the code here.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-08-17 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584324#comment-16584324
 ] 

koert kuipers commented on SPARK-17916:
---

the default behavior in 2.3.x for csv format is that when i write out null 
value, it comes back in as null. when i write out empty string, it also comes 
back in as null.

now my nulls are coming back in as empty strings. please advice what settings i 
need to get behavior of 2.3 back.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-05-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467998#comment-16467998
 ] 

Apache Spark commented on SPARK-17916:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21273

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Priority: Major
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2017-12-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302594#comment-16302594
 ] 

Apache Spark commented on SPARK-17916:
--

User 'aa8y' has created a pull request for this issue:
https://github.com/apache/spark/pull/20068

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-11-29 Thread Jork Zijlstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704736#comment-15704736
 ] 

Jork Zijlstra commented on SPARK-17916:
---

I also have the same issue in 2.0.1. This code seems to be the problem:

private def rowToString(row: InternalRow): Seq[String] = {
var i = 0
val values = new Array[String](row.numFields)
while (i < row.numFields) {
  if (!row.isNullAt(i)) {
values(i) = valueConverters(i).apply(row, i)
  } else {
values(i) = params.nullValue
  }
  i += 1
}
values
  }


def castTo(
  datum: String,
  castType: DataType,
  nullable: Boolean = true,
  options: CSVOptions = CSVOptions()): Any = {

if (nullable && datum == options.nullValue) {
  null
} else {

}

So first the missing value in the data in transformed into the nullValue. Then 
in the castTo the value is checked against the nullValue, which is always true 
for a missing value. 

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-11-09 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652414#comment-15652414
 ] 

Eric Liang commented on SPARK-17916:


In our case, a user wants the empty string (whether actually missing, e.g. ,, 
or quoted ,""), to resolve as the empty string. It should only turn into null 
if nullValue is set to "". There doesn't currently appear to be some option 
combination that allows this.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-11-09 Thread Suresh Thalamati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651172#comment-15651172
 ] 

Suresh Thalamati commented on SPARK-17916:
--

@Eric Liang   If it is possible , can you please share the data, and expected 
output with the options 

I am trying to fix this issue in PR ; https://github.com/apache/spark/pull/12904



> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-11-08 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649039#comment-15649039
 ] 

Eric Liang commented on SPARK-17916:


We're hitting this as a regression from 2.0 as well.

Ideally, we don't want the empty string to be treated specially in any 
scenario. The only logic that converts it to nulls should be due to the 
nullValue option.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-20 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15594046#comment-15594046
 ] 

Felix Cheung commented on SPARK-17916:
--

So here's what happen.

First, R read.csv has clearly documented that it treats empty/blank string the 
same as NA in the following condition: "Blank fields are also considered to be 
missing values in logical, integer, numeric and complex fields."

Second, in this example in R, the 2nd column is turned into "logical", instead 
of "character" (ie. string) as expected:
{code}
> d <- "col1,col2
+ 1,\"-\"
+ 2,\"\""
> df <- read.csv(text=d, quote="\"", na.strings=c("-"))
> df
  col1 col2
11   NA
22   NA
> str(df)
'data.frame':   2 obs. of  2 variables:
 $ col1: int  1 2
 $ col2: logi  NA NA
{code}

And that is why the blank string is turned into NA.

Whereas if the data.frame has character/factor column instead, the blank field 
is retained as blank:
{code}
> d <- "col1,col2
+ 1,\"###\"
+ 2,\"\"
+ 3,\"this is a string\""
> df <- read.csv(text=d, quote="\"", na.strings=c("###"))
> df
  col1 col2
11 
22
33 this is a string
> str(df)
'data.frame':   3 obs. of  2 variables:
 $ col1: int  1 2 3
 $ col2: Factor w/ 2 levels "","this is a string": NA 1 2
{code}

IMO this behavior makes sense.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-20 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593300#comment-15593300
 ] 

Hyukjin Kwon commented on SPARK-17916:
--

Could I please ask what you think? cc [~felixcheung]

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-20 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593271#comment-15593271
 ] 

Hyukjin Kwon commented on SPARK-17916:
--

Oh, yes sure. I just thought the root problem is to differentiate {{""}}. Once 
we can distinguish it, we can easily transform it. Also, another point I want 
to make was.. we already have a great reference in R but it seems not handling 
this case.


> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-20 Thread Suresh Thalamati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593119#comment-15593119
 ] 

Suresh Thalamati commented on SPARK-17916:
--

Thank you for trying out the different scenarios. I think output you are 
getting after setting he quote to empty is not what is expected in the case. 
You want "" to be recognized as empty string, not actual quotes in the output.

Example (Before my changes on 2.0.1 branch):

input:
col1,col2
1,"-"
2,""
3,
4,"A,B"

val df = spark.read.format("csv").option("nullValue", "\"-\"").option("quote", 
"").option("header", true).load("/Users/suresht/sparktests/emptystring.csv")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]

scala> df.selectExpr("length(col2)").show
++
|length(col2)|
++
|null|
|   2|
|null|
|   2|
++


scala> df.show
+++
|col1|col2|
+++
|   1|null|
|   2|  ""|
|   3|null|
|   4|  "A|
+++





> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-20 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15591417#comment-15591417
 ] 

Hyukjin Kwon commented on SPARK-17916:
--

FWIW, the codes I ran in Spark (equalvalent with the last example in R) is as 
blow.

{code}
spark.read.format("csv")
  .option("nullValue", "\"-\"")
  .option("quote", "")
  .option("header", "true")
  .load("path")
  .show()
{code}

{code}
+++
|col1|col2|
+++
|   1|null|
|   2|  ""|
+++
{code}

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-20 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15591408#comment-15591408
 ] 

Hyukjin Kwon commented on SPARK-17916:
--

[~falaki] I just have thought about this more but I started to feel we might 
not need this.
I tested {{read.csv()}} in R as well after setting the equalvalent default 
values in Spark CSV options.

{code}
> d <- "col1,col2
+ 1,\"-\"
+ 2,\"\""
> df <- read.csv(text=d, quote="\"", na.strings=c("-"))
> df
  col1 col2
11   NA
22   NA
{code}

It seems {{read.csv()}} in R seems not having this option[1] and handles 
equalvalently with the current behaviour of Spark CSV dataource.
Shouldn't we maybe do this as below?
 
{code}
> df <- read.csv(text=d, quote="", na.strings=c("\"-\""))
> df
  col1 col2
11 
22   ""
{code}

I am not saying we should exactly match {{read.csv()}} in R to {{read.csv()}} 
in Spark but maybe I guess it'd be nicer if the behaviour is matched up in 
general.
I guess we could add this option (as you pointed out) but I am worried if this 
brings some confusions between {{nullValue}} and {{emptyValue}} as pointed out 
in the PR.

[1]https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-13 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573692#comment-15573692
 ] 

Hossein Falaki commented on SPARK-17916:


Thanks for linking it. Yes they are very much same issues. However, I slightly 
disagree with the proposed solution. I will comment on the PR.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-10-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573668#comment-15573668
 ] 

Hyukjin Kwon commented on SPARK-17916:
--

Hi [~falaki], this JIRA rings a bell to me. Do you mind if I ask to take a look 
https://github.com/apache/spark/pull/12904 and SPARK-15125. Could you confirm 
that it is a duplicate maybe?

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org