[jira] [Commented] (SPARK-26645) CSV infer schema bug infers decimal(9,-1)

2020-11-25 Thread Punit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238798#comment-17238798
 ] 

Punit Shah commented on SPARK-26645:


Hello [~dongjoon] If we can get this PR then this would be tremendously helpful.

> CSV infer schema bug infers decimal(9,-1)
> -
>
> Key: SPARK-26645
> URL: https://issues.apache.org/jira/browse/SPARK-26645
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 3.0.0
>
>
> we have a file /tmp/t1/file.txt that contains only one line "1.18927098E9".
> running:
> {code:python}
> df = spark.read.csv('/tmp/t1', header=False, inferSchema=True, sep='\t')
> print df.dtypes
> {code}
> causes:
> {noformat}
> ValueError: Could not parse datatype: decimal(9,-1)
> {noformat}
> I'm not sure where the bug is - inferSchema or dtypes?
> I saw it is legal to have a decimal with negative scale in the code 
> (CSVInferSchema.scala):
> {code:python}
> if (bigDecimal.scale <= 0) {
> // `DecimalType` conversion can fail when
> //   1. The precision is bigger than 38.
> //   2. scale is bigger than precision.
> DecimalType(bigDecimal.precision, bigDecimal.scale)
>   } 
> {code}
> but what does it mean?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33445) Can't parse decimal type from csv file

2020-11-18 Thread Punit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17234919#comment-17234919
 ] 

Punit Shah commented on SPARK-33445:


Thank you very much [~dongjoon]

> Can't parse decimal type from csv file
> --
>
> Key: SPARK-33445
> URL: https://issues.apache.org/jira/browse/SPARK-33445
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7
>Reporter: Punit Shah
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: tsd.csv
>
>
> The attached file is a one column csv file containing decimals.
> Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", 
> header=True, inferSchema=True){color}
> Then invoking {color:#de350b}mydf2.schema{color} will result in error:
> {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33445) Can't parse decimal type from csv file

2020-11-18 Thread Punit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17234851#comment-17234851
 ] 

Punit Shah commented on SPARK-33445:


My apologies [~dongjoon] for the incorrect tags.  Please let me know what I 
need to do from my end, if anything, to continue to move this ticket forward.

> Can't parse decimal type from csv file
> --
>
> Key: SPARK-33445
> URL: https://issues.apache.org/jira/browse/SPARK-33445
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7
>Reporter: Punit Shah
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: tsd.csv
>
>
> The attached file is a one column csv file containing decimals.
> Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", 
> header=True, inferSchema=True){color}
> Then invoking {color:#de350b}mydf2.schema{color} will result in error:
> {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33445) Can't parse decimal type from csv file

2020-11-17 Thread Punit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17233589#comment-17233589
 ] 

Punit Shah edited comment on SPARK-33445 at 11/17/20, 1:38 PM:
---

[~dongjoon]

As per the issue description, the call to spark_session.schema results in 
error.  Not spark_session.printSchema.

Please test again.

And I know for a fact that in 2.4.3 printSchema succeeds, whereas schema fails.


was (Author: bullsoverbears):
As per the issue description, the call to spark_session.schema results in 
error.  Not spark_session.printSchema.

Please test again.

And I know for a fact that in 2.4.3 printSchema succeeds, whereas schema fails.

> Can't parse decimal type from csv file
> --
>
> Key: SPARK-33445
> URL: https://issues.apache.org/jira/browse/SPARK-33445
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7, 3.0.0
>Reporter: Punit Shah
>Priority: Major
> Attachments: tsd.csv
>
>
> The attached file is a one column csv file containing decimals.
> Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", 
> header=True, inferSchema=True){color}
> Then invoking {color:#de350b}mydf2.schema{color} will result in error:
> {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-33445) Can't parse decimal type from csv file

2020-11-17 Thread Punit Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah reopened SPARK-33445:


As per the issue description, the call to spark_session.schema results in 
error.  Not spark_session.printSchema.

Please test again.

And I know for a fact that in 2.4.3 printSchema succeeds, whereas schema fails.

> Can't parse decimal type from csv file
> --
>
> Key: SPARK-33445
> URL: https://issues.apache.org/jira/browse/SPARK-33445
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7, 3.0.0
>Reporter: Punit Shah
>Priority: Major
> Attachments: tsd.csv
>
>
> The attached file is a one column csv file containing decimals.
> Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", 
> header=True, inferSchema=True){color}
> Then invoking {color:#de350b}mydf2.schema{color} will result in error:
> {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33445) Can't parse decimal type from csv file

2020-11-13 Thread Punit Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah updated SPARK-33445:
---
Attachment: tsd.csv

> Can't parse decimal type from csv file
> --
>
> Key: SPARK-33445
> URL: https://issues.apache.org/jira/browse/SPARK-33445
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.6, 2.4.7, 3.0.0
>Reporter: Punit Shah
>Priority: Major
> Attachments: tsd.csv
>
>
> The attached file is a one column csv file containing decimals.
> Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", 
> header=True, inferSchema=True){color}
> Then invoking {color:#de350b}mydf2.schema{color} will result in error:
> {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33445) Can't parse decimal type from csv file

2020-11-13 Thread Punit Shah (Jira)
Punit Shah created SPARK-33445:
--

 Summary: Can't parse decimal type from csv file
 Key: SPARK-33445
 URL: https://issues.apache.org/jira/browse/SPARK-33445
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.0.0, 2.4.7, 2.4.6
Reporter: Punit Shah
 Attachments: tsd.csv

The attached file is a one column csv file containing decimals.

Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", header=True, 
inferSchema=True){color}

Then invoking {color:#de350b}mydf2.schema{color} will result in error:

{color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results

2020-11-05 Thread Punit Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah updated SPARK-33327:
---
Description: 
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates  ** 
*{color:#de350b}incorrect{color}* {color}results:
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)")

 

  was:
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates 
{color:#de350b}*incorrect*{color}{color} results:
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)")

 


> grouped by first and last against date column returns incorrect results
> ---
>
> Key: SPARK-33327
> URL: https://issues.apache.org/jira/browse/SPARK-33327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7
>Reporter: Punit Shah
>Priority: Major
> Attachments: users.csv
>
>
> The attached csv file has two columns, namely "User" and "FromDate".  The 
> import defaults the "FromDate" column as a timestamp. 
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
>  * outDF.createOrReplaceTempView("table02")
> In this default case the following sql generates 
> {color:#de350b}*incorrect*{color} results:
> {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
> `FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
> count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
> {color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
> is read in as a Date, then the above sql query also generates  ** 
> *{color:#de350b}incorrect{color}* {color}results:
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
> header=True).selectExpr("`User`", "cast(`FromDate` as date)")
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results

2020-11-05 Thread Punit Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah updated SPARK-33327:
---
Description: 
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query {color:#de350b}*also*{color} 
generates  *incorrect* {color} results:
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)")

 

  was:
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates  *incorrect* 
{color} results:
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)")

 


> grouped by first and last against date column returns incorrect results
> ---
>
> Key: SPARK-33327
> URL: https://issues.apache.org/jira/browse/SPARK-33327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7
>Reporter: Punit Shah
>Priority: Major
> Attachments: users.csv
>
>
> The attached csv file has two columns, namely "User" and "FromDate".  The 
> import defaults the "FromDate" column as a timestamp. 
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
>  * outDF.createOrReplaceTempView("table02")
> In this default case the following sql generates 
> {color:#de350b}*incorrect*{color} results:
> {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
> `FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
> count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
> {color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
> is read in as a Date, then the above sql query {color:#de350b}*also*{color} 
> generates  *incorrect* {color} results:
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
> header=True).selectExpr("`User`", "cast(`FromDate` as date)")
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results

2020-11-05 Thread Punit Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah updated SPARK-33327:
---
Description: 
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates  *incorrect* 
{color} results:
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)")

 

  was:
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates  ** 
*{color:#de350b}incorrect{color}* {color}results:
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)")

 


> grouped by first and last against date column returns incorrect results
> ---
>
> Key: SPARK-33327
> URL: https://issues.apache.org/jira/browse/SPARK-33327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7
>Reporter: Punit Shah
>Priority: Major
> Attachments: users.csv
>
>
> The attached csv file has two columns, namely "User" and "FromDate".  The 
> import defaults the "FromDate" column as a timestamp. 
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
>  * outDF.createOrReplaceTempView("table02")
> In this default case the following sql generates 
> {color:#de350b}*incorrect*{color} results:
> {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
> `FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
> count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
> {color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
> is read in as a Date, then the above sql query also generates  *incorrect* 
> {color} results:
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
> header=True).selectExpr("`User`", "cast(`FromDate` as date)")
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results

2020-11-05 Thread Punit Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah updated SPARK-33327:
---
Description: 
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates 
{color:#de350b}*incorrect*{color}{color} results:
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)")

 

  was:
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates 
{color:#de350b}*incorrect*{color} results:{color}
 * {color:#172b4d}outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)"){color}

 


> grouped by first and last against date column returns incorrect results
> ---
>
> Key: SPARK-33327
> URL: https://issues.apache.org/jira/browse/SPARK-33327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7
>Reporter: Punit Shah
>Priority: Major
> Attachments: users.csv
>
>
> The attached csv file has two columns, namely "User" and "FromDate".  The 
> import defaults the "FromDate" column as a timestamp. 
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
>  * outDF.createOrReplaceTempView("table02")
> In this default case the following sql generates 
> {color:#de350b}*incorrect*{color} results:
> {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
> `FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
> count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
> {color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
> is read in as a Date, then the above sql query also generates 
> {color:#de350b}*incorrect*{color}{color} results:
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, 
> header=True).selectExpr("`User`", "cast(`FromDate` as date)")
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results

2020-11-05 Thread Punit Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah updated SPARK-33327:
---
Description: 
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates 
{color:#de350b}*incorrect*{color} results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query also generates 
{color:#de350b}*incorrect*{color} results:{color}
 * {color:#172b4d}outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)"){color}

 

  was:
The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates correct results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query generates incorrect 
results:{color}
 * {color:#172b4d}outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)"){color}

 


> grouped by first and last against date column returns incorrect results
> ---
>
> Key: SPARK-33327
> URL: https://issues.apache.org/jira/browse/SPARK-33327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7
>Reporter: Punit Shah
>Priority: Major
> Attachments: users.csv
>
>
> The attached csv file has two columns, namely "User" and "FromDate".  The 
> import defaults the "FromDate" column as a timestamp. 
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
>  * outDF.createOrReplaceTempView("table02")
> In this default case the following sql generates 
> {color:#de350b}*incorrect*{color} results:
> {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
> `FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
> count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
> {color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
> is read in as a Date, then the above sql query also generates 
> {color:#de350b}*incorrect*{color} results:{color}
>  * {color:#172b4d}outDF = spark_session.read.csv("users.csv", 
> inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as 
> date)"){color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33327) grouped by first and last against date column returns incorrect results

2020-11-05 Thread Punit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226712#comment-17226712
 ] 

Punit Shah commented on SPARK-33327:


The correct behaviour of running the query should be:

cnt, FromDate_First, FromDate_Last, cntdist

15, 2013-02-21, 2013-12-13, 4

or:

cnt, FromDate_First, FromDate_Last, cntdist

15, 2013-02-21 00:00:00, 2013-12-13 00:00:00, 4

Thanks for asking [~hyukjin.kwon]  Now I notice that both imports fail as shown 
below:

The spark_session.read.csv("users.csv", inferSchema=True, header=True) behaves 
incorrectly like:

cnt, FromDate_First, FromDate_Last, cntdist

15, 2013-12-13 00:00:00, 2013-03-18 00:00:00, 4

The spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)") also behaves 
incorrectly like so:

cnt, FromDate_First, FromDate_Last, cntdist

15, 2013-12-13 , 2013-02-21 , 4

> grouped by first and last against date column returns incorrect results
> ---
>
> Key: SPARK-33327
> URL: https://issues.apache.org/jira/browse/SPARK-33327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7
>Reporter: Punit Shah
>Priority: Major
> Attachments: users.csv
>
>
> The attached csv file has two columns, namely "User" and "FromDate".  The 
> import defaults the "FromDate" column as a timestamp. 
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
>  * outDF.createOrReplaceTempView("table02")
> In this default case the following sql generates correct results:
> {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
> `FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
> count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
> {color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
> is read in as a Date, then the above sql query generates incorrect 
> results:{color}
>  * {color:#172b4d}outDF = spark_session.read.csv("users.csv", 
> inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as 
> date)"){color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results

2020-11-03 Thread Punit Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah updated SPARK-33327:
---
Attachment: users.csv

> grouped by first and last against date column returns incorrect results
> ---
>
> Key: SPARK-33327
> URL: https://issues.apache.org/jira/browse/SPARK-33327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 2.4.7
>Reporter: Punit Shah
>Priority: Major
> Attachments: users.csv
>
>
> The attached csv file has two columns, namely "User" and "FromDate".  The 
> import defaults the "FromDate" column as a timestamp. 
>  * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
>  * outDF.createOrReplaceTempView("table02")
> In this default case the following sql generates correct results:
> {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
> `FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
> count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}
> {color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
> is read in as a Date, then the above sql query generates incorrect 
> results:{color}
>  * {color:#172b4d}outDF = spark_session.read.csv("users.csv", 
> inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as 
> date)"){color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33327) grouped by first and last against date column returns incorrect results

2020-11-03 Thread Punit Shah (Jira)
Punit Shah created SPARK-33327:
--

 Summary: grouped by first and last against date column returns 
incorrect results
 Key: SPARK-33327
 URL: https://issues.apache.org/jira/browse/SPARK-33327
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.7, 2.4.6
Reporter: Punit Shah


The attached csv file has two columns, namely "User" and "FromDate".  The 
import defaults the "FromDate" column as a timestamp. 
 * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True)
 * outDF.createOrReplaceTempView("table02")

In this default case the following sql generates correct results:

{color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as 
`FromDate_First`, last(`FromDate`) as `FromDate_Last`, 
count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color}

{color:#172b4d}However if we read the dataframe like so (where the "FromDate" 
is read in as a Date, then the above sql query generates incorrect 
results:{color}
 * {color:#172b4d}outDF = spark_session.read.csv("users.csv", inferSchema=True, 
header=True).selectExpr("`User`", "cast(`FromDate` as date)"){color}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-32965) pyspark reading csv files with utf_16le encoding

2020-10-01 Thread Punit Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah reopened SPARK-32965:


The linked duplicate issue won't be fixed because the issue was mixed with a 
multiline feature issue.  However my ticket exclusively deals with utf-16le and 
utf-16be encoding not being handled correctly via pyspark.

Therefore this issue is still open and unresolved.

> pyspark reading csv files with utf_16le encoding
> 
>
> Key: SPARK-32965
> URL: https://issues.apache.org/jira/browse/SPARK-32965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Priority: Major
> Attachments: 16le.csv, 32965.png
>
>
> If you have a file encoded in utf_16le or utf_16be and try to use 
> spark.read.csv("", encoding="utf_16le") the dataframe isn't 
> rendered properly
> if you use python decoding like:
> prdd = spark_session._sc.binaryFiles(path_url).values().flatMap(lambda x : 
> x.decode("utf_16le").splitlines())
> and then do spark.read.csv(prdd), then it works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32965) pyspark reading csv files with utf_16le encoding

2020-09-23 Thread Punit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200743#comment-17200743
 ] 

Punit Shah commented on SPARK-32965:


It looks similar.  I've attached a utf-16le file to this ticket.  The pyspark 
code is essentially:

spark.read.csv("16le.csv", inferSchema=True, header=True, encoding="utf_16le").

The attached picture shows the result.

> pyspark reading csv files with utf_16le encoding
> 
>
> Key: SPARK-32965
> URL: https://issues.apache.org/jira/browse/SPARK-32965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Priority: Major
> Attachments: 16le.csv, 32965.png
>
>
> If you have a file encoded in utf_16le or utf_16be and try to use 
> spark.read.csv("", encoding="utf_16le") the dataframe isn't 
> rendered properly
> if you use python decoding like:
> prdd = spark_session._sc.binaryFiles(path_url).values().flatMap(lambda x : 
> x.decode("utf_16le").splitlines())
> and then do spark.read.csv(prdd), then it works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32965) pyspark reading csv files with utf_16le encoding

2020-09-23 Thread Punit Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah updated SPARK-32965:
---
Attachment: 32965.png

> pyspark reading csv files with utf_16le encoding
> 
>
> Key: SPARK-32965
> URL: https://issues.apache.org/jira/browse/SPARK-32965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Priority: Major
> Attachments: 16le.csv, 32965.png
>
>
> If you have a file encoded in utf_16le or utf_16be and try to use 
> spark.read.csv("", encoding="utf_16le") the dataframe isn't 
> rendered properly
> if you use python decoding like:
> prdd = spark_session._sc.binaryFiles(path_url).values().flatMap(lambda x : 
> x.decode("utf_16le").splitlines())
> and then do spark.read.csv(prdd), then it works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32965) pyspark reading csv files with utf_16le encoding

2020-09-23 Thread Punit Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah updated SPARK-32965:
---
Attachment: 16le.csv

> pyspark reading csv files with utf_16le encoding
> 
>
> Key: SPARK-32965
> URL: https://issues.apache.org/jira/browse/SPARK-32965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Priority: Major
> Attachments: 16le.csv
>
>
> If you have a file encoded in utf_16le or utf_16be and try to use 
> spark.read.csv("", encoding="utf_16le") the dataframe isn't 
> rendered properly
> if you use python decoding like:
> prdd = spark_session._sc.binaryFiles(path_url).values().flatMap(lambda x : 
> x.decode("utf_16le").splitlines())
> and then do spark.read.csv(prdd), then it works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32965) pyspark reading csv files with utf_16le encoding

2020-09-22 Thread Punit Shah (Jira)
Punit Shah created SPARK-32965:
--

 Summary: pyspark reading csv files with utf_16le encoding
 Key: SPARK-32965
 URL: https://issues.apache.org/jira/browse/SPARK-32965
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.0.0, 2.4.7
Reporter: Punit Shah


If you have a file encoded in utf_16le or utf_16be and try to use 
spark.read.csv("", encoding="utf_16le") the dataframe isn't rendered 
properly

if you use python decoding like:

prdd = spark_session._sc.binaryFiles(path_url).values().flatMap(lambda x : 
x.decode("utf_16le").splitlines())

and then do spark.read.csv(prdd), then it works.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32956) Duplicate Columns in a csv file

2020-09-22 Thread Punit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200163#comment-17200163
 ] 

Punit Shah commented on SPARK-32956:


That may work

> Duplicate Columns in a csv file
> ---
>
> Key: SPARK-32956
> URL: https://issues.apache.org/jira/browse/SPARK-32956
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Priority: Major
>
> Imagine a csv file shaped like:
> 
> Id,Product,Sale_Amount,Sale_Units,Sale_Amount2,Sale_Amount,Sale_Price
> 1,P,"6,40,728","6,40,728","6,40,728","6,40,728","6,40,728"
> 2,P,"5,81,644","5,81,644","5,81,644","5,81,644","5,81,644"
> =
> Reading this with the header=True will result in a stacktrace.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32956) Duplicate Columns in a csv file

2020-09-21 Thread Punit Shah (Jira)
Punit Shah created SPARK-32956:
--

 Summary: Duplicate Columns in a csv file
 Key: SPARK-32956
 URL: https://issues.apache.org/jira/browse/SPARK-32956
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1, 3.0.0, 2.4.7, 2.4.6, 2.4.5, 2.4.4, 2.4.3
Reporter: Punit Shah


Imagine a csv file shaped like:



Id,Product,Sale_Amount,Sale_Units,Sale_Amount2,Sale_Amount,Sale_Price
1,P,"6,40,728","6,40,728","6,40,728","6,40,728","6,40,728"
2,P,"5,81,644","5,81,644","5,81,644","5,81,644","5,81,644"

=

Reading this with the header=True will result in a stacktrace.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread Punit Shah (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Punit Shah closed SPARK-32888.
--

Resolved by adding documentation

> reading a parallized rdd with two identical records results in a zero count 
> df when read via spark.read.csv
> ---
>
> Key: SPARK-32888
> URL: https://issues.apache.org/jira/browse/SPARK-32888
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> * Imagine a two-row csv file like so (where the header and first record are 
> duplicate rows):
> aaa,bbb
> aaa,bbb
>  * The following is pyspark code
>  * create a parallelized rdd like: {color:#FF}prdd = 
> spark.read.text("test.csv").rdd.flatMap(lambda x : x){color}
>  * {color:#172b4d}create a df like so: {color:#de350b}mydf = 
> spark.read.csv(prdd, header=True){color}{color}
>  * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a 
> record count of zero (when it should be 1){color}{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread Punit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197069#comment-17197069
 ] 

Punit Shah commented on SPARK-32888:


Thank you for your reply [~viirya]  However what I've noticed is that NO lines 
are removed during a straight csv import.  That is why my comment was put 
forth.  There is a difference in the results when reading from csv directly and 
when reading from rdd.

Rdds cause removal of lines, while straight csv don't.

> reading a parallized rdd with two identical records results in a zero count 
> df when read via spark.read.csv
> ---
>
> Key: SPARK-32888
> URL: https://issues.apache.org/jira/browse/SPARK-32888
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> * Imagine a two-row csv file like so (where the header and first record are 
> duplicate rows):
> aaa,bbb
> aaa,bbb
>  * The following is pyspark code
>  * create a parallelized rdd like: {color:#FF}prdd = 
> spark.read.text("test.csv").rdd.flatMap(lambda x : x){color}
>  * {color:#172b4d}create a df like so: {color:#de350b}mydf = 
> spark.read.csv(prdd, header=True){color}{color}
>  * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a 
> record count of zero (when it should be 1){color}{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread Punit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196985#comment-17196985
 ] 

Punit Shah edited comment on SPARK-32888 at 9/16/20, 2:55 PM:
--

Why do we remove lines that are the same as the header? The result of this 
behaviour differs from reading csv files directly as opposed to from rdds.

[~viirya] would appreciate comment thanks...


was (Author: bullsoverbears):
Why do we remove lines that are the same as the header? The result of this 
behaviour differs from reading csv files directly as opposed to from rdds.

> reading a parallized rdd with two identical records results in a zero count 
> df when read via spark.read.csv
> ---
>
> Key: SPARK-32888
> URL: https://issues.apache.org/jira/browse/SPARK-32888
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> * Imagine a two-row csv file like so (where the header and first record are 
> duplicate rows):
> aaa,bbb
> aaa,bbb
>  * The following is pyspark code
>  * create a parallelized rdd like: {color:#FF}prdd = 
> spark.read.text("test.csv").rdd.flatMap(lambda x : x){color}
>  * {color:#172b4d}create a df like so: {color:#de350b}mydf = 
> spark.read.csv(prdd, header=True){color}{color}
>  * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a 
> record count of zero (when it should be 1){color}{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-16 Thread Punit Shah (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17196985#comment-17196985
 ] 

Punit Shah commented on SPARK-32888:


Why do we remove lines that are the same as the header? The result of this 
behaviour differs from reading csv files directly as opposed to from rdds.

> reading a parallized rdd with two identical records results in a zero count 
> df when read via spark.read.csv
> ---
>
> Key: SPARK-32888
> URL: https://issues.apache.org/jira/browse/SPARK-32888
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Punit Shah
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> * Imagine a two-row csv file like so (where the header and first record are 
> duplicate rows):
> aaa,bbb
> aaa,bbb
>  * The following is pyspark code
>  * create a parallelized rdd like: {color:#FF}prdd = 
> spark.read.text("test.csv").rdd.flatMap(lambda x : x){color}
>  * {color:#172b4d}create a df like so: {color:#de350b}mydf = 
> spark.read.csv(prdd, header=True){color}{color}
>  * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a 
> record count of zero (when it should be 1){color}{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv

2020-09-15 Thread Punit Shah (Jira)
Punit Shah created SPARK-32888:
--

 Summary: reading a parallized rdd with two identical records 
results in a zero count df when read via spark.read.csv
 Key: SPARK-32888
 URL: https://issues.apache.org/jira/browse/SPARK-32888
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1, 3.0.0, 2.4.7, 2.4.6, 2.4.5
Reporter: Punit Shah


* Imagine a two-row csv file like so (where the header and first record are 
duplicate rows):

aaa,bbb

aaa,bbb
 * The following is pyspark code
 * create a parallelized rdd like: {color:#FF}prdd = 
spark.read.text("test.csv").rdd.flatMap(lambda x : x){color}
 * {color:#172b4d}create a df like so: {color:#de350b}mydf = 
spark.read.csv(prdd, header=True){color}{color}
 * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a 
record count of zero (when it should be 1){color}{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org