[jira] [Created] (SPARK-33445) Can't parse decimal type from csv file
Punit Shah created SPARK-33445: -- Summary: Can't parse decimal type from csv file Key: SPARK-33445 URL: https://issues.apache.org/jira/browse/SPARK-33445 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.0.0, 2.4.7, 2.4.6 Reporter: Punit Shah Attachments: tsd.csv The attached file is a one column csv file containing decimals. Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", header=True, inferSchema=True){color} Then invoking {color:#de350b}mydf2.schema{color} will result in error: {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33445) Can't parse decimal type from csv file
[ https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah updated SPARK-33445: --- Attachment: tsd.csv > Can't parse decimal type from csv file > -- > > Key: SPARK-33445 > URL: https://issues.apache.org/jira/browse/SPARK-33445 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.6, 2.4.7, 3.0.0 >Reporter: Punit Shah >Priority: Major > Attachments: tsd.csv > > > The attached file is a one column csv file containing decimals. > Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", > header=True, inferSchema=True){color} > Then invoking {color:#de350b}mydf2.schema{color} will result in error: > {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-33445) Can't parse decimal type from csv file
[ https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah reopened SPARK-33445: As per the issue description, the call to spark_session.schema results in error. Not spark_session.printSchema. Please test again. And I know for a fact that in 2.4.3 printSchema succeeds, whereas schema fails. > Can't parse decimal type from csv file > -- > > Key: SPARK-33445 > URL: https://issues.apache.org/jira/browse/SPARK-33445 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7, 3.0.0 >Reporter: Punit Shah >Priority: Major > Attachments: tsd.csv > > > The attached file is a one column csv file containing decimals. > Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", > header=True, inferSchema=True){color} > Then invoking {color:#de350b}mydf2.schema{color} will result in error: > {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33445) Can't parse decimal type from csv file
[ https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233589#comment-17233589 ] Punit Shah edited comment on SPARK-33445 at 11/17/20, 1:38 PM: --- [~dongjoon] As per the issue description, the call to spark_session.schema results in error. Not spark_session.printSchema. Please test again. And I know for a fact that in 2.4.3 printSchema succeeds, whereas schema fails. was (Author: bullsoverbears): As per the issue description, the call to spark_session.schema results in error. Not spark_session.printSchema. Please test again. And I know for a fact that in 2.4.3 printSchema succeeds, whereas schema fails. > Can't parse decimal type from csv file > -- > > Key: SPARK-33445 > URL: https://issues.apache.org/jira/browse/SPARK-33445 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7, 3.0.0 >Reporter: Punit Shah >Priority: Major > Attachments: tsd.csv > > > The attached file is a one column csv file containing decimals. > Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", > header=True, inferSchema=True){color} > Then invoking {color:#de350b}mydf2.schema{color} will result in error: > {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33445) Can't parse decimal type from csv file
[ https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17234851#comment-17234851 ] Punit Shah commented on SPARK-33445: My apologies [~dongjoon] for the incorrect tags. Please let me know what I need to do from my end, if anything, to continue to move this ticket forward. > Can't parse decimal type from csv file > -- > > Key: SPARK-33445 > URL: https://issues.apache.org/jira/browse/SPARK-33445 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7 >Reporter: Punit Shah >Priority: Major > Fix For: 3.0.0 > > Attachments: tsd.csv > > > The attached file is a one column csv file containing decimals. > Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", > header=True, inferSchema=True){color} > Then invoking {color:#de350b}mydf2.schema{color} will result in error: > {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33445) Can't parse decimal type from csv file
[ https://issues.apache.org/jira/browse/SPARK-33445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17234919#comment-17234919 ] Punit Shah commented on SPARK-33445: Thank you very much [~dongjoon] > Can't parse decimal type from csv file > -- > > Key: SPARK-33445 > URL: https://issues.apache.org/jira/browse/SPARK-33445 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.7 >Reporter: Punit Shah >Priority: Major > Fix For: 3.0.0 > > Attachments: tsd.csv > > > The attached file is a one column csv file containing decimals. > Execute: {color:#de350b}mydf2 = spark_session.read.csv("tsd.csv", > header=True, inferSchema=True){color} > Then invoking {color:#de350b}mydf2.schema{color} will result in error: > {color:#ff8b00}ValueError: Could not parse datatype: decimal(6,-7){color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26645) CSV infer schema bug infers decimal(9,-1)
[ https://issues.apache.org/jira/browse/SPARK-26645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238798#comment-17238798 ] Punit Shah commented on SPARK-26645: Hello [~dongjoon] If we can get this PR then this would be tremendously helpful. > CSV infer schema bug infers decimal(9,-1) > - > > Key: SPARK-26645 > URL: https://issues.apache.org/jira/browse/SPARK-26645 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Assignee: Marco Gaido >Priority: Minor > Fix For: 3.0.0 > > > we have a file /tmp/t1/file.txt that contains only one line "1.18927098E9". > running: > {code:python} > df = spark.read.csv('/tmp/t1', header=False, inferSchema=True, sep='\t') > print df.dtypes > {code} > causes: > {noformat} > ValueError: Could not parse datatype: decimal(9,-1) > {noformat} > I'm not sure where the bug is - inferSchema or dtypes? > I saw it is legal to have a decimal with negative scale in the code > (CSVInferSchema.scala): > {code:python} > if (bigDecimal.scale <= 0) { > // `DecimalType` conversion can fail when > // 1. The precision is bigger than 38. > // 2. scale is bigger than precision. > DecimalType(bigDecimal.precision, bigDecimal.scale) > } > {code} > but what does it mean? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
Punit Shah created SPARK-32888: -- Summary: reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv Key: SPARK-32888 URL: https://issues.apache.org/jira/browse/SPARK-32888 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.1, 3.0.0, 2.4.7, 2.4.6, 2.4.5 Reporter: Punit Shah * Imagine a two-row csv file like so (where the header and first record are duplicate rows): aaa,bbb aaa,bbb * The following is pyspark code * create a parallelized rdd like: {color:#FF}prdd = spark.read.text("test.csv").rdd.flatMap(lambda x : x){color} * {color:#172b4d}create a df like so: {color:#de350b}mydf = spark.read.csv(prdd, header=True){color}{color} * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a record count of zero (when it should be 1){color}{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196985#comment-17196985 ] Punit Shah commented on SPARK-32888: Why do we remove lines that are the same as the header? The result of this behaviour differs from reading csv files directly as opposed to from rdds. > reading a parallized rdd with two identical records results in a zero count > df when read via spark.read.csv > --- > > Key: SPARK-32888 > URL: https://issues.apache.org/jira/browse/SPARK-32888 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > * Imagine a two-row csv file like so (where the header and first record are > duplicate rows): > aaa,bbb > aaa,bbb > * The following is pyspark code > * create a parallelized rdd like: {color:#FF}prdd = > spark.read.text("test.csv").rdd.flatMap(lambda x : x){color} > * {color:#172b4d}create a df like so: {color:#de350b}mydf = > spark.read.csv(prdd, header=True){color}{color} > * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a > record count of zero (when it should be 1){color}{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196985#comment-17196985 ] Punit Shah edited comment on SPARK-32888 at 9/16/20, 2:55 PM: -- Why do we remove lines that are the same as the header? The result of this behaviour differs from reading csv files directly as opposed to from rdds. [~viirya] would appreciate comment thanks... was (Author: bullsoverbears): Why do we remove lines that are the same as the header? The result of this behaviour differs from reading csv files directly as opposed to from rdds. > reading a parallized rdd with two identical records results in a zero count > df when read via spark.read.csv > --- > > Key: SPARK-32888 > URL: https://issues.apache.org/jira/browse/SPARK-32888 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > * Imagine a two-row csv file like so (where the header and first record are > duplicate rows): > aaa,bbb > aaa,bbb > * The following is pyspark code > * create a parallelized rdd like: {color:#FF}prdd = > spark.read.text("test.csv").rdd.flatMap(lambda x : x){color} > * {color:#172b4d}create a df like so: {color:#de350b}mydf = > spark.read.csv(prdd, header=True){color}{color} > * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a > record count of zero (when it should be 1){color}{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197069#comment-17197069 ] Punit Shah commented on SPARK-32888: Thank you for your reply [~viirya] However what I've noticed is that NO lines are removed during a straight csv import. That is why my comment was put forth. There is a difference in the results when reading from csv directly and when reading from rdd. Rdds cause removal of lines, while straight csv don't. > reading a parallized rdd with two identical records results in a zero count > df when read via spark.read.csv > --- > > Key: SPARK-32888 > URL: https://issues.apache.org/jira/browse/SPARK-32888 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > * Imagine a two-row csv file like so (where the header and first record are > duplicate rows): > aaa,bbb > aaa,bbb > * The following is pyspark code > * create a parallelized rdd like: {color:#FF}prdd = > spark.read.text("test.csv").rdd.flatMap(lambda x : x){color} > * {color:#172b4d}create a df like so: {color:#de350b}mydf = > spark.read.csv(prdd, header=True){color}{color} > * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a > record count of zero (when it should be 1){color}{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
[ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah closed SPARK-32888. -- Resolved by adding documentation > reading a parallized rdd with two identical records results in a zero count > df when read via spark.read.csv > --- > > Key: SPARK-32888 > URL: https://issues.apache.org/jira/browse/SPARK-32888 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > * Imagine a two-row csv file like so (where the header and first record are > duplicate rows): > aaa,bbb > aaa,bbb > * The following is pyspark code > * create a parallelized rdd like: {color:#FF}prdd = > spark.read.text("test.csv").rdd.flatMap(lambda x : x){color} > * {color:#172b4d}create a df like so: {color:#de350b}mydf = > spark.read.csv(prdd, header=True){color}{color} > * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a > record count of zero (when it should be 1){color}{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32956) Duplicate Columns in a csv file
Punit Shah created SPARK-32956: -- Summary: Duplicate Columns in a csv file Key: SPARK-32956 URL: https://issues.apache.org/jira/browse/SPARK-32956 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.1, 3.0.0, 2.4.7, 2.4.6, 2.4.5, 2.4.4, 2.4.3 Reporter: Punit Shah Imagine a csv file shaped like: Id,Product,Sale_Amount,Sale_Units,Sale_Amount2,Sale_Amount,Sale_Price 1,P,"6,40,728","6,40,728","6,40,728","6,40,728","6,40,728" 2,P,"5,81,644","5,81,644","5,81,644","5,81,644","5,81,644" = Reading this with the header=True will result in a stacktrace. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32956) Duplicate Columns in a csv file
[ https://issues.apache.org/jira/browse/SPARK-32956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200163#comment-17200163 ] Punit Shah commented on SPARK-32956: That may work > Duplicate Columns in a csv file > --- > > Key: SPARK-32956 > URL: https://issues.apache.org/jira/browse/SPARK-32956 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Priority: Major > > Imagine a csv file shaped like: > > Id,Product,Sale_Amount,Sale_Units,Sale_Amount2,Sale_Amount,Sale_Price > 1,P,"6,40,728","6,40,728","6,40,728","6,40,728","6,40,728" > 2,P,"5,81,644","5,81,644","5,81,644","5,81,644","5,81,644" > = > Reading this with the header=True will result in a stacktrace. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32965) pyspark reading csv files with utf_16le encoding
Punit Shah created SPARK-32965: -- Summary: pyspark reading csv files with utf_16le encoding Key: SPARK-32965 URL: https://issues.apache.org/jira/browse/SPARK-32965 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 3.0.0, 2.4.7 Reporter: Punit Shah If you have a file encoded in utf_16le or utf_16be and try to use spark.read.csv("", encoding="utf_16le") the dataframe isn't rendered properly if you use python decoding like: prdd = spark_session._sc.binaryFiles(path_url).values().flatMap(lambda x : x.decode("utf_16le").splitlines()) and then do spark.read.csv(prdd), then it works. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32965) pyspark reading csv files with utf_16le encoding
[ https://issues.apache.org/jira/browse/SPARK-32965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah updated SPARK-32965: --- Attachment: 16le.csv > pyspark reading csv files with utf_16le encoding > > > Key: SPARK-32965 > URL: https://issues.apache.org/jira/browse/SPARK-32965 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Priority: Major > Attachments: 16le.csv > > > If you have a file encoded in utf_16le or utf_16be and try to use > spark.read.csv("", encoding="utf_16le") the dataframe isn't > rendered properly > if you use python decoding like: > prdd = spark_session._sc.binaryFiles(path_url).values().flatMap(lambda x : > x.decode("utf_16le").splitlines()) > and then do spark.read.csv(prdd), then it works. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32965) pyspark reading csv files with utf_16le encoding
[ https://issues.apache.org/jira/browse/SPARK-32965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah updated SPARK-32965: --- Attachment: 32965.png > pyspark reading csv files with utf_16le encoding > > > Key: SPARK-32965 > URL: https://issues.apache.org/jira/browse/SPARK-32965 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Priority: Major > Attachments: 16le.csv, 32965.png > > > If you have a file encoded in utf_16le or utf_16be and try to use > spark.read.csv("", encoding="utf_16le") the dataframe isn't > rendered properly > if you use python decoding like: > prdd = spark_session._sc.binaryFiles(path_url).values().flatMap(lambda x : > x.decode("utf_16le").splitlines()) > and then do spark.read.csv(prdd), then it works. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32965) pyspark reading csv files with utf_16le encoding
[ https://issues.apache.org/jira/browse/SPARK-32965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200743#comment-17200743 ] Punit Shah commented on SPARK-32965: It looks similar. I've attached a utf-16le file to this ticket. The pyspark code is essentially: spark.read.csv("16le.csv", inferSchema=True, header=True, encoding="utf_16le"). The attached picture shows the result. > pyspark reading csv files with utf_16le encoding > > > Key: SPARK-32965 > URL: https://issues.apache.org/jira/browse/SPARK-32965 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Priority: Major > Attachments: 16le.csv, 32965.png > > > If you have a file encoded in utf_16le or utf_16be and try to use > spark.read.csv("", encoding="utf_16le") the dataframe isn't > rendered properly > if you use python decoding like: > prdd = spark_session._sc.binaryFiles(path_url).values().flatMap(lambda x : > x.decode("utf_16le").splitlines()) > and then do spark.read.csv(prdd), then it works. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-32965) pyspark reading csv files with utf_16le encoding
[ https://issues.apache.org/jira/browse/SPARK-32965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah reopened SPARK-32965: The linked duplicate issue won't be fixed because the issue was mixed with a multiline feature issue. However my ticket exclusively deals with utf-16le and utf-16be encoding not being handled correctly via pyspark. Therefore this issue is still open and unresolved. > pyspark reading csv files with utf_16le encoding > > > Key: SPARK-32965 > URL: https://issues.apache.org/jira/browse/SPARK-32965 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.0, 3.0.1 >Reporter: Punit Shah >Priority: Major > Attachments: 16le.csv, 32965.png > > > If you have a file encoded in utf_16le or utf_16be and try to use > spark.read.csv("", encoding="utf_16le") the dataframe isn't > rendered properly > if you use python decoding like: > prdd = spark_session._sc.binaryFiles(path_url).values().flatMap(lambda x : > x.decode("utf_16le").splitlines()) > and then do spark.read.csv(prdd), then it works. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33327) grouped by first and last against date column returns incorrect results
Punit Shah created SPARK-33327: -- Summary: grouped by first and last against date column returns incorrect results Key: SPARK-33327 URL: https://issues.apache.org/jira/browse/SPARK-33327 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.7, 2.4.6 Reporter: Punit Shah The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates correct results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query generates incorrect results:{color} * {color:#172b4d}outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)"){color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah updated SPARK-33327: --- Attachment: users.csv > grouped by first and last against date column returns incorrect results > --- > > Key: SPARK-33327 > URL: https://issues.apache.org/jira/browse/SPARK-33327 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7 >Reporter: Punit Shah >Priority: Major > Attachments: users.csv > > > The attached csv file has two columns, namely "User" and "FromDate". The > import defaults the "FromDate" column as a timestamp. > * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) > * outDF.createOrReplaceTempView("table02") > In this default case the following sql generates correct results: > {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as > `FromDate_First`, last(`FromDate`) as `FromDate_Last`, > count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} > {color:#172b4d}However if we read the dataframe like so (where the "FromDate" > is read in as a Date, then the above sql query generates incorrect > results:{color} > * {color:#172b4d}outDF = spark_session.read.csv("users.csv", > inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as > date)"){color} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33327) grouped by first and last against date column returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226712#comment-17226712 ] Punit Shah commented on SPARK-33327: The correct behaviour of running the query should be: cnt, FromDate_First, FromDate_Last, cntdist 15, 2013-02-21, 2013-12-13, 4 or: cnt, FromDate_First, FromDate_Last, cntdist 15, 2013-02-21 00:00:00, 2013-12-13 00:00:00, 4 Thanks for asking [~hyukjin.kwon] Now I notice that both imports fail as shown below: The spark_session.read.csv("users.csv", inferSchema=True, header=True) behaves incorrectly like: cnt, FromDate_First, FromDate_Last, cntdist 15, 2013-12-13 00:00:00, 2013-03-18 00:00:00, 4 The spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") also behaves incorrectly like so: cnt, FromDate_First, FromDate_Last, cntdist 15, 2013-12-13 , 2013-02-21 , 4 > grouped by first and last against date column returns incorrect results > --- > > Key: SPARK-33327 > URL: https://issues.apache.org/jira/browse/SPARK-33327 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7 >Reporter: Punit Shah >Priority: Major > Attachments: users.csv > > > The attached csv file has two columns, namely "User" and "FromDate". The > import defaults the "FromDate" column as a timestamp. > * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) > * outDF.createOrReplaceTempView("table02") > In this default case the following sql generates correct results: > {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as > `FromDate_First`, last(`FromDate`) as `FromDate_Last`, > count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} > {color:#172b4d}However if we read the dataframe like so (where the "FromDate" > is read in as a Date, then the above sql query generates incorrect > results:{color} > * {color:#172b4d}outDF = spark_session.read.csv("users.csv", > inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as > date)"){color} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah updated SPARK-33327: --- Description: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates {color:#de350b}*incorrect*{color} results:{color} * {color:#172b4d}outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)"){color} was: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates correct results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query generates incorrect results:{color} * {color:#172b4d}outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)"){color} > grouped by first and last against date column returns incorrect results > --- > > Key: SPARK-33327 > URL: https://issues.apache.org/jira/browse/SPARK-33327 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7 >Reporter: Punit Shah >Priority: Major > Attachments: users.csv > > > The attached csv file has two columns, namely "User" and "FromDate". The > import defaults the "FromDate" column as a timestamp. > * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) > * outDF.createOrReplaceTempView("table02") > In this default case the following sql generates > {color:#de350b}*incorrect*{color} results: > {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as > `FromDate_First`, last(`FromDate`) as `FromDate_Last`, > count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} > {color:#172b4d}However if we read the dataframe like so (where the "FromDate" > is read in as a Date, then the above sql query also generates > {color:#de350b}*incorrect*{color} results:{color} > * {color:#172b4d}outDF = spark_session.read.csv("users.csv", > inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as > date)"){color} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah updated SPARK-33327: --- Description: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates {color:#de350b}*incorrect*{color}{color} results: * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") was: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates {color:#de350b}*incorrect*{color} results:{color} * {color:#172b4d}outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)"){color} > grouped by first and last against date column returns incorrect results > --- > > Key: SPARK-33327 > URL: https://issues.apache.org/jira/browse/SPARK-33327 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7 >Reporter: Punit Shah >Priority: Major > Attachments: users.csv > > > The attached csv file has two columns, namely "User" and "FromDate". The > import defaults the "FromDate" column as a timestamp. > * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) > * outDF.createOrReplaceTempView("table02") > In this default case the following sql generates > {color:#de350b}*incorrect*{color} results: > {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as > `FromDate_First`, last(`FromDate`) as `FromDate_Last`, > count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} > {color:#172b4d}However if we read the dataframe like so (where the "FromDate" > is read in as a Date, then the above sql query also generates > {color:#de350b}*incorrect*{color}{color} results: > * outDF = spark_session.read.csv("users.csv", inferSchema=True, > header=True).selectExpr("`User`", "cast(`FromDate` as date)") > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah updated SPARK-33327: --- Description: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query {color:#de350b}*also*{color} generates *incorrect* {color} results: * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") was: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates *incorrect* {color} results: * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") > grouped by first and last against date column returns incorrect results > --- > > Key: SPARK-33327 > URL: https://issues.apache.org/jira/browse/SPARK-33327 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7 >Reporter: Punit Shah >Priority: Major > Attachments: users.csv > > > The attached csv file has two columns, namely "User" and "FromDate". The > import defaults the "FromDate" column as a timestamp. > * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) > * outDF.createOrReplaceTempView("table02") > In this default case the following sql generates > {color:#de350b}*incorrect*{color} results: > {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as > `FromDate_First`, last(`FromDate`) as `FromDate_Last`, > count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} > {color:#172b4d}However if we read the dataframe like so (where the "FromDate" > is read in as a Date, then the above sql query {color:#de350b}*also*{color} > generates *incorrect* {color} results: > * outDF = spark_session.read.csv("users.csv", inferSchema=True, > header=True).selectExpr("`User`", "cast(`FromDate` as date)") > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah updated SPARK-33327: --- Description: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates *incorrect* {color} results: * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") was: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates ** *{color:#de350b}incorrect{color}* {color}results: * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") > grouped by first and last against date column returns incorrect results > --- > > Key: SPARK-33327 > URL: https://issues.apache.org/jira/browse/SPARK-33327 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7 >Reporter: Punit Shah >Priority: Major > Attachments: users.csv > > > The attached csv file has two columns, namely "User" and "FromDate". The > import defaults the "FromDate" column as a timestamp. > * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) > * outDF.createOrReplaceTempView("table02") > In this default case the following sql generates > {color:#de350b}*incorrect*{color} results: > {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as > `FromDate_First`, last(`FromDate`) as `FromDate_Last`, > count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} > {color:#172b4d}However if we read the dataframe like so (where the "FromDate" > is read in as a Date, then the above sql query also generates *incorrect* > {color} results: > * outDF = spark_session.read.csv("users.csv", inferSchema=True, > header=True).selectExpr("`User`", "cast(`FromDate` as date)") > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33327) grouped by first and last against date column returns incorrect results
[ https://issues.apache.org/jira/browse/SPARK-33327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Punit Shah updated SPARK-33327: --- Description: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates ** *{color:#de350b}incorrect{color}* {color}results: * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") was: The attached csv file has two columns, namely "User" and "FromDate". The import defaults the "FromDate" column as a timestamp. * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) * outDF.createOrReplaceTempView("table02") In this default case the following sql generates {color:#de350b}*incorrect*{color} results: {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as `FromDate_First`, last(`FromDate`) as `FromDate_Last`, count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} {color:#172b4d}However if we read the dataframe like so (where the "FromDate" is read in as a Date, then the above sql query also generates {color:#de350b}*incorrect*{color}{color} results: * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True).selectExpr("`User`", "cast(`FromDate` as date)") > grouped by first and last against date column returns incorrect results > --- > > Key: SPARK-33327 > URL: https://issues.apache.org/jira/browse/SPARK-33327 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.6, 2.4.7 >Reporter: Punit Shah >Priority: Major > Attachments: users.csv > > > The attached csv file has two columns, namely "User" and "FromDate". The > import defaults the "FromDate" column as a timestamp. > * outDF = spark_session.read.csv("users.csv", inferSchema=True, header=True) > * outDF.createOrReplaceTempView("table02") > In this default case the following sql generates > {color:#de350b}*incorrect*{color} results: > {color:#de350b}*"select count(`User`) as cnt, first(`FromDate`) as > `FromDate_First`, last(`FromDate`) as `FromDate_Last`, > count(distinct(`FromDate`)) as cntdist from table02 group by `User`"*{color} > {color:#172b4d}However if we read the dataframe like so (where the "FromDate" > is read in as a Date, then the above sql query also generates ** > *{color:#de350b}incorrect{color}* {color}results: > * outDF = spark_session.read.csv("users.csv", inferSchema=True, > header=True).selectExpr("`User`", "cast(`FromDate` as date)") > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org