[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count
[ https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051839#comment-17051839 ] Suchintak Patnaik commented on SPARK-29058: --- [~hyukjin.kwon] I agree with you on this. However, the dataframe is getting created without the second row which is malformed. This can be observed from df.show() >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema) >>> df.show() +--+--+-++ | Fruit| color|price|quantity| +--+--+-++ | apple| red|1| 3| |orange|orange|3| 5| +--+--+-++ so, ideally it should return the correct row count accordingly. What you say? > Reading csv file with DROPMALFORMED showing incorrect record count > -- > > Key: SPARK-29058 > URL: https://issues.apache.org/jira/browse/SPARK-29058 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Minor > > The spark sql csv reader is dropping malformed records as expected, but the > record count is showing as incorrect. > Consider this file (fruit.csv) > {code} > apple,red,1,3 > banana,yellow,2,4.56 > orange,orange,3,5 > {code} > Defining schema as follows: > {code} > schema = "Fruit string,color string,price int,quantity int" > {code} > Notice that the "quantity" field is defined as integer type, but the 2nd row > in the file contains a floating point value, hence it is a corrupt record. > {code} > >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema) > >>> df.show() > +--+--+-++ > | Fruit| color|price|quantity| > +--+--+-++ > | apple| red|1| 3| > |orange|orange|3| 5| > +--+--+-++ > >>> df.count() > 3 > {code} > Malformed record is getting dropped as expected, but incorrect record count > is getting displayed. > Here the df.count() should give value as 2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count
[ https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051820#comment-17051820 ] Suchintak Patnaik commented on SPARK-29058: --- [~hyukjin.kwon] How does column pruning work here because count() does not need any columns to perform the count. It just returns the row count. > Reading csv file with DROPMALFORMED showing incorrect record count > -- > > Key: SPARK-29058 > URL: https://issues.apache.org/jira/browse/SPARK-29058 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Minor > > The spark sql csv reader is dropping malformed records as expected, but the > record count is showing as incorrect. > Consider this file (fruit.csv) > {code} > apple,red,1,3 > banana,yellow,2,4.56 > orange,orange,3,5 > {code} > Defining schema as follows: > {code} > schema = "Fruit string,color string,price int,quantity int" > {code} > Notice that the "quantity" field is defined as integer type, but the 2nd row > in the file contains a floating point value, hence it is a corrupt record. > {code} > >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema) > >>> df.show() > +--+--+-++ > | Fruit| color|price|quantity| > +--+--+-++ > | apple| red|1| 3| > |orange|orange|3| 5| > +--+--+-++ > >>> df.count() > 3 > {code} > Malformed record is getting dropped as expected, but incorrect record count > is getting displayed. > Here the df.count() should give value as 2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count
[ https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051524#comment-17051524 ] Suchintak Patnaik commented on SPARK-29058: --- [~hyukjin.kwon] Any update on this issue? > Reading csv file with DROPMALFORMED showing incorrect record count > -- > > Key: SPARK-29058 > URL: https://issues.apache.org/jira/browse/SPARK-29058 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Minor > > The spark sql csv reader is dropping malformed records as expected, but the > record count is showing as incorrect. > Consider this file (fruit.csv) > {code} > apple,red,1,3 > banana,yellow,2,4.56 > orange,orange,3,5 > {code} > Defining schema as follows: > {code} > schema = "Fruit string,color string,price int,quantity int" > {code} > Notice that the "quantity" field is defined as integer type, but the 2nd row > in the file contains a floating point value, hence it is a corrupt record. > {code} > >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema) > >>> df.show() > +--+--+-++ > | Fruit| color|price|quantity| > +--+--+-++ > | apple| red|1| 3| > |orange|orange|3| 5| > +--+--+-++ > >>> df.count() > 3 > {code} > Malformed record is getting dropped as expected, but incorrect record count > is getting displayed. > Here the df.count() should give value as 2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31042) Error in writing a pyspark streaming dataframe created from Kafka source to a csv file
Suchintak Patnaik created SPARK-31042: - Summary: Error in writing a pyspark streaming dataframe created from Kafka source to a csv file Key: SPARK-31042 URL: https://issues.apache.org/jira/browse/SPARK-31042 Project: Spark Issue Type: Bug Components: PySpark, Structured Streaming Affects Versions: 2.4.5 Reporter: Suchintak Patnaik While writing a streaming dataframe created from Kafka source to a csv file gives following error in PySpark. NOTE : The same streaming dataframe is getting displayed in the console. sdf.writeStream.format("console").start().awaitTermination() // Working sdf.writeStream\ .format("csv")\ .option("path", "C://output")\ .option("checkpointLocation", "C://Checkpoint")\ .outputMode("append")\ .start().awaitTermination()// Not working Error - *File "C:\Spark\python\pyspark\sql\utils.py", line 63, in deco return f(*a, **kw) File "C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o63.awaitTermination. : org.apache.spark.sql.streaming.StreamingQueryException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got {"logOffset":1} === Streaming Query === Identifier: [id = 6718625c-489e-44c8-b273-0da3429e97a8, runId = b64887ba-ca32-499e-9ab5-f839fd44ec26] Current Committed Offsets: {KafkaV2[Subscribe[test1]]: {"logOffset":1}} Current Available Offsets: {KafkaV2[Subscribe[test1]]: {"logOffset":1}} Current State: ACTIVE Thread State: RUNNABLE* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation
[ https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962741#comment-16962741 ] Suchintak Patnaik commented on SPARK-29621: --- [~hyukjin.kwon] count() returns the row count, it has noting to do with the column. In both of the cases whether it is count() or show(), first filter() is performed on the dataframe based on the corrupt_record column *df.filter(df._corrupt_record.isNotNull()).*count()// Error *df.filter(df._corrupt_record.isNotNull()).*show()// No Error > Querying internal corrupt record column should not be allowed in filter > operation > - > > Key: SPARK-29621 > URL: https://issues.apache.org/jira/browse/SPARK-29621 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > Labels: PySpark, SparkSQL > > As per > *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, > _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when > the referenced columns only include the internal corrupt record column"_ > But it's allowing while querying only the internal corrupt record column in > case of *filter* operation. > {code} > from pyspark.sql.types import * > schema = StructType([ > StructField("_corrupt_record", StringType(), False), > StructField("Name", StringType(), False), > StructField("Colour", StringType(), True), > StructField("Price", IntegerType(), True), > StructField("Quantity", IntegerType(), True)]) > df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE") > df.filter(df._corrupt_record.isNotNull()).show() # Allowed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation
[ https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962658#comment-16962658 ] Suchintak Patnaik edited comment on SPARK-29621 at 10/30/19 3:38 AM: - [~hyukjin.kwon]As per this, As per *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, it should not allow referencing only internal corrupt column right?? Then how come df.filter(df._corrupt_record.isNotNull()).count() shows error and df.filter(df._corrupt_record.isNotNull()).show() doesn't?? was (Author: patnaik): [~gurwls223] it should not allow referencing only internal corrupt record right?? Then how come filter.count() shows error and filter.show() doesn't?? > Querying internal corrupt record column should not be allowed in filter > operation > - > > Key: SPARK-29621 > URL: https://issues.apache.org/jira/browse/SPARK-29621 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > Labels: PySpark, SparkSQL > > As per > *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, > _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when > the referenced columns only include the internal corrupt record column"_ > But it's allowing while querying only the internal corrupt record column in > case of *filter* operation. > {code} > from pyspark.sql.types import * > schema = StructType([ > StructField("_corrupt_record", StringType(), False), > StructField("Name", StringType(), False), > StructField("Colour", StringType(), True), > StructField("Price", IntegerType(), True), > StructField("Quantity", IntegerType(), True)]) > df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE") > df.filter(df._corrupt_record.isNotNull()).show() # Allowed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation
[ https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962658#comment-16962658 ] Suchintak Patnaik commented on SPARK-29621: --- [~gurwls223] it should not allow referencing only internal corrupt record right?? Then how come filter.count() shows error and filter.show() doesn't?? > Querying internal corrupt record column should not be allowed in filter > operation > - > > Key: SPARK-29621 > URL: https://issues.apache.org/jira/browse/SPARK-29621 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > Labels: PySpark, SparkSQL > > As per > *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, > _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when > the referenced columns only include the internal corrupt record column"_ > But it's allowing while querying only the internal corrupt record column in > case of *filter* operation. > {code} > from pyspark.sql.types import * > schema = StructType([ > StructField("_corrupt_record", StringType(), False), > StructField("Name", StringType(), False), > StructField("Colour", StringType(), True), > StructField("Price", IntegerType(), True), > StructField("Quantity", IntegerType(), True)]) > df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE") > df.filter(df._corrupt_record.isNotNull()).show() # Allowed > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation
[ https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suchintak Patnaik updated SPARK-29621: -- Labels: PySpark SparkSQL (was: ) > Querying internal corrupt record column should not be allowed in filter > operation > - > > Key: SPARK-29621 > URL: https://issues.apache.org/jira/browse/SPARK-29621 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > Labels: PySpark, SparkSQL > > As per > *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, > _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when > the referenced columns only include the internal corrupt record column"_ > But it's allowing while querying only the internal corrupt record column in > case of *filter* operation. > from pyspark.sql.types import * > schema = > StructType([StructField("_corrupt_record",StringType(),False),StructField("Name",StringType(),False),StructField("Colour",StringType(),True),StructField("Price",IntegerType(),True),StructField("Quantity",IntegerType(),True)]) > df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE") > df.filter(df._corrupt_record.isNotNull()).show() // Allowed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation
Suchintak Patnaik created SPARK-29621: - Summary: Querying internal corrupt record column should not be allowed in filter operation Key: SPARK-29621 URL: https://issues.apache.org/jira/browse/SPARK-29621 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Reporter: Suchintak Patnaik As per *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*, _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column"_ But it's allowing while querying only the internal corrupt record column in case of *filter* operation. from pyspark.sql.types import * schema = StructType([StructField("_corrupt_record",StringType(),False),StructField("Name",StringType(),False),StructField("Colour",StringType(),True),StructField("Price",IntegerType(),True),StructField("Quantity",IntegerType(),True)]) df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE") df.filter(df._corrupt_record.isNotNull()).show() // Allowed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format
[ https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961028#comment-16961028 ] Suchintak Patnaik commented on SPARK-29234: --- [~dongjoon] [~yumwang] Are the PRs backported to Spark version 2.3 and 2.4? > bucketed table created by Spark SQL DataFrame is in SequenceFile format > --- > > Key: SPARK-29234 > URL: https://issues.apache.org/jira/browse/SPARK-29234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > > When we create a bucketed table as follows, it's input and output format are > getting displayed as SequenceFile format. But physically the files are > getting created in HDFS as the format specified by the user e.g. > orc,parquet,etc. > df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample") > in Hive, DESCRIBE FORMATTED OrdersExample; > describe formatted ordersExample; > OK > # col_name data_type comment > col array from deserializer > # Storage Information > SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Querying the same table in Hive is giving error. > select * from OrdersExample; > OK > Failed with exception java.io.IOException:java.io.IOException: > hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc > not a SequenceFile -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format
[ https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937408#comment-16937408 ] Suchintak Patnaik commented on SPARK-29234: --- Is it possible to back port the PRs to version 2.3 as well?? > bucketed table created by Spark SQL DataFrame is in SequenceFile format > --- > > Key: SPARK-29234 > URL: https://issues.apache.org/jira/browse/SPARK-29234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > > When we create a bucketed table as follows, it's input and output format are > getting displayed as SequenceFile format. But physically the files are > getting created in HDFS as the format specified by the user e.g. > orc,parquet,etc. > df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample") > in Hive, DESCRIBE FORMATTED OrdersExample; > describe formatted ordersExample; > OK > # col_name data_type comment > col array from deserializer > # Storage Information > SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Querying the same table in Hive is giving error. > select * from OrdersExample; > OK > Failed with exception java.io.IOException:java.io.IOException: > hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc > not a SequenceFile -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format
[ https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937378#comment-16937378 ] Suchintak Patnaik commented on SPARK-29234: --- [~yumwang] in which Spark version this is fixed? Isn't it backward compatible like fixed in older versions? > bucketed table created by Spark SQL DataFrame is in SequenceFile format > --- > > Key: SPARK-29234 > URL: https://issues.apache.org/jira/browse/SPARK-29234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > > When we create a bucketed table as follows, it's input and output format are > getting displayed as SequenceFile format. But physically the files are > getting created in HDFS as the format specified by the user e.g. > orc,parquet,etc. > df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample") > in Hive, DESCRIBE FORMATTED OrdersExample; > describe formatted ordersExample; > OK > # col_name data_type comment > col array from deserializer > # Storage Information > SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Querying the same table in Hive is giving error. > select * from OrdersExample; > OK > Failed with exception java.io.IOException:java.io.IOException: > hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc > not a SequenceFile -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format
[ https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suchintak Patnaik updated SPARK-29234: -- Description: When we create a bucketed table as follows, it's input and output format are getting displayed as SequenceFile format. But physically the files are getting created in HDFS as the format specified by the user e.g. orc,parquet,etc. df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample") in Hive, DESCRIBE FORMATTED OrdersExample; describe formatted ordersExample; OK # col_name data_type comment col array from deserializer # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Querying the same table in Hive is giving error. select * from OrdersExample; OK Failed with exception java.io.IOException:java.io.IOException: hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc not a SequenceFile was: When we create a bucketed table as follows, it's input and output format are getting displayed as SequenceFile format. But physically the files are getting created in HDFS as the format specified by the user e.g. orc,parquet,etc. df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample") in Hive, DESCRIBE FORMATTED OrdersExample; describe formatted ordersExample; OK # col_name data_type comment col array from deserializer # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Querying the same table in Hive is giving error. select * from OrdersExample; OK Failed with exception java.io.IOException:java.io.IOException: hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc not a SequenceFile While reading the same table in Spark also giving error. df = spark. > bucketed table created by Spark SQL DataFrame is in SequenceFile format > --- > > Key: SPARK-29234 > URL: https://issues.apache.org/jira/browse/SPARK-29234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > > When we create a bucketed table as follows, it's input and output format are > getting displayed as SequenceFile format. But physically the files are > getting created in HDFS as the format specified by the user e.g. > orc,parquet,etc. > df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample") > in Hive, DESCRIBE FORMATTED OrdersExample; > describe formatted ordersExample; > OK > # col_name data_type comment > col array from deserializer > # Storage Information > SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Querying the same table in Hive is giving error. > select * from OrdersExample; > OK > Failed with exception java.io.IOException:java.io.IOException: > hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc > not a SequenceFile -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format
[ https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suchintak Patnaik updated SPARK-29234: -- Description: When we create a bucketed table as follows, it's input and output format are getting displayed as SequenceFile format. But physically the files are getting created in HDFS as the format specified by the user e.g. orc,parquet,etc. df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample") in Hive, DESCRIBE FORMATTED OrdersExample; describe formatted ordersExample; OK # col_name data_type comment col array from deserializer # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Querying the same table in Hive is giving error. select * from OrdersExample; OK Failed with exception java.io.IOException:java.io.IOException: hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc not a SequenceFile While reading the same table in Spark also giving error. df = spark. was: When we create a bucketed table as follows, it's input and output format are getting displayed as SequenceFile format. But physically the files are getting created in HDFS as the format specified by the user e.g. orc,parquet,etc. df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample") in Hive, DESCRIBE FORMATTED OrdersExample; describe formatted ordersExample; OK # col_name data_type comment col array from deserializer # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Querying the same table in Hive is giving error. select * from OrdersExample; OK Failed with exception java.io.IOException:java.io.IOException: hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc not a SequenceFile > bucketed table created by Spark SQL DataFrame is in SequenceFile format > --- > > Key: SPARK-29234 > URL: https://issues.apache.org/jira/browse/SPARK-29234 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Major > > When we create a bucketed table as follows, it's input and output format are > getting displayed as SequenceFile format. But physically the files are > getting created in HDFS as the format specified by the user e.g. > orc,parquet,etc. > df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample") > in Hive, DESCRIBE FORMATTED OrdersExample; > describe formatted ordersExample; > OK > # col_name data_type comment > col array from deserializer > # Storage Information > SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > Querying the same table in Hive is giving error. > select * from OrdersExample; > OK > Failed with exception java.io.IOException:java.io.IOException: > hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc > not a SequenceFile > While reading the same table in Spark also giving error. > df = spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format
Suchintak Patnaik created SPARK-29234: - Summary: bucketed table created by Spark SQL DataFrame is in SequenceFile format Key: SPARK-29234 URL: https://issues.apache.org/jira/browse/SPARK-29234 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Suchintak Patnaik When we create a bucketed table as follows, it's input and output format are getting displayed as SequenceFile format. But physically the files are getting created in HDFS as the format specified by the user e.g. orc,parquet,etc. df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample") in Hive, DESCRIBE FORMATTED OrdersExample; describe formatted ordersExample; OK # col_name data_type comment col array from deserializer # Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Querying the same table in Hive is giving error. select * from OrdersExample; OK Failed with exception java.io.IOException:java.io.IOException: hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc not a SequenceFile -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count
[ https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suchintak Patnaik reopened SPARK-29058: --- Though the workaround of caching the dataframe first and then using count() works well, that is not feasible if the base datasaet size is large. The dataframe count should give the correct count after discarding the corrupt records. > Reading csv file with DROPMALFORMED showing incorrect record count > -- > > Key: SPARK-29058 > URL: https://issues.apache.org/jira/browse/SPARK-29058 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Minor > > The spark sql csv reader is dropping malformed records as expected, but the > record count is showing as incorrect. > Consider this file (fruit.csv) > {code} > apple,red,1,3 > banana,yellow,2,4.56 > orange,orange,3,5 > {code} > Defining schema as follows: > {code} > schema = "Fruit string,color string,price int,quantity int" > {code} > Notice that the "quantity" field is defined as integer type, but the 2nd row > in the file contains a floating point value, hence it is a corrupt record. > {code} > >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema) > >>> df.show() > +--+--+-++ > | Fruit| color|price|quantity| > +--+--+-++ > | apple| red|1| 3| > |orange|orange|3| 5| > +--+--+-++ > >>> df.count() > 3 > {code} > Malformed record is getting dropped as expected, but incorrect record count > is getting displayed. > Here the df.count() should give value as 2 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count
[ https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930258#comment-16930258 ] Suchintak Patnaik commented on SPARK-29058: --- [~hyukjin.kwon] 1) As per this ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126]) *it's disallowed if only the corrupt record column is referenced.* However, in this case I don't have any corrupt record column defined in my schema since I am using mode as DROPMALFORMED, not PERMISSIVE. 2) As you mentioned earlier, count() does not need the columns to count, but here the purpose is to count the rows. 3) Though the workaround is working fine, *df.cache().count()* is not appropriate to cache in memory if my base dataset is large and before doing a series of operations on my dataset, I want to drop corrupt records and keep track of the count. 4) My question is why dataframe count is giving the wrong row count even if it is discarding the rows. > Reading csv file with DROPMALFORMED showing incorrect record count > -- > > Key: SPARK-29058 > URL: https://issues.apache.org/jira/browse/SPARK-29058 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Minor > > The spark sql csv reader is dropping malformed records as expected, but the > record count is showing as incorrect. > Consider this file (fruit.csv) > {code} > apple,red,1,3 > banana,yellow,2,4.56 > orange,orange,3,5 > {code} > Defining schema as follows: > {code} > schema = "Fruit string,color string,price int,quantity int" > {code} > Notice that the "quantity" field is defined as integer type, but the 2nd row > in the file contains a floating point value, hence it is a corrupt record. > {code} > >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema) > >>> df.show() > +--+--+-++ > | Fruit| color|price|quantity| > +--+--+-++ > | apple| red|1| 3| > |orange|orange|3| 5| > +--+--+-++ > >>> df.count() > 3 > {code} > Malformed record is getting dropped as expected, but incorrect record count > is getting displayed. > Here the df.count() should give value as 2 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count
[ https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930232#comment-16930232 ] Suchintak Patnaik commented on SPARK-29058: --- [~hyukjin.kwon] If I take quantity is of type int, that record is getting dropped, but count showing is incorrect. There can be situation where few records may not be as per the data type defined in the schema and the requirement is to drop such records while loading. > Reading csv file with DROPMALFORMED showing incorrect record count > -- > > Key: SPARK-29058 > URL: https://issues.apache.org/jira/browse/SPARK-29058 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Suchintak Patnaik >Priority: Minor > > The spark sql csv reader is dropping malformed records as expected, but the > record count is showing as incorrect. > Consider this file (fruit.csv) > {code} > apple,red,1,3 > banana,yellow,2,4.56 > orange,orange,3,5 > {code} > Defining schema as follows: > {code} > schema = "Fruit string,color string,price int,quantity int" > {code} > Notice that the "quantity" field is defined as integer type, but the 2nd row > in the file contains a floating point value, hence it is a corrupt record. > {code} > >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema) > >>> df.show() > +--+--+-++ > | Fruit| color|price|quantity| > +--+--+-++ > | apple| red|1| 3| > |orange|orange|3| 5| > +--+--+-++ > >>> df.count() > 3 > {code} > Malformed record is getting dropped as expected, but incorrect record count > is getting displayed. > Here the df.count() should give value as 2 > > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count
Suchintak Patnaik created SPARK-29058: - Summary: Reading csv file with DROPMALFORMED showing incorrect record count Key: SPARK-29058 URL: https://issues.apache.org/jira/browse/SPARK-29058 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.3.0 Reporter: Suchintak Patnaik The spark sql csv reader is dropping malformed records as expected, but the record count is showing as incorrect. Consider this file (fruit.csv) apple,red,1,3 banana,yellow,2,4.56 orange,orange,3,5 Defining schema as follows: schema = "Fruit string,color string,price int,quantity int" Notice that the "quantity" field is defined as integer type, but the 2nd row in the file contains a floating point value, hence it is a corrupt record. >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema) >>> df.show() +--+--+-++ | Fruit| color|price|quantity| +--+--+-++ | apple| red|1| 3| |orange|orange|3| 5| +--+--+-++ >>> df.count() 3 Malformed record is getting dropped as expected, but incorrect record count is getting displayed. Here the df.count() should give value as 2 -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org