[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count

2020-03-04 Thread Suchintak Patnaik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051839#comment-17051839
 ] 

Suchintak Patnaik commented on SPARK-29058:
---

[~hyukjin.kwon] I agree with you on this.

However, the dataframe is getting created without the second row which is 
malformed. This can be observed from df.show()

>>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
>>> df.show()
+--+--+-++
| Fruit| color|price|quantity|
+--+--+-++
| apple|   red|1|   3|
|orange|orange|3|   5|
+--+--+-++

so, ideally it should return the correct row count accordingly. What you say?

> Reading csv file with DROPMALFORMED showing incorrect record count
> --
>
> Key: SPARK-29058
> URL: https://issues.apache.org/jira/browse/SPARK-29058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the 
> record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row 
> in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +--+--+-++
> | Fruit| color|price|quantity|
> +--+--+-++
> | apple|   red|1|   3|
> |orange|orange|3|   5|
> +--+--+-++
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count 
> is getting displayed.
> Here the df.count() should give value as 2
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count

2020-03-04 Thread Suchintak Patnaik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051820#comment-17051820
 ] 

Suchintak Patnaik commented on SPARK-29058:
---

[~hyukjin.kwon] How does column pruning work here because count() does not need 
any columns to perform the count. It just returns the row count.

> Reading csv file with DROPMALFORMED showing incorrect record count
> --
>
> Key: SPARK-29058
> URL: https://issues.apache.org/jira/browse/SPARK-29058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the 
> record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row 
> in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +--+--+-++
> | Fruit| color|price|quantity|
> +--+--+-++
> | apple|   red|1|   3|
> |orange|orange|3|   5|
> +--+--+-++
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count 
> is getting displayed.
> Here the df.count() should give value as 2
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count

2020-03-04 Thread Suchintak Patnaik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051524#comment-17051524
 ] 

Suchintak Patnaik commented on SPARK-29058:
---

[~hyukjin.kwon] Any update on this issue?

> Reading csv file with DROPMALFORMED showing incorrect record count
> --
>
> Key: SPARK-29058
> URL: https://issues.apache.org/jira/browse/SPARK-29058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the 
> record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row 
> in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +--+--+-++
> | Fruit| color|price|quantity|
> +--+--+-++
> | apple|   red|1|   3|
> |orange|orange|3|   5|
> +--+--+-++
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count 
> is getting displayed.
> Here the df.count() should give value as 2
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31042) Error in writing a pyspark streaming dataframe created from Kafka source to a csv file

2020-03-04 Thread Suchintak Patnaik (Jira)
Suchintak Patnaik created SPARK-31042:
-

 Summary: Error in writing a pyspark streaming dataframe created 
from Kafka source to a csv file 
 Key: SPARK-31042
 URL: https://issues.apache.org/jira/browse/SPARK-31042
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Structured Streaming
Affects Versions: 2.4.5
Reporter: Suchintak Patnaik


While writing a streaming dataframe created from Kafka source to a csv file 
gives following error in PySpark.

NOTE : The same streaming dataframe is getting displayed in the console.

sdf.writeStream.format("console").start().awaitTermination()  // Working

sdf.writeStream\
.format("csv")\
.option("path", "C://output")\
.option("checkpointLocation", "C://Checkpoint")\
.outputMode("append")\
.start().awaitTermination()// Not working


Error
-
 *File "C:\Spark\python\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
  File "C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in 
get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
o63.awaitTermination.
: org.apache.spark.sql.streaming.StreamingQueryException: Expected e.g. 
{"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got {"logOffset":1}
=== Streaming Query ===
Identifier: [id = 6718625c-489e-44c8-b273-0da3429e97a8, runId = 
b64887ba-ca32-499e-9ab5-f839fd44ec26]
Current Committed Offsets: {KafkaV2[Subscribe[test1]]: {"logOffset":1}}
Current Available Offsets: {KafkaV2[Subscribe[test1]]: {"logOffset":1}}

Current State: ACTIVE
Thread State: RUNNABLE*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation

2019-10-29 Thread Suchintak Patnaik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962741#comment-16962741
 ] 

Suchintak Patnaik commented on SPARK-29621:
---

[~hyukjin.kwon] count() returns the row count, it has noting to do with the 
column. 

In both of the cases whether it is count() or show(), first filter() is 
performed on the dataframe based on the corrupt_record column

*df.filter(df._corrupt_record.isNotNull()).*count()// Error

*df.filter(df._corrupt_record.isNotNull()).*show()// No Error

> Querying internal corrupt record column should not be allowed in filter 
> operation
> -
>
> Key: SPARK-29621
> URL: https://issues.apache.org/jira/browse/SPARK-29621
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>  Labels: PySpark, SparkSQL
>
> As per 
> *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
> _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
> the referenced columns only include the internal corrupt record column"_
> But it's allowing while querying only the internal corrupt record column in 
> case of *filter* operation.
> {code}
> from pyspark.sql.types import *
> schema = StructType([
> StructField("_corrupt_record", StringType(), False),
> StructField("Name", StringType(), False),
> StructField("Colour", StringType(), True),
> StructField("Price", IntegerType(), True),
> StructField("Quantity", IntegerType(), True)])
> df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE")
> df.filter(df._corrupt_record.isNotNull()).show()  # Allowed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation

2019-10-29 Thread Suchintak Patnaik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962658#comment-16962658
 ] 

Suchintak Patnaik edited comment on SPARK-29621 at 10/30/19 3:38 AM:
-

[~hyukjin.kwon]As per this,
As per 
*https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
 it should not allow referencing only internal corrupt column right??

Then how come df.filter(df._corrupt_record.isNotNull()).count() shows error and 
df.filter(df._corrupt_record.isNotNull()).show() doesn't??


was (Author: patnaik):
[~gurwls223] it should not allow referencing only internal corrupt record 
right??

Then how come filter.count() shows error and filter.show() doesn't??

> Querying internal corrupt record column should not be allowed in filter 
> operation
> -
>
> Key: SPARK-29621
> URL: https://issues.apache.org/jira/browse/SPARK-29621
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>  Labels: PySpark, SparkSQL
>
> As per 
> *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
> _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
> the referenced columns only include the internal corrupt record column"_
> But it's allowing while querying only the internal corrupt record column in 
> case of *filter* operation.
> {code}
> from pyspark.sql.types import *
> schema = StructType([
> StructField("_corrupt_record", StringType(), False),
> StructField("Name", StringType(), False),
> StructField("Colour", StringType(), True),
> StructField("Price", IntegerType(), True),
> StructField("Quantity", IntegerType(), True)])
> df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE")
> df.filter(df._corrupt_record.isNotNull()).show()  # Allowed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation

2019-10-29 Thread Suchintak Patnaik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962658#comment-16962658
 ] 

Suchintak Patnaik commented on SPARK-29621:
---

[~gurwls223] it should not allow referencing only internal corrupt record 
right??

Then how come filter.count() shows error and filter.show() doesn't??

> Querying internal corrupt record column should not be allowed in filter 
> operation
> -
>
> Key: SPARK-29621
> URL: https://issues.apache.org/jira/browse/SPARK-29621
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>  Labels: PySpark, SparkSQL
>
> As per 
> *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
> _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
> the referenced columns only include the internal corrupt record column"_
> But it's allowing while querying only the internal corrupt record column in 
> case of *filter* operation.
> {code}
> from pyspark.sql.types import *
> schema = StructType([
> StructField("_corrupt_record", StringType(), False),
> StructField("Name", StringType(), False),
> StructField("Colour", StringType(), True),
> StructField("Price", IntegerType(), True),
> StructField("Quantity", IntegerType(), True)])
> df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE")
> df.filter(df._corrupt_record.isNotNull()).show()  # Allowed
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation

2019-10-28 Thread Suchintak Patnaik (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suchintak Patnaik updated SPARK-29621:
--
Labels: PySpark SparkSQL  (was: )

> Querying internal corrupt record column should not be allowed in filter 
> operation
> -
>
> Key: SPARK-29621
> URL: https://issues.apache.org/jira/browse/SPARK-29621
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>  Labels: PySpark, SparkSQL
>
> As per 
> *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
> _"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
> the referenced columns only include the internal corrupt record column"_
> But it's allowing while querying only the internal corrupt record column in 
> case of *filter* operation.
> from pyspark.sql.types import *
> schema = 
> StructType([StructField("_corrupt_record",StringType(),False),StructField("Name",StringType(),False),StructField("Colour",StringType(),True),StructField("Price",IntegerType(),True),StructField("Quantity",IntegerType(),True)])
> df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE")
> df.filter(df._corrupt_record.isNotNull()).show()   // Allowed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29621) Querying internal corrupt record column should not be allowed in filter operation

2019-10-28 Thread Suchintak Patnaik (Jira)
Suchintak Patnaik created SPARK-29621:
-

 Summary: Querying internal corrupt record column should not be 
allowed in filter operation
 Key: SPARK-29621
 URL: https://issues.apache.org/jira/browse/SPARK-29621
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Suchintak Patnaik


As per 
*https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
_"Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the 
referenced columns only include the internal corrupt record column"_

But it's allowing while querying only the internal corrupt record column in 
case of *filter* operation.

from pyspark.sql.types import *
schema = 
StructType([StructField("_corrupt_record",StringType(),False),StructField("Name",StringType(),False),StructField("Colour",StringType(),True),StructField("Price",IntegerType(),True),StructField("Quantity",IntegerType(),True)])

df = spark.read.csv("fruit.csv",schema=schema,mode="PERMISSIVE")

df.filter(df._corrupt_record.isNotNull()).show()   // Allowed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-10-28 Thread Suchintak Patnaik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961028#comment-16961028
 ] 

Suchintak Patnaik commented on SPARK-29234:
---

[~dongjoon]
[~yumwang]

Are the PRs backported to Spark version 2.3 and 2.4?

> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> ---
>
> Key: SPARK-29234
> URL: https://issues.apache.org/jira/browse/SPARK-29234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are 
> getting displayed as SequenceFile format. But physically the files are 
> getting created in HDFS as the format specified by the user e.g. 
> orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name  data_type   comment
> col array   from deserializer
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: 
> hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
>  not a SequenceFile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-09-24 Thread Suchintak Patnaik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937408#comment-16937408
 ] 

Suchintak Patnaik commented on SPARK-29234:
---

Is it possible to back port the PRs to version 2.3 as well??

> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> ---
>
> Key: SPARK-29234
> URL: https://issues.apache.org/jira/browse/SPARK-29234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are 
> getting displayed as SequenceFile format. But physically the files are 
> getting created in HDFS as the format specified by the user e.g. 
> orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name  data_type   comment
> col array   from deserializer
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: 
> hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
>  not a SequenceFile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-09-24 Thread Suchintak Patnaik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937378#comment-16937378
 ] 

Suchintak Patnaik commented on SPARK-29234:
---

[~yumwang] in which Spark version this is fixed? Isn't it backward compatible 
like fixed in older versions?

> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> ---
>
> Key: SPARK-29234
> URL: https://issues.apache.org/jira/browse/SPARK-29234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are 
> getting displayed as SequenceFile format. But physically the files are 
> getting created in HDFS as the format specified by the user e.g. 
> orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name  data_type   comment
> col array   from deserializer
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: 
> hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
>  not a SequenceFile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-09-24 Thread Suchintak Patnaik (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suchintak Patnaik updated SPARK-29234:
--
Description: 
When we create a bucketed table as follows, it's input and output format are 
getting displayed as SequenceFile format. But physically the files are getting 
created in HDFS as the format specified by the user e.g. orc,parquet,etc.

df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")

in Hive, DESCRIBE FORMATTED OrdersExample;

describe formatted ordersExample;
OK
# col_name  data_type   comment
col array   from deserializer

# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Querying the same table in Hive is giving error.

select * from OrdersExample;
OK
Failed with exception java.io.IOException:java.io.IOException: 
hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
 not a SequenceFile




  was:
When we create a bucketed table as follows, it's input and output format are 
getting displayed as SequenceFile format. But physically the files are getting 
created in HDFS as the format specified by the user e.g. orc,parquet,etc.

df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")

in Hive, DESCRIBE FORMATTED OrdersExample;

describe formatted ordersExample;
OK
# col_name  data_type   comment
col array   from deserializer

# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Querying the same table in Hive is giving error.

select * from OrdersExample;
OK
Failed with exception java.io.IOException:java.io.IOException: 
hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
 not a SequenceFile

While reading the same table in Spark also giving error.

df = spark.



> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> ---
>
> Key: SPARK-29234
> URL: https://issues.apache.org/jira/browse/SPARK-29234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are 
> getting displayed as SequenceFile format. But physically the files are 
> getting created in HDFS as the format specified by the user e.g. 
> orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name  data_type   comment
> col array   from deserializer
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: 
> hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
>  not a SequenceFile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-09-24 Thread Suchintak Patnaik (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suchintak Patnaik updated SPARK-29234:
--
Description: 
When we create a bucketed table as follows, it's input and output format are 
getting displayed as SequenceFile format. But physically the files are getting 
created in HDFS as the format specified by the user e.g. orc,parquet,etc.

df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")

in Hive, DESCRIBE FORMATTED OrdersExample;

describe formatted ordersExample;
OK
# col_name  data_type   comment
col array   from deserializer

# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Querying the same table in Hive is giving error.

select * from OrdersExample;
OK
Failed with exception java.io.IOException:java.io.IOException: 
hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
 not a SequenceFile

While reading the same table in Spark also giving error.

df = spark.


  was:
When we create a bucketed table as follows, it's input and output format are 
getting displayed as SequenceFile format. But physically the files are getting 
created in HDFS as the format specified by the user e.g. orc,parquet,etc.

df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")

in Hive, DESCRIBE FORMATTED OrdersExample;

describe formatted ordersExample;
OK
# col_name  data_type   comment
col array   from deserializer

# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Querying the same table in Hive is giving error.

select * from OrdersExample;
OK
Failed with exception java.io.IOException:java.io.IOException: 
hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
 not a SequenceFile



> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> ---
>
> Key: SPARK-29234
> URL: https://issues.apache.org/jira/browse/SPARK-29234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are 
> getting displayed as SequenceFile format. But physically the files are 
> getting created in HDFS as the format specified by the user e.g. 
> orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name  data_type   comment
> col array   from deserializer
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: 
> hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
>  not a SequenceFile
> While reading the same table in Spark also giving error.
> df = spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-09-24 Thread Suchintak Patnaik (Jira)
Suchintak Patnaik created SPARK-29234:
-

 Summary: bucketed table created by Spark SQL DataFrame is in 
SequenceFile format
 Key: SPARK-29234
 URL: https://issues.apache.org/jira/browse/SPARK-29234
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Suchintak Patnaik


When we create a bucketed table as follows, it's input and output format are 
getting displayed as SequenceFile format. But physically the files are getting 
created in HDFS as the format specified by the user e.g. orc,parquet,etc.

df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")

in Hive, DESCRIBE FORMATTED OrdersExample;

describe formatted ordersExample;
OK
# col_name  data_type   comment
col array   from deserializer

# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Querying the same table in Hive is giving error.

select * from OrdersExample;
OK
Failed with exception java.io.IOException:java.io.IOException: 
hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
 not a SequenceFile




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count

2019-09-18 Thread Suchintak Patnaik (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suchintak Patnaik reopened SPARK-29058:
---

Though the workaround of caching the dataframe first and then using count() 
works well, that is not feasible if the base datasaet size is large.

The dataframe count should give the correct count after discarding the corrupt 
records.

> Reading csv file with DROPMALFORMED showing incorrect record count
> --
>
> Key: SPARK-29058
> URL: https://issues.apache.org/jira/browse/SPARK-29058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the 
> record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row 
> in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +--+--+-++
> | Fruit| color|price|quantity|
> +--+--+-++
> | apple|   red|1|   3|
> |orange|orange|3|   5|
> +--+--+-++
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count 
> is getting displayed.
> Here the df.count() should give value as 2
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count

2019-09-15 Thread Suchintak Patnaik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930258#comment-16930258
 ] 

Suchintak Patnaik commented on SPARK-29058:
---

[~hyukjin.kwon]

1) As per this 
([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126])

*it's disallowed if only the corrupt record column is referenced.*

However, in this case I don't have any corrupt record column defined in my 
schema since I am using mode as DROPMALFORMED, not PERMISSIVE.

2) As you mentioned earlier, count() does not need the columns to count, but 
here the purpose is to count the rows.

3) Though the workaround is working fine, *df.cache().count()*  is not 
appropriate to cache in memory if my base dataset is large and before doing a 
series of operations on my dataset, I want to drop corrupt records and keep 
track of the count.

4) My question is why dataframe count is giving the wrong row count even if it 
is discarding the rows.




> Reading csv file with DROPMALFORMED showing incorrect record count
> --
>
> Key: SPARK-29058
> URL: https://issues.apache.org/jira/browse/SPARK-29058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the 
> record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row 
> in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +--+--+-++
> | Fruit| color|price|quantity|
> +--+--+-++
> | apple|   red|1|   3|
> |orange|orange|3|   5|
> +--+--+-++
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count 
> is getting displayed.
> Here the df.count() should give value as 2
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count

2019-09-15 Thread Suchintak Patnaik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930232#comment-16930232
 ] 

Suchintak Patnaik commented on SPARK-29058:
---

[~hyukjin.kwon]  If I take quantity is of type int, that record is getting 
dropped, but count showing is incorrect.

There can be situation where few records may not be as per the data type 
defined in the schema and the requirement is to drop such records while loading.

> Reading csv file with DROPMALFORMED showing incorrect record count
> --
>
> Key: SPARK-29058
> URL: https://issues.apache.org/jira/browse/SPARK-29058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the 
> record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row 
> in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +--+--+-++
> | Fruit| color|price|quantity|
> +--+--+-++
> | apple|   red|1|   3|
> |orange|orange|3|   5|
> +--+--+-++
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count 
> is getting displayed.
> Here the df.count() should give value as 2
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count

2019-09-11 Thread Suchintak Patnaik (Jira)
Suchintak Patnaik created SPARK-29058:
-

 Summary: Reading csv file with DROPMALFORMED showing incorrect 
record count
 Key: SPARK-29058
 URL: https://issues.apache.org/jira/browse/SPARK-29058
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.3.0
Reporter: Suchintak Patnaik


The spark sql csv reader is dropping malformed records as expected, but the 
record count is showing as incorrect.

Consider this file (fruit.csv)

apple,red,1,3
banana,yellow,2,4.56
orange,orange,3,5

Defining schema as follows:

schema = "Fruit string,color string,price int,quantity int"

Notice that the "quantity" field is defined as integer type, but the 2nd row in 
the file contains a floating point value, hence it is a corrupt record.


>>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
>>> df.show()
+--+--+-++
| Fruit| color|price|quantity|
+--+--+-++
| apple|   red|1|   3|
|orange|orange|3|   5|
+--+--+-++

>>> df.count()
3

Malformed record is getting dropped as expected, but incorrect record count is 
getting displayed.

Here the df.count() should give value as 2




 

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org