[jira] [Comment Edited] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-26 Thread anju (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387733#comment-17387733
 ] 

anju edited comment on SPARK-36277 at 7/27/21, 3:33 AM:


[~hyukjin.kwon]Sure let me check and update. which version would you suggest?


was (Author: datumgirl):
Sure let me check and update

> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG, Inputfile.PNG, sample.csv
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv "without schema" and "with schema using 
> mode 'dropmalformed' options  in two different dataframes . While displaying 
> the "with schema using mode 'dropmalformed'" dataframe , the display looks 
> good ,it is not showing the malformed records .But when we apply count api on 
> the dataframe it gives the record count of actual file. I am expecting it 
> should give me valid record count .
> here is the code used:-
> {code}
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
> StructField("salary", IntegerType(), True) \
>   ])
> with_schema_df = 
> spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
> print("The dataframe with schema")
> with_schema_df.show()
> print("The dataframe without schema")
> without_schema_df.show()
> cnt_with_schema=with_schema_df.count()
> print("The  records count from with schema df :"+str(cnt_with_schema))
> cnt_without_schema=without_schema_df.count()
> print("The  records count from without schema df: "+str(cnt_without_schema))
> {code}
> here is the outputs screen shot 111.PNG is the outputs of the code and 
> inputfile.csv is the input to the code
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-26 Thread anju (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387733#comment-17387733
 ] 

anju commented on SPARK-36277:
--

Sure let me check and update

> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG, Inputfile.PNG, sample.csv
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv "without schema" and "with schema using 
> mode 'dropmalformed' options  in two different dataframes . While displaying 
> the "with schema using mode 'dropmalformed'" dataframe , the display looks 
> good ,it is not showing the malformed records .But when we apply count api on 
> the dataframe it gives the record count of actual file. I am expecting it 
> should give me valid record count .
> here is the code used:-
> {code}
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
> StructField("salary", IntegerType(), True) \
>   ])
> with_schema_df = 
> spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
> print("The dataframe with schema")
> with_schema_df.show()
> print("The dataframe without schema")
> without_schema_df.show()
> cnt_with_schema=with_schema_df.count()
> print("The  records count from with schema df :"+str(cnt_with_schema))
> cnt_without_schema=without_schema_df.count()
> print("The  records count from without schema df: "+str(cnt_without_schema))
> {code}
> here is the outputs screen shot 111.PNG is the outputs of the code and 
> inputfile.csv is the input to the code
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-26 Thread anju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anju updated SPARK-36277:
-
Attachment: sample.csv

> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG, Inputfile.PNG, sample.csv
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv "without schema" and "with schema using 
> mode 'dropmalformed' options  in two different dataframes . While displaying 
> the "with schema using mode 'dropmalformed'" dataframe , the display looks 
> good ,it is not showing the malformed records .But when we apply count api on 
> the dataframe it gives the record count of actual file. I am expecting it 
> should give me valid record count .
> here is the code used:-
> ```
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
> StructField("salary", IntegerType(), True) \
>   ])
> with_schema_df = 
> spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
> print("The dataframe with schema")
> with_schema_df.show()
> print("The dataframe without schema")
> without_schema_df.show()
> cnt_with_schema=with_schema_df.count()
> print("The  records count from with schema df :"+str(cnt_with_schema))
> cnt_without_schema=without_schema_df.count()
> print("The  records count from without schema df: "+str(cnt_without_schema))
> ```
> here is the outputs screen shot 111.PNG is the outputs of the code and 
> inputfile.csv is the input to the code
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-26 Thread anju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anju updated SPARK-36277:
-
Description: 
I am writing the steps to reproduce the issue for "count" pyspark api while 
using mode as dropmalformed.

I have a csv sample file in s3 bucket . I am reading the file using pyspark api 
for csv . I am reading the csv "without schema" and "with schema using mode 
'dropmalformed' options  in two different dataframes . While displaying the 
"with schema using mode 'dropmalformed'" dataframe , the display looks good ,it 
is not showing the malformed records .But when we apply count api on the 
dataframe it gives the record count of actual file. I am expecting it should 
give me valid record count .

here is the code used:-
```
without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
  ])
with_schema_df = 
spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
print("The dataframe with schema")
with_schema_df.show()
print("The dataframe without schema")
without_schema_df.show()
cnt_with_schema=with_schema_df.count()
print("The  records count from with schema df :"+str(cnt_with_schema))
cnt_without_schema=without_schema_df.count()
print("The  records count from without schema df: "+str(cnt_without_schema))
```
here is the outputs screen shot 111.PNG is the outputs of the code and 
inputfile.csv is the input to the code


 







  was:
I am writing the steps to reproduce the issue for "count" pyspark api while 
using mode as dropmalformed.

I have a csv sample file in s3 bucket . I am reading the file using pyspark api 
for csv . I am reading the csv with schema and without schema using mode 
"dropmalformed" options  in two different dataframes . While displaying the 
with schema dataframe , the display looks good ,it is not showing the malformed 
records .But when we apply count api on the dataframe it gives the record count 
of actual file. I am expecting it should give me valid record count .

here is the code used:-
```
without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
  ])
with_schema_df = 
spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
print("The dataframe with schema")
with_schema_df.show()
print("The dataframe without schema")
without_schema_df.show()
cnt_with_schema=with_schema_df.count()
print("The  records count from with schema df :"+str(cnt_with_schema))
cnt_without_schema=without_schema_df.count()
print("The  records count from without schema df: "+str(cnt_without_schema))
```
here is the outputs screen shot 111.PNG is the outputs of the code and 
inputfile.csv is the input to the code


 








> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG, Inputfile.PNG, sample.csv
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv "without schema" and "with schema using 
> mode 'dropmalformed' options  in two different dataframes . While displaying 
> the "with schema using mode 'dropmalformed'" dataframe , the display looks 
> good ,it is not showing the malformed records .But when we apply count api on 
> the dataframe it gives the record count of actual file. I am expecting it 
> should give me valid record count .
> here is the code used:-
> ```
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), 

[jira] [Commented] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-26 Thread anju (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387404#comment-17387404
 ] 

anju commented on SPARK-36277:
--

[~hyukjin.kwon] I edited my issue description . Could you please check if this 
is okay

> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG, Inputfile.PNG
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv with schema and without schema using mode 
> "dropmalformed" options  in two different dataframes . While displaying the 
> with schema dataframe , the display looks good ,it is not showing the 
> malformed records .But when we apply count api on the dataframe it gives the 
> record count of actual file. I am expecting it should give me valid record 
> count .
> here is the code used:-
> ```
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
> StructField("salary", IntegerType(), True) \
>   ])
> with_schema_df = 
> spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
> print("The dataframe with schema")
> with_schema_df.show()
> print("The dataframe without schema")
> without_schema_df.show()
> cnt_with_schema=with_schema_df.count()
> print("The  records count from with schema df :"+str(cnt_with_schema))
> cnt_without_schema=without_schema_df.count()
> print("The  records count from without schema df: "+str(cnt_without_schema))
> ```
> here is the outputs screen shot 111.PNG is the outputs of the code and 
> inputfile.csv is the input to the code
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-26 Thread anju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anju updated SPARK-36277:
-
Description: 
I am writing the steps to reproduce the issue for "count" pyspark api while 
using mode as dropmalformed.

I have a csv sample file in s3 bucket . I am reading the file using pyspark api 
for csv . I am reading the csv with schema and without schema using mode 
"dropmalformed" options  in two different dataframes . While displaying the 
with schema dataframe , the display looks good ,it is not showing the malformed 
records .But when we apply count api on the dataframe it gives the record count 
of actual file. I am expecting it should give me valid record count .

here is the code used:-
```
without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
  ])
with_schema_df = 
spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
print("The dataframe with schema")
with_schema_df.show()
print("The dataframe without schema")
without_schema_df.show()
cnt_with_schema=with_schema_df.count()
print("The  records count from with schema df :"+str(cnt_with_schema))
cnt_without_schema=without_schema_df.count()
print("The  records count from without schema df: "+str(cnt_without_schema))
```
here is the outputs screen shot 111.PNG is the outputs of the code and 
inputfile.csv is the input to the code


 







  was:
I am writing the steps to reproduce the issue for "count" pyspark api while 
using mode as dropmalformed.

I have a csv sample file in s3 bucket . I am reading the file using pyspark api 
for csv . I am reading the csv with schema and without schema using mode 
"dropmalformed" options  in two different dataframes . While displaying the 
with schema dataframe , the display looks good ,it is not showing the malformed 
records .But when we apply count api on the dataframe it gives the record count 
of actual file. I am expecting it should give me valid record count .

here is the code used:-
```
without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
  ])
with_schema_df = 
spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
print("The dataframe with schema")
with_schema_df.show()
print("The dataframe without schema")
without_schema_df.show()
cnt_with_schema=with_schema_df.count()
print("The  records count from with schema df :"+str(cnt_with_schema))
cnt_without_schema=without_schema_df.count()
print("The  records count from without schema df: "+str(cnt_without_schema))
```
here is the outputs screen shot:-


 








> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG, Inputfile.PNG
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv with schema and without schema using mode 
> "dropmalformed" options  in two different dataframes . While displaying the 
> with schema dataframe , the display looks good ,it is not showing the 
> malformed records .But when we apply count api on the dataframe it gives the 
> record count of actual file. I am expecting it should give me valid record 
> count .
> here is the code used:-
> ```
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
> StructField("salary", IntegerType(), True) \
>   ])
> with_schema_df = 
> 

[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-26 Thread anju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anju updated SPARK-36277:
-
Attachment: Inputfile.PNG

> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG, Inputfile.PNG
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv with schema and without schema using mode 
> "dropmalformed" options  in two different dataframes . While displaying the 
> with schema dataframe , the display looks good ,it is not showing the 
> malformed records .But when we apply count api on the dataframe it gives the 
> record count of actual file. I am expecting it should give me valid record 
> count .
> here is the code used:-
> ```
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
> StructField("salary", IntegerType(), True) \
>   ])
> with_schema_df = 
> spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
> print("The dataframe with schema")
> with_schema_df.show()
> print("The dataframe without schema")
> without_schema_df.show()
> cnt_with_schema=with_schema_df.count()
> print("The  records count from with schema df :"+str(cnt_with_schema))
> cnt_without_schema=without_schema_df.count()
> print("The  records count from without schema df: "+str(cnt_without_schema))
> ```
> here is the outputs screen shot:-
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-26 Thread anju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anju updated SPARK-36277:
-
Attachment: 111.PNG

> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv with schema and without schema using mode 
> "dropmalformed" options  in two different dataframes . While displaying the 
> with schema dataframe , the display looks good ,it is not showing the 
> malformed records .But when we apply count api on the dataframe it gives the 
> record count of actual file. I am expecting it should give me valid record 
> count .
> here is the code used:-
> ```
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
> StructField("salary", IntegerType(), True) \
>   ])
> with_schema_df = 
> spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
> print("The dataframe with schema")
> with_schema_df.show()
> print("The dataframe without schema")
> without_schema_df.show()
> cnt_with_schema=with_schema_df.count()
> print("The  records count from with schema df :"+str(cnt_with_schema))
> cnt_without_schema=without_schema_df.count()
> print("The  records count from without schema df: "+str(cnt_without_schema))
> ```
> here is the outputs:-
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-26 Thread anju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anju updated SPARK-36277:
-
Attachment: (was: 111.PNG)

> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv with schema and without schema using mode 
> "dropmalformed" options  in two different dataframes . While displaying the 
> with schema dataframe , the display looks good ,it is not showing the 
> malformed records .But when we apply count api on the dataframe it gives the 
> record count of actual file. I am expecting it should give me valid record 
> count .
> here is the code used:-
> ```
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
> StructField("salary", IntegerType(), True) \
>   ])
> with_schema_df = 
> spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
> print("The dataframe with schema")
> with_schema_df.show()
> print("The dataframe without schema")
> without_schema_df.show()
> cnt_with_schema=with_schema_df.count()
> print("The  records count from with schema df :"+str(cnt_with_schema))
> cnt_without_schema=without_schema_df.count()
> print("The  records count from without schema df: "+str(cnt_without_schema))
> ```
> here is the outputs:-
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-26 Thread anju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anju updated SPARK-36277:
-
Description: 
I am writing the steps to reproduce the issue for "count" pyspark api while 
using mode as dropmalformed.

I have a csv sample file in s3 bucket . I am reading the file using pyspark api 
for csv . I am reading the csv with schema and without schema using mode 
"dropmalformed" options  in two different dataframes . While displaying the 
with schema dataframe , the display looks good ,it is not showing the malformed 
records .But when we apply count api on the dataframe it gives the record count 
of actual file. I am expecting it should give me valid record count .

here is the code used:-
```
without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
  ])
with_schema_df = 
spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
print("The dataframe with schema")
with_schema_df.show()
print("The dataframe without schema")
without_schema_df.show()
cnt_with_schema=with_schema_df.count()
print("The  records count from with schema df :"+str(cnt_with_schema))
cnt_without_schema=without_schema_df.count()
print("The  records count from without schema df: "+str(cnt_without_schema))
```
here is the outputs screen shot:-


 







  was:
I am writing the steps to reproduce the issue for "count" pyspark api while 
using mode as dropmalformed.

I have a csv sample file in s3 bucket . I am reading the file using pyspark api 
for csv . I am reading the csv with schema and without schema using mode 
"dropmalformed" options  in two different dataframes . While displaying the 
with schema dataframe , the display looks good ,it is not showing the malformed 
records .But when we apply count api on the dataframe it gives the record count 
of actual file. I am expecting it should give me valid record count .

here is the code used:-
```
without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
  ])
with_schema_df = 
spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
print("The dataframe with schema")
with_schema_df.show()
print("The dataframe without schema")
without_schema_df.show()
cnt_with_schema=with_schema_df.count()
print("The  records count from with schema df :"+str(cnt_with_schema))
cnt_without_schema=without_schema_df.count()
print("The  records count from without schema df: "+str(cnt_without_schema))
```
here is the outputs:-


 








> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv with schema and without schema using mode 
> "dropmalformed" options  in two different dataframes . While displaying the 
> with schema dataframe , the display looks good ,it is not showing the 
> malformed records .But when we apply count api on the dataframe it gives the 
> record count of actual file. I am expecting it should give me valid record 
> count .
> here is the code used:-
> ```
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
> StructField("salary", IntegerType(), True) \
>   ])
> with_schema_df = 
> spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
> 

[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-26 Thread anju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anju updated SPARK-36277:
-
Description: 
I am writing the steps to reproduce the issue for "count" pyspark api while 
using mode as dropmalformed.

I have a csv sample file in s3 bucket . I am reading the file using pyspark api 
for csv . I am reading the csv with schema and without schema using mode 
"dropmalformed" options  in two different dataframes . While displaying the 
with schema dataframe , the display looks good ,it is not showing the malformed 
records .But when we apply count api on the dataframe it gives the record count 
of actual file. I am expecting it should give me valid record count .

here is the code used:-
```
without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
  ])
with_schema_df = 
spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
print("The dataframe with schema")
with_schema_df.show()
print("The dataframe without schema")
without_schema_df.show()
cnt_with_schema=with_schema_df.count()
print("The  records count from with schema df :"+str(cnt_with_schema))
cnt_without_schema=without_schema_df.count()
print("The  records count from without schema df: "+str(cnt_without_schema))
```
here is the outputs:-


 







  was:
I am writing the steps to reproduce the issue for "count" pyspark api while 
using mode as dropmalformed.

I have a csv sample file in s3 bucket . I am reading the file using pyspark api 
for csv . I am reading the csv with schema and without schema using mode 
"dropmalformed" options  in two different dataframes . While displaying the 
with schema dataframe , the display looks good ,it is not showing the malformed 
records .But when we apply count api on the dataframe it gives the record count 
of actual file. I am expecting it should give me valid record count .

here is the code used:-
```
without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
  ])
with_schema_df = 
spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
print("The dataframe with schema")
with_schema_df.show()
print("The dataframe without schema")
without_schema_df.show()
cnt_with_schema=with_schema_df.count()
print("The  records count from with schema df :"+str(cnt_with_schema))
cnt_without_schema=without_schema_df.count()
print("The  records count from without schema df: "+str(cnt_without_schema))
```
here is the outputs:-


 !111.PNG! 








> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv with schema and without schema using mode 
> "dropmalformed" options  in two different dataframes . While displaying the 
> with schema dataframe , the display looks good ,it is not showing the 
> malformed records .But when we apply count api on the dataframe it gives the 
> record count of actual file. I am expecting it should give me valid record 
> count .
> here is the code used:-
> ```
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
> StructField("salary", IntegerType(), True) \
>   ])
> with_schema_df = 
> spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
> 

[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-26 Thread anju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anju updated SPARK-36277:
-
Description: 
I am writing the steps to reproduce the issue for "count" pyspark api while 
using mode as dropmalformed.

I have a csv sample file in s3 bucket . I am reading the file using pyspark api 
for csv . I am reading the csv with schema and without schema using mode 
"dropmalformed" options  in two different dataframes . While displaying the 
with schema dataframe , the display looks good ,it is not showing the malformed 
records .But when we apply count api on the dataframe it gives the record count 
of actual file. I am expecting it should give me valid record count .

here is the code used:-
```
without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
  ])
with_schema_df = 
spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
print("The dataframe with schema")
with_schema_df.show()
print("The dataframe without schema")
without_schema_df.show()
cnt_with_schema=with_schema_df.count()
print("The  records count from with schema df :"+str(cnt_with_schema))
cnt_without_schema=without_schema_df.count()
print("The  records count from without schema df: "+str(cnt_without_schema))
```
here is the outputs:-


 !111.PNG! 







  was:
While reading the dataframe in malformed mode ,I am not getting right record 
count. dataframe.count() is giving me the record count of actual file including 
malformed records, eventhough data frame is read in "dropmalformed" mode. Is 
there a way to overcome this in pyspark
  here is the high level overview of what i am doing. I am trying to read the 
two dataframes from one file using with/without predefined schema. Issue is 
when i read a DF with a predefined schema and with mode as "dropmalformed", the 
record count in  df is not dropping the records. The record count is same as 
actual file where i am expecting less record count,as there are few malformed 
records . But when i try to select and display the records in df ,it is not 
showing malformed records. So display is correct. output is attached in the 
aattchment

code 
 
{code} 
s3_obj =boto3.client('s3')
s3_clientobj = s3_obj.get_object(Bucket='xyz', 
Key='data/test_files/schema_xyz.json')
s3_clientobj
s3_clientdata = 
s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))

extract_with_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")

extract_without_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")

extract_with_schema_df.select("col1","col2").show()
cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the 
records with schema "+ str(cnt1))
cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the 
records without schema "+str(cnt2))

cnt2=extract_without_schema_df.select("col1","col2").show()}}
{code} 

 


> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv with schema and without schema using mode 
> "dropmalformed" options  in two different dataframes . While displaying the 
> with schema dataframe , the display looks good ,it is not showing the 
> malformed records .But when we apply count api on the dataframe it gives the 
> record count of actual file. I am expecting it should give me valid record 
> count .
> here is the code used:-
> ```
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
>   

[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-23 Thread anju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anju updated SPARK-36277:
-
Description: 
While reading the dataframe in malformed mode ,I am not getting right record 
count. dataframe.count() is giving me the record count of actual file including 
malformed records, eventhough data frame is read in "dropmalformed" mode. Is 
there a way to overcome this in pyspark
 here is the high level overview of what i am doing I am trying to read the two 
dataframes from one file using with/without predefined schema. Issue is when i 
read a DF with a predefined schema and with mode as "dropmalformed", the record 
count in  df is not dropping the records. The record count is same as actual 
file where i am expecting less record count,as there are few malformed records 
. But when i try to select and display the records in df ,it is not showing 
malformed records. So display is correct. output is attached in the aattchment

code 
  
 {{s3_obj =boto3.client('s3')
 s3_clientobj = s3_obj.get_object(Bucket='xyz', 
Key='data/test_files/schema_xyz.json')
 s3_clientobj
 s3_clientdata = 
s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
 schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))

extract_with_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")

extract_without_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")

extract_with_schema_df.select("col1","col2").show()
 cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the 
records with schema "+ str(cnt1))
 cnt2=extract_without_schema_df.select("col1","col2").count()print("count of 
the records without schema "+str(cnt2))

cnt2=extract_without_schema_df.select("col1","col2").show()}}

 

  was:
 I am trying to read the two dataframes from one file using with/without 
predefined schema. Issue is when i read a DF with a predefined schema and with 
mode as "dropmalformed", the record count in  df is not dropping the records. 
The record count is same as actual file where i am expecting less record 
count,as there are few malformed records . But when i try to select and display 
the records in df ,it is not showing malformed records. So display is correct. 
output is attached in the aattchement

code 
 
{{s3_obj =boto3.client('s3')
s3_clientobj = s3_obj.get_object(Bucket='xyz', 
Key='data/test_files/schema_xyz.json')
s3_clientobj
s3_clientdata = 
s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))

extract_with_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")
   
extract_without_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")

extract_with_schema_df.select("col1","col2").show()
cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the 
records with schema "+ str(cnt1))
cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the 
records without schema "+str(cnt2))

cnt2=extract_without_schema_df.select("col1","col2").show()}}

 


> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG
>
>
> While reading the dataframe in malformed mode ,I am not getting right record 
> count. dataframe.count() is giving me the record count of actual file 
> including malformed records, eventhough data frame is read in "dropmalformed" 
> mode. Is there a way to overcome this in pyspark
>  here is the high level overview of what i am doing I am trying to read the 
> two dataframes from one file using with/without predefined schema. Issue is 
> when i read a DF with a predefined schema and with mode as "dropmalformed", 
> the record count in  df is not dropping the records. The record count is same 
> as actual file where i am expecting less record count,as there are few 
> malformed records . But when i try to select and display the records in df 
> ,it is not showing malformed records. So display is correct. output is 
> attached in the aattchment
> code 
>   
>  {{s3_obj =boto3.client('s3')
>  s3_clientobj = s3_obj.get_object(Bucket='xyz', 
> Key='data/test_files/schema_xyz.json')
>  s3_clientobj
>  s3_clientdata = 
> s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
>  schemaFromJson 

[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-23 Thread anju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anju updated SPARK-36277:
-
Description: 
While reading the dataframe in malformed mode ,I am not getting right record 
count. dataframe.count() is giving me the record count of actual file including 
malformed records, eventhough data frame is read in "dropmalformed" mode. Is 
there a way to overcome this in pyspark
  here is the high level overview of what i am doing. I am trying to read the 
two dataframes from one file using with/without predefined schema. Issue is 
when i read a DF with a predefined schema and with mode as "dropmalformed", the 
record count in  df is not dropping the records. The record count is same as 
actual file where i am expecting less record count,as there are few malformed 
records . But when i try to select and display the records in df ,it is not 
showing malformed records. So display is correct. output is attached in the 
aattchment

code 
  
 {{s3_obj =boto3.client('s3')
 s3_clientobj = s3_obj.get_object(Bucket='xyz', 
Key='data/test_files/schema_xyz.json')
 s3_clientobj
 s3_clientdata = 
s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
 schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))

extract_with_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")

extract_without_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")

extract_with_schema_df.select("col1","col2").show()
 cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the 
records with schema "+ str(cnt1))
 cnt2=extract_without_schema_df.select("col1","col2").count()print("count of 
the records without schema "+str(cnt2))

cnt2=extract_without_schema_df.select("col1","col2").show()}}

 

  was:
While reading the dataframe in malformed mode ,I am not getting right record 
count. dataframe.count() is giving me the record count of actual file including 
malformed records, eventhough data frame is read in "dropmalformed" mode. Is 
there a way to overcome this in pyspark
 here is the high level overview of what i am doing I am trying to read the two 
dataframes from one file using with/without predefined schema. Issue is when i 
read a DF with a predefined schema and with mode as "dropmalformed", the record 
count in  df is not dropping the records. The record count is same as actual 
file where i am expecting less record count,as there are few malformed records 
. But when i try to select and display the records in df ,it is not showing 
malformed records. So display is correct. output is attached in the aattchment

code 
  
 {{s3_obj =boto3.client('s3')
 s3_clientobj = s3_obj.get_object(Bucket='xyz', 
Key='data/test_files/schema_xyz.json')
 s3_clientobj
 s3_clientdata = 
s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
 schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))

extract_with_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")

extract_without_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")

extract_with_schema_df.select("col1","col2").show()
 cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the 
records with schema "+ str(cnt1))
 cnt2=extract_without_schema_df.select("col1","col2").count()print("count of 
the records without schema "+str(cnt2))

cnt2=extract_without_schema_df.select("col1","col2").show()}}

 


> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG
>
>
> While reading the dataframe in malformed mode ,I am not getting right record 
> count. dataframe.count() is giving me the record count of actual file 
> including malformed records, eventhough data frame is read in "dropmalformed" 
> mode. Is there a way to overcome this in pyspark
>   here is the high level overview of what i am doing. I am trying to read the 
> two dataframes from one file using with/without predefined schema. Issue is 
> when i read a DF with a predefined schema and with mode as "dropmalformed", 
> the record count in  df is not dropping the records. The record count is same 
> as actual file where i am expecting less record count,as there are few 
> malformed records . But when i try to select and display the records in df 
> ,it is not showing malformed records. So display is correct. output is 

[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-23 Thread anju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anju updated SPARK-36277:
-
Description: 
 I am trying to read the two dataframes from one file using with/without 
predefined schema. Issue is when i read a DF with a predefined schema and with 
mode as "dropmalformed", the record count in  df is not dropping the records. 
The record count is same as actual file where i am expecting less record 
count,as there are few malformed records . But when i try to select and display 
the records in df ,it is not showing malformed records. So display is correct. 
output is attached in the aattchement

code 
 
{{s3_obj =boto3.client('s3')
s3_clientobj = s3_obj.get_object(Bucket='xyz', 
Key='data/test_files/schema_xyz.json')
s3_clientobj
s3_clientdata = 
s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))

extract_with_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")
   
extract_without_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")

extract_with_schema_df.select("col1","col2").show()
cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the 
records with schema "+ str(cnt1))
cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the 
records without schema "+str(cnt2))

cnt2=extract_without_schema_df.select("col1","col2").show()}}

 

  was:
 I am trying to read the two dataframes from one file using with/without 
predefined schema. Issue is when i read a DF with a predefined schema and with 
mode as "dropmalformed", the record count in  df is not dropping the records. 
The record count is same as actual file where i am expecting less record 
count,as there are few malformed records . But when i try to select and display 
the records in df ,it is not showing malformed records. So display is correct. 
output is attached in the atatchement

!image-2021-07-23-16-31-46-172.png!


> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG
>
>
>  I am trying to read the two dataframes from one file using with/without 
> predefined schema. Issue is when i read a DF with a predefined schema and 
> with mode as "dropmalformed", the record count in  df is not dropping the 
> records. The record count is same as actual file where i am expecting less 
> record count,as there are few malformed records . But when i try to select 
> and display the records in df ,it is not showing malformed records. So 
> display is correct. output is attached in the aattchement
> code 
>  
> {{s3_obj =boto3.client('s3')
> s3_clientobj = s3_obj.get_object(Bucket='xyz', 
> Key='data/test_files/schema_xyz.json')
> s3_clientobj
> s3_clientdata = 
> s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
> schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))
> extract_with_schema_df = 
> spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")
>
> extract_without_schema_df = 
> spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")
> extract_with_schema_df.select("col1","col2").show()
> cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the 
> records with schema "+ str(cnt1))
> cnt2=extract_without_schema_df.select("col1","col2").count()print("count of 
> the records without schema "+str(cnt2))
> cnt2=extract_without_schema_df.select("col1","col2").show()}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-23 Thread anju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anju updated SPARK-36277:
-
Description: 
 I am trying to read the two dataframes from one file using with/without 
predefined schema. Issue is when i read a DF with a predefined schema and with 
mode as "dropmalformed", the record count in  df is not dropping the records. 
The record count is same as actual file where i am expecting less record 
count,as there are few malformed records . But when i try to select and display 
the records in df ,it is not showing malformed records. So display is correct. 
output is attached in the atatchement

!image-2021-07-23-16-31-46-172.png!

  was:
 I am trying to read the two dataframes from one file using with/without 
predefined schema. Issue is when i read a DF with a predefined schema and with 
mode as "dropmalformed", the record count in  df is not dropping the records. 
The record count is same as actual file where i am expecting less record 
count,as there are few malformed records . But when i try to select and display 
the records in df ,it is not showing malformed records. So display is correct.

!image-2021-07-23-16-31-46-172.png!


> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG
>
>
>  I am trying to read the two dataframes from one file using with/without 
> predefined schema. Issue is when i read a DF with a predefined schema and 
> with mode as "dropmalformed", the record count in  df is not dropping the 
> records. The record count is same as actual file where i am expecting less 
> record count,as there are few malformed records . But when i try to select 
> and display the records in df ,it is not showing malformed records. So 
> display is correct. output is attached in the atatchement
> !image-2021-07-23-16-31-46-172.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-23 Thread anju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anju updated SPARK-36277:
-
Attachment: 111.PNG

> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
> Attachments: 111.PNG
>
>
>  I am trying to read the two dataframes from one file using with/without 
> predefined schema. Issue is when i read a DF with a predefined schema and 
> with mode as "dropmalformed", the record count in  df is not dropping the 
> records. The record count is same as actual file where i am expecting less 
> record count,as there are few malformed records . But when i try to select 
> and display the records in df ,it is not showing malformed records. So 
> display is correct. output is attached in the atatchement
> !image-2021-07-23-16-31-46-172.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

2021-07-23 Thread anju (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anju updated SPARK-36277:
-
Summary: Issue with record count of data frame while reading in 
DropMalformed mode  (was: Issue record count of data frame while reading in 
DropMalformed mode)

> Issue with record count of data frame while reading in DropMalformed mode
> -
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: anju
>Priority: Major
>
>  I am trying to read the two dataframes from one file using with/without 
> predefined schema. Issue is when i read a DF with a predefined schema and 
> with mode as "dropmalformed", the record count in  df is not dropping the 
> records. The record count is same as actual file where i am expecting less 
> record count,as there are few malformed records . But when i try to select 
> and display the records in df ,it is not showing malformed records. So 
> display is correct.
> !image-2021-07-23-16-31-46-172.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36277) Issue record count of data frame while reading in DropMalformed mode

2021-07-23 Thread anju (Jira)
anju created SPARK-36277:


 Summary: Issue record count of data frame while reading in 
DropMalformed mode
 Key: SPARK-36277
 URL: https://issues.apache.org/jira/browse/SPARK-36277
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.3
Reporter: anju


 I am trying to read the two dataframes from one file using with/without 
predefined schema. Issue is when i read a DF with a predefined schema and with 
mode as "dropmalformed", the record count in  df is not dropping the records. 
The record count is same as actual file where i am expecting less record 
count,as there are few malformed records . But when i try to select and display 
the records in df ,it is not showing malformed records. So display is correct.

!image-2021-07-23-16-31-46-172.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org