[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

anju (Jira) Mon, 26 Jul 2021 07:51:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


anju updated SPARK-36277:
-------------------------
    Description: 
I am writing the steps to reproduce the issue for "count" pyspark api while 
using mode as dropmalformed.

I have a csv sample file in s3 bucket . I am reading the file using pyspark api 
for csv . I am reading the csv with schema and without schema using mode 
"dropmalformed" options  in two different dataframes . While displaying the 
with schema dataframe , the display looks good ,it is not showing the malformed 
records .But when we apply count api on the dataframe it gives the record count 
of actual file. I am expecting it should give me valid record count .

here is the code used:-
```
without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])
with_schema_df = 
spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
print("The dataframe with schema")
with_schema_df.show()
print("The dataframe without schema")
without_schema_df.show()
cnt_with_schema=with_schema_df.count()
print("The  records count from with schema df :"+str(cnt_with_schema))
cnt_without_schema=without_schema_df.count()
print("The  records count from without schema df: "+str(cnt_without_schema))
```
here is the outputs:-


 !111.PNG! 







  was:
While reading the dataframe in malformed mode ,I am not getting right record 
count. dataframe.count() is giving me the record count of actual file including 
malformed records, eventhough data frame is read in "dropmalformed" mode. Is 
there a way to overcome this in pyspark
  here is the high level overview of what i am doing. I am trying to read the 
two dataframes from one file using with/without predefined schema. Issue is 
when i read a DF with a predefined schema and with mode as "dropmalformed", the 
record count in  df is not dropping the records. The record count is same as 
actual file where i am expecting less record count,as there are few malformed 
records . But when i try to select and display the records in df ,it is not 
showing malformed records. So display is correct. output is attached in the 
aattchment

code 
 
{code} 
s3_obj =boto3.client('s3')
s3_clientobj = s3_obj.get_object(Bucket='xyz', 
Key='data/test_files/schema_xyz.json')
s3_clientobj
s3_clientdata = 
s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))

extract_with_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")

extract_without_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")

extract_with_schema_df.select("col1","col2").show()
cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the 
records with schema "+ str(cnt1))
cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the 
records without schema "+str(cnt2))

cnt2=extract_without_schema_df.select("col1","col2").show()}}
{code} 

 


> Issue with record count of data frame while reading in DropMalformed mode
> -------------------------------------------------------------------------
>
>                 Key: SPARK-36277
>                 URL: https://issues.apache.org/jira/browse/SPARK-36277
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.3
>            Reporter: anju
>            Priority: Major
>         Attachments: 111.PNG
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while 
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark 
> api for csv . I am reading the csv with schema and without schema using mode 
> "dropmalformed" options  in two different dataframes . While displaying the 
> with schema dataframe , the display looks good ,it is not showing the 
> malformed records .But when we apply count api on the dataframe it gives the 
> record count of actual file. I am expecting it should give me valid record 
> count .
> here is the code used:-
> ```
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
>     StructField("firstname",StringType(),True), \
>     StructField("middlename",StringType(),True), \
>     StructField("lastname",StringType(),True), \
>     StructField("id", StringType(), True), \
>     StructField("gender", StringType(), True), \
>     StructField("salary", IntegerType(), True) \
>   ])
> with_schema_df = 
> spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
> print("The dataframe with schema")
> with_schema_df.show()
> print("The dataframe without schema")
> without_schema_df.show()
> cnt_with_schema=with_schema_df.count()
> print("The  records count from with schema df :"+str(cnt_with_schema))
> cnt_without_schema=without_schema_df.count()
> print("The  records count from without schema df: "+str(cnt_without_schema))
> ```
> here is the outputs:-
>  !111.PNG! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

Reply via email to