[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

anju (Jira) Fri, 23 Jul 2021 13:45:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


anju updated SPARK-36277:
-------------------------
    Description: 
While reading the dataframe in malformed mode ,I am not getting right record 
count. dataframe.count() is giving me the record count of actual file including 
malformed records, eventhough data frame is read in "dropmalformed" mode. Is 
there a way to overcome this in pyspark
  here is the high level overview of what i am doing. I am trying to read the 
two dataframes from one file using with/without predefined schema. Issue is 
when i read a DF with a predefined schema and with mode as "dropmalformed", the 
record count in  df is not dropping the records. The record count is same as 
actual file where i am expecting less record count,as there are few malformed 
records . But when i try to select and display the records in df ,it is not 
showing malformed records. So display is correct. output is attached in the 
aattchment

code 
  
 {{s3_obj =boto3.client('s3')
 s3_clientobj = s3_obj.get_object(Bucket='xyz', 
Key='data/test_files/schema_xyz.json')
 s3_clientobj
 s3_clientdata = 
s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
 schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))

extract_with_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")

extract_without_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")

extract_with_schema_df.select("col1","col2").show()
 cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the 
records with schema "+ str(cnt1))
 cnt2=extract_without_schema_df.select("col1","col2").count()print("count of 
the records without schema "+str(cnt2))

cnt2=extract_without_schema_df.select("col1","col2").show()}}

 

  was:
While reading the dataframe in malformed mode ,I am not getting right record 
count. dataframe.count() is giving me the record count of actual file including 
malformed records, eventhough data frame is read in "dropmalformed" mode. Is 
there a way to overcome this in pyspark
 here is the high level overview of what i am doing I am trying to read the two 
dataframes from one file using with/without predefined schema. Issue is when i 
read a DF with a predefined schema and with mode as "dropmalformed", the record 
count in  df is not dropping the records. The record count is same as actual 
file where i am expecting less record count,as there are few malformed records 
. But when i try to select and display the records in df ,it is not showing 
malformed records. So display is correct. output is attached in the aattchment

code 
  
 {{s3_obj =boto3.client('s3')
 s3_clientobj = s3_obj.get_object(Bucket='xyz', 
Key='data/test_files/schema_xyz.json')
 s3_clientobj
 s3_clientdata = 
s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
 schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))

extract_with_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")

extract_without_schema_df = 
spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")

extract_with_schema_df.select("col1","col2").show()
 cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the 
records with schema "+ str(cnt1))
 cnt2=extract_without_schema_df.select("col1","col2").count()print("count of 
the records without schema "+str(cnt2))

cnt2=extract_without_schema_df.select("col1","col2").show()}}

 


> Issue with record count of data frame while reading in DropMalformed mode
> -------------------------------------------------------------------------
>
>                 Key: SPARK-36277
>                 URL: https://issues.apache.org/jira/browse/SPARK-36277
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.3
>            Reporter: anju
>            Priority: Major
>         Attachments: 111.PNG
>
>
> While reading the dataframe in malformed mode ,I am not getting right record 
> count. dataframe.count() is giving me the record count of actual file 
> including malformed records, eventhough data frame is read in "dropmalformed" 
> mode. Is there a way to overcome this in pyspark
>   here is the high level overview of what i am doing. I am trying to read the 
> two dataframes from one file using with/without predefined schema. Issue is 
> when i read a DF with a predefined schema and with mode as "dropmalformed", 
> the record count in  df is not dropping the records. The record count is same 
> as actual file where i am expecting less record count,as there are few 
> malformed records . But when i try to select and display the records in df 
> ,it is not showing malformed records. So display is correct. output is 
> attached in the aattchment
> code 
>   
>  {{s3_obj =boto3.client('s3')
>  s3_clientobj = s3_obj.get_object(Bucket='xyz', 
> Key='data/test_files/schema_xyz.json')
>  s3_clientobj
>  s3_clientdata = 
> s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
>  schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))
> extract_with_schema_df = 
> spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")
> extract_without_schema_df = 
> spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")
> extract_with_schema_df.select("col1","col2").show()
>  cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the 
> records with schema "+ str(cnt1))
>  cnt2=extract_without_schema_df.select("col1","col2").count()print("count of 
> the records without schema "+str(cnt2))
> cnt2=extract_without_schema_df.select("col1","col2").show()}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode

Reply via email to