[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
anju updated SPARK-36277: ------------------------- Description: While reading the dataframe in malformed mode ,I am not getting right record count. dataframe.count() is giving me the record count of actual file including malformed records, eventhough data frame is read in "dropmalformed" mode. Is there a way to overcome this in pyspark here is the high level overview of what i am doing. I am trying to read the two dataframes from one file using with/without predefined schema. Issue is when i read a DF with a predefined schema and with mode as "dropmalformed", the record count in df is not dropping the records. The record count is same as actual file where i am expecting less record count,as there are few malformed records . But when i try to select and display the records in df ,it is not showing malformed records. So display is correct. output is attached in the aattchment code {{s3_obj =boto3.client('s3') s3_clientobj = s3_obj.get_object(Bucket='xyz', Key='data/test_files/schema_xyz.json') s3_clientobj s3_clientdata = s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata) schemaFromJson =StructType.fromJson(json.loads(s3_clientdata)) extract_with_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED") extract_without_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive") extract_with_schema_df.select("col1","col2").show() cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the records with schema "+ str(cnt1)) cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the records without schema "+str(cnt2)) cnt2=extract_without_schema_df.select("col1","col2").show()}} was: While reading the dataframe in malformed mode ,I am not getting right record count. dataframe.count() is giving me the record count of actual file including malformed records, eventhough data frame is read in "dropmalformed" mode. Is there a way to overcome this in pyspark here is the high level overview of what i am doing I am trying to read the two dataframes from one file using with/without predefined schema. Issue is when i read a DF with a predefined schema and with mode as "dropmalformed", the record count in df is not dropping the records. The record count is same as actual file where i am expecting less record count,as there are few malformed records . But when i try to select and display the records in df ,it is not showing malformed records. So display is correct. output is attached in the aattchment code {{s3_obj =boto3.client('s3') s3_clientobj = s3_obj.get_object(Bucket='xyz', Key='data/test_files/schema_xyz.json') s3_clientobj s3_clientdata = s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata) schemaFromJson =StructType.fromJson(json.loads(s3_clientdata)) extract_with_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED") extract_without_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive") extract_with_schema_df.select("col1","col2").show() cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the records with schema "+ str(cnt1)) cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the records without schema "+str(cnt2)) cnt2=extract_without_schema_df.select("col1","col2").show()}} > Issue with record count of data frame while reading in DropMalformed mode > ------------------------------------------------------------------------- > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.3 > Reporter: anju > Priority: Major > Attachments: 111.PNG > > > While reading the dataframe in malformed mode ,I am not getting right record > count. dataframe.count() is giving me the record count of actual file > including malformed records, eventhough data frame is read in "dropmalformed" > mode. Is there a way to overcome this in pyspark > here is the high level overview of what i am doing. I am trying to read the > two dataframes from one file using with/without predefined schema. Issue is > when i read a DF with a predefined schema and with mode as "dropmalformed", > the record count in df is not dropping the records. The record count is same > as actual file where i am expecting less record count,as there are few > malformed records . But when i try to select and display the records in df > ,it is not showing malformed records. So display is correct. output is > attached in the aattchment > code > > {{s3_obj =boto3.client('s3') > s3_clientobj = s3_obj.get_object(Bucket='xyz', > Key='data/test_files/schema_xyz.json') > s3_clientobj > s3_clientdata = > s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata) > schemaFromJson =StructType.fromJson(json.loads(s3_clientdata)) > extract_with_schema_df = > spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED") > extract_without_schema_df = > spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive") > extract_with_schema_df.select("col1","col2").show() > cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the > records with schema "+ str(cnt1)) > cnt2=extract_without_schema_df.select("col1","col2").count()print("count of > the records without schema "+str(cnt2)) > cnt2=extract_without_schema_df.select("col1","col2").show()}} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org