[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17729825#comment-17729825 ]
Zach Liu commented on SPARK-36277: ---------------------------------- I see the same behavior on Spark 3.3.1. I have to create this "checkpoint": ``` spark.sql("set spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ColumnPruning") true_count = df.count() spark.sql("set spark.sql.optimizer.excludedRules=null") all_count = df.count() malformed_count = all_count - true_count if malformed_count > 0: raise ValueError("Self-defined schema is not compatible with the data") ``` I don't know if disabling `ColumnPruning` has other implications, so I just re-enable it. > Issue with record count of data frame while reading in DropMalformed mode > ------------------------------------------------------------------------- > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.3 > Reporter: anju > Priority: Major > Attachments: 111.PNG, Inputfile.PNG, sample.csv > > > I am writing the steps to reproduce the issue for "count" pyspark api while > using mode as dropmalformed. > I have a csv sample file in s3 bucket . I am reading the file using pyspark > api for csv . I am reading the csv "without schema" and "with schema using > mode 'dropmalformed' options in two different dataframes . While displaying > the "with schema using mode 'dropmalformed'" dataframe , the display looks > good ,it is not showing the malformed records .But when we apply count api on > the dataframe it gives the record count of actual file. I am expecting it > should give me valid record count . > here is the code used:- > {code} > without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) > schema = StructType([ \ > StructField("firstname",StringType(),True), \ > StructField("middlename",StringType(),True), \ > StructField("lastname",StringType(),True), \ > StructField("id", StringType(), True), \ > StructField("gender", StringType(), True), \ > StructField("salary", IntegerType(), True) \ > ]) > with_schema_df = > spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") > print("The dataframe with schema") > with_schema_df.show() > print("The dataframe without schema") > without_schema_df.show() > cnt_with_schema=with_schema_df.count() > print("The records count from with schema df :"+str(cnt_with_schema)) > cnt_without_schema=without_schema_df.count() > print("The records count from without schema df: "+str(cnt_without_schema)) > {code} > here is the outputs screen shot 111.PNG is the outputs of the code and > inputfile.csv is the input to the code > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org