[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
anju updated SPARK-36277: ------------------------- Description: I am writing the steps to reproduce the issue for "count" pyspark api while using mode as dropmalformed. I have a csv sample file in s3 bucket . I am reading the file using pyspark api for csv . I am reading the csv with schema and without schema using mode "dropmalformed" options in two different dataframes . While displaying the with schema dataframe , the display looks good ,it is not showing the malformed records .But when we apply count api on the dataframe it gives the record count of actual file. I am expecting it should give me valid record count . here is the code used:- ``` without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) schema = StructType([ \ StructField("firstname",StringType(),True), \ StructField("middlename",StringType(),True), \ StructField("lastname",StringType(),True), \ StructField("id", StringType(), True), \ StructField("gender", StringType(), True), \ StructField("salary", IntegerType(), True) \ ]) with_schema_df = spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") print("The dataframe with schema") with_schema_df.show() print("The dataframe without schema") without_schema_df.show() cnt_with_schema=with_schema_df.count() print("The records count from with schema df :"+str(cnt_with_schema)) cnt_without_schema=without_schema_df.count() print("The records count from without schema df: "+str(cnt_without_schema)) ``` here is the outputs screen shot 111.PNG is the outputs of the code and inputfile.csv is the input to the code was: I am writing the steps to reproduce the issue for "count" pyspark api while using mode as dropmalformed. I have a csv sample file in s3 bucket . I am reading the file using pyspark api for csv . I am reading the csv with schema and without schema using mode "dropmalformed" options in two different dataframes . While displaying the with schema dataframe , the display looks good ,it is not showing the malformed records .But when we apply count api on the dataframe it gives the record count of actual file. I am expecting it should give me valid record count . here is the code used:- ``` without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) schema = StructType([ \ StructField("firstname",StringType(),True), \ StructField("middlename",StringType(),True), \ StructField("lastname",StringType(),True), \ StructField("id", StringType(), True), \ StructField("gender", StringType(), True), \ StructField("salary", IntegerType(), True) \ ]) with_schema_df = spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") print("The dataframe with schema") with_schema_df.show() print("The dataframe without schema") without_schema_df.show() cnt_with_schema=with_schema_df.count() print("The records count from with schema df :"+str(cnt_with_schema)) cnt_without_schema=without_schema_df.count() print("The records count from without schema df: "+str(cnt_without_schema)) ``` here is the outputs screen shot:- > Issue with record count of data frame while reading in DropMalformed mode > ------------------------------------------------------------------------- > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.3 > Reporter: anju > Priority: Major > Attachments: 111.PNG, Inputfile.PNG > > > I am writing the steps to reproduce the issue for "count" pyspark api while > using mode as dropmalformed. > I have a csv sample file in s3 bucket . I am reading the file using pyspark > api for csv . I am reading the csv with schema and without schema using mode > "dropmalformed" options in two different dataframes . While displaying the > with schema dataframe , the display looks good ,it is not showing the > malformed records .But when we apply count api on the dataframe it gives the > record count of actual file. I am expecting it should give me valid record > count . > here is the code used:- > ``` > without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) > schema = StructType([ \ > StructField("firstname",StringType(),True), \ > StructField("middlename",StringType(),True), \ > StructField("lastname",StringType(),True), \ > StructField("id", StringType(), True), \ > StructField("gender", StringType(), True), \ > StructField("salary", IntegerType(), True) \ > ]) > with_schema_df = > spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") > print("The dataframe with schema") > with_schema_df.show() > print("The dataframe without schema") > without_schema_df.show() > cnt_with_schema=with_schema_df.count() > print("The records count from with schema df :"+str(cnt_with_schema)) > cnt_without_schema=without_schema_df.count() > print("The records count from without schema df: "+str(cnt_without_schema)) > ``` > here is the outputs screen shot 111.PNG is the outputs of the code and > inputfile.csv is the input to the code > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org