[
https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
anju updated SPARK-36277:
-------------------------
Description:
I am writing the steps to reproduce the issue for "count" pyspark api while
using mode as dropmalformed.
I have a csv sample file in s3 bucket . I am reading the file using pyspark api
for csv . I am reading the csv with schema and without schema using mode
"dropmalformed" options in two different dataframes . While displaying the
with schema dataframe , the display looks good ,it is not showing the malformed
records .But when we apply count api on the dataframe it gives the record count
of actual file. I am expecting it should give me valid record count .
here is the code used:-
```
without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
with_schema_df =
spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
print("The dataframe with schema")
with_schema_df.show()
print("The dataframe without schema")
without_schema_df.show()
cnt_with_schema=with_schema_df.count()
print("The records count from with schema df :"+str(cnt_with_schema))
cnt_without_schema=without_schema_df.count()
print("The records count from without schema df: "+str(cnt_without_schema))
```
here is the outputs:-
!111.PNG!
was:
While reading the dataframe in malformed mode ,I am not getting right record
count. dataframe.count() is giving me the record count of actual file including
malformed records, eventhough data frame is read in "dropmalformed" mode. Is
there a way to overcome this in pyspark
here is the high level overview of what i am doing. I am trying to read the
two dataframes from one file using with/without predefined schema. Issue is
when i read a DF with a predefined schema and with mode as "dropmalformed", the
record count in df is not dropping the records. The record count is same as
actual file where i am expecting less record count,as there are few malformed
records . But when i try to select and display the records in df ,it is not
showing malformed records. So display is correct. output is attached in the
aattchment
code
{code}
s3_obj =boto3.client('s3')
s3_clientobj = s3_obj.get_object(Bucket='xyz',
Key='data/test_files/schema_xyz.json')
s3_clientobj
s3_clientdata =
s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata)
schemaFromJson =StructType.fromJson(json.loads(s3_clientdata))
extract_with_schema_df =
spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED")
extract_without_schema_df =
spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive")
extract_with_schema_df.select("col1","col2").show()
cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the
records with schema "+ str(cnt1))
cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the
records without schema "+str(cnt2))
cnt2=extract_without_schema_df.select("col1","col2").show()}}
{code}
> Issue with record count of data frame while reading in DropMalformed mode
> -------------------------------------------------------------------------
>
> Key: SPARK-36277
> URL: https://issues.apache.org/jira/browse/SPARK-36277
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.4.3
> Reporter: anju
> Priority: Major
> Attachments: 111.PNG
>
>
> I am writing the steps to reproduce the issue for "count" pyspark api while
> using mode as dropmalformed.
> I have a csv sample file in s3 bucket . I am reading the file using pyspark
> api for csv . I am reading the csv with schema and without schema using mode
> "dropmalformed" options in two different dataframes . While displaying the
> with schema dataframe , the display looks good ,it is not showing the
> malformed records .But when we apply count api on the dataframe it gives the
> record count of actual file. I am expecting it should give me valid record
> count .
> here is the code used:-
> ```
> without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True)
> schema = StructType([ \
> StructField("firstname",StringType(),True), \
> StructField("middlename",StringType(),True), \
> StructField("lastname",StringType(),True), \
> StructField("id", StringType(), True), \
> StructField("gender", StringType(), True), \
> StructField("salary", IntegerType(), True) \
> ])
> with_schema_df =
> spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED")
> print("The dataframe with schema")
> with_schema_df.show()
> print("The dataframe without schema")
> without_schema_df.show()
> cnt_with_schema=with_schema_df.count()
> print("The records count from with schema df :"+str(cnt_with_schema))
> cnt_without_schema=without_schema_df.count()
> print("The records count from without schema df: "+str(cnt_without_schema))
> ```
> here is the outputs:-
> !111.PNG!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]