[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36277: - Description: I am writing the steps to reproduce the issue for "count" pyspark api while using mode as dropmalformed. I have a csv sample file in s3 bucket . I am reading the file using pyspark api for csv . I am reading the csv "without schema" and "with schema using mode 'dropmalformed' options in two different dataframes . While displaying the "with schema using mode 'dropmalformed'" dataframe , the display looks good ,it is not showing the malformed records .But when we apply count api on the dataframe it gives the record count of actual file. I am expecting it should give me valid record count . here is the code used:- {code} without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) schema = StructType([ \ StructField("firstname",StringType(),True), \ StructField("middlename",StringType(),True), \ StructField("lastname",StringType(),True), \ StructField("id", StringType(), True), \ StructField("gender", StringType(), True), \ StructField("salary", IntegerType(), True) \ ]) with_schema_df = spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") print("The dataframe with schema") with_schema_df.show() print("The dataframe without schema") without_schema_df.show() cnt_with_schema=with_schema_df.count() print("The records count from with schema df :"+str(cnt_with_schema)) cnt_without_schema=without_schema_df.count() print("The records count from without schema df: "+str(cnt_without_schema)) {code} here is the outputs screen shot 111.PNG is the outputs of the code and inputfile.csv is the input to the code was: I am writing the steps to reproduce the issue for "count" pyspark api while using mode as dropmalformed. I have a csv sample file in s3 bucket . I am reading the file using pyspark api for csv . I am reading the csv "without schema" and "with schema using mode 'dropmalformed' options in two different dataframes . While displaying the "with schema using mode 'dropmalformed'" dataframe , the display looks good ,it is not showing the malformed records .But when we apply count api on the dataframe it gives the record count of actual file. I am expecting it should give me valid record count . here is the code used:- ``` without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) schema = StructType([ \ StructField("firstname",StringType(),True), \ StructField("middlename",StringType(),True), \ StructField("lastname",StringType(),True), \ StructField("id", StringType(), True), \ StructField("gender", StringType(), True), \ StructField("salary", IntegerType(), True) \ ]) with_schema_df = spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") print("The dataframe with schema") with_schema_df.show() print("The dataframe without schema") without_schema_df.show() cnt_with_schema=with_schema_df.count() print("The records count from with schema df :"+str(cnt_with_schema)) cnt_without_schema=without_schema_df.count() print("The records count from without schema df: "+str(cnt_without_schema)) ``` here is the outputs screen shot 111.PNG is the outputs of the code and inputfile.csv is the input to the code > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG, Inputfile.PNG, sample.csv > > > I am writing the steps to reproduce the issue for "count" pyspark api while > using mode as dropmalformed. > I have a csv sample file in s3 bucket . I am reading the file using pyspark > api for csv . I am reading the csv "without schema" and "with schema using > mode 'dropmalformed' options in two different dataframes . While displaying > the "with schema using mode 'dropmalformed'" dataframe , the display looks > good ,it is not showing the malformed records .But when we apply count api on > the dataframe it gives the record count of actual file. I am expecting it > should give me valid record count . > here is the code used:- > {code} > without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) > schema = StructType([ \ > StructField("firstname",StringType(),True), \ > StructField("middlename",StringType(),True), \ >
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anju updated SPARK-36277: - Attachment: sample.csv > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG, Inputfile.PNG, sample.csv > > > I am writing the steps to reproduce the issue for "count" pyspark api while > using mode as dropmalformed. > I have a csv sample file in s3 bucket . I am reading the file using pyspark > api for csv . I am reading the csv "without schema" and "with schema using > mode 'dropmalformed' options in two different dataframes . While displaying > the "with schema using mode 'dropmalformed'" dataframe , the display looks > good ,it is not showing the malformed records .But when we apply count api on > the dataframe it gives the record count of actual file. I am expecting it > should give me valid record count . > here is the code used:- > ``` > without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) > schema = StructType([ \ > StructField("firstname",StringType(),True), \ > StructField("middlename",StringType(),True), \ > StructField("lastname",StringType(),True), \ > StructField("id", StringType(), True), \ > StructField("gender", StringType(), True), \ > StructField("salary", IntegerType(), True) \ > ]) > with_schema_df = > spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") > print("The dataframe with schema") > with_schema_df.show() > print("The dataframe without schema") > without_schema_df.show() > cnt_with_schema=with_schema_df.count() > print("The records count from with schema df :"+str(cnt_with_schema)) > cnt_without_schema=without_schema_df.count() > print("The records count from without schema df: "+str(cnt_without_schema)) > ``` > here is the outputs screen shot 111.PNG is the outputs of the code and > inputfile.csv is the input to the code > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anju updated SPARK-36277: - Description: I am writing the steps to reproduce the issue for "count" pyspark api while using mode as dropmalformed. I have a csv sample file in s3 bucket . I am reading the file using pyspark api for csv . I am reading the csv "without schema" and "with schema using mode 'dropmalformed' options in two different dataframes . While displaying the "with schema using mode 'dropmalformed'" dataframe , the display looks good ,it is not showing the malformed records .But when we apply count api on the dataframe it gives the record count of actual file. I am expecting it should give me valid record count . here is the code used:- ``` without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) schema = StructType([ \ StructField("firstname",StringType(),True), \ StructField("middlename",StringType(),True), \ StructField("lastname",StringType(),True), \ StructField("id", StringType(), True), \ StructField("gender", StringType(), True), \ StructField("salary", IntegerType(), True) \ ]) with_schema_df = spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") print("The dataframe with schema") with_schema_df.show() print("The dataframe without schema") without_schema_df.show() cnt_with_schema=with_schema_df.count() print("The records count from with schema df :"+str(cnt_with_schema)) cnt_without_schema=without_schema_df.count() print("The records count from without schema df: "+str(cnt_without_schema)) ``` here is the outputs screen shot 111.PNG is the outputs of the code and inputfile.csv is the input to the code was: I am writing the steps to reproduce the issue for "count" pyspark api while using mode as dropmalformed. I have a csv sample file in s3 bucket . I am reading the file using pyspark api for csv . I am reading the csv with schema and without schema using mode "dropmalformed" options in two different dataframes . While displaying the with schema dataframe , the display looks good ,it is not showing the malformed records .But when we apply count api on the dataframe it gives the record count of actual file. I am expecting it should give me valid record count . here is the code used:- ``` without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) schema = StructType([ \ StructField("firstname",StringType(),True), \ StructField("middlename",StringType(),True), \ StructField("lastname",StringType(),True), \ StructField("id", StringType(), True), \ StructField("gender", StringType(), True), \ StructField("salary", IntegerType(), True) \ ]) with_schema_df = spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") print("The dataframe with schema") with_schema_df.show() print("The dataframe without schema") without_schema_df.show() cnt_with_schema=with_schema_df.count() print("The records count from with schema df :"+str(cnt_with_schema)) cnt_without_schema=without_schema_df.count() print("The records count from without schema df: "+str(cnt_without_schema)) ``` here is the outputs screen shot 111.PNG is the outputs of the code and inputfile.csv is the input to the code > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG, Inputfile.PNG, sample.csv > > > I am writing the steps to reproduce the issue for "count" pyspark api while > using mode as dropmalformed. > I have a csv sample file in s3 bucket . I am reading the file using pyspark > api for csv . I am reading the csv "without schema" and "with schema using > mode 'dropmalformed' options in two different dataframes . While displaying > the "with schema using mode 'dropmalformed'" dataframe , the display looks > good ,it is not showing the malformed records .But when we apply count api on > the dataframe it gives the record count of actual file. I am expecting it > should give me valid record count . > here is the code used:- > ``` > without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) > schema = StructType([ \ > StructField("firstname",StringType(),True), \ > StructField("middlename",StringType(),True), \ > StructField("lastname",StringType(),True), \ > StructField("id", StringType(),
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anju updated SPARK-36277: - Description: I am writing the steps to reproduce the issue for "count" pyspark api while using mode as dropmalformed. I have a csv sample file in s3 bucket . I am reading the file using pyspark api for csv . I am reading the csv with schema and without schema using mode "dropmalformed" options in two different dataframes . While displaying the with schema dataframe , the display looks good ,it is not showing the malformed records .But when we apply count api on the dataframe it gives the record count of actual file. I am expecting it should give me valid record count . here is the code used:- ``` without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) schema = StructType([ \ StructField("firstname",StringType(),True), \ StructField("middlename",StringType(),True), \ StructField("lastname",StringType(),True), \ StructField("id", StringType(), True), \ StructField("gender", StringType(), True), \ StructField("salary", IntegerType(), True) \ ]) with_schema_df = spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") print("The dataframe with schema") with_schema_df.show() print("The dataframe without schema") without_schema_df.show() cnt_with_schema=with_schema_df.count() print("The records count from with schema df :"+str(cnt_with_schema)) cnt_without_schema=without_schema_df.count() print("The records count from without schema df: "+str(cnt_without_schema)) ``` here is the outputs screen shot 111.PNG is the outputs of the code and inputfile.csv is the input to the code was: I am writing the steps to reproduce the issue for "count" pyspark api while using mode as dropmalformed. I have a csv sample file in s3 bucket . I am reading the file using pyspark api for csv . I am reading the csv with schema and without schema using mode "dropmalformed" options in two different dataframes . While displaying the with schema dataframe , the display looks good ,it is not showing the malformed records .But when we apply count api on the dataframe it gives the record count of actual file. I am expecting it should give me valid record count . here is the code used:- ``` without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) schema = StructType([ \ StructField("firstname",StringType(),True), \ StructField("middlename",StringType(),True), \ StructField("lastname",StringType(),True), \ StructField("id", StringType(), True), \ StructField("gender", StringType(), True), \ StructField("salary", IntegerType(), True) \ ]) with_schema_df = spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") print("The dataframe with schema") with_schema_df.show() print("The dataframe without schema") without_schema_df.show() cnt_with_schema=with_schema_df.count() print("The records count from with schema df :"+str(cnt_with_schema)) cnt_without_schema=without_schema_df.count() print("The records count from without schema df: "+str(cnt_without_schema)) ``` here is the outputs screen shot:- > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG, Inputfile.PNG > > > I am writing the steps to reproduce the issue for "count" pyspark api while > using mode as dropmalformed. > I have a csv sample file in s3 bucket . I am reading the file using pyspark > api for csv . I am reading the csv with schema and without schema using mode > "dropmalformed" options in two different dataframes . While displaying the > with schema dataframe , the display looks good ,it is not showing the > malformed records .But when we apply count api on the dataframe it gives the > record count of actual file. I am expecting it should give me valid record > count . > here is the code used:- > ``` > without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) > schema = StructType([ \ > StructField("firstname",StringType(),True), \ > StructField("middlename",StringType(),True), \ > StructField("lastname",StringType(),True), \ > StructField("id", StringType(), True), \ > StructField("gender", StringType(), True), \ > StructField("salary", IntegerType(), True) \ > ]) > with_schema_df = >
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anju updated SPARK-36277: - Attachment: Inputfile.PNG > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG, Inputfile.PNG > > > I am writing the steps to reproduce the issue for "count" pyspark api while > using mode as dropmalformed. > I have a csv sample file in s3 bucket . I am reading the file using pyspark > api for csv . I am reading the csv with schema and without schema using mode > "dropmalformed" options in two different dataframes . While displaying the > with schema dataframe , the display looks good ,it is not showing the > malformed records .But when we apply count api on the dataframe it gives the > record count of actual file. I am expecting it should give me valid record > count . > here is the code used:- > ``` > without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) > schema = StructType([ \ > StructField("firstname",StringType(),True), \ > StructField("middlename",StringType(),True), \ > StructField("lastname",StringType(),True), \ > StructField("id", StringType(), True), \ > StructField("gender", StringType(), True), \ > StructField("salary", IntegerType(), True) \ > ]) > with_schema_df = > spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") > print("The dataframe with schema") > with_schema_df.show() > print("The dataframe without schema") > without_schema_df.show() > cnt_with_schema=with_schema_df.count() > print("The records count from with schema df :"+str(cnt_with_schema)) > cnt_without_schema=without_schema_df.count() > print("The records count from without schema df: "+str(cnt_without_schema)) > ``` > here is the outputs screen shot:- > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anju updated SPARK-36277: - Attachment: 111.PNG > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG > > > I am writing the steps to reproduce the issue for "count" pyspark api while > using mode as dropmalformed. > I have a csv sample file in s3 bucket . I am reading the file using pyspark > api for csv . I am reading the csv with schema and without schema using mode > "dropmalformed" options in two different dataframes . While displaying the > with schema dataframe , the display looks good ,it is not showing the > malformed records .But when we apply count api on the dataframe it gives the > record count of actual file. I am expecting it should give me valid record > count . > here is the code used:- > ``` > without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) > schema = StructType([ \ > StructField("firstname",StringType(),True), \ > StructField("middlename",StringType(),True), \ > StructField("lastname",StringType(),True), \ > StructField("id", StringType(), True), \ > StructField("gender", StringType(), True), \ > StructField("salary", IntegerType(), True) \ > ]) > with_schema_df = > spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") > print("The dataframe with schema") > with_schema_df.show() > print("The dataframe without schema") > without_schema_df.show() > cnt_with_schema=with_schema_df.count() > print("The records count from with schema df :"+str(cnt_with_schema)) > cnt_without_schema=without_schema_df.count() > print("The records count from without schema df: "+str(cnt_without_schema)) > ``` > here is the outputs:- > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anju updated SPARK-36277: - Attachment: (was: 111.PNG) > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG > > > I am writing the steps to reproduce the issue for "count" pyspark api while > using mode as dropmalformed. > I have a csv sample file in s3 bucket . I am reading the file using pyspark > api for csv . I am reading the csv with schema and without schema using mode > "dropmalformed" options in two different dataframes . While displaying the > with schema dataframe , the display looks good ,it is not showing the > malformed records .But when we apply count api on the dataframe it gives the > record count of actual file. I am expecting it should give me valid record > count . > here is the code used:- > ``` > without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) > schema = StructType([ \ > StructField("firstname",StringType(),True), \ > StructField("middlename",StringType(),True), \ > StructField("lastname",StringType(),True), \ > StructField("id", StringType(), True), \ > StructField("gender", StringType(), True), \ > StructField("salary", IntegerType(), True) \ > ]) > with_schema_df = > spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") > print("The dataframe with schema") > with_schema_df.show() > print("The dataframe without schema") > without_schema_df.show() > cnt_with_schema=with_schema_df.count() > print("The records count from with schema df :"+str(cnt_with_schema)) > cnt_without_schema=without_schema_df.count() > print("The records count from without schema df: "+str(cnt_without_schema)) > ``` > here is the outputs:- > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anju updated SPARK-36277: - Description: I am writing the steps to reproduce the issue for "count" pyspark api while using mode as dropmalformed. I have a csv sample file in s3 bucket . I am reading the file using pyspark api for csv . I am reading the csv with schema and without schema using mode "dropmalformed" options in two different dataframes . While displaying the with schema dataframe , the display looks good ,it is not showing the malformed records .But when we apply count api on the dataframe it gives the record count of actual file. I am expecting it should give me valid record count . here is the code used:- ``` without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) schema = StructType([ \ StructField("firstname",StringType(),True), \ StructField("middlename",StringType(),True), \ StructField("lastname",StringType(),True), \ StructField("id", StringType(), True), \ StructField("gender", StringType(), True), \ StructField("salary", IntegerType(), True) \ ]) with_schema_df = spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") print("The dataframe with schema") with_schema_df.show() print("The dataframe without schema") without_schema_df.show() cnt_with_schema=with_schema_df.count() print("The records count from with schema df :"+str(cnt_with_schema)) cnt_without_schema=without_schema_df.count() print("The records count from without schema df: "+str(cnt_without_schema)) ``` here is the outputs screen shot:- was: I am writing the steps to reproduce the issue for "count" pyspark api while using mode as dropmalformed. I have a csv sample file in s3 bucket . I am reading the file using pyspark api for csv . I am reading the csv with schema and without schema using mode "dropmalformed" options in two different dataframes . While displaying the with schema dataframe , the display looks good ,it is not showing the malformed records .But when we apply count api on the dataframe it gives the record count of actual file. I am expecting it should give me valid record count . here is the code used:- ``` without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) schema = StructType([ \ StructField("firstname",StringType(),True), \ StructField("middlename",StringType(),True), \ StructField("lastname",StringType(),True), \ StructField("id", StringType(), True), \ StructField("gender", StringType(), True), \ StructField("salary", IntegerType(), True) \ ]) with_schema_df = spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") print("The dataframe with schema") with_schema_df.show() print("The dataframe without schema") without_schema_df.show() cnt_with_schema=with_schema_df.count() print("The records count from with schema df :"+str(cnt_with_schema)) cnt_without_schema=without_schema_df.count() print("The records count from without schema df: "+str(cnt_without_schema)) ``` here is the outputs:- > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG > > > I am writing the steps to reproduce the issue for "count" pyspark api while > using mode as dropmalformed. > I have a csv sample file in s3 bucket . I am reading the file using pyspark > api for csv . I am reading the csv with schema and without schema using mode > "dropmalformed" options in two different dataframes . While displaying the > with schema dataframe , the display looks good ,it is not showing the > malformed records .But when we apply count api on the dataframe it gives the > record count of actual file. I am expecting it should give me valid record > count . > here is the code used:- > ``` > without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) > schema = StructType([ \ > StructField("firstname",StringType(),True), \ > StructField("middlename",StringType(),True), \ > StructField("lastname",StringType(),True), \ > StructField("id", StringType(), True), \ > StructField("gender", StringType(), True), \ > StructField("salary", IntegerType(), True) \ > ]) > with_schema_df = > spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") >
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anju updated SPARK-36277: - Description: I am writing the steps to reproduce the issue for "count" pyspark api while using mode as dropmalformed. I have a csv sample file in s3 bucket . I am reading the file using pyspark api for csv . I am reading the csv with schema and without schema using mode "dropmalformed" options in two different dataframes . While displaying the with schema dataframe , the display looks good ,it is not showing the malformed records .But when we apply count api on the dataframe it gives the record count of actual file. I am expecting it should give me valid record count . here is the code used:- ``` without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) schema = StructType([ \ StructField("firstname",StringType(),True), \ StructField("middlename",StringType(),True), \ StructField("lastname",StringType(),True), \ StructField("id", StringType(), True), \ StructField("gender", StringType(), True), \ StructField("salary", IntegerType(), True) \ ]) with_schema_df = spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") print("The dataframe with schema") with_schema_df.show() print("The dataframe without schema") without_schema_df.show() cnt_with_schema=with_schema_df.count() print("The records count from with schema df :"+str(cnt_with_schema)) cnt_without_schema=without_schema_df.count() print("The records count from without schema df: "+str(cnt_without_schema)) ``` here is the outputs:- was: I am writing the steps to reproduce the issue for "count" pyspark api while using mode as dropmalformed. I have a csv sample file in s3 bucket . I am reading the file using pyspark api for csv . I am reading the csv with schema and without schema using mode "dropmalformed" options in two different dataframes . While displaying the with schema dataframe , the display looks good ,it is not showing the malformed records .But when we apply count api on the dataframe it gives the record count of actual file. I am expecting it should give me valid record count . here is the code used:- ``` without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) schema = StructType([ \ StructField("firstname",StringType(),True), \ StructField("middlename",StringType(),True), \ StructField("lastname",StringType(),True), \ StructField("id", StringType(), True), \ StructField("gender", StringType(), True), \ StructField("salary", IntegerType(), True) \ ]) with_schema_df = spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") print("The dataframe with schema") with_schema_df.show() print("The dataframe without schema") without_schema_df.show() cnt_with_schema=with_schema_df.count() print("The records count from with schema df :"+str(cnt_with_schema)) cnt_without_schema=without_schema_df.count() print("The records count from without schema df: "+str(cnt_without_schema)) ``` here is the outputs:- !111.PNG! > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG > > > I am writing the steps to reproduce the issue for "count" pyspark api while > using mode as dropmalformed. > I have a csv sample file in s3 bucket . I am reading the file using pyspark > api for csv . I am reading the csv with schema and without schema using mode > "dropmalformed" options in two different dataframes . While displaying the > with schema dataframe , the display looks good ,it is not showing the > malformed records .But when we apply count api on the dataframe it gives the > record count of actual file. I am expecting it should give me valid record > count . > here is the code used:- > ``` > without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) > schema = StructType([ \ > StructField("firstname",StringType(),True), \ > StructField("middlename",StringType(),True), \ > StructField("lastname",StringType(),True), \ > StructField("id", StringType(), True), \ > StructField("gender", StringType(), True), \ > StructField("salary", IntegerType(), True) \ > ]) > with_schema_df = > spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") >
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anju updated SPARK-36277: - Description: I am writing the steps to reproduce the issue for "count" pyspark api while using mode as dropmalformed. I have a csv sample file in s3 bucket . I am reading the file using pyspark api for csv . I am reading the csv with schema and without schema using mode "dropmalformed" options in two different dataframes . While displaying the with schema dataframe , the display looks good ,it is not showing the malformed records .But when we apply count api on the dataframe it gives the record count of actual file. I am expecting it should give me valid record count . here is the code used:- ``` without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) schema = StructType([ \ StructField("firstname",StringType(),True), \ StructField("middlename",StringType(),True), \ StructField("lastname",StringType(),True), \ StructField("id", StringType(), True), \ StructField("gender", StringType(), True), \ StructField("salary", IntegerType(), True) \ ]) with_schema_df = spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True,schema=schema,mode="DROPMALFORMED") print("The dataframe with schema") with_schema_df.show() print("The dataframe without schema") without_schema_df.show() cnt_with_schema=with_schema_df.count() print("The records count from with schema df :"+str(cnt_with_schema)) cnt_without_schema=without_schema_df.count() print("The records count from without schema df: "+str(cnt_without_schema)) ``` here is the outputs:- !111.PNG! was: While reading the dataframe in malformed mode ,I am not getting right record count. dataframe.count() is giving me the record count of actual file including malformed records, eventhough data frame is read in "dropmalformed" mode. Is there a way to overcome this in pyspark here is the high level overview of what i am doing. I am trying to read the two dataframes from one file using with/without predefined schema. Issue is when i read a DF with a predefined schema and with mode as "dropmalformed", the record count in df is not dropping the records. The record count is same as actual file where i am expecting less record count,as there are few malformed records . But when i try to select and display the records in df ,it is not showing malformed records. So display is correct. output is attached in the aattchment code {code} s3_obj =boto3.client('s3') s3_clientobj = s3_obj.get_object(Bucket='xyz', Key='data/test_files/schema_xyz.json') s3_clientobj s3_clientdata = s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata) schemaFromJson =StructType.fromJson(json.loads(s3_clientdata)) extract_with_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED") extract_without_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive") extract_with_schema_df.select("col1","col2").show() cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the records with schema "+ str(cnt1)) cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the records without schema "+str(cnt2)) cnt2=extract_without_schema_df.select("col1","col2").show()}} {code} > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG > > > I am writing the steps to reproduce the issue for "count" pyspark api while > using mode as dropmalformed. > I have a csv sample file in s3 bucket . I am reading the file using pyspark > api for csv . I am reading the csv with schema and without schema using mode > "dropmalformed" options in two different dataframes . While displaying the > with schema dataframe , the display looks good ,it is not showing the > malformed records .But when we apply count api on the dataframe it gives the > record count of actual file. I am expecting it should give me valid record > count . > here is the code used:- > ``` > without_schema_df=spark.read.csv("s3://noa-poc-lakeformation/data/test_files/sample.csv",header=True) > schema = StructType([ \ > StructField("firstname",StringType(),True), \ > StructField("middlename",StringType(),True), \ > StructField("lastname",StringType(),True), \ > StructField("id", StringType(), True), \ > StructField("gender", StringType(), True), \ >
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36277: - Description: While reading the dataframe in malformed mode ,I am not getting right record count. dataframe.count() is giving me the record count of actual file including malformed records, eventhough data frame is read in "dropmalformed" mode. Is there a way to overcome this in pyspark here is the high level overview of what i am doing. I am trying to read the two dataframes from one file using with/without predefined schema. Issue is when i read a DF with a predefined schema and with mode as "dropmalformed", the record count in df is not dropping the records. The record count is same as actual file where i am expecting less record count,as there are few malformed records . But when i try to select and display the records in df ,it is not showing malformed records. So display is correct. output is attached in the aattchment code {code} s3_obj =boto3.client('s3') s3_clientobj = s3_obj.get_object(Bucket='xyz', Key='data/test_files/schema_xyz.json') s3_clientobj s3_clientdata = s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata) schemaFromJson =StructType.fromJson(json.loads(s3_clientdata)) extract_with_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED") extract_without_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive") extract_with_schema_df.select("col1","col2").show() cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the records with schema "+ str(cnt1)) cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the records without schema "+str(cnt2)) cnt2=extract_without_schema_df.select("col1","col2").show()}} {code} was: While reading the dataframe in malformed mode ,I am not getting right record count. dataframe.count() is giving me the record count of actual file including malformed records, eventhough data frame is read in "dropmalformed" mode. Is there a way to overcome this in pyspark here is the high level overview of what i am doing. I am trying to read the two dataframes from one file using with/without predefined schema. Issue is when i read a DF with a predefined schema and with mode as "dropmalformed", the record count in df is not dropping the records. The record count is same as actual file where i am expecting less record count,as there are few malformed records . But when i try to select and display the records in df ,it is not showing malformed records. So display is correct. output is attached in the aattchment code {{s3_obj =boto3.client('s3') s3_clientobj = s3_obj.get_object(Bucket='xyz', Key='data/test_files/schema_xyz.json') s3_clientobj s3_clientdata = s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata) schemaFromJson =StructType.fromJson(json.loads(s3_clientdata)) extract_with_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED") extract_without_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive") extract_with_schema_df.select("col1","col2").show() cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the records with schema "+ str(cnt1)) cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the records without schema "+str(cnt2)) cnt2=extract_without_schema_df.select("col1","col2").show()}} > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG > > > While reading the dataframe in malformed mode ,I am not getting right record > count. dataframe.count() is giving me the record count of actual file > including malformed records, eventhough data frame is read in "dropmalformed" > mode. Is there a way to overcome this in pyspark > here is the high level overview of what i am doing. I am trying to read the > two dataframes from one file using with/without predefined schema. Issue is > when i read a DF with a predefined schema and with mode as "dropmalformed", > the record count in df is not dropping the records. The record count is same > as actual file where i am expecting less record count,as there are few > malformed records . But when i try to select and display the records in df > ,it is not showing malformed records. So
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anju updated SPARK-36277: - Description: While reading the dataframe in malformed mode ,I am not getting right record count. dataframe.count() is giving me the record count of actual file including malformed records, eventhough data frame is read in "dropmalformed" mode. Is there a way to overcome this in pyspark here is the high level overview of what i am doing I am trying to read the two dataframes from one file using with/without predefined schema. Issue is when i read a DF with a predefined schema and with mode as "dropmalformed", the record count in df is not dropping the records. The record count is same as actual file where i am expecting less record count,as there are few malformed records . But when i try to select and display the records in df ,it is not showing malformed records. So display is correct. output is attached in the aattchment code {{s3_obj =boto3.client('s3') s3_clientobj = s3_obj.get_object(Bucket='xyz', Key='data/test_files/schema_xyz.json') s3_clientobj s3_clientdata = s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata) schemaFromJson =StructType.fromJson(json.loads(s3_clientdata)) extract_with_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED") extract_without_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive") extract_with_schema_df.select("col1","col2").show() cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the records with schema "+ str(cnt1)) cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the records without schema "+str(cnt2)) cnt2=extract_without_schema_df.select("col1","col2").show()}} was: I am trying to read the two dataframes from one file using with/without predefined schema. Issue is when i read a DF with a predefined schema and with mode as "dropmalformed", the record count in df is not dropping the records. The record count is same as actual file where i am expecting less record count,as there are few malformed records . But when i try to select and display the records in df ,it is not showing malformed records. So display is correct. output is attached in the aattchement code {{s3_obj =boto3.client('s3') s3_clientobj = s3_obj.get_object(Bucket='xyz', Key='data/test_files/schema_xyz.json') s3_clientobj s3_clientdata = s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata) schemaFromJson =StructType.fromJson(json.loads(s3_clientdata)) extract_with_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED") extract_without_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive") extract_with_schema_df.select("col1","col2").show() cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the records with schema "+ str(cnt1)) cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the records without schema "+str(cnt2)) cnt2=extract_without_schema_df.select("col1","col2").show()}} > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG > > > While reading the dataframe in malformed mode ,I am not getting right record > count. dataframe.count() is giving me the record count of actual file > including malformed records, eventhough data frame is read in "dropmalformed" > mode. Is there a way to overcome this in pyspark > here is the high level overview of what i am doing I am trying to read the > two dataframes from one file using with/without predefined schema. Issue is > when i read a DF with a predefined schema and with mode as "dropmalformed", > the record count in df is not dropping the records. The record count is same > as actual file where i am expecting less record count,as there are few > malformed records . But when i try to select and display the records in df > ,it is not showing malformed records. So display is correct. output is > attached in the aattchment > code > > {{s3_obj =boto3.client('s3') > s3_clientobj = s3_obj.get_object(Bucket='xyz', > Key='data/test_files/schema_xyz.json') > s3_clientobj > s3_clientdata = > s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata) > schemaFromJson
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anju updated SPARK-36277: - Description: While reading the dataframe in malformed mode ,I am not getting right record count. dataframe.count() is giving me the record count of actual file including malformed records, eventhough data frame is read in "dropmalformed" mode. Is there a way to overcome this in pyspark here is the high level overview of what i am doing. I am trying to read the two dataframes from one file using with/without predefined schema. Issue is when i read a DF with a predefined schema and with mode as "dropmalformed", the record count in df is not dropping the records. The record count is same as actual file where i am expecting less record count,as there are few malformed records . But when i try to select and display the records in df ,it is not showing malformed records. So display is correct. output is attached in the aattchment code {{s3_obj =boto3.client('s3') s3_clientobj = s3_obj.get_object(Bucket='xyz', Key='data/test_files/schema_xyz.json') s3_clientobj s3_clientdata = s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata) schemaFromJson =StructType.fromJson(json.loads(s3_clientdata)) extract_with_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED") extract_without_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive") extract_with_schema_df.select("col1","col2").show() cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the records with schema "+ str(cnt1)) cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the records without schema "+str(cnt2)) cnt2=extract_without_schema_df.select("col1","col2").show()}} was: While reading the dataframe in malformed mode ,I am not getting right record count. dataframe.count() is giving me the record count of actual file including malformed records, eventhough data frame is read in "dropmalformed" mode. Is there a way to overcome this in pyspark here is the high level overview of what i am doing I am trying to read the two dataframes from one file using with/without predefined schema. Issue is when i read a DF with a predefined schema and with mode as "dropmalformed", the record count in df is not dropping the records. The record count is same as actual file where i am expecting less record count,as there are few malformed records . But when i try to select and display the records in df ,it is not showing malformed records. So display is correct. output is attached in the aattchment code {{s3_obj =boto3.client('s3') s3_clientobj = s3_obj.get_object(Bucket='xyz', Key='data/test_files/schema_xyz.json') s3_clientobj s3_clientdata = s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata) schemaFromJson =StructType.fromJson(json.loads(s3_clientdata)) extract_with_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED") extract_without_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive") extract_with_schema_df.select("col1","col2").show() cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the records with schema "+ str(cnt1)) cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the records without schema "+str(cnt2)) cnt2=extract_without_schema_df.select("col1","col2").show()}} > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG > > > While reading the dataframe in malformed mode ,I am not getting right record > count. dataframe.count() is giving me the record count of actual file > including malformed records, eventhough data frame is read in "dropmalformed" > mode. Is there a way to overcome this in pyspark > here is the high level overview of what i am doing. I am trying to read the > two dataframes from one file using with/without predefined schema. Issue is > when i read a DF with a predefined schema and with mode as "dropmalformed", > the record count in df is not dropping the records. The record count is same > as actual file where i am expecting less record count,as there are few > malformed records . But when i try to select and display the records in df > ,it is not showing malformed records. So display is correct. output is
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anju updated SPARK-36277: - Description: I am trying to read the two dataframes from one file using with/without predefined schema. Issue is when i read a DF with a predefined schema and with mode as "dropmalformed", the record count in df is not dropping the records. The record count is same as actual file where i am expecting less record count,as there are few malformed records . But when i try to select and display the records in df ,it is not showing malformed records. So display is correct. output is attached in the aattchement code {{s3_obj =boto3.client('s3') s3_clientobj = s3_obj.get_object(Bucket='xyz', Key='data/test_files/schema_xyz.json') s3_clientobj s3_clientdata = s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata) schemaFromJson =StructType.fromJson(json.loads(s3_clientdata)) extract_with_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED") extract_without_schema_df = spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive") extract_with_schema_df.select("col1","col2").show() cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the records with schema "+ str(cnt1)) cnt2=extract_without_schema_df.select("col1","col2").count()print("count of the records without schema "+str(cnt2)) cnt2=extract_without_schema_df.select("col1","col2").show()}} was: I am trying to read the two dataframes from one file using with/without predefined schema. Issue is when i read a DF with a predefined schema and with mode as "dropmalformed", the record count in df is not dropping the records. The record count is same as actual file where i am expecting less record count,as there are few malformed records . But when i try to select and display the records in df ,it is not showing malformed records. So display is correct. output is attached in the atatchement !image-2021-07-23-16-31-46-172.png! > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG > > > I am trying to read the two dataframes from one file using with/without > predefined schema. Issue is when i read a DF with a predefined schema and > with mode as "dropmalformed", the record count in df is not dropping the > records. The record count is same as actual file where i am expecting less > record count,as there are few malformed records . But when i try to select > and display the records in df ,it is not showing malformed records. So > display is correct. output is attached in the aattchement > code > > {{s3_obj =boto3.client('s3') > s3_clientobj = s3_obj.get_object(Bucket='xyz', > Key='data/test_files/schema_xyz.json') > s3_clientobj > s3_clientdata = > s3_clientobj['Body'].read().decode('utf-8')#print(s3_clientdata)schemaSource=json.loads(s3_clientdata) > schemaFromJson =StructType.fromJson(json.loads(s3_clientdata)) > extract_with_schema_df = > spark.read.csv("s3:few_columns.csv",header=True,sep=",",schema=schemaFromJson,mode="DROPMALFORMED") > > extract_without_schema_df = > spark.read.csv("s3:few_columns.csv",header=True,sep=",",mode="permissive") > extract_with_schema_df.select("col1","col2").show() > cnt1=extract_with_schema_df.select("col1","col2").count()print("count of the > records with schema "+ str(cnt1)) > cnt2=extract_without_schema_df.select("col1","col2").count()print("count of > the records without schema "+str(cnt2)) > cnt2=extract_without_schema_df.select("col1","col2").show()}} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anju updated SPARK-36277: - Description: I am trying to read the two dataframes from one file using with/without predefined schema. Issue is when i read a DF with a predefined schema and with mode as "dropmalformed", the record count in df is not dropping the records. The record count is same as actual file where i am expecting less record count,as there are few malformed records . But when i try to select and display the records in df ,it is not showing malformed records. So display is correct. output is attached in the atatchement !image-2021-07-23-16-31-46-172.png! was: I am trying to read the two dataframes from one file using with/without predefined schema. Issue is when i read a DF with a predefined schema and with mode as "dropmalformed", the record count in df is not dropping the records. The record count is same as actual file where i am expecting less record count,as there are few malformed records . But when i try to select and display the records in df ,it is not showing malformed records. So display is correct. !image-2021-07-23-16-31-46-172.png! > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG > > > I am trying to read the two dataframes from one file using with/without > predefined schema. Issue is when i read a DF with a predefined schema and > with mode as "dropmalformed", the record count in df is not dropping the > records. The record count is same as actual file where i am expecting less > record count,as there are few malformed records . But when i try to select > and display the records in df ,it is not showing malformed records. So > display is correct. output is attached in the atatchement > !image-2021-07-23-16-31-46-172.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anju updated SPARK-36277: - Attachment: 111.PNG > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > Attachments: 111.PNG > > > I am trying to read the two dataframes from one file using with/without > predefined schema. Issue is when i read a DF with a predefined schema and > with mode as "dropmalformed", the record count in df is not dropping the > records. The record count is same as actual file where i am expecting less > record count,as there are few malformed records . But when i try to select > and display the records in df ,it is not showing malformed records. So > display is correct. output is attached in the atatchement > !image-2021-07-23-16-31-46-172.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36277) Issue with record count of data frame while reading in DropMalformed mode
[ https://issues.apache.org/jira/browse/SPARK-36277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anju updated SPARK-36277: - Summary: Issue with record count of data frame while reading in DropMalformed mode (was: Issue record count of data frame while reading in DropMalformed mode) > Issue with record count of data frame while reading in DropMalformed mode > - > > Key: SPARK-36277 > URL: https://issues.apache.org/jira/browse/SPARK-36277 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 >Reporter: anju >Priority: Major > > I am trying to read the two dataframes from one file using with/without > predefined schema. Issue is when i read a DF with a predefined schema and > with mode as "dropmalformed", the record count in df is not dropping the > records. The record count is same as actual file where i am expecting less > record count,as there are few malformed records . But when i try to select > and display the records in df ,it is not showing malformed records. So > display is correct. > !image-2021-07-23-16-31-46-172.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org