[jira] [Updated] (SPARK-49016) Spark DataSet.isEmpty behaviour is different on CSV than JSON

2024-07-26 Thread Marius Butan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Butan updated SPARK-49016:
-
Description: 
Spark DataSet.isEmpty behaviour is different on CSV than JSON:
 * CSV → dataSet.isEmpty return the values for any query

 * JSON → dataSet.isEmpty throws error when filter is only {_}corrupt{_}_record 
is null:
!image-2024-07-26-15-50-10-280.png!

Tested version: Spark 3.4.3, Spark 3.5.1

Expected behaviour: throw error on both file types or return the correct value

 

In order to demonstrate the behaviour I added an unit test

 
test.csv
{code:java}
first,second,third{code}
test.json
{code:java}
{"first": "first", "second": "second", "third": "third"}{code}
Code:
{noformat}
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;

public class SparkIsEmptyTest {

private SparkSession sparkSession;

@BeforeEach
void setUp() {
sparkSession = getSpark();
}

@AfterEach
void after() {
sparkSession.close();
}

@Test
void testDatasetIsEmptyForCsv() {
var dataSet = runCsvQuery("select first, second, third, _corrupt_record 
from tempView where _corrupt_record is null");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForJson() {
var dataSet = runJsonQuery("select first, second, third, 
_corrupt_record from tempView where _corrupt_record is null");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForJsonAnd1Eq1() {
var dataSet = runJsonQuery(
"select first, second, third, _corrupt_record from tempView 
where _corrupt_record is null and 1=1");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForCsvAnd1Eq1() {
var dataSet = runCsvQuery(
"select first, second, third, _corrupt_record from tempView 
where _corrupt_record is null and 1=1");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForJsonAndOtherCondition() {
   var dataSet = runJsonQuery("select first, second, third, _corrupt_record 
from tempView where _corrupt_record is null and first='first'");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForCsvAndOtherCondition() {
var dataSet = runCsvQuery("select first, second, third, _corrupt_record 
from tempView where _corrupt_record is null and first='first'");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForJsonAggregation() {
var dataSet = runJsonQuery("select count(1) from tempView where 
_corrupt_record is null");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForCsvAggregation() {
var dataSet = runCsvQuery("select count(1) from tempView where 
_corrupt_record is null");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForJsonAggregationGroupBy() {
var dataSet = runJsonQuery("select count(1) , first from tempView where 
_corrupt_record is null group by first");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForCsvAggregationGroupBy() {
var dataSet = runJsonQuery("select count(1) , first from tempView where 
_corrupt_record is null group by first");

assert !dataSet.isEmpty();
}

private SparkSession getSpark() {
return SparkSession.builder()
.master("local")
.appName("spark-dataset-isEmpty-issue")
.config("spark.ui.enabled", "false")
.getOrCreate();
}

private Dataset runJsonQuery(String query) {
Dataset dataset = sparkSession.read()
.schema("first STRING,second String, third STRING, 
_corrupt_record STRING")
.option("columnNameOfCorruptRecord", "_corrupt_record")
.json("test.json");

dataset.createOrReplaceTempView("tempView");
var dataSet = sparkSession.sql(query);

dataSet.show();

return dataSet;
}

private Dataset runCsvQuery(String query) {
Dataset dataset = sparkSession.read()
.schema("first STRING,second String, third STRING, 
_corrupt_record STRING")
.option("columnNameOfCorruptRecord", "_corrupt_record")
.csv("test.csv");

dataset.createOrReplaceTempView("tempView");
var dataSet = sparkSession.sql(query);

dataSet.show();

return dataSet;
}
}{noformat}
Result:
!image-2024-07-26-15-50-24-308.png!

 

  was:
Spark DataSet.isEmpty behaviour is different on CSV than JSON:
 * CSV → dataSet.isEmpty return the values for any query

 * JSON → dataSet.isEmpty throws error when filter is only {_}corrupt{_}_record

[jira] [Updated] (SPARK-49016) Spark DataSet.isEmpty behaviour is different on CSV than JSON

2024-07-26 Thread Marius Butan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Butan updated SPARK-49016:
-
Attachment: image-2024-07-26-15-50-10-280.png

> Spark DataSet.isEmpty behaviour is different on CSV than JSON
> -
>
> Key: SPARK-49016
> URL: https://issues.apache.org/jira/browse/SPARK-49016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1, 3.4.3
>Reporter: Marius Butan
>Priority: Major
> Attachments: image-2024-07-26-15-50-10-280.png, 
> image-2024-07-26-15-50-24-308.png
>
>
> Spark DataSet.isEmpty behaviour is different on CSV than JSON:
>  * CSV → dataSet.isEmpty return the values for any query
>  * JSON → dataSet.isEmpty throws error when filter is only 
> {_}corrupt{_}_record is null:
> !image-2024-07-26-15-49-03-234.png!
> Tested version: Spark 3.4.3, Spark 3.5.1
> Expected behaviour: throw error on both file types or return the correct value
>  
> In order to demonstrate the behaviour I added an unit test
>  
> test.csv
> {code:java}
> first,second,third{code}
> test.json
> {code:java}
> {"first": "first", "second": "second", "third": "third"}{code}
> Code:
> {noformat}
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.junit.jupiter.api.AfterEach;
> import org.junit.jupiter.api.BeforeEach;
> import org.junit.jupiter.api.Test;
> public class SparkIsEmptyTest {
> private SparkSession sparkSession;
> @BeforeEach
> void setUp() {
> sparkSession = getSpark();
> }
> @AfterEach
> void after() {
> sparkSession.close();
> }
> @Test
> void testDatasetIsEmptyForCsv() {
> var dataSet = runCsvQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJson() {
> var dataSet = runJsonQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAnd1Eq1() {
> var dataSet = runJsonQuery(
> "select first, second, third, _corrupt_record from tempView 
> where _corrupt_record is null and 1=1");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAnd1Eq1() {
> var dataSet = runCsvQuery(
> "select first, second, third, _corrupt_record from tempView 
> where _corrupt_record is null and 1=1");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAndOtherCondition() {
>var dataSet = runJsonQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null and 
> first='first'");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAndOtherCondition() {
> var dataSet = runCsvQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null and 
> first='first'");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAggregation() {
> var dataSet = runJsonQuery("select count(1) from tempView where 
> _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAggregation() {
> var dataSet = runCsvQuery("select count(1) from tempView where 
> _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAggregationGroupBy() {
> var dataSet = runJsonQuery("select count(1) , first from tempView 
> where _corrupt_record is null group by first");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAggregationGroupBy() {
> var dataSet = runJsonQuery("select count(1) , first from tempView 
> where _corrupt_record is null group by first");
> assert !dataSet.isEmpty();
> }
> private SparkSession getSpark() {
> return SparkSession.builder()
> .master("local")
> .appName("spark-dataset-isEmpty-issue")
> .config("spark.ui.enabled", "false")
> .getOrCreate();
> }
> private Dataset runJsonQuery(String query) {
> Dataset dataset = sparkSession.read()
> .schema("first STRING,second String, third STRING, 
> _corrupt_record STRING")
> .option("columnNameOfCorruptRecord", "_corrupt_record")
> .json("test.json");
> dataset.createOrReplaceTempView("tempView");
> var dataSet = sparkSession.sql(query);
> dataSet.

[jira] [Updated] (SPARK-49016) Spark DataSet.isEmpty behaviour is different on CSV than JSON

2024-07-26 Thread Marius Butan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Butan updated SPARK-49016:
-
Attachment: image-2024-07-26-15-50-24-308.png

> Spark DataSet.isEmpty behaviour is different on CSV than JSON
> -
>
> Key: SPARK-49016
> URL: https://issues.apache.org/jira/browse/SPARK-49016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.1, 3.4.3
>Reporter: Marius Butan
>Priority: Major
> Attachments: image-2024-07-26-15-50-10-280.png, 
> image-2024-07-26-15-50-24-308.png
>
>
> Spark DataSet.isEmpty behaviour is different on CSV than JSON:
>  * CSV → dataSet.isEmpty return the values for any query
>  * JSON → dataSet.isEmpty throws error when filter is only 
> {_}corrupt{_}_record is null:
> !image-2024-07-26-15-49-03-234.png!
> Tested version: Spark 3.4.3, Spark 3.5.1
> Expected behaviour: throw error on both file types or return the correct value
>  
> In order to demonstrate the behaviour I added an unit test
>  
> test.csv
> {code:java}
> first,second,third{code}
> test.json
> {code:java}
> {"first": "first", "second": "second", "third": "third"}{code}
> Code:
> {noformat}
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.junit.jupiter.api.AfterEach;
> import org.junit.jupiter.api.BeforeEach;
> import org.junit.jupiter.api.Test;
> public class SparkIsEmptyTest {
> private SparkSession sparkSession;
> @BeforeEach
> void setUp() {
> sparkSession = getSpark();
> }
> @AfterEach
> void after() {
> sparkSession.close();
> }
> @Test
> void testDatasetIsEmptyForCsv() {
> var dataSet = runCsvQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJson() {
> var dataSet = runJsonQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAnd1Eq1() {
> var dataSet = runJsonQuery(
> "select first, second, third, _corrupt_record from tempView 
> where _corrupt_record is null and 1=1");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAnd1Eq1() {
> var dataSet = runCsvQuery(
> "select first, second, third, _corrupt_record from tempView 
> where _corrupt_record is null and 1=1");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAndOtherCondition() {
>var dataSet = runJsonQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null and 
> first='first'");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAndOtherCondition() {
> var dataSet = runCsvQuery("select first, second, third, 
> _corrupt_record from tempView where _corrupt_record is null and 
> first='first'");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAggregation() {
> var dataSet = runJsonQuery("select count(1) from tempView where 
> _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAggregation() {
> var dataSet = runCsvQuery("select count(1) from tempView where 
> _corrupt_record is null");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForJsonAggregationGroupBy() {
> var dataSet = runJsonQuery("select count(1) , first from tempView 
> where _corrupt_record is null group by first");
> assert !dataSet.isEmpty();
> }
> @Test
> void testDatasetIsEmptyForCsvAggregationGroupBy() {
> var dataSet = runJsonQuery("select count(1) , first from tempView 
> where _corrupt_record is null group by first");
> assert !dataSet.isEmpty();
> }
> private SparkSession getSpark() {
> return SparkSession.builder()
> .master("local")
> .appName("spark-dataset-isEmpty-issue")
> .config("spark.ui.enabled", "false")
> .getOrCreate();
> }
> private Dataset runJsonQuery(String query) {
> Dataset dataset = sparkSession.read()
> .schema("first STRING,second String, third STRING, 
> _corrupt_record STRING")
> .option("columnNameOfCorruptRecord", "_corrupt_record")
> .json("test.json");
> dataset.createOrReplaceTempView("tempView");
> var dataSet = sparkSession.sql(query);
> dataSet.

[jira] [Created] (SPARK-49016) Spark DataSet.isEmpty behaviour is different on CSV than JSON

2024-07-26 Thread Marius Butan (Jira)
Marius Butan created SPARK-49016:


 Summary: Spark DataSet.isEmpty behaviour is different on CSV than 
JSON
 Key: SPARK-49016
 URL: https://issues.apache.org/jira/browse/SPARK-49016
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.3, 3.5.1
Reporter: Marius Butan


Spark DataSet.isEmpty behaviour is different on CSV than JSON:
 * CSV → dataSet.isEmpty return the values for any query

 * JSON → dataSet.isEmpty throws error when filter is only {_}corrupt{_}_record 
is null:
!image-2024-07-26-15-49-03-234.png!

Tested version: Spark 3.4.3, Spark 3.5.1

Expected behaviour: throw error on both file types or return the correct value

 

In order to demonstrate the behaviour I added an unit test

 
test.csv


{code:java}
first,second,third{code}
test.json


{code:java}
{"first": "first", "second": "second", "third": "third"}{code}
Code:
{noformat}
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;

public class SparkIsEmptyTest {

private SparkSession sparkSession;

@BeforeEach
void setUp() {
sparkSession = getSpark();
}

@AfterEach
void after() {
sparkSession.close();
}

@Test
void testDatasetIsEmptyForCsv() {
var dataSet = runCsvQuery("select first, second, third, _corrupt_record 
from tempView where _corrupt_record is null");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForJson() {
var dataSet = runJsonQuery("select first, second, third, 
_corrupt_record from tempView where _corrupt_record is null");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForJsonAnd1Eq1() {
var dataSet = runJsonQuery(
"select first, second, third, _corrupt_record from tempView 
where _corrupt_record is null and 1=1");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForCsvAnd1Eq1() {
var dataSet = runCsvQuery(
"select first, second, third, _corrupt_record from tempView 
where _corrupt_record is null and 1=1");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForJsonAndOtherCondition() {
   var dataSet = runJsonQuery("select first, second, third, _corrupt_record 
from tempView where _corrupt_record is null and first='first'");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForCsvAndOtherCondition() {
var dataSet = runCsvQuery("select first, second, third, _corrupt_record 
from tempView where _corrupt_record is null and first='first'");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForJsonAggregation() {
var dataSet = runJsonQuery("select count(1) from tempView where 
_corrupt_record is null");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForCsvAggregation() {
var dataSet = runCsvQuery("select count(1) from tempView where 
_corrupt_record is null");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForJsonAggregationGroupBy() {
var dataSet = runJsonQuery("select count(1) , first from tempView where 
_corrupt_record is null group by first");

assert !dataSet.isEmpty();
}

@Test
void testDatasetIsEmptyForCsvAggregationGroupBy() {
var dataSet = runJsonQuery("select count(1) , first from tempView where 
_corrupt_record is null group by first");

assert !dataSet.isEmpty();
}

private SparkSession getSpark() {
return SparkSession.builder()
.master("local")
.appName("spark-dataset-isEmpty-issue")
.config("spark.ui.enabled", "false")
.getOrCreate();
}

private Dataset runJsonQuery(String query) {
Dataset dataset = sparkSession.read()
.schema("first STRING,second String, third STRING, 
_corrupt_record STRING")
.option("columnNameOfCorruptRecord", "_corrupt_record")
.json("test.json");

dataset.createOrReplaceTempView("tempView");
var dataSet = sparkSession.sql(query);

dataSet.show();

return dataSet;
}

private Dataset runCsvQuery(String query) {
Dataset dataset = sparkSession.read()
.schema("first STRING,second String, third STRING, 
_corrupt_record STRING")
.option("columnNameOfCorruptRecord", "_corrupt_record")
.csv("test.csv");

dataset.createOrReplaceTempView("tempView");
var dataSet = sparkSession.sql(query);

dataSet.show();

return dataSet;
}
}{noformat}
Result:
!image-2024-07-26-15-48-51-018.png!


[jira] [Commented] (SPARK-34042) Column pruning is not working as expected for PERMISIVE mode

2021-01-14 Thread Marius Butan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265022#comment-17265022
 ] 

Marius Butan commented on SPARK-34042:
--

Like I said in description I made some tests in 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]:

 

All the tests are running on a file with 3 columns

withPruningEnabledAndMapSingleColumn -> with pruning enabled when we have a 
column in schema and we use it for the select query, the expected result is 2 
not 0

withPruningEnabledAndMap2ColumnsButUse1InSql-> with pruning enabled when we 
have 2 columns in schema and we use only 1 in  the select, the expected result 
is 2 and it is correct

withPruningDisableAndMap2ColumnsButUse1InSql -> with pruning disabled it works 
correctly

> Column pruning is not working as expected for PERMISIVE mode
> 
>
> Key: SPARK-34042
> URL: https://issues.apache.org/jira/browse/SPARK-34042
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.7
>Reporter: Marius Butan
>Priority: Major
>
> In PERMISSIVE mode
> Given a csv with multiple columns per row, if your file schema has a single 
> column and you are doing a SELECT in SQL with a condition like 
> ' is null', the row is marked as corrupted
>  
> BUT if you add an extra column in the file schema and you are not putting 
> that column in SQL SELECT , the row is not marked as corrupted
>  
> PS. I don't know exactly what is the right behavior, I didn't find it for 
> PERMISSIVE mode the documentation.
> What I found is: As an example, CSV file contains the "id,name" header and 
> one row "1234". In Spark 2.4, the selection of the id column consists of a 
> row with one column value 1234 but in Spark 2.3 and earlier, it is empty in 
> the DROPMALFORMED mode. To restore the previous behavior, set 
> {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.
>  
> [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]
>  
> I made a "unit" test in order to exemplify the issue: 
> [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34042) Column pruning is not working as expected for PERMISIVE mode

2021-01-07 Thread Marius Butan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Butan updated SPARK-34042:
-
Description: 
In PERMISSIVE mode

Given a csv with multiple columns per row, if your file schema has a single 
column and you are doing a SELECT in SQL with a condition like 
' is null', the row is marked as corrupted

 

BUT if you add an extra column in the file schema and you are not putting that 
column in SQL SELECT , the row is not marked as corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find it for 
PERMISSIVE mode the documentation.

What I found is: As an example, CSV file contains the "id,name" header and one 
row "1234". In Spark 2.4, the selection of the id column consists of a row with 
one column value 1234 but in Spark 2.3 and earlier, it is empty in the 
DROPMALFORMED mode. To restore the previous behavior, set 
{{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 

  was:
In PERMISSIVE mode

Given a csv with multiple columns per row, if your file schema has a single 
column and you are doing a SELECT in SQL with a condition like 
' is null', the row is marked as corrupted

 

BUT if you add an extra column in the file schema and you are not putting that 
column in SQL SELECT , the row is not marked as corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find for 
PERMISSIVE mode the documentation.

What I found is: As an example, CSV file contains the "id,name" header and one 
row "1234". In Spark 2.4, the selection of the id column consists of a row with 
one column value 1234 but in Spark 2.3 and earlier, it is empty in the 
DROPMALFORMED mode. To restore the previous behavior, set 
{{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 


> Column pruning is not working as expected for PERMISIVE mode
> 
>
> Key: SPARK-34042
> URL: https://issues.apache.org/jira/browse/SPARK-34042
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.7
>Reporter: Marius Butan
>Priority: Major
>
> In PERMISSIVE mode
> Given a csv with multiple columns per row, if your file schema has a single 
> column and you are doing a SELECT in SQL with a condition like 
> ' is null', the row is marked as corrupted
>  
> BUT if you add an extra column in the file schema and you are not putting 
> that column in SQL SELECT , the row is not marked as corrupted
>  
> PS. I don't know exactly what is the right behavior, I didn't find it for 
> PERMISSIVE mode the documentation.
> What I found is: As an example, CSV file contains the "id,name" header and 
> one row "1234". In Spark 2.4, the selection of the id column consists of a 
> row with one column value 1234 but in Spark 2.3 and earlier, it is empty in 
> the DROPMALFORMED mode. To restore the previous behavior, set 
> {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.
>  
> [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]
>  
> I made a "unit" test in order to exemplify the issue: 
> [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34042) Column pruning is not working as expected for PERMISIVE mode

2021-01-07 Thread Marius Butan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Butan updated SPARK-34042:
-
Description: 
In PERMISSIVE mode

Given a csv with multiple columns per row, if your file schema has a single 
column and you are doing a SELECT in SQL with a condition like 
' is null', the row is marked as corrupted

 

BUT if you add an extra column in the file schema and you are not putting that 
column in SQL SELECT , the row is not marked as corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find for 
PERMISSIVE mode the documentation.

What I found is: As an example, CSV file contains the "id,name" header and one 
row "1234". In Spark 2.4, the selection of the id column consists of a row with 
one column value 1234 but in Spark 2.3 and earlier, it is empty in the 
DROPMALFORMED mode. To restore the previous behavior, set 
{{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 

  was:
In PERMISSIVE mode

Given a csv with multiple columns per row, if your file schema has a single 
column and you are doing a SELECT in SQL with a condition like 
' is null', the row is marked as corrupted

 

BUT if you add an extra column in the file schema and you are not putting that 
column in SQL SELECT , the row is not marked as corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find for 
PERMISSIVE mode the documentation.

What I found is: As an example, CSV file contains the "id,name" header and one 
row "1234". In Spark 2.4, the selection of the id column consists of a row with 
one column value 1234 but in Spark 2.3 and earlier, it is empty in the 
DROPMALFORMED mode. To restore the previous behavior, set 
{{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 


> Column pruning is not working as expected for PERMISIVE mode
> 
>
> Key: SPARK-34042
> URL: https://issues.apache.org/jira/browse/SPARK-34042
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.7
>Reporter: Marius Butan
>Priority: Major
>
> In PERMISSIVE mode
> Given a csv with multiple columns per row, if your file schema has a single 
> column and you are doing a SELECT in SQL with a condition like 
> ' is null', the row is marked as corrupted
>  
> BUT if you add an extra column in the file schema and you are not putting 
> that column in SQL SELECT , the row is not marked as corrupted
>  
> PS. I don't know exactly what is the right behavior, I didn't find for 
> PERMISSIVE mode the documentation.
> What I found is: As an example, CSV file contains the "id,name" header and 
> one row "1234". In Spark 2.4, the selection of the id column consists of a 
> row with one column value 1234 but in Spark 2.3 and earlier, it is empty in 
> the DROPMALFORMED mode. To restore the previous behavior, set 
> {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.
>  
> [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]
>  
> I made a "unit" test in order to exemplify the issue: 
> [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34042) Column pruning is not working as expected for PERMISIVE mode

2021-01-07 Thread Marius Butan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Butan updated SPARK-34042:
-
Description: 
In PERMISSIVE mode

Given a csv with multiple columns per row, if your file schema has a single 
column and you are doing a SELECT in SQL with a condition like 
' is null', the row is marked as corrupted

 

BUT if you add an extra column in the file schema and you are not putting that 
column in SQL SELECT , the row is not marked as corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find for 
PERMISSIVE mode the documentation.

What I found is: As an example, CSV file contains the "id,name" header and one 
row "1234". In Spark 2.4, the selection of the id column consists of a row with 
one column value 1234 but in Spark 2.3 and earlier, it is empty in the 
DROPMALFORMED mode. To restore the previous behavior, set 
{{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 

  was:
In PERMISSIVE mode

Given a csv with multiple columns, if you have in schema a single column and 
you are selecting in SQL with condition that corrupt record to be null, the row 
is mapped as corrupted.

BUT if you add an extra column in csv schema an extra column and you are not 
selecting that column in SQL, the row is not corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find for 
PERMISSIVE mode the documentation. What I found is: As an example, CSV file 
contains the "id,name" header and one row "1234". In Spark 2.4, the selection 
of the id column consists of a row with one column value 1234 but in Spark 2.3 
and earlier, it is empty in the DROPMALFORMED mode. To restore the previous 
behavior, set {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 


> Column pruning is not working as expected for PERMISIVE mode
> 
>
> Key: SPARK-34042
> URL: https://issues.apache.org/jira/browse/SPARK-34042
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.7
>Reporter: Marius Butan
>Priority: Major
>
> In PERMISSIVE mode
> Given a csv with multiple columns per row, if your file schema has a single 
> column and you are doing a SELECT in SQL with a condition like 
> ' is null', the row is marked as corrupted
>  
> BUT if you add an extra column in the file schema and you are not putting 
> that column in SQL SELECT , the row is not marked as corrupted
>  
> PS. I don't know exactly what is the right behavior, I didn't find for 
> PERMISSIVE mode the documentation.
> What I found is: As an example, CSV file contains the "id,name" header and 
> one row "1234". In Spark 2.4, the selection of the id column consists of a 
> row with one column value 1234 but in Spark 2.3 and earlier, it is empty in 
> the DROPMALFORMED mode. To restore the previous behavior, set 
> {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.
>  
> [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]
>  
> I made a "unit" test in order to exemplify the issue: 
> [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34042) Column pruning is not working as expected for PERMISIVE mode

2021-01-07 Thread Marius Butan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Butan updated SPARK-34042:
-
Description: 
In PERMISSIVE mode

Given a csv with multiple columns, if you have in schema a single column and 
you are selecting in SQL with condition that corrupt record to be null, the row 
is mapped as corrupted.

BUT if you add an extra column in csv schema an extra column and you are not 
selecting that column in SQL, the row is not corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find for 
PERMISSIVE mode the documentation. What I found is: As an example, CSV file 
contains the "id,name" header and one row "1234". In Spark 2.4, the selection 
of the id column consists of a row with one column value 1234 but in Spark 2.3 
and earlier, it is empty in the DROPMALFORMED mode. To restore the previous 
behavior, set {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 

  was:
In PERMISSIVE mode

Given a csv with multiple columns, if you have in schema a single column and 
you are selecting in SQL with condition that corrupt record to be null, the row 
is mapped as corrupted.

BUT if you add an extra column in csv schema an extra column and you are not 
select that column in SQL, the row is not corrupted

 

PS. I don't know exactly what is the right behaviour, I didn't find for 
PERMISSIVE mode the documentation. What I found is: As an example, CSV file 
contains the "id,name" header and one row "1234". In Spark 2.4, selection of 
the id column consists of a row with one column value 1234 but in Spark 2.3 and 
earlier it is empty in the DROPMALFORMED mode. To restore the previous 
behavior, set {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 


> Column pruning is not working as expected for PERMISIVE mode
> 
>
> Key: SPARK-34042
> URL: https://issues.apache.org/jira/browse/SPARK-34042
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.7
>Reporter: Marius Butan
>Priority: Major
>
> In PERMISSIVE mode
> Given a csv with multiple columns, if you have in schema a single column and 
> you are selecting in SQL with condition that corrupt record to be null, the 
> row is mapped as corrupted.
> BUT if you add an extra column in csv schema an extra column and you are not 
> selecting that column in SQL, the row is not corrupted
>  
> PS. I don't know exactly what is the right behavior, I didn't find for 
> PERMISSIVE mode the documentation. What I found is: As an example, CSV file 
> contains the "id,name" header and one row "1234". In Spark 2.4, the selection 
> of the id column consists of a row with one column value 1234 but in Spark 
> 2.3 and earlier, it is empty in the DROPMALFORMED mode. To restore the 
> previous behavior, set {{spark.sql.csv.parser.columnPruning.enabled}} to 
> {{false}}.
>  
> [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]
>  
> I made a "unit" test in order to exemplify the issue: 
> [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34042) Column pruning is not working as expected for PERMISIVE mode

2021-01-07 Thread Marius Butan (Jira)
Marius Butan created SPARK-34042:


 Summary: Column pruning is not working as expected for PERMISIVE 
mode
 Key: SPARK-34042
 URL: https://issues.apache.org/jira/browse/SPARK-34042
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.4.7
Reporter: Marius Butan


In PERMISSIVE mode

Given a csv with multiple columns, if you have in schema a single column and 
you are selecting in SQL with condition that corrupt record to be null, the row 
is mapped as corrupted.

BUT if you add an extra column in csv schema an extra column and you are not 
select that column in SQL, the row is not corrupted

 

PS. I don't know exactly what is the right behaviour, I didn't find for 
PERMISSIVE mode the documentation. What I found is: As an example, CSV file 
contains the "id,name" header and one row "1234". In Spark 2.4, selection of 
the id column consists of a row with one column value 1234 but in Spark 2.3 and 
earlier it is empty in the DROPMALFORMED mode. To restore the previous 
behavior, set {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org