[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593194#comment-16593194 ] Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 6:56 AM: --- [~dongjoon] [~yucai] Here is a brief summary. We can see that * The data source tables with hive impl always return a,B,c, no matter whether spark.sql.caseSensitive is set to true or false and no matter metastore table schema is in lower case or upper case. Given ORC file schema is (a,B,c,C) ** Is it better to return null in scenario 2 and 10? ** Is it better to return C in scenario 12? ** Is it better to fail due to ambiguity in scenario 15, 18, 21, 24, rather than always return lower case one? * The data source tables with native impl, compared to hive impl, handle scenario 2, 10, 12 in a more reasonable way. However, they handles ambiguity in the same way as hive impl. * The hive serde tables always throw IndexOutOfBoundsException at runtime. * Since in case-sensitive mode analysis should fail if a column name in query and metastore schema are in different cases, all AnalysisException(s) meet our expectation. Stacktrace of IndexOutOfBoundsException: {code:java} java.lang.IndexOutOfBoundsException: toIndex = 4 at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) at java.util.ArrayList.subList(ArrayList.java:996) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {code} was (Author: seancxmao): [~dongjoon] [~yucai] Here is a brief summary. We can see that * The data source tables always return a,B,c, no matter whether spark.sql.caseSensitive is set to true or false and no matter metastore table schema is in lower case or upper case. Since ORC file schema is (a,B,c,C), ** Is it better to return null in scenario 2 and 10? ** Is it better to return
[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593185#comment-16593185 ] Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 6:45 AM: --- Thorough investigation about ORC tables {code:java} val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", "id * 4 as C") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data") $> hive --orcfiledump /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Structure for /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Type: struct CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data' DESC EXTENDED orc_data_source_lower; DESC EXTENDED orc_data_source_upper; DESC EXTENDED orc_hive_serde_lower; DESC EXTENDED orc_hive_serde_upper; spark.conf.set("spark.sql.hive.convertMetastoreOrc", false) {code} ||no.||caseSensitive||table columns||select column||orc column (select via data source table, hive impl)||orc column (select via data source table, native impl)||orc column (select via hive serde table)|| |1|true|a, b, c|a|a |a|IndexOutOfBoundsException | |2| | |b|B |null|IndexOutOfBoundsException | |3| | |c|c |c|IndexOutOfBoundsException | |4| | |A|AnalysisException|AnalysisException|AnalysisException| |5| | |B|AnalysisException|AnalysisException|AnalysisException| |6| | |C|AnalysisException|AnalysisException|AnalysisException| |7| |A, B, C|a|AnalysisException |AnalysisException|AnalysisException| |8| | |b|AnalysisException |AnalysisException|AnalysisException | |9| | |c|AnalysisException |AnalysisException|AnalysisException | |10| | |A|a |null|IndexOutOfBoundsException | |11| | |B|B |B|IndexOutOfBoundsException | |12| | |C|c |C|IndexOutOfBoundsException | |13|false|a, b, c|a|a |a|IndexOutOfBoundsException | |14| | |b|B |B|IndexOutOfBoundsException | |15| | |c|c |c|IndexOutOfBoundsException | |16| | |A|a |a|IndexOutOfBoundsException | |17| | |B|B |B|IndexOutOfBoundsException | |18| | |C|c |c|IndexOutOfBoundsException | |19| |A, B, C|a|a |a|IndexOutOfBoundsException | |20| | |b|B |B|IndexOutOfBoundsException | |21| | |c|c |c|IndexOutOfBoundsException | |22| | |A|a |a|IndexOutOfBoundsException | |23| | |B|B |B|IndexOutOfBoundsException | |24| | |C|c |c|IndexOutOfBoundsException | was (Author: seancxmao): Thorough investigation about ORC tables {code} val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", "id * 4 as C") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data") $> hive --orcfiledump /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Structure for /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Type: struct CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data' DESC EXTENDED orc_data_source_lower; DESC EXTENDED orc_data_source_upper; DESC EXTENDED orc_hive_serde_lower; DESC EXTENDED orc_hive_serde_upper; spark.conf.set("spark.sql.hive.convertMetastoreOrc", false) {code} ||no.||caseSensitive||table columns||select column||orc column (select via data source table)||orc column (select via hive serde table)|| |1|true|a, b, c|a|a |IndexOutOfBoundsException | |2| | |b|B |IndexOutOfBoundsException | |3| | |c|c |IndexOutOfBoundsException | |4| | |A|AnalysisException|AnalysisException| |5| | |B|AnalysisException|AnalysisException| |6| | |C|AnalysisException|AnalysisException| |7| |A, B, C|a|AnalysisException |AnalysisException| |8| | |b|AnalysisException |AnalysisException | |9| | |c|AnalysisException |AnalysisException | |10| | |A|a |IndexOutOfBoundsException | |11| | |B|B |IndexOutOfBoundsException | |12| | |C|c |IndexOutOfBoundsException | |13|false|a, b, c|a|a |IndexOutOfBoundsException | |14| | |b|B |IndexOutOfBoundsException | |15| | |c|c |IndexOutOfBoundsException | |16| | |A|a |IndexOutOfBoundsException | |17| | |B|B |IndexOutOfBoundsException |
[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593200#comment-16593200 ] Chenxiao Mao commented on SPARK-25175: -- Also here is the similar investigation I did for parquet tables. Just for your information: https://github.com/apache/spark/pull/22184/files#r212405373 > Case-insensitive field resolution when reading from ORC > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues. Since Spark has 2 > OrcFileFormat, we should add support for both. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat doesn't support case-insensitive field > resolution at all. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao reopened SPARK-25175: -- > Case-insensitive field resolution when reading from ORC > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues. Since Spark has 2 > OrcFileFormat, we should add support for both. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat doesn't support case-insensitive field > resolution at all. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593194#comment-16593194 ] Chenxiao Mao commented on SPARK-25175: -- [~dongjoon] [~yucai] Here is a brief summary. We can see that * The data source tables always return a,B,c, no matter whether spark.sql.caseSensitive is set to true or false and no matter metastore table schema is in lower case or upper case. Since ORC file schema is (a,B,c,C), ** Is it better to return null in scenario 2 and 10? ** Is it better to return C in scenario 12? ** Is it better to fail due to ambiguity in scenario 15, 18, 21, 24, rather than always return lower case one? * The hive serde tables always throw IndexOutOfBoundsException at runtime. * Since in case-sensitive mode analysis should fail if a column name in query and metastore schema are in different cases, all AnalysisException(s) meet our expectation. Stacktrace of IndexOutOfBoundsException: {code:java} java.lang.IndexOutOfBoundsException: toIndex = 4 at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004) at java.util.ArrayList.subList(ArrayList.java:996) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161) at org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66) at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113) at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {code} > Case-insensitive field resolution when reading from ORC > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues. Since Spark has 2 > OrcFileFormat, we should add support for both. > * Since SPARK-2883, S
[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593185#comment-16593185 ] Chenxiao Mao commented on SPARK-25175: -- Thorough investigation about ORC tables {code} val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", "id * 4 as C") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data") $> hive --orcfiledump /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Structure for /user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc Type: struct CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data' CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc LOCATION '/user/hive/warehouse/orc_data' DESC EXTENDED orc_data_source_lower; DESC EXTENDED orc_data_source_upper; DESC EXTENDED orc_hive_serde_lower; DESC EXTENDED orc_hive_serde_upper; spark.conf.set("spark.sql.hive.convertMetastoreOrc", false) {code} ||no.||caseSensitive||table columns||select column||orc column (select via data source table)||orc column (select via hive serde table)|| |1|true|a, b, c|a|a |IndexOutOfBoundsException | |2| | |b|B |IndexOutOfBoundsException | |3| | |c|c |IndexOutOfBoundsException | |4| | |A|AnalysisException|AnalysisException| |5| | |B|AnalysisException|AnalysisException| |6| | |C|AnalysisException|AnalysisException| |7| |A, B, C|a|AnalysisException |AnalysisException| |8| | |b|AnalysisException |AnalysisException | |9| | |c|AnalysisException |AnalysisException | |10| | |A|a |IndexOutOfBoundsException | |11| | |B|B |IndexOutOfBoundsException | |12| | |C|c |IndexOutOfBoundsException | |13|false|a, b, c|a|a |IndexOutOfBoundsException | |14| | |b|B |IndexOutOfBoundsException | |15| | |c|c |IndexOutOfBoundsException | |16| | |A|a |IndexOutOfBoundsException | |17| | |B|B |IndexOutOfBoundsException | |18| | |C|c |IndexOutOfBoundsException | |19| |A, B, C|a|a |IndexOutOfBoundsException | |20| | |b|B |IndexOutOfBoundsException | |21| | |c|c |IndexOutOfBoundsException | |22| | |A|a |IndexOutOfBoundsException | |23| | |B|B |IndexOutOfBoundsException | |24| | |C|c |IndexOutOfBoundsException | > Case-insensitive field resolution when reading from ORC > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues. Since Spark has 2 > OrcFileFormat, we should add support for both. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat doesn't support case-insensitive field > resolution at all. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter cases between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 addressed this issue already. The biggest difference is, in Spark 2.1, user will get Exception for the same query: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} So they will know the issue and fix the query. But in Spark 2.3, user will get the wrong results sliently. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter cases between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 addressed this issue already. The biggest difference is, in Spark 2.1, user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} So they will know the issue and fix the query. But in Spark 2.3, user will get the wrong results sliently. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in diffe
[jira] [Commented] (SPARK-25248) Audit barrier APIs for Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-25248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593176#comment-16593176 ] Apache Spark commented on SPARK-25248: -- User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/22240 > Audit barrier APIs for Spark 2.4 > > > Key: SPARK-25248 > URL: https://issues.apache.org/jira/browse/SPARK-25248 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > > Make a pass over APIs added for barrier execution mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25248) Audit barrier APIs for Spark 2.4
Xiangrui Meng created SPARK-25248: - Summary: Audit barrier APIs for Spark 2.4 Key: SPARK-25248 URL: https://issues.apache.org/jira/browse/SPARK-25248 Project: Spark Issue Type: Story Components: Spark Core Affects Versions: 2.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Make a pass over APIs added for barrier execution mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25247) Make RDDBarrier configurable
Xiangrui Meng created SPARK-25247: - Summary: Make RDDBarrier configurable Key: SPARK-25247 URL: https://issues.apache.org/jira/browse/SPARK-25247 Project: Spark Issue Type: Story Components: Spark Core Affects Versions: 2.4.0 Reporter: Xiangrui Meng Currently we only offer one method under `RDDBarrier`. Users might want to have better control over a barrier stage, e.g., timeout behavior, failure recovery, etc. This Jira is to discuss what options we should provide under RDDBarrier. Note: Users can use multiple RDDBarrier in a single barrier stage. So we also need to discuss how to merge the options. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593156#comment-16593156 ] Dongjoon Hyun commented on SPARK-25175: --- Thanks, [~yucai]. I'm highly interested in this case. I'll wait for his reopening. :) > Case-insensitive field resolution when reading from ORC > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues. Since Spark has 2 > OrcFileFormat, we should add support for both. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat doesn't support case-insensitive field > resolution at all. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25175. --- Resolution: Cannot Reproduce I followed the same direction given by SPARK-25132, but I cannot reproduce this in Spark 2.3.1. {code} scala> spark.version res8: String = 2.3.1 scala> spark.range(5).toDF.write.mode("overwrite").format("orc").saveAsTable("t3") scala> sql("create table t4 (`ID` BIGINT) USING orc LOCATION '/Users/dongjoon/spark-release/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t3'") scala> sql("select * from t3").show +---+ | id| +---+ | 2| | 3| | 4| | 1| | 0| +---+ scala> sql("select * from t4").show +---+ | ID| +---+ | 2| | 3| | 4| | 1| | 0| +---+ {code} Please reopen this with a reproducible example. Thanks, [~seancxmao]. > Case-insensitive field resolution when reading from ORC > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues. Since Spark has 2 > OrcFileFormat, we should add support for both. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat doesn't support case-insensitive field > resolution at all. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593152#comment-16593152 ] yucai commented on SPARK-25175: --- I pinged [~seancxmao] offline, he will give more details. > Case-insensitive field resolution when reading from ORC > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues. Since Spark has 2 > OrcFileFormat, we should add support for both. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat doesn't support case-insensitive field > resolution at all. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Summary: wrong records are returned when Hive metastore schema and parquet schema are in different letter cases (was: data issue when Hive metastore schema and parquet schema are in different letter cases) > wrong records are returned when Hive metastore schema and parquet schema are > in different letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest difference is, in Spark 2.1, user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > So they will know the issue and fix the query. > But in Spark 2.3, user will get the wrong results sliently. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter cases between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 addressed this issue already. The biggest difference is, in Spark 2.1, user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} So they will know the issue and fix the query. But in Spark 2.3, user will get the wrong results sliently. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 solved this issue. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > data issue when Hive metastore schema and parquet schema are in different > letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter > cases between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 addressed this issue already. > > The biggest diff
[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 solved this issue. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 solved this issue. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > data issue when Hive metastore schema and parquet schema are in different > letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter case > between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 solved this issue. > > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID]
[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema are in different letter cases
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Summary: data issue when Hive metastore schema and parquet schema are in different letter cases (was: data issue when Hive metastore schema and parquet schema have different letter case) > data issue when Hive metastore schema and parquet schema are in different > letter cases > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter case > between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 solved this issue. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema have different letter case
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Summary: data issue when Hive metastore schema and parquet schema have different letter case (was: data issue when ) > data issue when Hive metastore schema and parquet schema have different > letter case > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter case > between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 solved this issue. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > -SPARK-25132-'s backport has been track in its jira. > Use this Jira to track the backport of SPARK-24716, > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) data issue when Hive metastore schema and parquet schema have different letter case
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 solved this issue. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 solved this issue. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- -SPARK-25132-'s backport has been track in its jira. Use this Jira to track the backport of SPARK-24716, [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > data issue when Hive metastore schema and parquet schema have different > letter case > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter case > between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark
[jira] [Updated] (SPARK-25206) data issue when
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Summary: data issue when (was: data issue because wrong column is pushdown for parquet) > data issue when > > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter case > between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 solved this issue. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > -SPARK-25132-'s backport has been track in its jira. > Use this Jira to track the backport of SPARK-24716, > > [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, even spark.sql.caseSensitive set to false. SPARK-25132 solved this issue. To make the above query work, we need both SPARK-25132 and -SPARK-24716.- -SPARK-25132-'s backport has been track in its jira. Use this Jira to track the backport of SPARK-24716, [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > data issue because wrong column is pushdown for parquet > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter case > between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. Spark SQL returns NULL for a column whose Hive metastore schema and > Parquet schema are in different letter cases, even spark.sql.caseSensitive > set to false. > SPARK-25132 solved this issue. > > To make the above query work, we need both SPARK-25132 and -SPARK-24716.- > > -SPARK-25132-'s backport has been track in i
[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* After deep dive, it has two issues, both are related to different letter case between Hive metastore schema and parquet schema. 1. Wrong column is pushdown. Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. 2. [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* It has two issues. 1. Wrong column Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > data issue because wrong column is pushdown for parquet > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > After deep dive, it has two issues, both are related to different letter case > between Hive metastore schema and parquet schema. > 1. Wrong column is pushdown. > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > 2. > > > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* It has two issues. 1. Wrong column Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > data issue because wrong column is pushdown for parquet > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > It has two issues. > 1. Wrong column > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25206) data issue because wrong column is pushdown for parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593126#comment-16593126 ] yucai edited comment on SPARK-25206 at 8/27/18 2:27 AM: [~dongjoon], because of the below root cause {quote}Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, but ID does not exist in /tmp/data (parquet is case sensitive, it has id actually). {quote} I changed the title to emphasize wrong column is pushdown: "id" should be pushdown instead of "ID". Feel free to let me know if you have any concern. This issue exists in 2.3 only, master is different. was (Author: yucai): [~dongjoon], because of the below root cause {quote}Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, but ID does not exist in /tmp/data (parquet is case sensitive, it has id actually). {quote} I changed the title to emphasize wrong column is pushdown: "id" should be pushdown instead of "ID". Feel free to let me know if you have any concern. > data issue because wrong column is pushdown for parquet > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593128#comment-16593128 ] Dongjoon Hyun commented on SPARK-25207: --- [~yucai]. My bad. Please ignore that. It was based on the old one. With the latest master branch, I found that the issue is a more general regression. Please see [the above comment|https://issues.apache.org/jira/browse/SPARK-25207?focusedCommentId=16593108&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16593108] and [Github comment|https://github.com/apache/spark/pull/22197#issuecomment-416085556] and update both GitHub PR and Apache JIRA description as you want. > Case-insensitve field resolution for filter pushdown when reading Parquet > - > > Key: SPARK-25207 > URL: https://issues.apache.org/jira/browse/SPARK-25207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: yucai >Priority: Major > Labels: Parquet > Attachments: image.png > > > Currently, filter pushdown will not work if Parquet schema and Hive metastore > schema are in different letter cases even spark.sql.caseSensitive is false. > Like the below case: > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > sql("select * from t where id > 0").show{code} > -No filter will be pushed down.- > {code} > scala> sql("select * from t where id > 0").explain // Filters are pushed > with `ID` > == Physical Plan == > *(1) Project [ID#90L] > +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0)) >+- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], > PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: > struct > scala> sql("select * from t").show// Parquet returns NULL for `ID` > because it has `id`. > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show // `NULL > 0` is `false`. > +---+ > | ID| > +---+ > +---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) data issue because wrong column is pushdown for parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593126#comment-16593126 ] yucai commented on SPARK-25206: --- [~dongjoon], because of the below root cause {quote}Spark pushdowns FilterApi.gt(intColumn("ID"), 0: Integer) into parquet, but ID does not exist in /tmp/data (parquet is case sensitive, it has id actually). {quote} I changed the title to emphasize wrong column is pushdown: "id" should be pushdown instead of "ID". Feel free to let me know if you have any concern. > data issue because wrong column is pushdown for parquet > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Description: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? was: In current Spark 2.3.1, below query returns wrong data silently. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t").show ++ | ID| ++ |null| |null| |null| |null| |null| |null| |null| |null| |null| |null| ++ scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ scala> sql("set spark.sql.parquet.filterPushdown").show ++-+ | key|value| ++-+ |spark.sql.parquet...| true| ++-+ scala> sql("set spark.sql.parquet.filterPushdown=false").show ++-+ | key|value| ++-+ |spark.sql.parquet...|false| ++-+ scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} *Root Cause* Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: Integer) into parquet, but {color:#ff}ID{color} does not exist in /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} actually). So no records are returned. In Spark 2.1, the user will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} But in Spark 2.3, they will get the wrong results sliently. Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema to do the pushdown, perfect for this issue. [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? > data issue because wrong column is pushdown for parquet > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25221) [DEPLOY] Consistent trailing whitespace treatment of conf values
[ https://issues.apache.org/jira/browse/SPARK-25221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-25221: Target Version/s: (was: 2.3.2, 2.4.0) > [DEPLOY] Consistent trailing whitespace treatment of conf values > > > Key: SPARK-25221 > URL: https://issues.apache.org/jira/browse/SPARK-25221 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.1 >Reporter: Gera Shegalov >Priority: Major > > When you use a custom line delimiter > {{spark.hadoop.textinputformat.record.delimiter}} that has a leading or a > trailing whitespace character it's only possible when specified via > {{--conf}} . Our pipeline consists of a highly customized generated jobs. > Storing all the config in a properities file is not only better for > readability but even necessary to avoid dealing with {{ARGS_MAX}} on > different OS. Spark should uniformly avoid trimming conf values in both > cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25221) [DEPLOY] Consistent trailing whitespace treatment of conf values
[ https://issues.apache.org/jira/browse/SPARK-25221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593124#comment-16593124 ] Saisai Shao commented on SPARK-25221: - I'm going to remove the target version, I don't think it is a critical/blocker issue, committers will set the proper fix version when merged. > [DEPLOY] Consistent trailing whitespace treatment of conf values > > > Key: SPARK-25221 > URL: https://issues.apache.org/jira/browse/SPARK-25221 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.1 >Reporter: Gera Shegalov >Priority: Major > > When you use a custom line delimiter > {{spark.hadoop.textinputformat.record.delimiter}} that has a leading or a > trailing whitespace character it's only possible when specified via > {{--conf}} . Our pipeline consists of a highly customized generated jobs. > Storing all the config in a properities file is not only better for > readability but even necessary to avoid dealing with {{ARGS_MAX}} on > different OS. Spark should uniformly avoid trimming conf values in both > cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) data issue because wrong column is pushdown for parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yucai updated SPARK-25206: -- Summary: data issue because wrong column is pushdown for parquet (was: Wrong data may be returned for Parquet) > data issue because wrong column is pushdown for parquet > --- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593117#comment-16593117 ] Hyukjin Kwon commented on SPARK-25206: -- [~yucai], mind fixing the JIRA title? > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593110#comment-16593110 ] yucai commented on SPARK-25207: --- [~dongjoon] , sorry if I am confusing you. This bug is created for master branch, because it has SPARK-25132 and -SPARK-24716- already. So it has no below issue actually. {code:java} scala> sql("select * from t").show// Parquet returns NULL for `ID` because it has `id`. ++ | ID| ++ |null| |null| |null| |null| |null| |null| |null| |null| |null| |null| ++ scala> sql("select * from t where id > 0").show // `NULL > 0` is `false`. +---+ | ID| +---+ +---+ {code} > Case-insensitve field resolution for filter pushdown when reading Parquet > - > > Key: SPARK-25207 > URL: https://issues.apache.org/jira/browse/SPARK-25207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: yucai >Priority: Major > Labels: Parquet > Attachments: image.png > > > Currently, filter pushdown will not work if Parquet schema and Hive metastore > schema are in different letter cases even spark.sql.caseSensitive is false. > Like the below case: > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > sql("select * from t where id > 0").show{code} > -No filter will be pushed down.- > {code} > scala> sql("select * from t where id > 0").explain // Filters are pushed > with `ID` > == Physical Plan == > *(1) Project [ID#90L] > +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0)) >+- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], > PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: > struct > scala> sql("select * from t").show// Parquet returns NULL for `ID` > because it has `id`. > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show // `NULL > 0` is `false`. > +---+ > | ID| > +---+ > +---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593108#comment-16593108 ] Dongjoon Hyun commented on SPARK-25207: --- According to the PR, this seems to be a new regression introduced at Spark 2.4. It's not specific to schema mismatch case. For example, in the following schema matched case, the input size is less than or equal to 8.0 MB in Spark 2.3.1, but now master seems to show the following. {code} spark.sparkContext.hadoopConfiguration.setInt("parquet.block.size", 8 * 1024 * 1024) spark.range(1, 40 * 1024 * 1024, 1, 1).sortWithinPartitions("id").write.mode("overwrite").parquet("/tmp/t") sql("CREATE TABLE t (id LONG) USING parquet LOCATION '/tmp/t'") // It should be less than and equal to 8MB. sql("select * from t where id < 100L").show() // It's already less than and equal to 8BM sql("select * from t where id < 100L").write.mode("overwrite").csv("/tmp/id") {code} !image.png! > Case-insensitve field resolution for filter pushdown when reading Parquet > - > > Key: SPARK-25207 > URL: https://issues.apache.org/jira/browse/SPARK-25207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: yucai >Priority: Major > Labels: Parquet > Attachments: image.png > > > Currently, filter pushdown will not work if Parquet schema and Hive metastore > schema are in different letter cases even spark.sql.caseSensitive is false. > Like the below case: > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > sql("select * from t where id > 0").show{code} > -No filter will be pushed down.- > {code} > scala> sql("select * from t where id > 0").explain // Filters are pushed > with `ID` > == Physical Plan == > *(1) Project [ID#90L] > +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0)) >+- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], > PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: > struct > scala> sql("select * from t").show// Parquet returns NULL for `ID` > because it has `id`. > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show // `NULL > 0` is `false`. > +---+ > | ID| > +---+ > +---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25207: -- Attachment: image.png > Case-insensitve field resolution for filter pushdown when reading Parquet > - > > Key: SPARK-25207 > URL: https://issues.apache.org/jira/browse/SPARK-25207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: yucai >Priority: Major > Labels: Parquet > Attachments: image.png > > > Currently, filter pushdown will not work if Parquet schema and Hive metastore > schema are in different letter cases even spark.sql.caseSensitive is false. > Like the below case: > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > sql("select * from t where id > 0").show{code} > -No filter will be pushed down.- > {code} > scala> sql("select * from t where id > 0").explain // Filters are pushed > with `ID` > == Physical Plan == > *(1) Project [ID#90L] > +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0)) >+- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], > PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: > struct > scala> sql("select * from t").show// Parquet returns NULL for `ID` > because it has `id`. > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show // `NULL > 0` is `false`. > +---+ > | ID| > +---+ > +---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593102#comment-16593102 ] yucai commented on SPARK-25206: --- I am OK with "known correctness bug in 2.3" way, just raise some concern in my previous post. > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593100#comment-16593100 ] yucai commented on SPARK-25206: --- [~smilegator] , sure, I will add tests. If we don't backport SPARK-25132 and SPARK-24716, user will have below issue. {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") scala> sql("select * from t where id > 0").show +---+ | ID| +---+ +---+ {code} The biggest difference is, in Spark 2.1, they will get Exception: {code:java} Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in schema!{code} So will they know the issue and fix the query. But in Spark 2.3, they will get the wrong results sliently and might be ignored? Could it be risky for the user? > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25236) Investigate using a logging library inside of PySpark on the workers instead of print
[ https://issues.apache.org/jira/browse/SPARK-25236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593098#comment-16593098 ] holdenk commented on SPARK-25236: - Probably. The only thing would be probably wanting to pass log level config from driver to exec but that could be a V2 feature. > Investigate using a logging library inside of PySpark on the workers instead > of print > - > > Key: SPARK-25236 > URL: https://issues.apache.org/jira/browse/SPARK-25236 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Trivial > > We don't have a logging library on the workers to use which means that its > difficult for folks to tune the log level on the workers. On the driver > processes we _could_ just call the JVM logging, but on the workers that won't > work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25236) Investigate using a logging library inside of PySpark on the workers instead of print
[ https://issues.apache.org/jira/browse/SPARK-25236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593097#comment-16593097 ] Liang-Chi Hsieh commented on SPARK-25236: - hmm, maybe dumb question, can't we use {{logging}} to do that? > Investigate using a logging library inside of PySpark on the workers instead > of print > - > > Key: SPARK-25236 > URL: https://issues.apache.org/jira/browse/SPARK-25236 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Trivial > > We don't have a logging library on the workers to use which means that its > difficult for folks to tune the log level on the workers. On the driver > processes we _could_ just call the JVM logging, but on the workers that won't > work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593096#comment-16593096 ] Wenchen Fan commented on SPARK-25206: - I'm fine to mark it as a known correctness bug in Spark 2.2, 2.3, shall we put it in the release notes of Spark 2.3.2? cc [~jerryshao] > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2
[ https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593085#comment-16593085 ] Michail Giannakopoulos edited comment on SPARK-24826 at 8/27/18 12:53 AM: -- [~dongjoon] I will try to repro and let you know... was (Author: miccagiann): [~dongjoon] I will and let you know... > Self-Join not working in Apache Spark 2.2.2 > --- > > Key: SPARK-24826 > URL: https://issues.apache.org/jira/browse/SPARK-24826 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.2.2 >Reporter: Michail Giannakopoulos >Priority: Major > Attachments: > part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet > > > Running a self-join against a table derived from a parquet file with many > columns fails during the planning phase with the following stack-trace: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), > coordinator[target post-shuffle partition size: 67108864] > +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, > funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, > emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, > verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, > desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, > member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, > int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, > emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, > issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, > title#22, zip_code#23, ... 92 more fields] > +- Filter isnotnull(_row_id#0L) > +- FileScan parquet > [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... > 92 more > fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... > 92 more fields] Batched: false, Format: Parquet, Location: > InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-..., > PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: > struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_... > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.e
[jira] [Commented] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2
[ https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593085#comment-16593085 ] Michail Giannakopoulos commented on SPARK-24826: [~dongjoon] I will and let you know... > Self-Join not working in Apache Spark 2.2.2 > --- > > Key: SPARK-24826 > URL: https://issues.apache.org/jira/browse/SPARK-24826 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.2.2 >Reporter: Michail Giannakopoulos >Priority: Major > Attachments: > part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet > > > Running a self-join against a table derived from a parquet file with many > columns fails during the planning phase with the following stack-trace: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), > coordinator[target post-shuffle partition size: 67108864] > +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, > funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, > emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, > verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, > desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, > member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, > int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, > emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, > issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, > title#22, zip_code#23, ... 92 more fields] > +- Filter isnotnull(_row_id#0L) > +- FileScan parquet > [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... > 92 more > fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... > 92 more fields] Batched: false, Format: Parquet, Location: > InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-..., > PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: > struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_... > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at > org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:73) > at >
[jira] [Commented] (SPARK-19355) Use map output statistices to improve global limit's parallelism
[ https://issues.apache.org/jira/browse/SPARK-19355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593080#comment-16593080 ] Apache Spark commented on SPARK-19355: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/22239 > Use map output statistices to improve global limit's parallelism > > > Key: SPARK-19355 > URL: https://issues.apache.org/jira/browse/SPARK-19355 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > A logical Limit is performed actually by two physical operations LocalLimit > and GlobalLimit. > In most of time, before GlobalLimit, we will perform a shuffle exchange to > shuffle data to single partition. When the limit number is very big, we > shuffle a lot of data to a single partition and significantly reduce > parallelism, except for the cost of shuffling. > This change tries to perform GlobalLimit without shuffling data to single > partition. Instead, we perform the map stage of the shuffling and collect the > statistics of the number of rows in each partition. Shuffled data are > actually all retrieved locally without from remote executors. > Once we get the number of output rows in each partition, we only take the > required number of rows from the locally shuffled data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25207: -- Description: Currently, filter pushdown will not work if Parquet schema and Hive metastore schema are in different letter cases even spark.sql.caseSensitive is false. Like the below case: {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") sql("select * from t where id > 0").show{code} -No filter will be pushed down.- {code} scala> sql("select * from t where id > 0").explain // Filters are pushed with `ID` == Physical Plan == *(1) Project [ID#90L] +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0)) +- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: struct scala> sql("select * from t").show// Parquet returns NULL for `ID` because it has `id`. ++ | ID| ++ |null| |null| |null| |null| |null| |null| |null| |null| |null| |null| ++ scala> sql("select * from t where id > 0").show // `NULL > 0` is `false`. +---+ | ID| +---+ +---+ {code} was: Currently, filter pushdown will not work if Parquet schema and Hive metastore schema are in different letter cases even spark.sql.caseSensitive is false. Like the below case: {code:java} spark.range(10).write.parquet("/tmp/data") sql("DROP TABLE t") sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") sql("select * from t where id > 0").show{code} No filter will be pushed down. > Case-insensitve field resolution for filter pushdown when reading Parquet > - > > Key: SPARK-25207 > URL: https://issues.apache.org/jira/browse/SPARK-25207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: yucai >Priority: Major > Labels: Parquet > > Currently, filter pushdown will not work if Parquet schema and Hive metastore > schema are in different letter cases even spark.sql.caseSensitive is false. > Like the below case: > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > sql("select * from t where id > 0").show{code} > -No filter will be pushed down.- > {code} > scala> sql("select * from t where id > 0").explain // Filters are pushed > with `ID` > == Physical Plan == > *(1) Project [ID#90L] > +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0)) >+- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], > PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: > struct > scala> sql("select * from t").show// Parquet returns NULL for `ID` > because it has `id`. > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show // `NULL > 0` is `false`. > +---+ > | ID| > +---+ > +---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24766) CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column stats in parquet
[ https://issues.apache.org/jira/browse/SPARK-24766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24766: -- Labels: Parquet (was: ) > CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column > stats in parquet > > > Key: SPARK-24766 > URL: https://issues.apache.org/jira/browse/SPARK-24766 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > Labels: Parquet > > How to reproduce: > {code:java} > INSERT OVERWRITE LOCAL DIRECTORY '/tmp/spark/parquet/dir' STORED AS parquet > select cast(1 as decimal) as decimal1; > {code} > {code:java} > create table test_parquet stored as parquet as select cast(1 as decimal) as > decimal1; > {code} > {noformat} > $ java -jar ./parquet-tools/target/parquet-tools-1.10.1-SNAPSHOT.jar meta > file:/tmp/spark/parquet/dir/part-0-cb96a617-4759-4b21-a222-2153ca0e8951-c000 > file: > file:/tmp/spark/parquet/dir/part-0-cb96a617-4759-4b21-a222-2153ca0e8951-c000 > creator: parquet-mr version 1.6.0 (build > 6aa21f8776625b5fa6b18059cfebe7549f2e00cb) > file schema: hive_schema > > decimal1: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > row group 1: RC:1 TS:46 OFFSET:4 > > decimal1: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:4 SZ:48/46/0.96 VC:1 > ENC:BIT_PACKED,PLAIN,RLE ST:[no stats for this column] > {noformat} > because spark still use com.twitter.parquet-hadoop-bundle.1.6.0. > May be we should refactor {{CreateHiveTableAsSelectCommand}} and > {{InsertIntoHiveDirCommand}} or [upgrade built-in > Hive|https://issues.apache.org/jira/browse/SPARK-23710]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2
[ https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593067#comment-16593067 ] Dongjoon Hyun commented on SPARK-24826: --- Hi, [~miccagiann]. Could you try that in Apache Spark 2.3.1? > Self-Join not working in Apache Spark 2.2.2 > --- > > Key: SPARK-24826 > URL: https://issues.apache.org/jira/browse/SPARK-24826 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.2.2 >Reporter: Michail Giannakopoulos >Priority: Major > Attachments: > part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet > > > Running a self-join against a table derived from a parquet file with many > columns fails during the planning phase with the following stack-trace: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), > coordinator[target post-shuffle partition size: 67108864] > +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, > funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, > emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, > verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, > desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, > member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, > int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, > emp_length#12, home_ownership#13, annual_inc#14, verification_status#15, > issue_d#16, loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, > title#22, zip_code#23, ... 92 more fields] > +- Filter isnotnull(_row_id#0L) > +- FileScan parquet > [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... > 92 more > fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,... > 92 more fields] Batched: false, Format: Parquet, Location: > InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-..., > PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: > struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_... > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) > at > org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:73) > a
[jira] [Updated] (SPARK-25132) Case-insensitive field resolution when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25132: -- Labels: Parquet (was: ) > Case-insensitive field resolution when reading from Parquet > --- > > Key: SPARK-25132 > URL: https://issues.apache.org/jira/browse/SPARK-25132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Major > Labels: Parquet > Fix For: 2.4.0 > > > Spark SQL returns NULL for a column whose Hive metastore schema and Parquet > schema are in different letter cases, regardless of spark.sql.caseSensitive > set to true or false. > Here is a simple example to reproduce this issue: > scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1") > spark-sql> show create table t1; > CREATE TABLE `t1` (`id` BIGINT) > USING parquet > OPTIONS ( > `serialization.format` '1' > ) > spark-sql> CREATE TABLE `t2` (`ID` BIGINT) > > USING parquet > > LOCATION 'hdfs://localhost/user/hive/warehouse/t1'; > spark-sql> select * from t1; > 0 > 1 > 2 > 3 > 4 > spark-sql> select * from t2; > NULL > NULL > NULL > NULL > NULL > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25135) insert datasource table may all null when select from view on parquet
[ https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25135: -- Labels: Parquet correctness (was: correctness) > insert datasource table may all null when select from view on parquet > - > > Key: SPARK-25135 > URL: https://issues.apache.org/jira/browse/SPARK-25135 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Yuming Wang >Priority: Blocker > Labels: Parquet, correctness > > This happens on parquet. > How to reproduce in parquet. > {code:scala} > val path = "/tmp/spark/parquet" > val cnt = 30 > spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) > as col2").write.mode("overwrite").parquet(path) > spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet > location '$path'") > spark.sql("create view view1 as select col1, col2 from table1 where col1 > > -20") > spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet") > spark.sql("insert overwrite table table2 select COL1, COL2 from view1") > spark.table("table2").show > {code} > FYI, the following is orc. > {code} > scala> val path = "/tmp/spark/orc" > scala> val cnt = 30 > scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as > bigint) as col2").write.mode("overwrite").orc(path) > scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc > location '$path'") > scala> spark.sql("create view view1 as select col1, col2 from table1 where > col1 > -20") > scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc") > scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1") > scala> spark.table("table2").show > +++ > |COL1|COL2| > +++ > | 15| 15| > | 16| 16| > | 17| 17| > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25207) Case-insensitve field resolution for filter pushdown when reading Parquet
[ https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25207: -- Labels: Parquet (was: ) > Case-insensitve field resolution for filter pushdown when reading Parquet > - > > Key: SPARK-25207 > URL: https://issues.apache.org/jira/browse/SPARK-25207 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: yucai >Priority: Major > Labels: Parquet > > Currently, filter pushdown will not work if Parquet schema and Hive metastore > schema are in different letter cases even spark.sql.caseSensitive is false. > Like the below case: > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > sql("select * from t where id > 0").show{code} > No filter will be pushed down. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25206: -- Labels: Parquet correctness (was: correctness) > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: Parquet, correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25135) insert datasource table may all null when select from view on parquet
[ https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593066#comment-16593066 ] Dongjoon Hyun commented on SPARK-25135: --- [~yumwang]. Could you update your PR according to this JIRA title? We need to be specific. > insert datasource table may all null when select from view on parquet > - > > Key: SPARK-25135 > URL: https://issues.apache.org/jira/browse/SPARK-25135 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Yuming Wang >Priority: Blocker > Labels: correctness > > This happens on parquet. > How to reproduce in parquet. > {code:scala} > val path = "/tmp/spark/parquet" > val cnt = 30 > spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) > as col2").write.mode("overwrite").parquet(path) > spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet > location '$path'") > spark.sql("create view view1 as select col1, col2 from table1 where col1 > > -20") > spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet") > spark.sql("insert overwrite table table2 select COL1, COL2 from view1") > spark.table("table2").show > {code} > FYI, the following is orc. > {code} > scala> val path = "/tmp/spark/orc" > scala> val cnt = 30 > scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as > bigint) as col2").write.mode("overwrite").orc(path) > scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc > location '$path'") > scala> spark.sql("create view view1 as select col1, col2 from table1 where > col1 > -20") > scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc") > scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1") > scala> spark.table("table2").show > +++ > |COL1|COL2| > +++ > | 15| 15| > | 16| 16| > | 17| 17| > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25135) insert datasource table may all null when select from view on parquet
[ https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25135: -- Description: This happens on parquet. How to reproduce in parquet. {code:scala} val path = "/tmp/spark/parquet" val cnt = 30 spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) as col2").write.mode("overwrite").parquet(path) spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet location '$path'") spark.sql("create view view1 as select col1, col2 from table1 where col1 > -20") spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet") spark.sql("insert overwrite table table2 select COL1, COL2 from view1") spark.table("table2").show {code} FYI, the following is orc. {code} scala> val path = "/tmp/spark/orc" scala> val cnt = 30 scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) as col2").write.mode("overwrite").orc(path) scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc location '$path'") scala> spark.sql("create view view1 as select col1, col2 from table1 where col1 > -20") scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc") scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1") scala> spark.table("table2").show +++ |COL1|COL2| +++ | 15| 15| | 16| 16| | 17| 17| ... {code} was: How to reproduce: {code:scala} val path = "/tmp/spark/parquet" val cnt = 30 spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) as col2").write.mode("overwrite").parquet(path) spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet location '$path'") spark.sql("create view view1 as select col1, col2 from table1 where col1 > -20") spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet") spark.sql("insert overwrite table table2 select COL1, COL2 from view1") spark.table("table2").show {code} This happens on parquet. {code} scala> val path = "/tmp/spark/orc" scala> val cnt = 30 scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) as col2").write.mode("overwrite").orc(path) scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc location '$path'") scala> spark.sql("create view view1 as select col1, col2 from table1 where col1 > -20") scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc") scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1") scala> spark.table("table2").show +++ |COL1|COL2| +++ | 15| 15| | 16| 16| | 17| 17| ... {code} > insert datasource table may all null when select from view on parquet > - > > Key: SPARK-25135 > URL: https://issues.apache.org/jira/browse/SPARK-25135 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Yuming Wang >Priority: Blocker > Labels: correctness > > This happens on parquet. > How to reproduce in parquet. > {code:scala} > val path = "/tmp/spark/parquet" > val cnt = 30 > spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) > as col2").write.mode("overwrite").parquet(path) > spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet > location '$path'") > spark.sql("create view view1 as select col1, col2 from table1 where col1 > > -20") > spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet") > spark.sql("insert overwrite table table2 select COL1, COL2 from view1") > spark.table("table2").show > {code} > FYI, the following is orc. > {code} > scala> val path = "/tmp/spark/orc" > scala> val cnt = 30 > scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as > bigint) as col2").write.mode("overwrite").orc(path) > scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc > location '$path'") > scala> spark.sql("create view view1 as select col1, col2 from table1 where > col1 > -20") > scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc") > scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1") > scala> spark.table("table2").show > +++ > |COL1|COL2| > +++ > | 15| 15| > | 16| 16| > | 17| 17| > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25135) insert datasource table may all null when select from view on parquet
[ https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25135: -- Description: How to reproduce: {code:scala} val path = "/tmp/spark/parquet" val cnt = 30 spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) as col2").write.mode("overwrite").parquet(path) spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet location '$path'") spark.sql("create view view1 as select col1, col2 from table1 where col1 > -20") spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet") spark.sql("insert overwrite table table2 select COL1, COL2 from view1") spark.table("table2").show {code} This happens on parquet. {code} scala> val path = "/tmp/spark/orc" scala> val cnt = 30 scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) as col2").write.mode("overwrite").orc(path) scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc location '$path'") scala> spark.sql("create view view1 as select col1, col2 from table1 where col1 > -20") scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc") scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1") scala> spark.table("table2").show +++ |COL1|COL2| +++ | 15| 15| | 16| 16| | 17| 17| ... {code} was: How to reproduce: {code:scala} val path = "/tmp/spark/parquet" val cnt = 30 spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) as col2").write.mode("overwrite").parquet(path) spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet location '$path'") spark.sql("create view view1 as select col1, col2 from table1 where col1 > -20") spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet") spark.sql("insert overwrite table table2 select COL1, COL2 from view1") spark.table("table2").show {code} > insert datasource table may all null when select from view on parquet > - > > Key: SPARK-25135 > URL: https://issues.apache.org/jira/browse/SPARK-25135 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Yuming Wang >Priority: Blocker > Labels: correctness > > How to reproduce: > {code:scala} > val path = "/tmp/spark/parquet" > val cnt = 30 > spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) > as col2").write.mode("overwrite").parquet(path) > spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet > location '$path'") > spark.sql("create view view1 as select col1, col2 from table1 where col1 > > -20") > spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet") > spark.sql("insert overwrite table table2 select COL1, COL2 from view1") > spark.table("table2").show > {code} > This happens on parquet. > {code} > scala> val path = "/tmp/spark/orc" > scala> val cnt = 30 > scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as > bigint) as col2").write.mode("overwrite").orc(path) > scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc > location '$path'") > scala> spark.sql("create view view1 as select col1, col2 from table1 where > col1 > -20") > scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc") > scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1") > scala> spark.table("table2").show > +++ > |COL1|COL2| > +++ > | 15| 15| > | 16| 16| > | 17| 17| > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25135) insert datasource table may all null when select from view on parquet
[ https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25135: -- Summary: insert datasource table may all null when select from view on parquet (was: insert datasource table may all null when select from view) > insert datasource table may all null when select from view on parquet > - > > Key: SPARK-25135 > URL: https://issues.apache.org/jira/browse/SPARK-25135 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Yuming Wang >Priority: Blocker > Labels: correctness > > How to reproduce: > {code:scala} > val path = "/tmp/spark/parquet" > val cnt = 30 > spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) > as col2").write.mode("overwrite").parquet(path) > spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet > location '$path'") > spark.sql("create view view1 as select col1, col2 from table1 where col1 > > -20") > spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet") > spark.sql("insert overwrite table table2 select COL1, COL2 from view1") > spark.table("table2").show > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25091) Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor memory
[ https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593061#comment-16593061 ] Dongjoon Hyun commented on SPARK-25091: --- Hi, [~Chao Fang]. Could you remove `Spark Thrift Server: ` from the title if you see it in `pyspark` shell as you reported? bq. Similar behavior when using pyspark df.unpersist(). > Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor > memory > > > Key: SPARK-25091 > URL: https://issues.apache.org/jira/browse/SPARK-25091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Yunling Cai >Priority: Critical > > UNCACHE TABLE and CLEAR CACHE does not clean up executor memory. > Through Spark UI, although in Storage, we see the cached table removed. In > Executor, the executors continue to hold the RDD and the memory is not > cleared. This results in huge waste in executor memory usage. As we call > CACHE TABLE, we run into issues where the cached tables are spilled to disk > instead of reclaiming the memory storage. > Steps to reproduce: > CACHE TABLE test.test_cache; > UNCACHE TABLE test.test_cache; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > CACHE TABLE test.test_cache; > CLEAR CACHE; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > Similar behavior when using pyspark df.unpersist(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593041#comment-16593041 ] Xiao Li edited comment on SPARK-25206 at 8/26/18 10:45 PM: --- Currently, we do not have a good test coverage when the physical schema and logical schema use difference cases. Thus, any new change could introduce new behavior changes or bugs. Thus, the first step is to add the tests first. [~yucai] Could you help this effort? Merging Parquet filter refactoring is kind of breaking our backport rule. Maybe we do not need to claim we support this scenario before Spark 2.4? was (Author: smilegator): Previously, we do not have a good test coverage when the physical schema and logical schema use difference cases. Thus, any new change could introduce new behavior changes or bugs. Thus, the first step is to add the tests first. [~yucai] Could you help this effort? Merging Parquet filter refactoring is kind of breaking our backport rule. Maybe we do not need to claim we support this scenario before Spark 2.4? > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593041#comment-16593041 ] Xiao Li commented on SPARK-25206: - Previously, we do not have a good test coverage when the physical schema and logical schema use difference cases. Thus, any new change could introduce new behavior changes or bugs. Thus, the first step is to add the tests first. [~yucai] Could you help this effort? Merging Parquet filter refactoring is kind of breaking our backport rule. Maybe we do not need to claim we support this scenario before Spark 2.4? > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25246) When the spark.eventLog.compress is enabled, the Application is not showing in the History server UI ('incomplete application' page), initially.
[ https://issues.apache.org/jira/browse/SPARK-25246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593040#comment-16593040 ] shahid commented on SPARK-25246: I am working on it :) > When the spark.eventLog.compress is enabled, the Application is not showing > in the History server UI ('incomplete application' page), initially. > > > Key: SPARK-25246 > URL: https://issues.apache.org/jira/browse/SPARK-25246 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: shahid >Priority: Major > > 1) bin/spark-shell --master yarn --conf "spark.eventLog.compress=true" > 2) hdfs dfs -ls /spark-logs > {code:java} > -rwxrwx--- 1 root supergroup *0* 2018-08-27 03:26 > /spark-logs/application_1535313809919_0005.lz4.inprogress > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25246) When the spark.eventLog.compress is enabled, the Application is not showing in the History server UI ('incomplete application' page), initially.
shahid created SPARK-25246: -- Summary: When the spark.eventLog.compress is enabled, the Application is not showing in the History server UI ('incomplete application' page), initially. Key: SPARK-25246 URL: https://issues.apache.org/jira/browse/SPARK-25246 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.1 Reporter: shahid 1) bin/spark-shell --master yarn --conf "spark.eventLog.compress=true" 2) hdfs dfs -ls /spark-logs {code:java} -rwxrwx--- 1 root supergroup *0* 2018-08-27 03:26 /spark-logs/application_1535313809919_0005.lz4.inprogress {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25245) Explain regarding limiting modification on "spark.sql.shuffle.partitions" for structured streaming
[ https://issues.apache.org/jira/browse/SPARK-25245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25245: Assignee: Apache Spark > Explain regarding limiting modification on "spark.sql.shuffle.partitions" for > structured streaming > -- > > Key: SPARK-25245 > URL: https://issues.apache.org/jira/browse/SPARK-25245 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > Couple of users wondered why "spark.sql.shuffle.partitions" keeps unchanged > when they changed the config value after running the query. Some of them even > submitted the patch as this behavior as a bug. But it is based on how state > is partitioned and the behavior is intentional. > Looks like it's worth to explain it to guide doc so that no more users will > be wondered. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25245) Explain regarding limiting modification on "spark.sql.shuffle.partitions" for structured streaming
[ https://issues.apache.org/jira/browse/SPARK-25245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593034#comment-16593034 ] Apache Spark commented on SPARK-25245: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/22238 > Explain regarding limiting modification on "spark.sql.shuffle.partitions" for > structured streaming > -- > > Key: SPARK-25245 > URL: https://issues.apache.org/jira/browse/SPARK-25245 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Couple of users wondered why "spark.sql.shuffle.partitions" keeps unchanged > when they changed the config value after running the query. Some of them even > submitted the patch as this behavior as a bug. But it is based on how state > is partitioned and the behavior is intentional. > Looks like it's worth to explain it to guide doc so that no more users will > be wondered. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25245) Explain regarding limiting modification on "spark.sql.shuffle.partitions" for structured streaming
[ https://issues.apache.org/jira/browse/SPARK-25245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25245: Assignee: (was: Apache Spark) > Explain regarding limiting modification on "spark.sql.shuffle.partitions" for > structured streaming > -- > > Key: SPARK-25245 > URL: https://issues.apache.org/jira/browse/SPARK-25245 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Couple of users wondered why "spark.sql.shuffle.partitions" keeps unchanged > when they changed the config value after running the query. Some of them even > submitted the patch as this behavior as a bug. But it is based on how state > is partitioned and the behavior is intentional. > Looks like it's worth to explain it to guide doc so that no more users will > be wondered. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25245) Explain regarding limiting modification on "spark.sql.shuffle.partitions" for structured streaming
Jungtaek Lim created SPARK-25245: Summary: Explain regarding limiting modification on "spark.sql.shuffle.partitions" for structured streaming Key: SPARK-25245 URL: https://issues.apache.org/jira/browse/SPARK-25245 Project: Spark Issue Type: Documentation Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jungtaek Lim Couple of users wondered why "spark.sql.shuffle.partitions" keeps unchanged when they changed the config value after running the query. Some of them even submitted the patch as this behavior as a bug. But it is based on how state is partitioned and the behavior is intentional. Looks like it's worth to explain it to guide doc so that no more users will be wondered. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593024#comment-16593024 ] Dongjoon Hyun commented on SPARK-25206: --- Hi, [~yucai], [~cloud_fan], [~smilegator], [~hyukjin.kwon]. In Spark 2.4, we are still trying to fix long-lasting Parquet case-sensitivity issues (Spark 2.1.x raises Exceptions and Spark 2.2.x is the same with Spark 2.3.x). Unfortunately, this effort is incomplete and unstable even in Spark 2.4 because we have unmerged one (SPARK-25207) and we may have more future unknown patches. In this case, we had better consider any backporting to `branch-2.3` after Spark 2.4 becomes stable first. We may land them together, not one by one. How do you think about this? Are the current three Spark-2.4-only Parquet patches(SPARK-25132, SPARK-24716, SPARK-25207) considered as a complete set of patches for this? > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC
[ https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593013#comment-16593013 ] Dongjoon Hyun commented on SPARK-25175: --- [~seancxmao]. If there is no example, we can not help you. In that case, we usually close this as `Cannot Reproduce`. > Case-insensitive field resolution when reading from ORC > --- > > Key: SPARK-25175 > URL: https://issues.apache.org/jira/browse/SPARK-25175 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Priority: Major > > SPARK-25132 adds support for case-insensitive field resolution when reading > from Parquet files. We found ORC files have similar issues. Since Spark has 2 > OrcFileFormat, we should add support for both. > * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive > dependency. This hive OrcFileFormat doesn't support case-insensitive field > resolution at all. > * SPARK-20682 adds a new ORC data source inside sql/core. This native > OrcFileFormat supports case-insensitive field resolution, however it cannot > handle duplicate fields. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected
[ https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Daitche updated SPARK-25244: -- Description: The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. If the maintainers agree that this should be fixed, I would try to come up with a patch. was: The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. If the maintainers agree that this should be fixed, I would be happy to contribute a patch. > [Python] Setting `spark.sql.session.timeZone` only partially respected > -- > > Key: SPARK-25244 > URL: https://issues.apache.org/jira/browse/SPARK-25244 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Anton Daitche >Priority: Major > > The setting `spark.sql.session.timeZone` is respected by PySpark when > converting from and to Pandas, as described > [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. > However, when timestamps are converted directly to Pythons `datetime` > objects, its ignored and the systems timezone is used. > This can be checked by the following code snippet > {code:java} > import pyspark.sql > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .getOrCreate() > ) > df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) > df = df.withColumn("ts", df["ts"].astype("timestamp")) > print(df.toPandas().iloc[0,0]) > print(df.collect()[0][0]) > {code} > Which for me prints (the exact result depends on the timezone of your system, > mine is Europe/Berlin) > {code:java} > 2018-06-01 01:00:00 > 2018-06-01 03:00:00 > {code} > Hence, the method `toPandas` respected the timezone setting (UTC), but the > method `collect` ignored it and converted the timestamp to my systems > timezone. > The cause for this behaviour is that the methods `toInternal` and > `fromInternal` of PySparks `TimestampType` class don't take into account the > setting `spark.sql.session.timeZone` and use the system timezone. > If the maintainers agree that this should be fixed, I would try to
[jira] [Created] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected
Anton Daitche created SPARK-25244: - Summary: [Python] Setting `spark.sql.session.timeZone` only partially respected Key: SPARK-25244 URL: https://issues.apache.org/jira/browse/SPARK-25244 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.1 Reporter: Anton Daitche The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|[http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].] However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. If the maintainers agree that this should be fixed, I would be happy to contribute a patch. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected
[ https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Daitche updated SPARK-25244: -- Description: The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. If the maintainers agree that this should be fixed, I would be happy to contribute a patch. was: The setting `spark.sql.session.timeZone` is respected by PySpark when converting from and to Pandas, as described [here|[http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].] However, when timestamps are converted directly to Pythons `datetime` objects, its ignored and the systems timezone is used. This can be checked by the following code snippet {code:java} import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0]) {code} Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin) {code:java} 2018-06-01 01:00:00 2018-06-01 03:00:00 {code} Hence, the method `toPandas` respected the timezone setting (UTC), but the method `collect` ignored it and converted the timestamp to my systems timezone. The cause for this behaviour is that the methods `toInternal` and `fromInternal` of PySparks `TimestampType` class don't take into account the setting `spark.sql.session.timeZone` and use the system timezone. If the maintainers agree that this should be fixed, I would be happy to contribute a patch. > [Python] Setting `spark.sql.session.timeZone` only partially respected > -- > > Key: SPARK-25244 > URL: https://issues.apache.org/jira/browse/SPARK-25244 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Anton Daitche >Priority: Major > > The setting `spark.sql.session.timeZone` is respected by PySpark when > converting from and to Pandas, as described > [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics]. > However, when timestamps are converted directly to Pythons `datetime` > objects, its ignored and the systems timezone is used. > This can be checked by the following code snippet > {code:java} > import pyspark.sql > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .getOrCreate() > ) > df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) > df = df.withColumn("ts", df["ts"].astype("timestamp")) > print(df.toPandas().iloc[0,0]) > print(df.collect()[0][0]) > {code} > Which for me prints (the exact result depends on the timezone of your system, > mine is Europe/Berlin) > {code:java} > 2018-06-01 01:00:00 > 2018-06-01 03:00:00 > {code} > Hence, the method `toPandas` respected the timezone setting (UTC), but the > method `collect` ignored it and converted the timestamp to my systems > timezone. > The cause for this behaviour is that the methods `toInternal` and > `fromInternal` of PySparks `TimestampType` class don't take into account the > setting `spark.sql.session.timeZone` and use the system timezone. > If the maintainers agree that this should be fixed, I would b
[jira] [Assigned] (SPARK-25243) Use FailureSafeParser in from_json
[ https://issues.apache.org/jira/browse/SPARK-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25243: Assignee: Apache Spark > Use FailureSafeParser in from_json > -- > > Key: SPARK-25243 > URL: https://issues.apache.org/jira/browse/SPARK-25243 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > The > [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28] > is used in parsing JSON, CSV files and dataset of strings. It supports the > [PERMISSIVE, DROPMALFORMED and > FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44] > modes. The ticket aims to make the from_json function compatible to regular > parsing via FailureSafeParser and support above modes -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25243) Use FailureSafeParser in from_json
[ https://issues.apache.org/jira/browse/SPARK-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16592940#comment-16592940 ] Apache Spark commented on SPARK-25243: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/22237 > Use FailureSafeParser in from_json > -- > > Key: SPARK-25243 > URL: https://issues.apache.org/jira/browse/SPARK-25243 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > The > [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28] > is used in parsing JSON, CSV files and dataset of strings. It supports the > [PERMISSIVE, DROPMALFORMED and > FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44] > modes. The ticket aims to make the from_json function compatible to regular > parsing via FailureSafeParser and support above modes -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25243) Use FailureSafeParser in from_json
[ https://issues.apache.org/jira/browse/SPARK-25243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25243: Assignee: (was: Apache Spark) > Use FailureSafeParser in from_json > -- > > Key: SPARK-25243 > URL: https://issues.apache.org/jira/browse/SPARK-25243 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > > The > [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28] > is used in parsing JSON, CSV files and dataset of strings. It supports the > [PERMISSIVE, DROPMALFORMED and > FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44] > modes. The ticket aims to make the from_json function compatible to regular > parsing via FailureSafeParser and support above modes -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25243) Use FailureSafeParser in from_json
Maxim Gekk created SPARK-25243: -- Summary: Use FailureSafeParser in from_json Key: SPARK-25243 URL: https://issues.apache.org/jira/browse/SPARK-25243 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maxim Gekk The [FailureSafeParser|https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FailureSafeParser.scala#L28] is used in parsing JSON, CSV files and dataset of strings. It supports the [PERMISSIVE, DROPMALFORMED and FAILFAST|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ParseMode.scala#L31-L44] modes. The ticket aims to make the from_json function compatible to regular parsing via FailureSafeParser and support above modes -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23707) Don't need shuffle exchange with single partition for 'spark.range'
[ https://issues.apache.org/jira/browse/SPARK-23707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23707. -- Resolution: Cannot Reproduce > Don't need shuffle exchange with single partition for 'spark.range' > > > Key: SPARK-23707 > URL: https://issues.apache.org/jira/browse/SPARK-23707 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xianyang Liu >Priority: Major > > Just like #20726. There is no need 'Exchange' when `spark.range` produce only > one partition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25013) JDBC urls with jdbc:mariadb don't work as expected
[ https://issues.apache.org/jira/browse/SPARK-25013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25013. -- Resolution: Won't Fix I wouldn't add this into Spark for now unless there's strong request from a community for now. > JDBC urls with jdbc:mariadb don't work as expected > -- > > Key: SPARK-25013 > URL: https://issues.apache.org/jira/browse/SPARK-25013 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dieter Vekeman >Priority: Minor > > When using the MariaDB JDBC driver, the JDBC connection url should be > {code:java} > jdbc:mariadb://localhost:3306/DB?user=someuser&password=somepassword > {code} > https://mariadb.com/kb/en/library/about-mariadb-connector-j/ > However this does not work well in Spark (see below) > *Workaround* > The MariaDB driver also supports using mysql which does work. > The problem seems to have been described and identified in: > https://jira.mariadb.org/browse/CONJ-421 > All works well with spark using connection string with {{"jdbc:mysql:..."}}, > but not using {{"jdbc:mariadb:..."}} because MySQL dialect is then not used. > when not used, defaut quote is {{"}}, not {{`}} > So, some internal query generated by spark like {{SELECT `i`,`ip` FROM tmp}} > will then be executed as {{SELECT "i","ip" FROM tmp}} with dataType > previously retrieved, causing the exception > The author of the comment says > {quote}I'll make a pull request to spark so "jdbc:mariadb:" connection string > can be handle{quote} > Did the pull request get lost or should a new one be made? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10697) Lift Calculation in Association Rule mining
[ https://issues.apache.org/jira/browse/SPARK-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10697: Assignee: Apache Spark > Lift Calculation in Association Rule mining > --- > > Key: SPARK-10697 > URL: https://issues.apache.org/jira/browse/SPARK-10697 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Yashwanth Kumar >Assignee: Apache Spark >Priority: Minor > > Lift is to be calculated for Association rule mining in > AssociationRules.scala under FPM. > Lift is a measure of the performance of a Association rules. > Adding lift will help to compare the model efficiency. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10697) Lift Calculation in Association Rule mining
[ https://issues.apache.org/jira/browse/SPARK-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10697: Assignee: (was: Apache Spark) > Lift Calculation in Association Rule mining > --- > > Key: SPARK-10697 > URL: https://issues.apache.org/jira/browse/SPARK-10697 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Yashwanth Kumar >Priority: Minor > > Lift is to be calculated for Association rule mining in > AssociationRules.scala under FPM. > Lift is a measure of the performance of a Association rules. > Adding lift will help to compare the model efficiency. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10697) Lift Calculation in Association Rule mining
[ https://issues.apache.org/jira/browse/SPARK-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16592899#comment-16592899 ] Apache Spark commented on SPARK-10697: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/22236 > Lift Calculation in Association Rule mining > --- > > Key: SPARK-10697 > URL: https://issues.apache.org/jira/browse/SPARK-10697 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Yashwanth Kumar >Priority: Minor > > Lift is to be calculated for Association rule mining in > AssociationRules.scala under FPM. > Lift is a measure of the performance of a Association rules. > Adding lift will help to compare the model efficiency. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23792) Documentation improvements for datetime functions
[ https://issues.apache.org/jira/browse/SPARK-23792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23792. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20901 [https://github.com/apache/spark/pull/20901] > Documentation improvements for datetime functions > - > > Key: SPARK-23792 > URL: https://issues.apache.org/jira/browse/SPARK-23792 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 2.3.0 >Reporter: A Bradbury >Assignee: A Bradbury >Priority: Minor > Fix For: 2.4.0 > > > Added details about the supported column input types, the column return type, > behaviour on invalid input, supporting examples and clarifications to the > datetime functions in `org.apache.spark.sql.functions` for Java/Scala. > These changes stemmed from confusion over behaviour of the `date_add` method. > On first use I thought it would add the specified days to the input > timestamp, but it also truncated (cast) the input timestamp to a date, > loosing the time part. > Some examples: > * Noted that the week definition for `dayofweek` method starts on a Sunday > * Corrected documentation for methods such as `last_day` that only listed > one type of input i.e. "date column" changed to "date, timestamp or string" > * Renamed the parameters of the `months_between` method to match those of > the `datediff` method and to indicate which parameter is expected to be > before then other chronologically > * `from_unixtime` documentation referenced the "given format" when there was > no format parameter > * Documentation for `to_timestamp` methods detailed that a unix timestamp in > seconds would be returned (implying 1521926327) when they would actually > return the input cast to a timestamp type > Some observations: > * The first day of the week by the `dayofweek` method is a Sunday, but by > the `weekofyear` method it is a Monday > * The `datediff` method returns a integer value, even with timestamp input, > whereas the `months_between` method returns a double, which seems inconsistent > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23792) Documentation improvements for datetime functions
[ https://issues.apache.org/jira/browse/SPARK-23792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23792: - Assignee: A Bradbury > Documentation improvements for datetime functions > - > > Key: SPARK-23792 > URL: https://issues.apache.org/jira/browse/SPARK-23792 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 2.3.0 >Reporter: A Bradbury >Assignee: A Bradbury >Priority: Minor > Fix For: 2.4.0 > > > Added details about the supported column input types, the column return type, > behaviour on invalid input, supporting examples and clarifications to the > datetime functions in `org.apache.spark.sql.functions` for Java/Scala. > These changes stemmed from confusion over behaviour of the `date_add` method. > On first use I thought it would add the specified days to the input > timestamp, but it also truncated (cast) the input timestamp to a date, > loosing the time part. > Some examples: > * Noted that the week definition for `dayofweek` method starts on a Sunday > * Corrected documentation for methods such as `last_day` that only listed > one type of input i.e. "date column" changed to "date, timestamp or string" > * Renamed the parameters of the `months_between` method to match those of > the `datediff` method and to indicate which parameter is expected to be > before then other chronologically > * `from_unixtime` documentation referenced the "given format" when there was > no format parameter > * Documentation for `to_timestamp` methods detailed that a unix timestamp in > seconds would be returned (implying 1521926327) when they would actually > return the input cast to a timestamp type > Some observations: > * The first day of the week by the `dayofweek` method is a Sunday, but by > the `weekofyear` method it is a Monday > * The `datediff` method returns a integer value, even with timestamp input, > whereas the `months_between` method returns a double, which seems inconsistent > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25080) NPE in HiveShim$.toCatalystDecimal(HiveShim.scala:110)
[ https://issues.apache.org/jira/browse/SPARK-25080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25080. -- Resolution: Cannot Reproduce > NPE in HiveShim$.toCatalystDecimal(HiveShim.scala:110) > -- > > Key: SPARK-25080 > URL: https://issues.apache.org/jira/browse/SPARK-25080 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.3.1 > Environment: AWS EMR >Reporter: Andrew K Long >Priority: Minor > > NPE while reading hive table. > > ``` > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1190 in stage 392.0 failed 4 times, most recent failure: Lost task > 1190.3 in stage 392.0 (TID 122055, ip-172-31-32-196.ec2.internal, executor > 487): java.lang.NullPointerException > at org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:110) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:414) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:413) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:433) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:217) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:294) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$2.apply(ShuffleExchangeExec.scala:265) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1753) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1741) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1740) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1740) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:871) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:871) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1974) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1923) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1912) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:682) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194) > ... 67 more > Caused by: java.lang.NullPointerException > at org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:110) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:414) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$
[jira] [Commented] (SPARK-25206) Wrong data may be returned for Parquet
[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16592893#comment-16592893 ] Hyukjin Kwon commented on SPARK-25206: -- Please fix the JIRA title to reflect more precisely rather then just wrong results since this one is a blocker and better be clarified. > Wrong data may be returned for Parquet > -- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: yucai >Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > ++ > | ID| > ++ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > ++ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...| true| > ++-+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > ++-+ > | key|value| > ++-+ > |spark.sql.parquet...|false| > ++-+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: > Integer) into parquet, but {color:#ff}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25135) insert datasource table may all null when select from view
[ https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16592890#comment-16592890 ] Yuming Wang commented on SPARK-25135: - Another serious case: {code:scala} withTempDir { dir => val path = dir.getCanonicalPath val cnt = 30 val table1Path = s"$path/table1" val table3Path = s"$path/table3" spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id % 3 as bigint) as col2") .write.mode(SaveMode.Overwrite).parquet(table1Path) withTable("table1", "table3") { spark.sql( s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet location '$table1Path/'") spark.sql("CREATE TABLE table3(COL1 bigint, COL2 bigint) using parquet " + "PARTITIONED BY (COL2) " + s"CLUSTERED BY (COL1) INTO 2 BUCKETS location '$table3Path/'") withView("view1") { spark.sql("CREATE VIEW view1 as select col1, col2 from table1 where col1 > -20") spark.sql("INSERT OVERWRITE TABLE table3 select COL1, COL2 from view1 CLUSTER BY COL1") spark.table("table3").show } } } } {code} Exception: {noformat} None.get java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$4$$anonfun$5.apply(FileFormatWriter.scala:126) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$4$$anonfun$5.apply(FileFormatWriter.scala:126) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$4.apply(FileFormatWriter.scala:126) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$4.apply(FileFormatWriter.scala:125) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:125) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:151) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:101) at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:117) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:186) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:186) at org.apache.spark.sql.Dataset$$anonfun$51.apply(Dataset.scala:3243) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3242) at org.apache.spark.sql.Dataset.(Dataset.scala:186) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:71) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638) {noformat} > insert datasource table may all null when select from view > -- > > Key: SPARK-25135 > URL: https://issues.apache.org/jira/browse/SPARK-25135 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Yuming Wang >Priority: Blocker > Labels: correctness > > How to reproduce: > {code:scala} > val path = "/tmp/spark/parquet" > val cnt = 30 > spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) > as col2").write.mode("overwrite").parquet(path) > spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet > location '$path'") > spark.sql("create view view1 as select col1, col2 from table1 where col1 > > -20") > spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet") > spark.sql("insert overwrite table table2 select COL1, COL2 from view1") > spark.table("table2").show > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23698) Spark code contains numerous undefined names in Python 3
[ https://issues.apache.org/jira/browse/SPARK-23698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16592848#comment-16592848 ] Apache Spark commented on SPARK-23698: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/22235 > Spark code contains numerous undefined names in Python 3 > > > Key: SPARK-23698 > URL: https://issues.apache.org/jira/browse/SPARK-23698 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: cclauss >Assignee: cclauss >Priority: Minor > Fix For: 2.4.0 > > > flake8 testing of https://github.com/apache/spark on Python 3.6.3 > $ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source > --statistics* > ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input' > result = raw_input("\n%s (y/n): " % prompt) > ^ > ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input' > primary_author = raw_input( > ^ > ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input' > pick_ref = raw_input("Enter a branch name [%s]: " % default_branch) >^ > ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input' > jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id) > ^ > ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input' > fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % > default_fix_versions) >^ > ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input' > raw_assignee = raw_input( >^ > ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input' > pr_num = raw_input("Which pull request would you like to merge? (e.g. > 34): ") > ^ > ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input' > result = raw_input("Would you like to use the modified title? (y/n): > ") > ^ > ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input' > while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y": > ^ > ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input' > response = raw_input("%s [y/n]: " % msg) >^ > ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode' > author = unidecode.unidecode(unicode(author, "UTF-8")).strip() > ^ > ./python/setup.py:37:11: F821 undefined name '__version__' > VERSION = __version__ > ^ > ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer' > dispatch[buffer] = save_buffer > ^ > ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file' > dispatch[file] = save_file > ^ > ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode' > if not isinstance(obj, str) and not isinstance(obj, unicode): > ^ > ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long' > intlike = (int, long) > ^ > ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long' > return self._sc._jvm.Time(long(timestamp * 1000)) > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 > undefined name 'xrange' > for i in xrange(50): > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 > undefined name 'xrange' > for j in xrange(5): > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 > undefined name 'xrange' > for k in xrange(20022): > ^ > 20F821 undefined name 'raw_input' > 20 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org