[jira] [Comment Edited] (SPARK-31583) grouping_id calculation should be improved
[ https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094651#comment-17094651 ] Costas Piliotis edited comment on SPARK-31583 at 4/28/20, 4:16 PM: --- [~maropu] I'm trying to avoid referencing the SPARK-21858 which already addresses the flipped bits. Specifically this is about how spark decides where to allocate the grouping_id based on the ordinal position in the grouping sets rather than the ordinal position in the select clause. Does that make sense? So if I have SELECT a,b,c,d FROM... GROUPING SETS ( (a,b,d), (a,b,c) ) the grouping_id bits would be determined as cdba the or instead of dcba.I believe if we look at most RDBMS that has grouping sets identified, my only suggestion is that it would be more predictable if the bit order in the grouping_id were determined by the ordinal position in the select. The flipped bits, is a separate ticket and I do believe the implementation should be predictably the same as other implementation in established RDBMS SQL implementations where 1=included, 0=excluded, but that matter is closed to discussion. was (Author: cpiliotis): [~maropu] I'm trying to avoid referencing the SPARK-21858 which already addresses the flipped bits. Specifically this is about how spark decides where to allocate the grouping_id based on the ordinal position in the grouping sets rather than the ordinal position in the select clause. Does that make sense? So if I have SELECT a,b,c,d FROM... GROUPING SETS ( (a,b,d), (a,b,c) ) the grouping_id would be abdc instead of abcd.I believe if we look at most RDBMS that has grouping sets identified, my only suggestion is that it would be more predictable if the bit order in the grouping_id were determined by the ordinal position in the select. > grouping_id calculation should be improved > -- > > Key: SPARK-31583 > URL: https://issues.apache.org/jira/browse/SPARK-31583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Costas Piliotis >Priority: Minor > > Unrelated to SPARK-21858 which identifies that grouping_id is determined by > exclusion from a grouping_set rather than inclusion, when performing complex > grouping_sets that are not in the order of the base select statement, > flipping the bit in the grouping_id seems to be happen when the grouping set > is identified rather than when the columns are selected in the sql. I will > of course use the exclusion strategy identified in SPARK-21858 as the > baseline for this. > > {code:scala} > import spark.implicits._ > val df= Seq( > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d") > ).toDF("a","b","c","d").createOrReplaceTempView("abc") > {code} > expected to have these references in the grouping_id: > d=1 > c=2 > b=4 > a=8 > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > This returns: > {noformat} > ++++++---+---+ > |a |b |c |d |count(1)|gid|gid_bin| > ++++++---+---+ > |a |null|c |null|4 |6 |110| > |null|null|null|null|4 |15 | | > |a |null|null|d |4 |5 |101| > |a |b |null|d |4 |1 |1 | > ++++++---+---+ > {noformat} > > In other words, I would have expected the excluded values one way but I > received them excluded in the order they were first seen in the specified > grouping sets. > a,b,d included = excldes c = 2; expected gid=2. received gid=1 > a,d included = excludes b=4, c=2 expected gid=6, received gid=5 > The grouping_id that actually is expected is (a,b,d,c) > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, > bin(grouping_id(a,b,d,c)) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > columns forming groupingid seem to be creatred as the grouping sets are > identified rather than ordinal position in parent query. > I'd like to at least point out that grouping_id is documented in many other > rdbms and I believe the spark project should use a policy of flipping the > bits so 1=inclusion; 0=exclusion in the grouping set. > However many rdms that do have the feature of a grouping_id do implement it > by the ordinal position recognized as fields in the select clause, rather > than allocating them as they are observed in the grouping sets. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (SPARK-31583) grouping_id calculation should be improved
[ https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094651#comment-17094651 ] Costas Piliotis commented on SPARK-31583: - [~maropu] I'm trying to avoid referencing the SPARK-21858 which already addresses the flipped bits. Specifically this is about how spark decides where to allocate the grouping_id based on the ordinal position in the grouping sets rather than the ordinal position in the select clause. Does that make sense? So if I have SELECT a,b,c,d FROM... GROUPING SETS ( (a,b,d), (a,b,c) ) the grouping_id would be abdc instead of abcd.I believe if we look at most RDBMS that has grouping sets identified, my only suggestion is that it would be more predictable if the bit order in the grouping_id were determined by the ordinal position in the select. > grouping_id calculation should be improved > -- > > Key: SPARK-31583 > URL: https://issues.apache.org/jira/browse/SPARK-31583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Costas Piliotis >Priority: Minor > > Unrelated to SPARK-21858 which identifies that grouping_id is determined by > exclusion from a grouping_set rather than inclusion, when performing complex > grouping_sets that are not in the order of the base select statement, > flipping the bit in the grouping_id seems to be happen when the grouping set > is identified rather than when the columns are selected in the sql. I will > of course use the exclusion strategy identified in SPARK-21858 as the > baseline for this. > > {code:scala} > import spark.implicits._ > val df= Seq( > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d"), > ("a","b","c","d") > ).toDF("a","b","c","d").createOrReplaceTempView("abc") > {code} > expected to have these references in the grouping_id: > d=1 > c=2 > b=4 > a=8 > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > This returns: > {noformat} > ++++++---+---+ > |a |b |c |d |count(1)|gid|gid_bin| > ++++++---+---+ > |a |null|c |null|4 |6 |110| > |null|null|null|null|4 |15 | | > |a |null|null|d |4 |5 |101| > |a |b |null|d |4 |1 |1 | > ++++++---+---+ > {noformat} > > In other words, I would have expected the excluded values one way but I > received them excluded in the order they were first seen in the specified > grouping sets. > a,b,d included = excldes c = 2; expected gid=2. received gid=1 > a,d included = excludes b=4, c=2 expected gid=6, received gid=5 > The grouping_id that actually is expected is (a,b,d,c) > {code:scala} > spark.sql(""" > select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, > bin(grouping_id(a,b,d,c)) as gid_bin > from abc > group by GROUPING SETS ( > (), > (a,b,d), > (a,c), > (a,d) > ) > """).show(false) > {code} > columns forming groupingid seem to be creatred as the grouping sets are > identified rather than ordinal position in parent query. > I'd like to at least point out that grouping_id is documented in many other > rdbms and I believe the spark project should use a policy of flipping the > bits so 1=inclusion; 0=exclusion in the grouping set. > However many rdms that do have the feature of a grouping_id do implement it > by the ordinal position recognized as fields in the select clause, rather > than allocating them as they are observed in the grouping sets. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31583) grouping_id calculation should be improved
Costas Piliotis created SPARK-31583: --- Summary: grouping_id calculation should be improved Key: SPARK-31583 URL: https://issues.apache.org/jira/browse/SPARK-31583 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.5 Reporter: Costas Piliotis Unrelated to SPARK-21858 which identifies that grouping_id is determined by exclusion from a grouping_set rather than inclusion, when performing complex grouping_sets that are not in the order of the base select statement, flipping the bit in the grouping_id seems to be happen when the grouping set is identified rather than when the columns are selected in the sql. I will of course use the exclusion strategy identified in SPARK-21858 as the baseline for this. {code:scala} import spark.implicits._ val df= Seq( ("a","b","c","d"), ("a","b","c","d"), ("a","b","c","d"), ("a","b","c","d") ).toDF("a","b","c","d").createOrReplaceTempView("abc") {code} expected to have these references in the grouping_id: d=1 c=2 b=4 a=8 {code:scala} spark.sql(""" select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin from abc group by GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) """).show(false) {code} This returns: {noformat} ++++++---+---+ |a |b |c |d |count(1)|gid|gid_bin| ++++++---+---+ |a |null|c |null|4 |6 |110| |null|null|null|null|4 |15 | | |a |null|null|d |4 |5 |101| |a |b |null|d |4 |1 |1 | ++++++---+---+ {noformat} In other words, I would have expected the excluded values one way but I received them excluded in the order they were first seen in the specified grouping sets. a,b,d included = excldes c = 2; expected gid=2. received gid=1 a,d included = excludes b=4, c=2 expected gid=6, received gid=5 The grouping_id that actually is expected is (a,b,d,c) {code:scala} spark.sql(""" select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, bin(grouping_id(a,b,d,c)) as gid_bin from abc group by GROUPING SETS ( (), (a,b,d), (a,c), (a,d) ) """).show(false) {code} columns forming groupingid seem to be creatred as the grouping sets are identified rather than ordinal position in parent query. I'd like to at least point out that grouping_id is documented in many other rdbms and I believe the spark project should use a policy of flipping the bits so 1=inclusion; 0=exclusion in the grouping set. However many rdms that do have the feature of a grouping_id do implement it by the ordinal position recognized as fields in the select clause, rather than allocating them as they are observed in the grouping sets. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22482) Unreadable Parquet array columns
Costas Piliotis created SPARK-22482: --- Summary: Unreadable Parquet array columns Key: SPARK-22482 URL: https://issues.apache.org/jira/browse/SPARK-22482 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Environment: Spark 2.1.0 Parquet 1.8.1 Hive 1.2 Hive 2.1.0 presto 0.157 presto 0.180 Reporter: Costas Piliotis We have seen an issue with writing out parquet data from spark. int and bool arrays seem to be throwing exceptions when trying to read the parquet files from hive and presto. I've logged a ticket here: PARQUET-1157 with the parquet project but I'm not sure if it's an issue within their project or an issue with spark itself. Spark is reading parquet-avro data which is output by a mapreduce job and writing it out to parquet. The inbound parquet format has the column defined as: {code} optional group playerpositions_ai (LIST) { repeated int32 array; } {code} Spark is redefining this data as this: {code} optional group playerpositions_ai (LIST) { repeated group list { optional int32 element; } } {code} and with legacy format: {code} optional group playerpositions_ai (LIST) { repeated group bag { optional int32 array; } } {code} The parquet data was tested in Hive 1.2, Hive 2.1, Presto 0.157, Presto 0.180, and Spark 2.1, as well as Amazon Athena (which is some form of presto implementation). I believe that the above schema is valid for writing out parquet. The spark command writing it out is simple: {code} data.repartition(((data.count() / 1000) + 1).toInt).write.format("parquet") .mode("append") .partitionBy(partitionColumns: _*) .save(path) {code} We initially wrote this out with legacy format turned off but later turned on legacy format and have seen this error occur the same way with legacy off and on. Spark's stack trace from reading this is: {code} java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream. at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55) at org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82) at org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64) at org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readInteger(DictionaryValuesReader.java:112) at org.apache.parquet.column.impl.ColumnReaderImpl$2$3.read(ColumnReaderImpl.java:243) at org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464) at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:370) at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Also do note that our data is stored on S3 if that matters. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org