[jira] [Comment Edited] (SPARK-31583) grouping_id calculation should be improved

2020-04-28 Thread Costas Piliotis (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094651#comment-17094651
 ] 

Costas Piliotis edited comment on SPARK-31583 at 4/28/20, 4:16 PM:
---

[~maropu] I'm trying to avoid referencing the SPARK-21858 which already 
addresses the flipped bits.  Specifically this is about how spark decides where 
to allocate the grouping_id based on the ordinal position in the grouping sets 
rather than the ordinal position in the select clause.   Does that make sense?

So if I have  SELECT a,b,c,d FROM... GROUPING SETS (  (a,b,d), (a,b,c) ) the 
grouping_id bits would be determined as cdba the or instead of dcba.I 
believe if we look at most RDBMS that has grouping sets identified, my only 
suggestion is that it would be more predictable if the bit order in the 
grouping_id were determined by the ordinal position in the select.   

The flipped bits, is a separate ticket and I do believe the implementation 
should be predictably the same as other implementation in established RDBMS SQL 
implementations where 1=included, 0=excluded, but that matter is closed to 
discussion.   


was (Author: cpiliotis):
[~maropu] I'm trying to avoid referencing the SPARK-21858 which already 
addresses the flipped bits.  Specifically this is about how spark decides where 
to allocate the grouping_id based on the ordinal position in the grouping sets 
rather than the ordinal position in the select clause.   Does that make sense?

So if I have  SELECT a,b,c,d FROM... GROUPING SETS (  (a,b,d), (a,b,c) ) the 
grouping_id would be abdc instead of abcd.I believe if we look at most 
RDBMS that has grouping sets identified, my only suggestion is that it would be 
more predictable if the bit order in the grouping_id were determined by the 
ordinal position in the select.   

> grouping_id calculation should be improved
> --
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Costas Piliotis
>Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
> exclusion from a grouping_set rather than inclusion, when performing complex 
> grouping_sets that are not in the order of the base select statement, 
> flipping the bit in the grouping_id seems to be happen when the grouping set 
> is identified rather than when the columns are selected in the sql.   I will 
> of course use the exclusion strategy identified in SPARK-21858 as the 
> baseline for this.  
>  
> {code:scala}
> import spark.implicits._
> val df= Seq(
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
>  d=1
>  c=2
>  b=4
>  a=8
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
> This returns:
> {noformat}
> ++++++---+---+
> |a   |b   |c   |d   |count(1)|gid|gid_bin|
> ++++++---+---+
> |a   |null|c   |null|4   |6  |110|
> |null|null|null|null|4   |15 |   |
> |a   |null|null|d   |4   |5  |101|
> |a   |b   |null|d   |4   |1  |1  |
> ++++++---+---+
> {noformat}
>  
>  In other words, I would have expected the excluded values one way but I 
> received them excluded in the order they were first seen in the specified 
> grouping sets.
>  a,b,d included = excldes c = 2; expected gid=2. received gid=1
>  a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c) 
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, 
> bin(grouping_id(a,b,d,c)) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
>  columns forming groupingid seem to be creatred as the grouping sets are 
> identified rather than ordinal position in parent query.
> I'd like to at least point out that grouping_id is documented in many other 
> rdbms and I believe the spark project should use a policy of flipping the 
> bits so 1=inclusion; 0=exclusion in the grouping set.
> However many rdms that do have the feature of a grouping_id do implement it 
> by the ordinal position recognized as fields in the select clause, rather 
> than allocating them as they are observed in the grouping sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SPARK-31583) grouping_id calculation should be improved

2020-04-28 Thread Costas Piliotis (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094651#comment-17094651
 ] 

Costas Piliotis commented on SPARK-31583:
-

[~maropu] I'm trying to avoid referencing the SPARK-21858 which already 
addresses the flipped bits.  Specifically this is about how spark decides where 
to allocate the grouping_id based on the ordinal position in the grouping sets 
rather than the ordinal position in the select clause.   Does that make sense?

So if I have  SELECT a,b,c,d FROM... GROUPING SETS (  (a,b,d), (a,b,c) ) the 
grouping_id would be abdc instead of abcd.I believe if we look at most 
RDBMS that has grouping sets identified, my only suggestion is that it would be 
more predictable if the bit order in the grouping_id were determined by the 
ordinal position in the select.   

> grouping_id calculation should be improved
> --
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Costas Piliotis
>Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
> exclusion from a grouping_set rather than inclusion, when performing complex 
> grouping_sets that are not in the order of the base select statement, 
> flipping the bit in the grouping_id seems to be happen when the grouping set 
> is identified rather than when the columns are selected in the sql.   I will 
> of course use the exclusion strategy identified in SPARK-21858 as the 
> baseline for this.  
>  
> {code:scala}
> import spark.implicits._
> val df= Seq(
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d"),
>  ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
>  d=1
>  c=2
>  b=4
>  a=8
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
> This returns:
> {noformat}
> ++++++---+---+
> |a   |b   |c   |d   |count(1)|gid|gid_bin|
> ++++++---+---+
> |a   |null|c   |null|4   |6  |110|
> |null|null|null|null|4   |15 |   |
> |a   |null|null|d   |4   |5  |101|
> |a   |b   |null|d   |4   |1  |1  |
> ++++++---+---+
> {noformat}
>  
>  In other words, I would have expected the excluded values one way but I 
> received them excluded in the order they were first seen in the specified 
> grouping sets.
>  a,b,d included = excldes c = 2; expected gid=2. received gid=1
>  a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c) 
> {code:scala}
> spark.sql("""
>  select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, 
> bin(grouping_id(a,b,d,c)) as gid_bin
>  from abc
>  group by GROUPING SETS (
>  (),
>  (a,b,d),
>  (a,c),
>  (a,d)
>  )
>  """).show(false)
> {code}
>  columns forming groupingid seem to be creatred as the grouping sets are 
> identified rather than ordinal position in parent query.
> I'd like to at least point out that grouping_id is documented in many other 
> rdbms and I believe the spark project should use a policy of flipping the 
> bits so 1=inclusion; 0=exclusion in the grouping set.
> However many rdms that do have the feature of a grouping_id do implement it 
> by the ordinal position recognized as fields in the select clause, rather 
> than allocating them as they are observed in the grouping sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31583) grouping_id calculation should be improved

2020-04-27 Thread Costas Piliotis (Jira)
Costas Piliotis created SPARK-31583:
---

 Summary: grouping_id calculation should be improved
 Key: SPARK-31583
 URL: https://issues.apache.org/jira/browse/SPARK-31583
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.5
Reporter: Costas Piliotis


Unrelated to SPARK-21858 which identifies that grouping_id is determined by 
exclusion from a grouping_set rather than inclusion, when performing complex 
grouping_sets that are not in the order of the base select statement, flipping 
the bit in the grouping_id seems to be happen when the grouping set is 
identified rather than when the columns are selected in the sql.   I will of 
course use the exclusion strategy identified in SPARK-21858 as the baseline for 
this.  

 
{code:scala}
import spark.implicits._
val df= Seq(
 ("a","b","c","d"),
 ("a","b","c","d"),
 ("a","b","c","d"),
 ("a","b","c","d")
).toDF("a","b","c","d").createOrReplaceTempView("abc")
{code}
expected to have these references in the grouping_id:
 d=1
 c=2
 b=4
 a=8
{code:scala}
spark.sql("""
 select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
 from abc
 group by GROUPING SETS (
 (),
 (a,b,d),
 (a,c),
 (a,d)
 )
 """).show(false)
{code}
This returns:
{noformat}
++++++---+---+
|a   |b   |c   |d   |count(1)|gid|gid_bin|
++++++---+---+
|a   |null|c   |null|4   |6  |110|
|null|null|null|null|4   |15 |   |
|a   |null|null|d   |4   |5  |101|
|a   |b   |null|d   |4   |1  |1  |
++++++---+---+
{noformat}
 

 In other words, I would have expected the excluded values one way but I 
received them excluded in the order they were first seen in the specified 
grouping sets.

 a,b,d included = excldes c = 2; expected gid=2. received gid=1
 a,d included = excludes b=4, c=2 expected gid=6, received gid=5

The grouping_id that actually is expected is (a,b,d,c) 

{code:scala}
spark.sql("""
 select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, 
bin(grouping_id(a,b,d,c)) as gid_bin
 from abc
 group by GROUPING SETS (
 (),
 (a,b,d),
 (a,c),
 (a,d)
 )
 """).show(false)
{code}


 columns forming groupingid seem to be creatred as the grouping sets are 
identified rather than ordinal position in parent query.

I'd like to at least point out that grouping_id is documented in many other 
rdbms and I believe the spark project should use a policy of flipping the bits 
so 1=inclusion; 0=exclusion in the grouping set.

However many rdms that do have the feature of a grouping_id do implement it by 
the ordinal position recognized as fields in the select clause, rather than 
allocating them as they are observed in the grouping sets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22482) Unreadable Parquet array columns

2017-11-09 Thread Costas Piliotis (JIRA)
Costas Piliotis created SPARK-22482:
---

 Summary: Unreadable Parquet array columns
 Key: SPARK-22482
 URL: https://issues.apache.org/jira/browse/SPARK-22482
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
 Environment: Spark 2.1.0
Parquet 1.8.1
Hive 1.2
Hive 2.1.0
presto 0.157
presto 0.180
Reporter: Costas Piliotis


We have seen an issue with writing out parquet data from spark.   int and bool 
arrays seem to be throwing exceptions when trying to read the parquet files 
from hive and presto.

I've logged a ticket here:  PARQUET-1157 with the parquet project but I'm not 
sure if it's an issue within their project or an issue with spark itself.

Spark is reading parquet-avro data which is output by a mapreduce job and 
writing it out to parquet.   

The inbound parquet format has the column defined as:

{code}
  optional group playerpositions_ai (LIST) {
repeated int32 array;
  }
{code}

Spark is redefining this data as this:

{code}
  optional group playerpositions_ai (LIST) {
repeated group list {
  optional int32 element;
}
  }
{code}

and with legacy format:
{code}
  optional group playerpositions_ai (LIST) {
repeated group bag {
  optional int32 array;
}
  }
{code}

The parquet data was tested in Hive 1.2, Hive 2.1, Presto 0.157, Presto 0.180, 
and Spark 2.1, as well as Amazon Athena (which is some form of presto 
implementation).   

I believe that the above schema is valid for writing out parquet.  

The spark command writing it out is simple:
{code}
  data.repartition(((data.count() / 1000) + 
1).toInt).write.format("parquet")
.mode("append")
.partitionBy(partitionColumns: _*)
.save(path)
{code}

We initially wrote this out with legacy format turned off but later turned on 
legacy format and have seen this error occur the same way with legacy off and 
on.  

Spark's stack trace from reading this is:

{code}
java.lang.IllegalArgumentException: Reading past RLE/BitPacking stream.
at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
at 
org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readNext(RunLengthBitPackingHybridDecoder.java:82)
at 
org.apache.parquet.column.values.rle.RunLengthBitPackingHybridDecoder.readInt(RunLengthBitPackingHybridDecoder.java:64)
at 
org.apache.parquet.column.values.dictionary.DictionaryValuesReader.readInteger(DictionaryValuesReader.java:112)
at 
org.apache.parquet.column.impl.ColumnReaderImpl$2$3.read(ColumnReaderImpl.java:243)
at 
org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:464)
at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:370)
at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

Also do note that our data is stored on S3 if that matters.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org