[jira] [Commented] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL

2018-05-13 Thread Fernando Pereira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473584#comment-16473584
 ] 

Fernando Pereira commented on SPARK-19618:
--

[~cloud_fan] I have created the Jira and an implementation to lift the limit 
via a configuration option. Internally we are forced to use our mod, and it 
would be nice to get in sync with upstream again at some point. It is a very 
small patch in the end. Thanks.

> Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
> 
>
> Key: SPARK-19618
> URL: https://issues.apache.org/jira/browse/SPARK-19618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Major
> Fix For: 2.2.0
>
>
> High number of buckets is allowed while creating a table via SQL query:
> {code}
> sparkSession.sql("""
> CREATE TABLE bucketed_table(col1 INT) USING parquet 
> CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS
> """)
> sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println)
> 
> [Num Buckets:,147483647,]
> [Bucket Columns:,[col1],]
> [Sort Columns:,[col1],]
> 
> {code}
> Trying the same via dataframe API does not work:
> {code}
> > df.write.format("orc").bucketBy(147483647, 
> > "j","k").sortBy("j","k").saveAsTable("bucketed_table")
> java.lang.IllegalArgumentException: requirement failed: Bucket number must be 
> greater than 0 and less than 10.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291)
>   at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL

2018-04-16 Thread Fernando Pereira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440085#comment-16440085
 ] 

Fernando Pereira commented on SPARK-19618:
--

Opened [SPARK-23997|https://issues.apache.org/jira/browse/SPARK-23997]

Thanks

> Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
> 
>
> Key: SPARK-19618
> URL: https://issues.apache.org/jira/browse/SPARK-19618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Major
> Fix For: 2.2.0
>
>
> High number of buckets is allowed while creating a table via SQL query:
> {code}
> sparkSession.sql("""
> CREATE TABLE bucketed_table(col1 INT) USING parquet 
> CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS
> """)
> sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println)
> 
> [Num Buckets:,147483647,]
> [Bucket Columns:,[col1],]
> [Sort Columns:,[col1],]
> 
> {code}
> Trying the same via dataframe API does not work:
> {code}
> > df.write.format("orc").bucketBy(147483647, 
> > "j","k").sortBy("j","k").saveAsTable("bucketed_table")
> java.lang.IllegalArgumentException: requirement failed: Bucket number must be 
> greater than 0 and less than 10.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291)
>   at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL

2018-04-15 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438912#comment-16438912
 ] 

Wenchen Fan commented on SPARK-19618:
-

making it configurable sounds like a good idea, can you open a JIRA for it? 
thanks!

> Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
> 
>
> Key: SPARK-19618
> URL: https://issues.apache.org/jira/browse/SPARK-19618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Major
> Fix For: 2.2.0
>
>
> High number of buckets is allowed while creating a table via SQL query:
> {code}
> sparkSession.sql("""
> CREATE TABLE bucketed_table(col1 INT) USING parquet 
> CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS
> """)
> sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println)
> 
> [Num Buckets:,147483647,]
> [Bucket Columns:,[col1],]
> [Sort Columns:,[col1],]
> 
> {code}
> Trying the same via dataframe API does not work:
> {code}
> > df.write.format("orc").bucketBy(147483647, 
> > "j","k").sortBy("j","k").saveAsTable("bucketed_table")
> java.lang.IllegalArgumentException: requirement failed: Bucket number must be 
> greater than 0 and less than 10.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291)
>   at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL

2018-04-15 Thread Fernando Pereira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438822#comment-16438822
 ] 

Fernando Pereira commented on SPARK-19618:
--

Is there any technological problem in using more than 100k buckets? Otherwise 
what about making it configurable?

We have an 80TB workload and to keep partitions "manageable" we do need to use 
a large number of buckets. While it might seem a lot today it is expected that 
workloads will continue to increase in size...

> Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
> 
>
> Key: SPARK-19618
> URL: https://issues.apache.org/jira/browse/SPARK-19618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Major
> Fix For: 2.2.0
>
>
> High number of buckets is allowed while creating a table via SQL query:
> {code}
> sparkSession.sql("""
> CREATE TABLE bucketed_table(col1 INT) USING parquet 
> CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS
> """)
> sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println)
> 
> [Num Buckets:,147483647,]
> [Bucket Columns:,[col1],]
> [Sort Columns:,[col1],]
> 
> {code}
> Trying the same via dataframe API does not work:
> {code}
> > df.write.format("orc").bucketBy(147483647, 
> > "j","k").sortBy("j","k").saveAsTable("bucketed_table")
> java.lang.IllegalArgumentException: requirement failed: Bucket number must be 
> greater than 0 and less than 10.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291)
>   at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19618) Inconsistency wrt max. buckets allowed from Dataframe API vs SQL

2017-02-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868963#comment-15868963
 ] 

Apache Spark commented on SPARK-19618:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/16948

> Inconsistency wrt max. buckets allowed from Dataframe API vs SQL
> 
>
> Key: SPARK-19618
> URL: https://issues.apache.org/jira/browse/SPARK-19618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>
> High number of buckets is allowed while creating a table via SQL query:
> {code}
> sparkSession.sql("""
> CREATE TABLE bucketed_table(col1 INT) USING parquet 
> CLUSTERED BY (col1) SORTED BY (col1) INTO 147483647 BUCKETS
> """)
> sparkSession.sql("DESC FORMATTED bucketed_table").collect.foreach(println)
> 
> [Num Buckets:,147483647,]
> [Bucket Columns:,[col1],]
> [Sort Columns:,[col1],]
> 
> {code}
> Trying the same via dataframe API does not work:
> {code}
> > df.write.format("orc").bucketBy(147483647, 
> > "j","k").sortBy("j","k").saveAsTable("bucketed_table")
> java.lang.IllegalArgumentException: requirement failed: Bucket number must be 
> greater than 0 and less than 10.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:293)
>   at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$getBucketSpec$2.apply(DataFrameWriter.scala:291)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.DataFrameWriter.getBucketSpec(DataFrameWriter.scala:291)
>   at 
> org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:429)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:410)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:365)
>   ... 50 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org