[jira] [Commented] (SPARK-20712) [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has length greater than 4000 bytes

2019-03-21 Thread Kris Geusebroek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797988#comment-16797988
 ] 

Kris Geusebroek commented on SPARK-20712:
-

[~yumwang] Also in scala it will fail after issuing a overwrite:

 
withTable("t1") {  val cols = Range(1, 2000).map(i => s"id as 
very_long_column_name_id$i")
  spark.range(1).selectExpr(cols: _*).selectExpr("struct(*) as 
nested").write.saveAsTable("t1")
  spark.table("t1").show
  # all good still
  spark.range(1).selectExpr(cols: _*).selectExpr("struct(*) as 
nested").write.mode("overwrite").saveAsTable("t1")
  spark.table("t1").show
  # fails
}
 

> [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has 
> length greater than 4000 bytes
> ---
>
> Key: SPARK-20712
> URL: https://issues.apache.org/jira/browse/SPARK-20712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0, 2.3.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I have following issue.
> I'm trying to read a table from hive when one of the column is nested so it's 
> schema has length longer than 4000 bytes.
> Everything worked on Spark 2.0.2. On 2.1.1 I'm getting Exception:
> {code}
> >> spark.read.table("SOME_TABLE")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/spark-2.1.1/python/pyspark/sql/readwriter.py", line 259, in table
> return self._df(self._jreader.table(tableName))
>   File 
> "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
>   File "/opt/spark-2.1.1/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", 
> line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o71.table.
> : org.apache.spark.SparkException: Cannot recognize hive type string: 
> SOME_VERY_LONG_FIELD_TYPE
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:78)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> 

[jira] [Comment Edited] (SPARK-20712) [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has length greater than 4000 bytes

2019-03-21 Thread Kris Geusebroek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797988#comment-16797988
 ] 

Kris Geusebroek edited comment on SPARK-20712 at 3/21/19 10:53 AM:
---

[~yumwang] Also in scala it will fail after issuing a overwrite:

 
{code:java}
withTable("t1") {
  val cols = Range(1, 2000).map(i => s"id as very_long_column_name_id$i")     
  spark.range(1).selectExpr(cols: _*).selectExpr("struct(*) as 
nested").write.saveAsTable("t1")
  spark.table("t1").show
  // all good
  spark.range(1).selectExpr(cols: _*).selectExpr("struct(*) as 
nested").write.mode("overwrite").saveAsTable("t1")
  spark.table("t1").show   
  // fails
}
 {code}
 


was (Author: krisgeus):
[~yumwang] Also in scala it will fail after issuing a overwrite:

 
```

withTable("t1") {

  val cols = Range(1, 2000).map(i => s"id as very_long_column_name_id$i")  

  spark.range(1).selectExpr(cols: _*).selectExpr("struct(*) as 
nested").write.saveAsTable("t1")

  spark.table("t1").show

  // all good

  spark.range(1).selectExpr(cols: _*).selectExpr("struct(*) as 
nested").write.mode("overwrite").saveAsTable("t1") 

  spark.table("t1").show

  // fails

}

``` 

> [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has 
> length greater than 4000 bytes
> ---
>
> Key: SPARK-20712
> URL: https://issues.apache.org/jira/browse/SPARK-20712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0, 2.3.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I have following issue.
> I'm trying to read a table from hive when one of the column is nested so it's 
> schema has length longer than 4000 bytes.
> Everything worked on Spark 2.0.2. On 2.1.1 I'm getting Exception:
> {code}
> >> spark.read.table("SOME_TABLE")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/spark-2.1.1/python/pyspark/sql/readwriter.py", line 259, in table
> return self._df(self._jreader.table(tableName))
>   File 
> "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
>   File "/opt/spark-2.1.1/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", 
> line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o71.table.
> : org.apache.spark.SparkException: Cannot recognize hive type string: 
> SOME_VERY_LONG_FIELD_TYPE
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:359)
>  

[jira] [Comment Edited] (SPARK-20712) [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has length greater than 4000 bytes

2019-03-21 Thread Kris Geusebroek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797988#comment-16797988
 ] 

Kris Geusebroek edited comment on SPARK-20712 at 3/21/19 10:51 AM:
---

[~yumwang] Also in scala it will fail after issuing a overwrite:

 
```

withTable("t1") {

  val cols = Range(1, 2000).map(i => s"id as very_long_column_name_id$i")  

  spark.range(1).selectExpr(cols: _*).selectExpr("struct(*) as 
nested").write.saveAsTable("t1")

  spark.table("t1").show

  // all good

  spark.range(1).selectExpr(cols: _*).selectExpr("struct(*) as 
nested").write.mode("overwrite").saveAsTable("t1") 

  spark.table("t1").show

  // fails

}

``` 


was (Author: krisgeus):
[~yumwang] Also in scala it will fail after issuing a overwrite:

 
withTable("t1") {  val cols = Range(1, 2000).map(i => s"id as 
very_long_column_name_id$i")
  spark.range(1).selectExpr(cols: _*).selectExpr("struct(*) as 
nested").write.saveAsTable("t1")
  spark.table("t1").show
  # all good still
  spark.range(1).selectExpr(cols: _*).selectExpr("struct(*) as 
nested").write.mode("overwrite").saveAsTable("t1")
  spark.table("t1").show
  # fails
}
 

> [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has 
> length greater than 4000 bytes
> ---
>
> Key: SPARK-20712
> URL: https://issues.apache.org/jira/browse/SPARK-20712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0, 2.3.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I have following issue.
> I'm trying to read a table from hive when one of the column is nested so it's 
> schema has length longer than 4000 bytes.
> Everything worked on Spark 2.0.2. On 2.1.1 I'm getting Exception:
> {code}
> >> spark.read.table("SOME_TABLE")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/spark-2.1.1/python/pyspark/sql/readwriter.py", line 259, in table
> return self._df(self._jreader.table(tableName))
>   File 
> "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
>   File "/opt/spark-2.1.1/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", 
> line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o71.table.
> : org.apache.spark.SparkException: Cannot recognize hive type string: 
> SOME_VERY_LONG_FIELD_TYPE
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:359)
> at 
> 

[jira] [Commented] (SPARK-24965) Spark SQL fails when reading a partitioned hive table with different formats per partition

2018-07-29 Thread Kris Geusebroek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561278#comment-16561278
 ] 

Kris Geusebroek commented on SPARK-24965:
-

PR: https://github.com/apache/spark/pull/21893

> Spark SQL fails when reading a partitioned hive table with different formats 
> per partition
> --
>
> Key: SPARK-24965
> URL: https://issues.apache.org/jira/browse/SPARK-24965
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Kris Geusebroek
>Priority: Major
>  Labels: pull-request-available
>
> When a hive parquet partitioned table contains a partition with a different 
> format (avro for example) the select * fails with a read exception (avro file 
> is not a parquet file)
> Selecting in hive acts as expected.
> To support this a new sql syntax needed to be supported also:
>  * ALTER TABLE   SET FILEFORMAT 
> This is included in the same PR since the unittest needs this to setup the 
> testdata.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24965) Spark SQL fails when reading a partitioned hive table with different formats per partition

2018-07-29 Thread Kris Geusebroek (JIRA)
Kris Geusebroek created SPARK-24965:
---

 Summary: Spark SQL fails when reading a partitioned hive table 
with different formats per partition
 Key: SPARK-24965
 URL: https://issues.apache.org/jira/browse/SPARK-24965
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Kris Geusebroek


When a hive parquet partitioned table contains a partition with a different 
format (avro for example) the select * fails with a read exception (avro file 
is not a parquet file)

Selecting in hive acts as expected.

To support this a new sql syntax needed to be supported also:
 * ALTER TABLE   SET FILEFORMAT 

This is included in the same PR since the unittest needs this to setup the 
testdata.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org