[jira] [Created] (SPARK-9206) ClassCastException using HiveContext with GoogleHadoopFileSystem as fs.defaultFS

2015-07-20 Thread Dennis Huo (JIRA)
Dennis Huo created SPARK-9206:
-

 Summary: ClassCastException using HiveContext with 
GoogleHadoopFileSystem as fs.defaultFS
 Key: SPARK-9206
 URL: https://issues.apache.org/jira/browse/SPARK-9206
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Dennis Huo


Originally reported on StackOverflow: 
http://stackoverflow.com/questions/31478955/googlehadoopfilesystem-cannot-be-cast-to-hadoop-filesystem

Google's "bdutil" command-line tool 
(https://github.com/GoogleCloudPlatform/bdutil) is one of the main supported 
ways of deploying Hadoop and Spark cluster on Google Cloud Platform, and has 
default settings which configure fs.defaultFS to use the Google Cloud Storage 
connector for Hadoop (and performs installation of the connector jarfile on top 
of tarball-based Hadoop and Spark distributions).

Starting in Spark 1.4.1, taking a default bdutil-based Spark deployment, 
running "spark-shell", and then trying to read a file with sqlContext like:

{code}
sqlContext.parquetFile("gs://my-bucket/my-file.parquet")
{code}

results in the following:

{noformat}
15/07/20 20:59:14 DEBUG IsolatedClientLoader: shared class: 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
java.lang.RuntimeException: java.lang.ClassCastException: 
com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem cannot be cast to 
org.apache.hadoop.fs.FileSystem
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
at 
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:116)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1$1(IsolatedClientLoader.scala:172)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.(IsolatedClientLoader.scala:168)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:213)
at 
org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:176)
at 
org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:371)
at 
org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:371)
at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:370)
at 
org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:383)
at 
org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:383)
at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:382)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at 
org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:438)
at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:264)
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:1099)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
at $iwC$$iwC$$iwC$$iwC.(:30)
at $iwC$$iwC$$iwC.(:32)
at $iwC$$iwC.(:34)
at $iwC.(:36)
at (:38)
at .(:42)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at 

[jira] [Created] (SPARK-9782) Add support for YARN application tags running Spark on YARN

2015-08-10 Thread Dennis Huo (JIRA)
Dennis Huo created SPARK-9782:
-

 Summary: Add support for YARN application tags running Spark on 
YARN
 Key: SPARK-9782
 URL: https://issues.apache.org/jira/browse/SPARK-9782
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.4.1
Reporter: Dennis Huo


https://issues.apache.org/jira/browse/YARN-1390 originally added the new 
“Application Tags” feature to YARN to help track the sources of applications 
among many possible YARN clients. 
https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a set 
of tags to be applied, and for comparison, 
https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for 
MapReduce to easily propagate tags through to YARN via Configuration settings.

Since the ApplicationSubmissionContext.setApplicationTags method was only added 
in Hadoop 2.4+, Spark support will invoke the method via reflection the same 
way other such version-specific methods are called in elsewhere in the YARN 
client. Since the usage of tags is generally not critical to the functionality 
of older YARN setups, it should be safe to handle NoSuchMethodException with 
just a logWarning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9782) Add support for YARN application tags running Spark on YARN

2015-08-10 Thread Dennis Huo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14680431#comment-14680431
 ] 

Dennis Huo commented on SPARK-9782:
---

Correct, from what I understand, the "node labels" JIRA is a more heavyweight 
behavioral-change feature, for being able to control packing of requested 
containers onto machines based on "node labels".

"YARN application tags" are distinct from node labels, and are only used by 
workflow orchestrators on top of YARN, without affecting how YARN does packing 
at all.

> Add support for YARN application tags running Spark on YARN
> ---
>
> Key: SPARK-9782
> URL: https://issues.apache.org/jira/browse/SPARK-9782
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.4.1
>Reporter: Dennis Huo
>
> https://issues.apache.org/jira/browse/YARN-1390 originally added the new 
> “Application Tags” feature to YARN to help track the sources of applications 
> among many possible YARN clients. 
> https://issues.apache.org/jira/browse/YARN-1399 improved on this to allow a 
> set of tags to be applied, and for comparison, 
> https://issues.apache.org/jira/browse/MAPREDUCE-5699 added support for 
> MapReduce to easily propagate tags through to YARN via Configuration settings.
> Since the ApplicationSubmissionContext.setApplicationTags method was only 
> added in Hadoop 2.4+, Spark support will invoke the method via reflection the 
> same way other such version-specific methods are called in elsewhere in the 
> YARN client. Since the usage of tags is generally not critical to the 
> functionality of older YARN setups, it should be safe to handle 
> NoSuchMethodException with just a logWarning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread Dennis Huo (Jira)
Dennis Huo created SPARK-40128:
--

 Summary: Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone 
encoding in VectorizedColumnReader
 Key: SPARK-40128
 URL: https://issues.apache.org/jira/browse/SPARK-40128
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Dennis Huo


Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).

Even though there apparently aren't many writers of the standalone 
DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6
 and could be more efficient for types of binary/string data that don't take 
good advantage of sharing common prefixes for incremental encoding.

The problem and be reproduced by trying to load one of the 
[https://github.com/apache/parquet-testing] files 
(delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread Dennis Huo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated SPARK-40128:
---
Attachment: delta_length_byte_array.parquet

> Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in 
> VectorizedColumnReader
> -
>
> Key: SPARK-40128
> URL: https://issues.apache.org/jira/browse/SPARK-40128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: delta_length_byte_array.parquet
>
>
> Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
> implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
> DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
> DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
> DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
> DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).
> Even though there apparently aren't many writers of the standalone 
> DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
> https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6
>  and could be more efficient for types of binary/string data that don't take 
> good advantage of sharing common prefixes for incremental encoding.
> The problem and be reproduced by trying to load one of the 
> [https://github.com/apache/parquet-testing] files 
> (delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread Dennis Huo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated SPARK-40128:
---
Description: 
Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).

Even though there apparently aren't many writers of the standalone 
DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
[https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6
 and could be more efficient for types of binary/string data that don't take 
good advantage of sharing common prefixes for incremental encoding.

The problem can be reproduced by trying to load one of the 
[https://github.com/apache/parquet-testing] files 
(delta_length_byte_array.parquet).

  was:
Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).

Even though there apparently aren't many writers of the standalone 
DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array–6
 and could be more efficient for types of binary/string data that don't take 
good advantage of sharing common prefixes for incremental encoding.

The problem and be reproduced by trying to load one of the 
[https://github.com/apache/parquet-testing] files 
(delta_length_byte_array.parquet).


> Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in 
> VectorizedColumnReader
> -
>
> Key: SPARK-40128
> URL: https://issues.apache.org/jira/browse/SPARK-40128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: delta_length_byte_array.parquet
>
>
> Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
> implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
> DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
> DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
> DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
> DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).
> Even though there apparently aren't many writers of the standalone 
> DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
> [https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6
>  and could be more efficient for types of binary/string data that don't take 
> good advantage of sharing common prefixes for incremental encoding.
> The problem can be reproduced by trying to load one of the 
> [https://github.com/apache/parquet-testing] files 
> (delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40128) Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in VectorizedColumnReader

2022-08-17 Thread Dennis Huo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Huo updated SPARK-40128:
---
Docs Text: Added support for keeping vectorized reads enabled for Parquet 
files using the DELTA_LENGTH_BYTE_ARRAY encoding as a standalone column 
encoding. Previously, the related DELTA_BINARY_PACKED and DELTA_BYTE_ARRAY 
encodings were accepted as column encodings, but DELTA_LENGTH_BYTE_ARRAY would 
still be rejected as "unsupported".

> Add DELTA_LENGTH_BYTE_ARRAY as a recognized standalone encoding in 
> VectorizedColumnReader
> -
>
> Key: SPARK-40128
> URL: https://issues.apache.org/jira/browse/SPARK-40128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dennis Huo
>Priority: Major
> Attachments: delta_length_byte_array.parquet
>
>
> Even though https://issues.apache.org/jira/browse/SPARK-36879 added 
> implementations for DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY, and 
> DELTA_LENGTH_BYTE_ARRAY encodings, only DELTA_BINARY_PACKED and 
> DELTA_BYTE_ARRAY were added as top-level standalone column encodings, with 
> DELTA_LENGTH_BYTE_ARRAY only being used in its capacity as a subcomponent of 
> DELTA_BYTE_ARRAY (for the non-shared string/binary suffixes).
> Even though there apparently aren't many writers of the standalone 
> DELTA_LENGTH_BYTE_ARRAY encoding, it's part of the core Parquet spec: 
> [https://parquet.apache.org/docs/file-format/data-pages/encodings/#delta-length-byte-array-delta_length_byte_array]–6
>  and could be more efficient for types of binary/string data that don't take 
> good advantage of sharing common prefixes for incremental encoding.
> The problem can be reproduced by trying to load one of the 
> [https://github.com/apache/parquet-testing] files 
> (delta_length_byte_array.parquet).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org