[jira] [Updated] (SPARK-26939) Fix some outdated comments about task schedulers

2019-02-22 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-26939:
-
Description: 
Some comments about task schedulers are outdated. They should be fixed.

* YarnClusterScheduler comments: reference to ClusterScheduler which is not 
used anymore.
* TaskSetManager comments: method statusUpdate does not exist as of now.

  was:
Some comments about task schedulers are outdated. They should be fixed.

* TaskScheduler comments: currently implemented exclusively by 
  org.apache.spark.scheduler.TaskSchedulerImpl. This is not true as of now.
* YarnClusterScheduler comments: reference to ClusterScheduler which is not 
used anymore.
* TaskSetManager comments: method statusUpdate does not exist as of now.


> Fix some outdated comments about task schedulers
> 
>
> Key: SPARK-26939
> URL: https://issues.apache.org/jira/browse/SPARK-26939
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Minor
>
> Some comments about task schedulers are outdated. They should be fixed.
> * YarnClusterScheduler comments: reference to ClusterScheduler which is not 
> used anymore.
> * TaskSetManager comments: method statusUpdate does not exist as of now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26939) Fix some outdated comments about task schedulers

2019-02-20 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-26939:


 Summary: Fix some outdated comments about task schedulers
 Key: SPARK-26939
 URL: https://issues.apache.org/jira/browse/SPARK-26939
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


Some comments about task schedulers are outdated. They should be fixed.

* TaskScheduler comments: currently implemented exclusively by 
  org.apache.spark.scheduler.TaskSchedulerImpl. This is not true as of now.
* YarnClusterScheduler comments: reference to ClusterScheduler which is not 
used anymore.
* TaskSetManager comments: method statusUpdate does not exist as of now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26813) Consolidate java version across language compilers and build tools

2019-02-01 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-26813:
-
Description: 
The java version here means versions of javac source, javac target, scalac 
target. They could be consolidated as a single version (currently 1.8)
|| ||javac||scalac||
|source|1.8|2.12/2.11|
|target|1.8|1.8|

The current issues are as follows
 * Maven defines a single property to specify java version (java.version) while 
SBT build defines different properties for javac (javacJVMVersion) and scalac 
(scalacJVMVersion). SBT should use a single property as Maven does.
 * Furthermore, it's even better for SBT to refer to java.version defined by 
Maven. This is possible since we've already been using sbt-pom-reader.

  was:
The java version here means versions of javac source, javac target, scalac 
target. They could be consolidated as a single version (currently 1.8)
|| ||javac||scalac||
|source|1.8|2.12/2.11|
|target|1.8|1.8|

The current issues are as follows
 * Maven defines a single property to specify java version (java.version) while 
SBT build defines different properties for javac (javacJVMVersion) and scalac 
(scalacJVMVersion). SBT should use a single property as Maven does.
 * Furthermore, it's even better for SBT to refer to java.version defined by 
Maven. This is possible since we've already been using sbt-pom-reader.

 


> Consolidate java version across language compilers and build tools
> --
>
> Key: SPARK-26813
> URL: https://issues.apache.org/jira/browse/SPARK-26813
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Minor
>
> The java version here means versions of javac source, javac target, scalac 
> target. They could be consolidated as a single version (currently 1.8)
> || ||javac||scalac||
> |source|1.8|2.12/2.11|
> |target|1.8|1.8|
> The current issues are as follows
>  * Maven defines a single property to specify java version (java.version) 
> while SBT build defines different properties for javac (javacJVMVersion) and 
> scalac (scalacJVMVersion). SBT should use a single property as Maven does.
>  * Furthermore, it's even better for SBT to refer to java.version defined by 
> Maven. This is possible since we've already been using sbt-pom-reader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26813) Consolidate java version across language compilers and build tools

2019-02-01 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-26813:
-
Description: 
The java version here means versions of javac source, javac target, scalac 
target. They could be consolidated as a single version (currently 1.8)
|| ||javac||scalac||
|source|1.8|2.12/2.11|
|target|1.8|1.8|

The current issues are as follows
 * Maven defines a single property to specify java version (java.version) while 
SBT build defines different properties for javac (javacJVMVersion) and scalac 
(scalacJVMVersion). SBT should use a single property as Maven does.
 * Furthermore, it's better for SBT to refer to java.version defined by Maven. 
This is possible since we've already been using sbt-pom-reader.

  was:
The java version here means versions of javac source, javac target, scalac 
target. They could be consolidated as a single version (currently 1.8)
|| ||javac||scalac||
|source|1.8|2.12/2.11|
|target|1.8|1.8|

The current issues are as follows
 * Maven defines a single property to specify java version (java.version) while 
SBT build defines different properties for javac (javacJVMVersion) and scalac 
(scalacJVMVersion). SBT should use a single property as Maven does.
 * Furthermore, it's even better for SBT to refer to java.version defined by 
Maven. This is possible since we've already been using sbt-pom-reader.


> Consolidate java version across language compilers and build tools
> --
>
> Key: SPARK-26813
> URL: https://issues.apache.org/jira/browse/SPARK-26813
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Minor
>
> The java version here means versions of javac source, javac target, scalac 
> target. They could be consolidated as a single version (currently 1.8)
> || ||javac||scalac||
> |source|1.8|2.12/2.11|
> |target|1.8|1.8|
> The current issues are as follows
>  * Maven defines a single property to specify java version (java.version) 
> while SBT build defines different properties for javac (javacJVMVersion) and 
> scalac (scalacJVMVersion). SBT should use a single property as Maven does.
>  * Furthermore, it's better for SBT to refer to java.version defined by 
> Maven. This is possible since we've already been using sbt-pom-reader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26813) Consolidate java version across language compilers and build tools

2019-02-01 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-26813:
-
Description: 
The java version here means versions of javac source, javac target, scalac 
target. They could be consolidated as a single version (currently 1.8)
|| ||javac||scalac||
|source|1.8|2.12/2.11|
|target|1.8|1.8|

The current issues are as follows
 * Maven defines a single property to specify java version (java.version) while 
SBT build defines different properties for javac (javacJVMVersion) and scalac 
(scalacJVMVersion). SBT should use a single property as Maven does.
 * Furthermore, it's even better for SBT to refer to java.version defined by 
Maven. This is possible since we've already been using sbt-pom-reader.

 

  was:
The java version here means versions of javac source, javac target, scalac 
target. They could be consolidated as a single version (currently 1.8)
|| ||javac||scalac||
|source|1.8|2.12/2.11|
|target|1.8|1.8|

The current issues are as follows
 * Maven defines a single property to specify java version (java.version) while 
SBT build defines different properties for javac (javacJVMVersion) and scalac 
(scalacJVMVersion). SBT should use a single property as Maven does.
 * For SBT build, both javac options and scalac options related to java version 
are provided. For Maven build, scala-maven-plugin compiles both Java and Scala 
code. However, javac options related to java version (-source, -target) are 
provided while scalac options related to java version (-target:TARGET) are not 
provided, which means scalac will depend on the default options (jvm-1.8). It's 
better for Maven build to explicitly provide scalac options as well.
 * Furthermore, it's even better for SBT to refer to java.version defined by 
Maven. This is possible since we've already been using sbt-pom-reader.

 


> Consolidate java version across language compilers and build tools
> --
>
> Key: SPARK-26813
> URL: https://issues.apache.org/jira/browse/SPARK-26813
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Minor
>
> The java version here means versions of javac source, javac target, scalac 
> target. They could be consolidated as a single version (currently 1.8)
> || ||javac||scalac||
> |source|1.8|2.12/2.11|
> |target|1.8|1.8|
> The current issues are as follows
>  * Maven defines a single property to specify java version (java.version) 
> while SBT build defines different properties for javac (javacJVMVersion) and 
> scalac (scalacJVMVersion). SBT should use a single property as Maven does.
>  * Furthermore, it's even better for SBT to refer to java.version defined by 
> Maven. This is possible since we've already been using sbt-pom-reader.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26813) Consolidate java version across language compilers and build tools

2019-02-01 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-26813:


 Summary: Consolidate java version across language compilers and 
build tools
 Key: SPARK-26813
 URL: https://issues.apache.org/jira/browse/SPARK-26813
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


The java version here means versions of javac source, javac target, scalac 
target. They could be consolidated as a single version (currently 1.8)
|| ||javac||scalac||
|source|1.8|2.12/2.11|
|target|1.8|1.8|

The current issues are as follows
 * Maven defines a single property to specify java version (java.version) while 
SBT build defines different properties for javac (javacJVMVersion) and scalac 
(scalacJVMVersion). SBT should use a single property as Maven does.
 * For SBT build, both javac options and scalac options related to java version 
are provided. For Maven build, scala-maven-plugin compiles both Java and Scala 
code. However, javac options related to java version (-source, -target) are 
provided while scalac options related to java version (-target:TARGET) are not 
provided, which means scalac will depend on the default options (jvm-1.8). It's 
better for Maven build to explicitly provide scalac options as well.
 * Furthermore, it's even better for SBT to refer to java.version defined by 
Maven. This is possible since we've already been using sbt-pom-reader.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26799) Make ANTLR v4 version consistent between Maven and SBT

2019-01-31 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-26799:


 Summary: Make ANTLR v4 version consistent between Maven and SBT
 Key: SPARK-26799
 URL: https://issues.apache.org/jira/browse/SPARK-26799
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


Currently ANTLR v4 versions used by Maven and SBT are slightly different. Maven 
uses 4.7.1 while SBT uses 4.7.
 * Maven(pom.xml): 4.7.1
 * SBT(project/SparkBuild): antlr4Version in Antlr4 := "4.7"

We should make Maven and SBT use a single version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26444) Stage color doesn't change with it's status

2018-12-26 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-26444:
-
Attachment: failed.png
complete.png
active.png

> Stage color doesn't change with it's status
> ---
>
> Key: SPARK-26444
> URL: https://issues.apache.org/jira/browse/SPARK-26444
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
> Attachments: active.png, complete.png, failed.png
>
>
> On job page, in event timeline section, stage color doesn't change according 
> to its status. See attachments for some screen shots. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26444) Stage color doesn't change with it's status

2018-12-26 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-26444:
-
Description: 
On job page, in event timeline section, stage color doesn't change according to 
its status. See attachments for some screen shots. 

 

  was:
On job page, in event timeline section, stage color doesn't change according to 
its status. Below are some screen shots.

active:

!image-2018-12-26-16-14-38-958.png!

complete:

!image-2018-12-26-16-15-55-957.png!

failed:

!image-2018-12-26-16-16-11-697.png!

 

 


> Stage color doesn't change with it's status
> ---
>
> Key: SPARK-26444
> URL: https://issues.apache.org/jira/browse/SPARK-26444
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> On job page, in event timeline section, stage color doesn't change according 
> to its status. See attachments for some screen shots. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26444) Stage color doesn't change with it's status

2018-12-26 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-26444:


 Summary: Stage color doesn't change with it's status
 Key: SPARK-26444
 URL: https://issues.apache.org/jira/browse/SPARK-26444
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


On job page, in event timeline section, stage color doesn't change according to 
its status. Below are some screen shots.

active:

!image-2018-12-26-16-14-38-958.png!

complete:

!image-2018-12-26-16-15-55-957.png!

failed:

!image-2018-12-26-16-16-11-697.png!

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26440) Show total CPU time across all tasks on stage pages

2018-12-25 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-26440:
-
Summary: Show total CPU time across all tasks on stage pages  (was: Display 
total CPU time across all tasks on stage pages)

> Show total CPU time across all tasks on stage pages
> ---
>
> Key: SPARK-26440
> URL: https://issues.apache.org/jira/browse/SPARK-26440
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> Task CPU time is added since 
> [SPARK-12221|https://issues.apache.org/jira/browse/SPARK-12221]. However, 
> total CPU time across all tasks is not displayed on stage pages. This could 
> be used to check whether a stage is CPU intensive or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26440) Display total CPU time across all tasks on stage pages

2018-12-25 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-26440:


 Summary: Display total CPU time across all tasks on stage pages
 Key: SPARK-26440
 URL: https://issues.apache.org/jira/browse/SPARK-26440
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


Task CPU time is added since 
[SPARK-12221|https://issues.apache.org/jira/browse/SPARK-12221]. However, total 
CPU time across all tasks is not displayed on stage pages. This could be used 
to check whether a stage is CPU intensive or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26279) Remove unused method in Logging

2018-12-05 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-26279:
-
Summary: Remove unused method in Logging  (was: Remove unused methods in 
Logging)

> Remove unused method in Logging
> ---
>
> Key: SPARK-26279
> URL: https://issues.apache.org/jira/browse/SPARK-26279
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> The method isTraceEnabled is not used anywhere. We should remove it to avoid 
> confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26279) Remove unused methods in Logging

2018-12-05 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-26279:


 Summary: Remove unused methods in Logging
 Key: SPARK-26279
 URL: https://issues.apache.org/jira/browse/SPARK-26279
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


The method isTraceEnabled is not used anywhere. We should remove it to avoid 
confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26277) WholeStageCodegen metrics should be tested with whole-stage codegen enabled

2018-12-05 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-26277:


 Summary: WholeStageCodegen metrics should be tested with 
whole-stage codegen enabled
 Key: SPARK-26277
 URL: https://issues.apache.org/jira/browse/SPARK-26277
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


In {{org.apache.spark.sql.execution.metric.SQLMetricsSuite}}, there's a test 
case named "WholeStageCodegen metrics". However, it is executed with 
whole-stage codegen disabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25833) Update migration guide for Hive view compatibility

2018-10-29 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25833:
-
Priority: Minor  (was: Major)

> Update migration guide for Hive view compatibility
> --
>
> Key: SPARK-25833
> URL: https://issues.apache.org/jira/browse/SPARK-25833
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Chenxiao Mao
>Priority: Minor
>
> Views without column names created by Hive are not readable by Spark.
> A simple example to reproduce this issue.
>  create a view via Hive CLI:
> {code:sql}
> hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
> {code}
> query that view via Spark
> {code:sql}
> spark-sql> select * from v1;
> Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 
> pos 7;
> 'Project [*]
> +- 'SubqueryAlias v1, `default`.`v1`
>+- 'Project ['t1._c0]
>   +- SubqueryAlias t1
>  +- Project [1 AS 1#41]
> +- OneRowRelation$
> {code}
> Check the view definition:
> {code:sql}
> hive> desc extended v1;
> OK
> _c0   int
> ...
> viewOriginalText:SELECT * FROM (SELECT 1) t1, 
> viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
> ...
> {code}
> _c0 in above view definition is automatically generated by Hive, which is not 
> recognizable by Spark.
>  see [Hive 
> LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446&navigatingVersions=true#LanguageManualDDL-CreateView]
>  for more details:
> {quote}If no column names are supplied, the names of the view's columns will 
> be derived automatically from the defining SELECT expression. (If the SELECT 
> contains unaliased scalar expressions such as x+y, the resulting view column 
> names will be generated in the form _C0, _C1, etc.)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25833) Update migration guide for Hive view compatibility

2018-10-29 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25833:
-
Issue Type: Documentation  (was: Bug)

> Update migration guide for Hive view compatibility
> --
>
> Key: SPARK-25833
> URL: https://issues.apache.org/jira/browse/SPARK-25833
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Chenxiao Mao
>Priority: Major
>
> Views without column names created by Hive are not readable by Spark.
> A simple example to reproduce this issue.
>  create a view via Hive CLI:
> {code:sql}
> hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
> {code}
> query that view via Spark
> {code:sql}
> spark-sql> select * from v1;
> Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 
> pos 7;
> 'Project [*]
> +- 'SubqueryAlias v1, `default`.`v1`
>+- 'Project ['t1._c0]
>   +- SubqueryAlias t1
>  +- Project [1 AS 1#41]
> +- OneRowRelation$
> {code}
> Check the view definition:
> {code:sql}
> hive> desc extended v1;
> OK
> _c0   int
> ...
> viewOriginalText:SELECT * FROM (SELECT 1) t1, 
> viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
> ...
> {code}
> _c0 in above view definition is automatically generated by Hive, which is not 
> recognizable by Spark.
>  see [Hive 
> LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446&navigatingVersions=true#LanguageManualDDL-CreateView]
>  for more details:
> {quote}If no column names are supplied, the names of the view's columns will 
> be derived automatically from the defining SELECT expression. (If the SELECT 
> contains unaliased scalar expressions such as x+y, the resulting view column 
> names will be generated in the form _C0, _C1, etc.)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25833) Views without column names created by Hive are not readable by Spark

2018-10-29 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25833:
-
Description: 
Views without column names created by Hive are not readable by Spark.

A simple example to reproduce this issue.
 create a view via Hive CLI:
{code:sql}
hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
{code}
query that view via Spark
{code:sql}
spark-sql> select * from v1;
Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 pos 
7;
'Project [*]
+- 'SubqueryAlias v1, `default`.`v1`
   +- 'Project ['t1._c0]
  +- SubqueryAlias t1
 +- Project [1 AS 1#41]
+- OneRowRelation$
{code}
Check the view definition:
{code:sql}
hive> desc extended v1;
OK
_c0 int
...
viewOriginalText:SELECT * FROM (SELECT 1) t1, 
viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
...
{code}
_c0 in above view definition is automatically generated by Hive, which is not 
recognizable by Spark.
 see [Hive 
LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446&navigatingVersions=true#LanguageManualDDL-CreateView]
 for more details:
{quote}If no column names are supplied, the names of the view's columns will be 
derived automatically from the defining SELECT expression. (If the SELECT 
contains unaliased scalar expressions such as x+y, the resulting view column 
names will be generated in the form _C0, _C1, etc.)
{quote}

  was:
A simple example to reproduce this issue.
 create a view via Hive CLI:
{code:sql}
hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
{code}
query that view via Spark
{code:sql}
spark-sql> select * from v1;
Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 pos 
7;
'Project [*]
+- 'SubqueryAlias v1, `default`.`v1`
   +- 'Project ['t1._c0]
  +- SubqueryAlias t1
 +- Project [1 AS 1#41]
+- OneRowRelation$
{code}
Check the view definition:
{code:sql}
hive> desc extended v1;
OK
_c0 int
...
viewOriginalText:SELECT * FROM (SELECT 1) t1, 
viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
...
{code}
_c0 in above view definition is automatically generated by Hive, which is not 
recognizable by Spark.
 see [Hive 
LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446&navigatingVersions=true#LanguageManualDDL-CreateView]
 for more details:
{quote}If no column names are supplied, the names of the view's columns will be 
derived automatically from the defining SELECT expression. (If the SELECT 
contains unaliased scalar expressions such as x+y, the resulting view column 
names will be generated in the form _C0, _C1, etc.)
{quote}


> Views without column names created by Hive are not readable by Spark
> 
>
> Key: SPARK-25833
> URL: https://issues.apache.org/jira/browse/SPARK-25833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Chenxiao Mao
>Priority: Major
>
> Views without column names created by Hive are not readable by Spark.
> A simple example to reproduce this issue.
>  create a view via Hive CLI:
> {code:sql}
> hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
> {code}
> query that view via Spark
> {code:sql}
> spark-sql> select * from v1;
> Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 
> pos 7;
> 'Project [*]
> +- 'SubqueryAlias v1, `default`.`v1`
>+- 'Project ['t1._c0]
>   +- SubqueryAlias t1
>  +- Project [1 AS 1#41]
> +- OneRowRelation$
> {code}
> Check the view definition:
> {code:sql}
> hive> desc extended v1;
> OK
> _c0   int
> ...
> viewOriginalText:SELECT * FROM (SELECT 1) t1, 
> viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
> ...
> {code}
> _c0 in above view definition is automatically generated by Hive, which is not 
> recognizable by Spark.
>  see [Hive 
> LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446&navigatingVersions=true#LanguageManualDDL-CreateView]
>  for more details:
> {quote}If no column names are supplied, the names of the view's columns will 
> be derived automatically from the defining SELECT expression. (If the SELECT 
> contains unaliased scalar expressions such as x+y, the resulting view column 
> names will be generated in the form _C0, _C1, etc.)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25833) Update migration guide for Hive view compatibility

2018-10-29 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25833:
-
Summary: Update migration guide for Hive view compatibility  (was: Views 
without column names created by Hive are not readable by Spark)

> Update migration guide for Hive view compatibility
> --
>
> Key: SPARK-25833
> URL: https://issues.apache.org/jira/browse/SPARK-25833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Chenxiao Mao
>Priority: Major
>
> Views without column names created by Hive are not readable by Spark.
> A simple example to reproduce this issue.
>  create a view via Hive CLI:
> {code:sql}
> hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
> {code}
> query that view via Spark
> {code:sql}
> spark-sql> select * from v1;
> Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 
> pos 7;
> 'Project [*]
> +- 'SubqueryAlias v1, `default`.`v1`
>+- 'Project ['t1._c0]
>   +- SubqueryAlias t1
>  +- Project [1 AS 1#41]
> +- OneRowRelation$
> {code}
> Check the view definition:
> {code:sql}
> hive> desc extended v1;
> OK
> _c0   int
> ...
> viewOriginalText:SELECT * FROM (SELECT 1) t1, 
> viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
> ...
> {code}
> _c0 in above view definition is automatically generated by Hive, which is not 
> recognizable by Spark.
>  see [Hive 
> LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446&navigatingVersions=true#LanguageManualDDL-CreateView]
>  for more details:
> {quote}If no column names are supplied, the names of the view's columns will 
> be derived automatically from the defining SELECT expression. (If the SELECT 
> contains unaliased scalar expressions such as x+y, the resulting view column 
> names will be generated in the form _C0, _C1, etc.)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25833) Views without column names created by Hive are not readable by Spark

2018-10-27 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1291#comment-1291
 ] 

Chenxiao Mao commented on SPARK-25833:
--

[~dkbiswal] Thanks for you comments. I think you are right that this is a 
duplicate.

Does it make sense to describe this compatibility issue explicitly in the user 
guide to help users troubleshoot this issue?

> Views without column names created by Hive are not readable by Spark
> 
>
> Key: SPARK-25833
> URL: https://issues.apache.org/jira/browse/SPARK-25833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Chenxiao Mao
>Priority: Major
>
> A simple example to reproduce this issue.
>  create a view via Hive CLI:
> {code:sql}
> hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
> {code}
> query that view via Spark
> {code:sql}
> spark-sql> select * from v1;
> Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 
> pos 7;
> 'Project [*]
> +- 'SubqueryAlias v1, `default`.`v1`
>+- 'Project ['t1._c0]
>   +- SubqueryAlias t1
>  +- Project [1 AS 1#41]
> +- OneRowRelation$
> {code}
> Check the view definition:
> {code:sql}
> hive> desc extended v1;
> OK
> _c0   int
> ...
> viewOriginalText:SELECT * FROM (SELECT 1) t1, 
> viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
> ...
> {code}
> _c0 in above view definition is automatically generated by Hive, which is not 
> recognizable by Spark.
>  see [Hive 
> LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446&navigatingVersions=true#LanguageManualDDL-CreateView]
>  for more details:
> {quote}If no column names are supplied, the names of the view's columns will 
> be derived automatically from the defining SELECT expression. (If the SELECT 
> contains unaliased scalar expressions such as x+y, the resulting view column 
> names will be generated in the form _C0, _C1, etc.)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25833) Views without column names created by Hive are not readable by Spark

2018-10-25 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25833:
-
Summary: Views without column names created by Hive are not readable by 
Spark  (was: Views without column names created by Hive is not readable by 
Spark)

> Views without column names created by Hive are not readable by Spark
> 
>
> Key: SPARK-25833
> URL: https://issues.apache.org/jira/browse/SPARK-25833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Chenxiao Mao
>Priority: Major
>
> A simple example to reproduce this issue.
>  create a view via Hive CLI:
> {code:sql}
> hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
> {code}
> query that view via Spark
> {code:sql}
> spark-sql> select * from v1;
> Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 
> pos 7;
> 'Project [*]
> +- 'SubqueryAlias v1, `default`.`v1`
>+- 'Project ['t1._c0]
>   +- SubqueryAlias t1
>  +- Project [1 AS 1#41]
> +- OneRowRelation$
> {code}
> Check the view definition:
> {code:sql}
> hive> desc extended v1;
> OK
> _c0   int
> ...
> viewOriginalText:SELECT * FROM (SELECT 1) t1, 
> viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
> ...
> {code}
> _c0 in above view definition is automatically generated by Hive, which is not 
> recognizable by Spark.
>  see [Hive 
> LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446&navigatingVersions=true#LanguageManualDDL-CreateView]
>  for more details:
> {quote}If no column names are supplied, the names of the view's columns will 
> be derived automatically from the defining SELECT expression. (If the SELECT 
> contains unaliased scalar expressions such as x+y, the resulting view column 
> names will be generated in the form _C0, _C1, etc.)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25833) Views without column names created by Hive is not readable by Spark

2018-10-25 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-25833:


 Summary: Views without column names created by Hive is not 
readable by Spark
 Key: SPARK-25833
 URL: https://issues.apache.org/jira/browse/SPARK-25833
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
Reporter: Chenxiao Mao


A simple example to reproduce this issue.
 create a view via Hive CLI:
{code:sql}
hive> CREATE VIEW v1 AS SELECT * FROM (SELECT 1) t1
{code}
query that view via Spark
{code:sql}
spark-sql> select * from v1;
Error in query: cannot resolve '`t1._c0`' given input columns: [1]; line 1 pos 
7;
'Project [*]
+- 'SubqueryAlias v1, `default`.`v1`
   +- 'Project ['t1._c0]
  +- SubqueryAlias t1
 +- Project [1 AS 1#41]
+- OneRowRelation$
{code}
Check the view definition:
{code:sql}
hive> desc extended v1;
OK
_c0 int
...
viewOriginalText:SELECT * FROM (SELECT 1) t1, 
viewExpandedText:SELECT `t1`.`_c0` FROM (SELECT 1) `t1`
...
{code}
_c0 in above view definition is automatically generated by Hive, which is not 
recognizable by Spark.
 see [Hive 
LanguageManual|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30746446&navigatingVersions=true#LanguageManualDDL-CreateView]
 for more details:
{quote}If no column names are supplied, the names of the view's columns will be 
derived automatically from the defining SELECT expression. (If the SELECT 
contains unaliased scalar expressions such as x+y, the resulting view column 
names will be generated in the form _C0, _C1, etc.)
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25797) Views created via 2.1 cannot be read via 2.2+

2018-10-23 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25797:
-
Description: 
We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
simple example to reproduce the issue.

Create views via Spark 2.1
{code:sql}
create view v1 as
select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
{code}

Query views via Spark 2.3
{code:sql}
select * from v1;
Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate
{code}

After investigation, we found that this is because when a view is created via 
Spark 2.1, the expanded text is saved instead of the original text. 
Unfortunately, the blow expanded text is buggy.
{code:sql}
spark-sql> desc extended v1;
c1 decimal(19,0) NULL
Detailed Table Information
Database default
Table v1
Type VIEW
View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0
{code}

We can see that c1 is decimal(19,0), however in the expanded text there is 
decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, 
decimal(20,0) in query is not allowed to cast to view definition column 
decimal(19,0). ([https://github.com/apache/spark/pull/16561])

I further tested other decimal calculations. Only add/subtract has this issue.

Create views via 2.1:
{code:sql}
create view v1 as
select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
create view v2 as
select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
create view v3 as
select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
create view v4 as
select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
create view v5 as
select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
create view v6 as
select cast(1 as decimal(18,0)) c1
union
select cast(1 as decimal(19,0)) c1;
{code}

Query views via Spark 2.3
{code:sql}
select * from v1;
Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate
select * from v2;
Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
decimal(19,0) as it may truncate
select * from v3;
1
select * from v4;
1
select * from v5;
0
select * from v6;
1
{code}

Views created via Spark 2.2+ don't have this issue because Spark 2.2+ does not 
generate expanded text for view 
(https://issues.apache.org/jira/browse/SPARK-18209).

  was:
We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
simple example to reproduce the issue.

Create views via Spark 2.1
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;|

Query views via Spark 2.3
|{{select * from v1;}}
 {{Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate}}|

After investigation, we found that this is because when a view is created via 
Spark 2.1, the expanded text is saved instead of the original text. 
Unfortunately, the blow expanded text is buggy.
|spark-sql> desc extended v1;
 c1 decimal(19,0) NULL
 Detailed Table Information
 Database default
 Table v1
 Type VIEW
 View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0|

We can see that c1 is decimal(19,0), however in the expanded text there is 
decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, 
decimal(20,0) in query is not allowed to cast to view definition column 
decimal(19,0). ([https://github.com/apache/spark/pull/16561])

I further tested other decimal calculations. Only add/subtract has this issue.

Create views via 2.1:
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
 create view v2 as
 select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
 create view v3 as
 select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
 create view v4 as
 select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
 create view v5 as
 select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
 create view v6 as
 select cast(1 as decimal(18,0)) c1
 union
 select cast(1 as decimal(19,0)) c1;|

Query views via Spark 2.3
|select * from v1;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate
 select * from v2;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
decimal(19,0) as it may truncate
 select * from v3;
 1
 select * from v4;
 1
 select * from v5;
 0
 select * from v6;
 1|

 Views created via Spark 2.2+ don't have this issue because Spark 2.2+ does not 
generate expanded text for view 
(https://issues.apache.org/jira/browse/SPARK-18209).


> Views created via 2.1 cannot be read via 2.2+
> --

[jira] [Updated] (SPARK-25797) Views created via 2.1 cannot be read via 2.2+

2018-10-23 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25797:
-
Description: 
We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
simple example to reproduce the issue.

Create views via Spark 2.1
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;|

Query views via Spark 2.3
|{{select * from v1;}}
 {{Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate}}|

After investigation, we found that this is because when a view is created via 
Spark 2.1, the expanded text is saved instead of the original text. 
Unfortunately, the blow expanded text is buggy.
|spark-sql> desc extended v1;
 c1 decimal(19,0) NULL
 Detailed Table Information
 Database default
 Table v1
 Type VIEW
 View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0|

We can see that c1 is decimal(19,0), however in the expanded text there is 
decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, 
decimal(20,0) in query is not allowed to cast to view definition column 
decimal(19,0). ([https://github.com/apache/spark/pull/16561])

I further tested other decimal calculations. Only add/subtract has this issue.

Create views via 2.1:
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
 create view v2 as
 select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
 create view v3 as
 select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
 create view v4 as
 select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
 create view v5 as
 select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
 create view v6 as
 select cast(1 as decimal(18,0)) c1
 union
 select cast(1 as decimal(19,0)) c1;|

Query views via Spark 2.3
|select * from v1;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate
 select * from v2;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
decimal(19,0) as it may truncate
 select * from v3;
 1
 select * from v4;
 1
 select * from v5;
 0
 select * from v6;
 1|

 Views created via Spark 2.2+ don't have this issue because Spark 2.2+ does not 
generate expanded text for view 
(https://issues.apache.org/jira/browse/SPARK-18209).

  was:
We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
simple example to reproduce the issue.

Create views via Spark 2.1
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;|

Query views via Spark 2.3
|{{select * from v1;}}
 {{Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate}}|

After investigation, we found that this is because when a view is created via 
Spark 2.1, the expanded text is saved instead of the original text. 
Unfortunately, the blow expanded text is buggy.
|spark-sql> desc extended v1;
 c1 decimal(19,0) NULL
Detailed Table Information
 Database default
 Table v1
 Type VIEW
 View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0|

We can see that c1 is decimal(19,0), however in the expanded text there is 
decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, 
decimal(20,0) in query is not allowed to cast to view definition column 
decimal(19,0). ([https://github.com/apache/spark/pull/16561])

I further tested other decimal calculations. Only add/subtract has this issue.

Create views via 2.1:
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
 create view v2 as
 select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
 create view v3 as
 select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
 create view v4 as
 select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
 create view v5 as
 select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
 create view v6 as
 select cast(1 as decimal(18,0)) c1
 union
 select cast(1 as decimal(19,0)) c1;|

Query views via Spark 2.3
|select * from v1;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate
 select * from v2;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
decimal(19,0) as it may truncate
 select * from v3;
 1
 select * from v4;
 1
 select * from v5;
 0
 select * from v6;
 1|

 


> Views created via 2.1 cannot be read via 2.2+
> -
>
> Key: SPARK-25797
> URL: https://issues.apache.org/jira/browse/SPARK-25797
> Project: Spark
>  Issue Type: Bug

[jira] [Updated] (SPARK-25797) Views created via 2.1 cannot be read via 2.2+

2018-10-22 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25797:
-
Description: 
We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
simple example to reproduce the issue.

Create views via Spark 2.1
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;|

Query views via Spark 2.3
|{{select * from v1;}}
 {{Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate}}|

After investigation, we found that this is because when a view is created via 
Spark 2.1, the expanded text is saved instead of the original text. 
Unfortunately, the blow expanded text is buggy.
|spark-sql> desc extended v1;
 c1 decimal(19,0) NULL
Detailed Table Information
 Database default
 Table v1
 Type VIEW
 View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0|

We can see that c1 is decimal(19,0), however in the expanded text there is 
decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, 
decimal(20,0) in query is not allowed to cast to view definition column 
decimal(19,0). ([https://github.com/apache/spark/pull/16561])

I further tested other decimal calculations. Only add/subtract has this issue.

Create views via 2.1:
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
 create view v2 as
 select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
 create view v3 as
 select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
 create view v4 as
 select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
 create view v5 as
 select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
 create view v6 as
 select cast(1 as decimal(18,0)) c1
 union
 select cast(1 as decimal(19,0)) c1;|

Query views via Spark 2.3
|select * from v1;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate
 select * from v2;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
decimal(19,0) as it may truncate
 select * from v3;
 1
 select * from v4;
 1
 select * from v5;
 0
 select * from v6;
 1|

 

  was:
We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
simple example to reproduce the issue.

Create views via Spark 2.1
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;|

Query views via Spark 2.3
|{{select * from v1;}}
 {{Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate}}|

After investigation, we found that this is because when a view is created via 
Spark 2.1, the expanded text is saved instead of the original text. 
Unfortunately, the blow expanded text is buggy.
|spark-sql> desc extended v1;
 c1 decimal(19,0) NULL
 # Detailed Table Information
 Database default
 Table v1
 Type VIEW
 View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0|

We can see that c1 is decimal(19,0), however in the expanded text there is 
decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, 
decimal(20,0) in query is not allowed to cast to view definition column 
decimal(19,0). ([https://github.com/apache/spark/pull/16561])

I further tested other decimal calculations. Only add/subtract has this issue.

Create views via 2.1:
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
 create view v2 as
 select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
 create view v3 as
 select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
 create view v4 as
 select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
 create view v5 as
 select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
 create view v6 as
 select cast(1 as decimal(18,0)) c1
 union
 select cast(1 as decimal(19,0)) c1;|

Query views via Spark 2.3
|select * from v1;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate
 select * from v2;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
decimal(19,0) as it may truncate
 select * from v3;
 1
 select * from v4;
 1
 select * from v5;
 0
 select * from v6;
 1|

 


> Views created via 2.1 cannot be read via 2.2+
> -
>
> Key: SPARK-25797
> URL: https://issues.apache.org/jira/browse/SPARK-25797
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: Chenxiao Mao
>Priority: Major
>
> We

[jira] [Updated] (SPARK-25797) Views created via 2.1 cannot be read via 2.2+

2018-10-22 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25797:
-
Description: 
We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
simple example to reproduce the issue.

Create views via Spark 2.1
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;|

Query views via Spark 2.3
|{{select * from v1;}}
 {{Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate}}|

After investigation, we found that this is because when a view is created via 
Spark 2.1, the expanded text is saved instead of the original text. 
Unfortunately, the blow expanded text is buggy.
|spark-sql> desc extended v1;
 c1 decimal(19,0) NULL
 # Detailed Table Information
 Database default
 Table v1
 Type VIEW
 View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0|

We can see that c1 is decimal(19,0), however in the expanded text there is 
decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, 
decimal(20,0) in query is not allowed to cast to view definition column 
decimal(19,0). ([https://github.com/apache/spark/pull/16561])

I further tested other decimal calculations. Only add/subtract has this issue.

Create views via 2.1:
|create view v1 as
 select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
 create view v2 as
 select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
 create view v3 as
 select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
 create view v4 as
 select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
 create view v5 as
 select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
 create view v6 as
 select cast(1 as decimal(18,0)) c1
 union
 select cast(1 as decimal(19,0)) c1;|

Query views via Spark 2.3
|select * from v1;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate
 select * from v2;
 Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
decimal(19,0) as it may truncate
 select * from v3;
 1
 select * from v4;
 1
 select * from v5;
 0
 select * from v6;
 1|

 

  was:
We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
simple example to reproduce the issue.

Create views via Spark 2.1

 
|create view v1 as
select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;|

 

Query views via Spark 2.3
|{{select * from v1;}}
{{Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate}}|

 

After investigation, we found that this is because when a view is created via 
Spark 2.1, the expanded text is saved instead of the original text. 
Unfortunately, the blow expanded text is buggy.
|spark-sql> desc extended v1;
c1 decimal(19,0) NULL
# Detailed Table Information
Database default
Table v1
Type VIEW
View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0|

We can see that c1 is decimal(19,0), however in the expanded text there is 
decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, 
decimal(20,0) in query is not allowed to cast to view definition column 
decimal(19,0). ([https://github.com/apache/spark/pull/16561])

I further tested other decimal calculations. Only add/subtract has this issue.

Create views via 2.1:
|create view v1 as
select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
create view v2 as
select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
create view v3 as
select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
create view v4 as
select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
create view v5 as
select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
create view v6 as
select cast(1 as decimal(18,0)) c1
union
select cast(1 as decimal(19,0)) c1;|

Query views via Spark 2.3
|select * from v1;
Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate
select * from v2;
Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
decimal(19,0) as it may truncate
select * from v3;
1
select * from v4;
1
select * from v5;
0
select * from v6;
1|

 


> Views created via 2.1 cannot be read via 2.2+
> -
>
> Key: SPARK-25797
> URL: https://issues.apache.org/jira/browse/SPARK-25797
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: Chenxiao Mao
>Priority: Major
>
> We ran into this issue

[jira] [Created] (SPARK-25797) Views created via 2.1 cannot be read via 2.2+

2018-10-22 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-25797:


 Summary: Views created via 2.1 cannot be read via 2.2+
 Key: SPARK-25797
 URL: https://issues.apache.org/jira/browse/SPARK-25797
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2, 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0
Reporter: Chenxiao Mao


We ran into this issue when we update our Spark from 2.1 to 2.3. Below's a 
simple example to reproduce the issue.

Create views via Spark 2.1

 
|create view v1 as
select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;|

 

Query views via Spark 2.3
|{{select * from v1;}}
{{Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate}}|

 

After investigation, we found that this is because when a view is created via 
Spark 2.1, the expanded text is saved instead of the original text. 
Unfortunately, the blow expanded text is buggy.
|spark-sql> desc extended v1;
c1 decimal(19,0) NULL
# Detailed Table Information
Database default
Table v1
Type VIEW
View Text SELECT `gen_attr_0` AS `c1` FROM (SELECT (CAST(CAST(1 AS 
DECIMAL(18,0)) AS DECIMAL(19,0)) + CAST(CAST(1 AS DECIMAL(18,0)) AS 
DECIMAL(19,0))) AS `gen_attr_0`) AS gen_subquery_0|

We can see that c1 is decimal(19,0), however in the expanded text there is 
decimal(19,0) + decimal(19,0) which results in decimal(20,0). Since Spark 2.2, 
decimal(20,0) in query is not allowed to cast to view definition column 
decimal(19,0). ([https://github.com/apache/spark/pull/16561])

I further tested other decimal calculations. Only add/subtract has this issue.

Create views via 2.1:
|create view v1 as
select (cast(1 as decimal(18,0)) + cast(1 as decimal(18,0))) c1;
create view v2 as
select (cast(1 as decimal(18,0)) - cast(1 as decimal(18,0))) c1;
create view v3 as
select (cast(1 as decimal(18,0)) * cast(1 as decimal(18,0))) c1;
create view v4 as
select (cast(1 as decimal(18,0)) / cast(1 as decimal(18,0))) c1;
create view v5 as
select (cast(1 as decimal(18,0)) % cast(1 as decimal(18,0))) c1;
create view v6 as
select cast(1 as decimal(18,0)) c1
union
select cast(1 as decimal(19,0)) c1;|

Query views via Spark 2.3
|select * from v1;
Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3906: 
decimal(19,0) as it may truncate
select * from v2;
Error in query: Cannot up cast `c1` from decimal(20,0) to c1#3909: 
decimal(19,0) as it may truncate
select * from v3;
1
select * from v4;
1
select * from v5;
0
select * from v6;
1|

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20885) JDBC predicate pushdown uses hardcoded date format

2018-09-21 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16623308#comment-16623308
 ] 

Chenxiao Mao commented on SPARK-20885:
--

[~phalverson] To avoid this issue, you may try this option 
(sessionInitStatement) for Oracle.

{{spark.read.format("jdbc").option("sessionInitStatement", "ALTER SESSION SET 
NLS_DATE_FORMAT = '-MM-DD'")}}

> JDBC predicate pushdown uses hardcoded date format
> --
>
> Key: SPARK-20885
> URL: https://issues.apache.org/jira/browse/SPARK-20885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Peter Halverson
>Priority: Minor
>
> If a date literal is used in a pushed-down filter expression, e.g.
> {code}
> val postingDate = java.sql.Date.valueOf("2016-06-03")
> val count = jdbcDF.filter($"POSTINGDATE" === postingDate).count
> {code}
> where the {{POSTINGDATE}} column is of JDBC type Date, the resulting 
> pushed-down SQL query looks like the following:
> {code}
> SELECT ..  ... FROM  WHERE POSTINGDATE = '2016-06-03'
> {code}
> Specifically, the date is compiled into a string literal using the hardcoded 
> -MM-dd format that {{java.sql.Date.toString}} emits. Note the implied 
> string conversion for date (and timestamp) values in {{JDBCRDD.compileValue}}
> {code}
>   /**
>* Converts value to SQL expression.
>*/
>   private def compileValue(value: Any): Any = value match {
> case stringValue: String => s"'${escapeSql(stringValue)}'"
> case timestampValue: Timestamp => "'" + timestampValue + "'"
> case dateValue: Date => "'" + dateValue + "'"
> case arrayValue: Array[Any] => arrayValue.map(compileValue).mkString(", ")
> case _ => value
>   }
> {code}
> The resulting query fails if the database is expecting a different format for 
> date string literals. For example, the default format for Oracle is 
> 'dd-MMM-yy', so when the relation query is executed, it fails with a syntax 
> error.
> {code}
> ORA-01861: literal does not match format string
> 01861. 0 -  "literal does not match format string"
> {code}
> In some situations it may be possible to change the database's expected date 
> format to match the Java format, but this is not always possible (e.g. 
> reading from an external database server)
> Shouldn't this kind of conversion be going through some kind of vendor 
> specific translation (e.g. through a {{JDBCDialect}})?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25489) Refactor UDTSerializationBenchmark

2018-09-20 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622325#comment-16622325
 ] 

Chenxiao Mao commented on SPARK-25489:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22499

> Refactor UDTSerializationBenchmark
> --
>
> Key: SPARK-25489
> URL: https://issues.apache.org/jira/browse/SPARK-25489
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> Refactor UDTSerializationBenchmark to use main method and print the output as 
> a separate file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25487) Refactor PrimitiveArrayBenchmark

2018-09-20 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622319#comment-16622319
 ] 

Chenxiao Mao commented on SPARK-25487:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22497

> Refactor PrimitiveArrayBenchmark
> 
>
> Key: SPARK-25487
> URL: https://issues.apache.org/jira/browse/SPARK-25487
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> Refactor PrimitiveArrayBenchmark to use main method and print the output as a 
> separate file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25490) Refactor KryoBenchmark

2018-09-20 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-25490:


 Summary: Refactor KryoBenchmark
 Key: SPARK-25490
 URL: https://issues.apache.org/jira/browse/SPARK-25490
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


Refactor KryoBenchmark to use main method and print the output as a separate 
file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25489) Refactor UDTSerializationBenchmark

2018-09-20 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-25489:


 Summary: Refactor UDTSerializationBenchmark
 Key: SPARK-25489
 URL: https://issues.apache.org/jira/browse/SPARK-25489
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


Refactor UDTSerializationBenchmark to use main method and print the output as a 
separate file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25487) Refactor PrimitiveArrayBenchmark

2018-09-20 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-25487:


 Summary: Refactor PrimitiveArrayBenchmark
 Key: SPARK-25487
 URL: https://issues.apache.org/jira/browse/SPARK-25487
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


Refactor PrimitiveArrayBenchmark to use main method and print the output as a 
separate file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25484) Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark

2018-09-20 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-25484:


 Summary: Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark
 Key: SPARK-25484
 URL: https://issues.apache.org/jira/browse/SPARK-25484
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to print the output as a 
separate file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25453) OracleIntegrationSuite IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]

2018-09-18 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620058#comment-16620058
 ] 

Chenxiao Mao commented on SPARK-25453:
--

User 'seancxmao' has created a pull request for this issue:
[https://github.com/apache/spark/pull/22461]

> OracleIntegrationSuite IllegalArgumentException: Timestamp format must be 
> -mm-dd hh:mm:ss[.f]
> -
>
> Key: SPARK-25453
> URL: https://issues.apache.org/jira/browse/SPARK-25453
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> - SPARK-22814 support date/timestamp types in partitionColumn *** FAILED ***
>   java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>   at java.sql.Timestamp.valueOf(Timestamp.java:204)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.toInternalBoundValue(JDBCRelation.scala:183)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:88)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:445)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:427)
>   ...{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25453) OracleIntegrationSuite IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]

2018-09-18 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619165#comment-16619165
 ] 

Chenxiao Mao commented on SPARK-25453:
--

I'm working on this. cc [~maropu] [~yumwang]

> OracleIntegrationSuite IllegalArgumentException: Timestamp format must be 
> -mm-dd hh:mm:ss[.f]
> -
>
> Key: SPARK-25453
> URL: https://issues.apache.org/jira/browse/SPARK-25453
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> - SPARK-22814 support date/timestamp types in partitionColumn *** FAILED ***
>   java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>   at java.sql.Timestamp.valueOf(Timestamp.java:204)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.toInternalBoundValue(JDBCRelation.scala:183)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:88)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:445)
>   at 
> org.apache.spark.sql.jdbc.OracleIntegrationSuite$$anonfun$18.apply(OracleIntegrationSuite.scala:427)
>   ...{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source

2018-09-16 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao closed SPARK-25391.


> Make behaviors consistent when converting parquet hive table to parquet data 
> source
> ---
>
> Key: SPARK-25391
> URL: https://issues.apache.org/jira/browse/SPARK-25391
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> parquet data source tables and hive parquet tables have different behaviors 
> about parquet field resolution. So, when 
> {{spark.sql.hive.convertMetastoreParquet}} is true, users might face 
> inconsistent behaviors. The differences are:
>  * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both 
> data source tables and hive tables do NOT respect 
> {{spark.sql.caseSensitive}}. However data source tables always do 
> case-sensitive parquet field resolution, while hive tables always do 
> case-insensitive parquet field resolution no matter whether 
> {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data 
> source tables respect {{spark.sql.caseSensitive}} while hive serde table 
> behavior is not changed.
>  * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, 
> data source tables do case-sensitive resolution and return columns with the 
> corresponding letter cases, while hive tables always return the first matched 
> column ignoring cases. SPARK-25132 let data source tables throw exception 
> when there is ambiguity while hive table behavior is not changed.
> This ticket aims to make behaviors consistent when converting hive table to 
> data source table.
>  * The behavior must be consistent to do the conversion, so we skip the 
> conversion in case-sensitive mode because hive parquet table always do 
> case-insensitive field resolution.
>  * In case-insensitive mode, when converting hive parquet table to parquet 
> data source, we switch the duplicated fields resolution mode to ask parquet 
> data source to pick the first matched field - the same behavior as hive 
> parquet table - to keep behaviors consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source

2018-09-15 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao resolved SPARK-25391.
--
Resolution: Won't Do

> Make behaviors consistent when converting parquet hive table to parquet data 
> source
> ---
>
> Key: SPARK-25391
> URL: https://issues.apache.org/jira/browse/SPARK-25391
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> parquet data source tables and hive parquet tables have different behaviors 
> about parquet field resolution. So, when 
> {{spark.sql.hive.convertMetastoreParquet}} is true, users might face 
> inconsistent behaviors. The differences are:
>  * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both 
> data source tables and hive tables do NOT respect 
> {{spark.sql.caseSensitive}}. However data source tables always do 
> case-sensitive parquet field resolution, while hive tables always do 
> case-insensitive parquet field resolution no matter whether 
> {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data 
> source tables respect {{spark.sql.caseSensitive}} while hive serde table 
> behavior is not changed.
>  * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, 
> data source tables do case-sensitive resolution and return columns with the 
> corresponding letter cases, while hive tables always return the first matched 
> column ignoring cases. SPARK-25132 let data source tables throw exception 
> when there is ambiguity while hive table behavior is not changed.
> This ticket aims to make behaviors consistent when converting hive table to 
> data source table.
>  * The behavior must be consistent to do the conversion, so we skip the 
> conversion in case-sensitive mode because hive parquet table always do 
> case-insensitive field resolution.
>  * In case-insensitive mode, when converting hive parquet table to parquet 
> data source, we switch the duplicated fields resolution mode to ask parquet 
> data source to pick the first matched field - the same behavior as hive 
> parquet table - to keep behaviors consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source

2018-09-09 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-25391:


 Summary: Make behaviors consistent when converting parquet hive 
table to parquet data source
 Key: SPARK-25391
 URL: https://issues.apache.org/jira/browse/SPARK-25391
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


parquet data source tables and hive parquet tables have different behaviors 
about parquet field resolution. So, when 
{{spark.sql.hive.convertMetastoreParquet}} is true, users might face 
inconsistent behaviors. The differences are:
 * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both data 
source tables and hive tables do NOT respect {{spark.sql.caseSensitive}}. 
However data source tables always do case-sensitive parquet field resolution, 
while hive tables always do case-insensitive parquet field resolution no matter 
whether {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let 
data source tables respect {{spark.sql.caseSensitive}} while hive serde table 
behavior is not changed.
 * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, data 
source tables do case-sensitive resolution and return columns with the 
corresponding letter cases, while hive tables always return the first matched 
column ignoring cases. SPARK-25132 let data source tables throw exception when 
there is ambiguity while hive table behavior is not changed.

This ticket aims to make behaviors consistent when converting hive table to 
data source table.
 * The behavior must be consistent to do the conversion, so we skip the 
conversion in case-sensitive mode because hive parquet table always do 
case-insensitive field resolution.
 * In case-insensitive mode, when converting hive parquet table to parquet data 
source, we switch the duplicated fields resolution mode to ask parquet data 
source to pick the first matched field - the same behavior as hive parquet 
table - to keep behaviors consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25175) Field resolution should fail if there's ambiguity for ORC native reader

2018-08-28 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25175:
-
Summary: Field resolution should fail if there's ambiguity for ORC native 
reader  (was: Field resolution should fail if there is ambiguity for ORC data 
source native implementation)

> Field resolution should fail if there's ambiguity for ORC native reader
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25175) Field resolution should fail if there is ambiguity for ORC data source native implementation

2018-08-28 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16595240#comment-16595240
 ] 

Chenxiao Mao commented on SPARK-25175:
--

[~cloud_fan] Does it make sense?

> Field resolution should fail if there is ambiguity for ORC data source native 
> implementation
> 
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25175) Field resolution should fail if there is ambiguity for ORC data source native implementation

2018-08-28 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25175:
-
Summary: Field resolution should fail if there is ambiguity for ORC data 
source native implementation  (was: Field resolution should fail if there is 
ambiguity in case-insensitive mode when reading from ORC)

> Field resolution should fail if there is ambiguity for ORC data source native 
> implementation
> 
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25175) Field resolution should fail if there is ambiguity in case-insensitive mode when reading from ORC

2018-08-28 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16595225#comment-16595225
 ] 

Chenxiao Mao commented on SPARK-25175:
--

After a deep dive into ORC file read paths (data source native, data source 
hive, hive serde), I realized that this is a little complicated. I'm not sure 
whether it's technically possible to make all three read paths consistent with 
respect to case sensitivity, because we rely on hive InputFormat/SerDe which we 
might not be able to change.

Please also see [~cloud_fan]'s comment on Parquet: 
[https://github.com/apache/spark/pull/22184/files#r212849852]

So I changed the title of this Jira to reduce the scope. This ticket aims to 
make ORC data source native impl consistent with Parquet data source. The gap 
is that field resolution should fail if there is ambiguity in case-insensitive 
mode when reading from ORC. Does it make sense?

As for duplicate fields with different letter cases, we don't have real use 
cases. It's just for testing purpose.

 

> Field resolution should fail if there is ambiguity in case-insensitive mode 
> when reading from ORC
> -
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25175) Field resolution should fail if there is ambiguity in case-insensitive mode when reading from ORC

2018-08-28 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25175:
-
Summary: Field resolution should fail if there is ambiguity in 
case-insensitive mode when reading from ORC  (was: Case-insensitive field 
resolution when reading from ORC)

> Field resolution should fail if there is ambiguity in case-insensitive mode 
> when reading from ORC
> -
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-28 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593194#comment-16593194
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/28/18 4:00 PM:
---

[~dongjoon] [~yucai] Here is a brief summary. We can see that
 * The data source tables with hive impl always return a,B,c, no matter whether 
spark.sql.caseSensitive is set to true or false and no matter metastore table 
schema is in lower case or upper case. They always do case-insensitive field 
resolution, and if there is ambiguity they return the first matched one. Given 
ORC file schema is (a,B,c,C)
 ** Is it better to return null in scenario 2 and 10? 
 ** Is it better to return C in scenario 12?
 ** Is it better to fail due to ambiguity in scenario 15, 18, 21, 24, rather 
than always return lower case one?

 * The data source tables with native impl, compared to hive impl, handle 
scenario 2, 10, 12 in a more reasonable way. However, they handles ambiguity in 
the same way as hive impl, which is not consistent with Parquet data source.
 * The hive serde tables always throw IndexOutOfBoundsException at runtime when 
ORC file schema has more fields than table schema. If ORC schema does NOT have 
more fields, hive serde tables do field resolution by ordinal rather than by 
name.
 * Since in case-sensitive mode analysis should fail if a column name in query 
and metastore schema are in different cases, all AnalysisException(s) are 
reasonable.

Stacktrace of IndexOutOfBoundsException:
{code:java}
java.lang.IndexOutOfBoundsException: toIndex = 4
at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
at java.util.ArrayList.subList(ArrayList.java:996)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
at 
org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}
 


was (Author: seancxmao):
[~dongjoon] [~yucai] Here is a brief s

[jira] [Updated] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-28 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25175:
-
Description: 
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues, but not identical 
to Parquet. Spark has two OrcFileFormat.
 * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat always do case-insensitive field resolution 
regardless of case sensitivity mode. When there is ambiguity, hive 
OrcFileFormat always returns the first matched field, rather than failing the 
reading operation.
 * SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.

Besides data source tables, hive serde tables also have issues. If ORC data 
file has more fields than table schema, we just can't read hive serde tables. 
If ORC data file does not have more fields, hive serde tables always do field 
resolution by ordinal, rather than by name.

Both ORC data source hive impl and hive serde table rely on the hive orc 
InputFormat/SerDe to read table. I'm not sure whether we can change underlying 
hive classes to make all orc read behaviors consistent.

This ticket aims to make read behavior of ORC data source native impl 
consistent with Parquet data source.

  was:
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues. Since Spark has 2 
OrcFileFormat, we should add support for both.
 * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat always do case-insensitive field resolution 
regardless of case sensitivity mode. When there is ambiguity, hive 
OrcFileFormat always returns the lower case field, rather than failing the 
reading operation.
 * SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.

Besides data source tables, hive serde tables also have issues. When there are 
duplicate fields (e.g. c, C), we just can't read hive serde tables. If there 
are no duplicate fields, hive serde tables always do case-insensitive field 
resolution regardless of case sensitivity mode.


> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-28 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593185#comment-16593185
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/28/18 3:27 PM:
---

Investigation about ORC tables with duplicate fields (c and C), thus also data 
file has more fields than table schema.
{code:java}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", 
"id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Type: struct

CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'

DESC EXTENDED orc_data_source_lower;
DESC EXTENDED orc_data_source_upper;
DESC EXTENDED orc_hive_serde_lower;
DESC EXTENDED orc_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
 
||no.||caseSensitive||table columns||select column||orc column
 (select via data source table, hive impl)||orc column
 (select via data source table, native impl)||orc column
 (select via hive serde table)||
|1|true|a, b, c|a|a |a|IndexOutOfBoundsException |
|2| | |b|B |null|IndexOutOfBoundsException |
|3| | |c|c |c|IndexOutOfBoundsException |
|4| | |A|AnalysisException|AnalysisException|AnalysisException|
|5| | |B|AnalysisException|AnalysisException|AnalysisException|
|6| | |C|AnalysisException|AnalysisException|AnalysisException|
|7| |A, B, C|a|AnalysisException |AnalysisException|AnalysisException|
|8| | |b|AnalysisException |AnalysisException|AnalysisException |
|9| | |c|AnalysisException |AnalysisException|AnalysisException |
|10| | |A|a |null|IndexOutOfBoundsException |
|11| | |B|B |B|IndexOutOfBoundsException |
|12| | |C|c |C|IndexOutOfBoundsException |
|13|false|a, b, c|a|a |a|IndexOutOfBoundsException |
|14| | |b|B |B|IndexOutOfBoundsException |
|15| | |c|c |c|IndexOutOfBoundsException |
|16| | |A|a |a|IndexOutOfBoundsException |
|17| | |B|B |B|IndexOutOfBoundsException |
|18| | |C|c |c|IndexOutOfBoundsException |
|19| |A, B, C|a|a |a|IndexOutOfBoundsException |
|20| | |b|B |B|IndexOutOfBoundsException |
|21| | |c|c |c|IndexOutOfBoundsException |
|22| | |A|a |a|IndexOutOfBoundsException |
|23| | |B|B |B|IndexOutOfBoundsException |
|24| | |C|c |c|IndexOutOfBoundsException |

Followup tests that use ORC files with no duplicate fields (only a,B).
{code:java}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data_nodup")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data_nodup//user/hive/warehouse/orc_data_nodup/part-1-4befd318-9ed5-4d77-b51b-09848d71d9cd-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data_nodup/part-1-4befd318-9ed5-4d77-b51b-09848d71d9cd-c000.snappy.orc
Type: struct

CREATE TABLE orc_nodup_hive_serde_lower (a LONG, b LONG) STORED AS orc LOCATION 
'/user/hive/warehouse/orc_data_nodup'
CREATE TABLE orc_nodup_hive_serde_upper (A LONG, B LONG) STORED AS orc LOCATION 
'/user/hive/warehouse/orc_data_nodup'

DESC EXTENDED orc_nodup_hive_serde_lower;
DESC EXTENDED orc_nodup_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
||no.||caseSensitive||table columns||select column||orc column
 (select via hive serde table)||
|1|true|a, b|a|a|
|2| | |b|B|
|4| | |A|AnalysisException|
|5| | |B|AnalysisException|
|7| |A, B|a|AnalysisException|
|8| | |b|AnalysisException |
|10| | |A|a|
|11| | |B|B|
|13|false|a, b|a|a|
|14| | |b|B|
|16| | |A|a|
|17| | |B|B|
|19| |A, B|a|a|
|20| | |b|B|
|22| | |A|a|
|23| | |B|B|

Tests show that for hive serde table field resolution is by ordinal, not by 
name.

{code}
spark.conf.set("spark.sql.caseSensitive", true)
spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
val data = spark.range(1).selectExpr("id + 1 as x", "id + 2 as y", "id + 3 as 
z")
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data_xyz")
sql("CREATE TABLE orc_table_ABC (A LONG, B LONG, C LONG) STORED AS orc LOCATION 
'/user/hive/warehouse/orc_data_xyz'")
sql("select B from orc_table_ABC").show
+---+
|  B|
+---+
|  2|
+---+
{code}



was (Author: seancxmao):
Thorough investigation about ORC tables with duplicate fields (c and C).
{code:java}
v

[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-28 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593194#comment-16593194
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/28/18 3:22 PM:
---

[~dongjoon] [~yucai] Here is a brief summary. We can see that
 * The data source tables with hive impl always return a,B,c, no matter whether 
spark.sql.caseSensitive is set to true or false and no matter metastore table 
schema is in lower case or upper case. They always do case-insensitive field 
resolution, and if there is ambiguity they return the first matched one. Given 
ORC file schema is (a,B,c,C)
 ** Is it better to return null in scenario 2 and 10? 
 ** Is it better to return C in scenario 12?
 ** Is it better to fail due to ambiguity in scenario 15, 18, 21, 24, rather 
than always return lower case one?

 * The data source tables with native impl, compared to hive impl, handle 
scenario 2, 10, 12 in a more reasonable way. However, they handles ambiguity in 
the same way as hive impl.
 * The hive serde tables always throw IndexOutOfBoundsException at runtime when 
ORC schema has more fields than table schema. If ORC schema does NOT have more 
fields, hive serde tables do field resolution by ordinal rather than by name.
 * Since in case-sensitive mode analysis should fail if a column name in query 
and metastore schema are in different cases, all AnalysisException(s) are 
reasonable.

Stacktrace of IndexOutOfBoundsException:
{code:java}
java.lang.IndexOutOfBoundsException: toIndex = 4
at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
at java.util.ArrayList.subList(ArrayList.java:996)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
at 
org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}
 


was (Author: seancxmao):
[~dongjoon] [~yucai] Here is a brief summary. We can see that
 * The data source tables with h

[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-27 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593200#comment-16593200
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 3:53 PM:
---

Also here is similar investigation I did for parquet tables. Just for your 
information: [https://github.com/apache/spark/pull/22184/files#r212405373]


was (Author: seancxmao):
Also here is the similar investigation I did for parquet tables. Just for your 
information: https://github.com/apache/spark/pull/22184/files#r212405373

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the lower case field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. When there 
> are duplicate fields (e.g. c, C), we just can't read hive serde tables. If 
> there are no duplicate fields, hive serde tables always do case-insensitive 
> field resolution regardless of case sensitivity mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-27 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25175:
-
Description: 
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues. Since Spark has 2 
OrcFileFormat, we should add support for both.
 * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat always do case-insensitive field resolution 
regardless of case sensitivity mode. When there is ambiguity, hive 
OrcFileFormat always returns the lower case field, rather than failing the 
reading operation.
 * SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.

Besides data source tables, hive serde tables also have issues. When there are 
duplicate fields (e.g. c, C), we just can't read hive serde tables. If there 
are no duplicate fields, hive serde tables always do case-insensitive field 
resolution regardless of case sensitivity mode.

  was:
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues. Since Spark has 2 
OrcFileFormat, we should add support for both.
 * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat always do case-insensitive field resolution 
regardless of case sensitivity mode. When there is ambiguity, hive 
OrcFileFormat always returns the lower case field, rather than failing the 
reading operation.
 * SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.

Besides data source tables, hive serde tables also have issues. When there is 
ambiguity, we just can't read hive serde tables.


> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the lower case field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. When there 
> are duplicate fields (e.g. c, C), we just can't read hive serde tables. If 
> there are no duplicate fields, hive serde tables always do case-insensitive 
> field resolution regardless of case sensitivity mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-27 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593194#comment-16593194
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 7:50 AM:
---

[~dongjoon] [~yucai] Here is a brief summary. We can see that
 * The data source tables with hive impl always return a,B,c, no matter whether 
spark.sql.caseSensitive is set to true or false and no matter metastore table 
schema is in lower case or upper case. It seems they always do case-insensitive 
field resolution, and if there is ambiguity they return the lower case ones. 
Given ORC file schema is (a,B,c,C)
 ** Is it better to return null in scenario 2 and 10? 
 ** Is it better to return C in scenario 12?
 ** Is it better to fail due to ambiguity in scenario 15, 18, 21, 24, rather 
than always return lower case one?

 * The data source tables with native impl, compared to hive impl, handle 
scenario 2, 10, 12 in a more reasonable way. However, they handles ambiguity in 
the same way as hive impl.
 * The hive serde tables always throw IndexOutOfBoundsException at runtime when 
there are duplicate fields (e.g. c, C). If there is no duplicate fields, it 
seems hive serde tables always do case-insensitive field resolution, just like 
hive implementation of OrcFileFormat.
 * Since in case-sensitive mode analysis should fail if a column name in query 
and metastore schema are in different cases, all AnalysisException(s) meet our 
expectation.

Stacktrace of IndexOutOfBoundsException:
{code:java}
java.lang.IndexOutOfBoundsException: toIndex = 4
at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
at java.util.ArrayList.subList(ArrayList.java:996)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
at 
org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}
 


was (Author: seancxmao):
[~dongjoon] [~yucai] Here is a brief summary

[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-27 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593185#comment-16593185
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 7:33 AM:
---

Thorough investigation about ORC tables with duplicate fields (c and C).
{code:java}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", 
"id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Type: struct

CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'

DESC EXTENDED orc_data_source_lower;
DESC EXTENDED orc_data_source_upper;
DESC EXTENDED orc_hive_serde_lower;
DESC EXTENDED orc_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
 
||no.||caseSensitive||table columns||select column||orc column
 (select via data source table, hive impl)||orc column
 (select via data source table, native impl)||orc column
 (select via hive serde table)||
|1|true|a, b, c|a|a |a|IndexOutOfBoundsException |
|2| | |b|B |null|IndexOutOfBoundsException |
|3| | |c|c |c|IndexOutOfBoundsException |
|4| | |A|AnalysisException|AnalysisException|AnalysisException|
|5| | |B|AnalysisException|AnalysisException|AnalysisException|
|6| | |C|AnalysisException|AnalysisException|AnalysisException|
|7| |A, B, C|a|AnalysisException |AnalysisException|AnalysisException|
|8| | |b|AnalysisException |AnalysisException|AnalysisException |
|9| | |c|AnalysisException |AnalysisException|AnalysisException |
|10| | |A|a |null|IndexOutOfBoundsException |
|11| | |B|B |B|IndexOutOfBoundsException |
|12| | |C|c |C|IndexOutOfBoundsException |
|13|false|a, b, c|a|a |a|IndexOutOfBoundsException |
|14| | |b|B |B|IndexOutOfBoundsException |
|15| | |c|c |c|IndexOutOfBoundsException |
|16| | |A|a |a|IndexOutOfBoundsException |
|17| | |B|B |B|IndexOutOfBoundsException |
|18| | |C|c |c|IndexOutOfBoundsException |
|19| |A, B, C|a|a |a|IndexOutOfBoundsException |
|20| | |b|B |B|IndexOutOfBoundsException |
|21| | |c|c |c|IndexOutOfBoundsException |
|22| | |A|a |a|IndexOutOfBoundsException |
|23| | |B|B |B|IndexOutOfBoundsException |
|24| | |C|c |c|IndexOutOfBoundsException |

Followup tests that use ORC files with no duplicate fields (only a,B).
{code:java}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data_nodup")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data_nodup//user/hive/warehouse/orc_data_nodup/part-1-4befd318-9ed5-4d77-b51b-09848d71d9cd-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data_nodup/part-1-4befd318-9ed5-4d77-b51b-09848d71d9cd-c000.snappy.orc
Type: struct

CREATE TABLE orc_nodup_hive_serde_lower (a LONG, b LONG) STORED AS orc LOCATION 
'/user/hive/warehouse/orc_data_nodup'
CREATE TABLE orc_nodup_hive_serde_upper (A LONG, B LONG) STORED AS orc LOCATION 
'/user/hive/warehouse/orc_data_nodup'

DESC EXTENDED orc_nodup_hive_serde_lower;
DESC EXTENDED orc_nodup_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
||no.||caseSensitive||table columns||select column||orc column
 (select via hive serde table)||
|1|true|a, b|a|a|
|2| | |b|B|
|4| | |A|AnalysisException|
|5| | |B|AnalysisException|
|7| |A, B|a|AnalysisException|
|8| | |b|AnalysisException |
|10| | |A|a|
|11| | |B|B|
|13|false|a, b|a|a|
|14| | |b|B|
|16| | |A|a|
|17| | |B|B|
|19| |A, B|a|a|
|20| | |b|B|
|22| | |A|a|
|23| | |B|B|


was (Author: seancxmao):
Thorough investigation about ORC tables with duplicate fields (c and C).
{code:java}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", 
"id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Type: struct

CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_data_source_upper (A

[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-27 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593185#comment-16593185
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 7:24 AM:
---

Thorough investigation about ORC tables with duplicate fields (c and C).
{code:java}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", 
"id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Type: struct

CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'

DESC EXTENDED orc_data_source_lower;
DESC EXTENDED orc_data_source_upper;
DESC EXTENDED orc_hive_serde_lower;
DESC EXTENDED orc_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
 
||no.||caseSensitive||table columns||select column||orc column
 (select via data source table, hive impl)||orc column
 (select via data source table, native impl)||orc column
 (select via hive serde table)||
|1|true|a, b, c|a|a |a|IndexOutOfBoundsException |
|2| | |b|B |null|IndexOutOfBoundsException |
|3| | |c|c |c|IndexOutOfBoundsException |
|4| | |A|AnalysisException|AnalysisException|AnalysisException|
|5| | |B|AnalysisException|AnalysisException|AnalysisException|
|6| | |C|AnalysisException|AnalysisException|AnalysisException|
|7| |A, B, C|a|AnalysisException |AnalysisException|AnalysisException|
|8| | |b|AnalysisException |AnalysisException|AnalysisException |
|9| | |c|AnalysisException |AnalysisException|AnalysisException |
|10| | |A|a |null|IndexOutOfBoundsException |
|11| | |B|B |B|IndexOutOfBoundsException |
|12| | |C|c |C|IndexOutOfBoundsException |
|13|false|a, b, c|a|a |a|IndexOutOfBoundsException |
|14| | |b|B |B|IndexOutOfBoundsException |
|15| | |c|c |c|IndexOutOfBoundsException |
|16| | |A|a |a|IndexOutOfBoundsException |
|17| | |B|B |B|IndexOutOfBoundsException |
|18| | |C|c |c|IndexOutOfBoundsException |
|19| |A, B, C|a|a |a|IndexOutOfBoundsException |
|20| | |b|B |B|IndexOutOfBoundsException |
|21| | |c|c |c|IndexOutOfBoundsException |
|22| | |A|a |a|IndexOutOfBoundsException |
|23| | |B|B |B|IndexOutOfBoundsException |
|24| | |C|c |c|IndexOutOfBoundsException |


was (Author: seancxmao):
Thorough investigation about ORC tables
{code:java}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", 
"id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Type: struct

CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'

DESC EXTENDED orc_data_source_lower;
DESC EXTENDED orc_data_source_upper;
DESC EXTENDED orc_hive_serde_lower;
DESC EXTENDED orc_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
 
||no.||caseSensitive||table columns||select column||orc column
 (select via data source table, hive impl)||orc column
(select via data source table, native impl)||orc column
 (select via hive serde table)||
|1|true|a, b, c|a|a |a|IndexOutOfBoundsException |
|2| | |b|B |null|IndexOutOfBoundsException |
|3| | |c|c |c|IndexOutOfBoundsException |
|4| | |A|AnalysisException|AnalysisException|AnalysisException|
|5| | |B|AnalysisException|AnalysisException|AnalysisException|
|6| | |C|AnalysisException|AnalysisException|AnalysisException|
|7| |A, B, C|a|AnalysisException |AnalysisException|AnalysisException|
|8| | |b|AnalysisException |AnalysisException|AnalysisException |
|9| | |c|AnalysisException |AnalysisException|AnalysisException |
|10| | |A|a |null|IndexOutOfBoundsException |
|11| | |B|B |B|IndexOutOfBoundsException |
|12| | |C|c |C|IndexOutOfB

[jira] [Updated] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-27 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25175:
-
Description: 
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues. Since Spark has 2 
OrcFileFormat, we should add support for both.
 * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat always do case-insensitive field resolution 
regardless of case sensitivity mode. When there is ambiguity, hive 
OrcFileFormat always returns the lower case field, rather than failing the 
reading operation.
 * SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.

Besides data source tables, hive serde tables also have issues. When there is 
ambiguity, we just can't read hive serde tables.

  was:
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues. Since Spark has 2 
OrcFileFormat, we should add support for both.
 * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat always do case-insensitive field resolution 
regardless of case sensitivity mode. When there is ambiguity, hive 
OrcFileFormat always returns the lower case field, rather than failing the 
reading operation.
 * SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.

Besides data source tables, hive serde tables also have issues. When there i


> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the lower case field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. When there is 
> ambiguity, we just can't read hive serde tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-27 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25175:
-
Description: 
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues. Since Spark has 2 
OrcFileFormat, we should add support for both.
 * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat always do case-insensitive field resolution 
regardless of case sensitivity mode. When there is ambiguity, hive 
OrcFileFormat always returns the lower case field, rather than failing the 
reading operation.
 * SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.

Besides data source tables, hive serde tables also have issues. When there i

  was:
SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues. Since Spark has 2 
OrcFileFormat, we should add support for both.
* Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat doesn't support case-insensitive field 
resolution at all.
* SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.


> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the lower case field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. When there i



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-26 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593194#comment-16593194
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 6:56 AM:
---

[~dongjoon] [~yucai] Here is a brief summary. We can see that
 * The data source tables with hive impl always return a,B,c, no matter whether 
spark.sql.caseSensitive is set to true or false and no matter metastore table 
schema is in lower case or upper case. Given ORC file schema is (a,B,c,C)
 ** Is it better to return null in scenario 2 and 10? 
 ** Is it better to return C in scenario 12?
 ** Is it better to fail due to ambiguity in scenario 15, 18, 21, 24, rather 
than always return lower case one?

 * The data source tables with native impl, compared to hive impl, handle 
scenario 2, 10, 12 in a more reasonable way. However, they handles ambiguity in 
the same way as hive impl.
 * The hive serde tables always throw IndexOutOfBoundsException at runtime.
 * Since in case-sensitive mode analysis should fail if a column name in query 
and metastore schema are in different cases, all AnalysisException(s) meet our 
expectation.

Stacktrace of IndexOutOfBoundsException:
{code:java}
java.lang.IndexOutOfBoundsException: toIndex = 4
at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
at java.util.ArrayList.subList(ArrayList.java:996)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
at 
org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}
 


was (Author: seancxmao):
[~dongjoon] [~yucai] Here is a brief summary. We can see that
 * The data source tables always return a,B,c, no matter whether 
spark.sql.caseSensitive is set to true or false and no matter metastore table 
schema is in lower case or upper case. Since ORC file schema is (a,B,c,C),
 ** Is it better to return null in scenario 2 and 10? 
 ** Is it better to return 

[jira] [Comment Edited] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-26 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593185#comment-16593185
 ] 

Chenxiao Mao edited comment on SPARK-25175 at 8/27/18 6:45 AM:
---

Thorough investigation about ORC tables
{code:java}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", 
"id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Type: struct

CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'

DESC EXTENDED orc_data_source_lower;
DESC EXTENDED orc_data_source_upper;
DESC EXTENDED orc_hive_serde_lower;
DESC EXTENDED orc_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
 
||no.||caseSensitive||table columns||select column||orc column
 (select via data source table, hive impl)||orc column
(select via data source table, native impl)||orc column
 (select via hive serde table)||
|1|true|a, b, c|a|a |a|IndexOutOfBoundsException |
|2| | |b|B |null|IndexOutOfBoundsException |
|3| | |c|c |c|IndexOutOfBoundsException |
|4| | |A|AnalysisException|AnalysisException|AnalysisException|
|5| | |B|AnalysisException|AnalysisException|AnalysisException|
|6| | |C|AnalysisException|AnalysisException|AnalysisException|
|7| |A, B, C|a|AnalysisException |AnalysisException|AnalysisException|
|8| | |b|AnalysisException |AnalysisException|AnalysisException |
|9| | |c|AnalysisException |AnalysisException|AnalysisException |
|10| | |A|a |null|IndexOutOfBoundsException |
|11| | |B|B |B|IndexOutOfBoundsException |
|12| | |C|c |C|IndexOutOfBoundsException |
|13|false|a, b, c|a|a |a|IndexOutOfBoundsException |
|14| | |b|B |B|IndexOutOfBoundsException |
|15| | |c|c |c|IndexOutOfBoundsException |
|16| | |A|a |a|IndexOutOfBoundsException |
|17| | |B|B |B|IndexOutOfBoundsException |
|18| | |C|c |c|IndexOutOfBoundsException |
|19| |A, B, C|a|a |a|IndexOutOfBoundsException |
|20| | |b|B |B|IndexOutOfBoundsException |
|21| | |c|c |c|IndexOutOfBoundsException |
|22| | |A|a |a|IndexOutOfBoundsException |
|23| | |B|B |B|IndexOutOfBoundsException |
|24| | |C|c |c|IndexOutOfBoundsException |


was (Author: seancxmao):
Thorough investigation about ORC tables
{code}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", 
"id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Type: struct

CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'

DESC EXTENDED orc_data_source_lower;
DESC EXTENDED orc_data_source_upper;
DESC EXTENDED orc_hive_serde_lower;
DESC EXTENDED orc_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
 
||no.||caseSensitive||table columns||select column||orc column
 (select via data source table)||orc column
 (select via hive serde table)||
|1|true|a, b, c|a|a |IndexOutOfBoundsException |
|2| | |b|B |IndexOutOfBoundsException |
|3| | |c|c |IndexOutOfBoundsException |
|4| | |A|AnalysisException|AnalysisException|
|5| | |B|AnalysisException|AnalysisException|
|6| | |C|AnalysisException|AnalysisException|
|7| |A, B, C|a|AnalysisException |AnalysisException|
|8| | |b|AnalysisException |AnalysisException |
|9| | |c|AnalysisException |AnalysisException |
|10| | |A|a |IndexOutOfBoundsException |
|11| | |B|B |IndexOutOfBoundsException |
|12| | |C|c |IndexOutOfBoundsException |
|13|false|a, b, c|a|a |IndexOutOfBoundsException |
|14| | |b|B |IndexOutOfBoundsException |
|15| | |c|c |IndexOutOfBoundsException |
|16| | |A|a |IndexOutOfBoundsException |
|17| | |B|B |IndexOutOfBoundsException |

[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-26 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593200#comment-16593200
 ] 

Chenxiao Mao commented on SPARK-25175:
--

Also here is the similar investigation I did for parquet tables. Just for your 
information: https://github.com/apache/spark/pull/22184/files#r212405373

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
> * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat doesn't support case-insensitive field 
> resolution at all.
> * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-26 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao reopened SPARK-25175:
--

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
> * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat doesn't support case-insensitive field 
> resolution at all.
> * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-26 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593194#comment-16593194
 ] 

Chenxiao Mao commented on SPARK-25175:
--

[~dongjoon] [~yucai] Here is a brief summary. We can see that
 * The data source tables always return a,B,c, no matter whether 
spark.sql.caseSensitive is set to true or false and no matter metastore table 
schema is in lower case or upper case. Since ORC file schema is (a,B,c,C),
 ** Is it better to return null in scenario 2 and 10? 
 ** Is it better to return C in scenario 12?
 ** Is it better to fail due to ambiguity in scenario 15, 18, 21, 24, rather 
than always return lower case one?
 * The hive serde tables always throw IndexOutOfBoundsException at runtime.
 * Since in case-sensitive mode analysis should fail if a column name in query 
and metastore schema are in different cases, all AnalysisException(s) meet our 
expectation.

Stacktrace of IndexOutOfBoundsException:
{code:java}
java.lang.IndexOutOfBoundsException: toIndex = 4
at java.util.ArrayList.subListRangeCheck(ArrayList.java:1004)
at java.util.ArrayList.subList(ArrayList.java:996)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
at 
org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}
 

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
> * Since SPARK-2883, S

[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-26 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593185#comment-16593185
 ] 

Chenxiao Mao commented on SPARK-25175:
--

Thorough investigation about ORC tables
{code}
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", 
"id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)
data.write.format("orc").mode("overwrite").save("/user/hive/warehouse/orc_data")

$> hive --orcfiledump 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Structure for 
/user/hive/warehouse/orc_data/part-1-9716d241-9ad9-4d56-8de3-7bc482067614-c000.snappy.orc
Type: struct

CREATE TABLE orc_data_source_lower (a LONG, b LONG, c LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_data_source_upper (A LONG, B LONG, C LONG) USING orc LOCATION 
'/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'
CREATE TABLE orc_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS orc 
LOCATION '/user/hive/warehouse/orc_data'

DESC EXTENDED orc_data_source_lower;
DESC EXTENDED orc_data_source_upper;
DESC EXTENDED orc_hive_serde_lower;
DESC EXTENDED orc_hive_serde_upper;

spark.conf.set("spark.sql.hive.convertMetastoreOrc", false)
{code}
 
||no.||caseSensitive||table columns||select column||orc column
 (select via data source table)||orc column
 (select via hive serde table)||
|1|true|a, b, c|a|a |IndexOutOfBoundsException |
|2| | |b|B |IndexOutOfBoundsException |
|3| | |c|c |IndexOutOfBoundsException |
|4| | |A|AnalysisException|AnalysisException|
|5| | |B|AnalysisException|AnalysisException|
|6| | |C|AnalysisException|AnalysisException|
|7| |A, B, C|a|AnalysisException |AnalysisException|
|8| | |b|AnalysisException |AnalysisException |
|9| | |c|AnalysisException |AnalysisException |
|10| | |A|a |IndexOutOfBoundsException |
|11| | |B|B |IndexOutOfBoundsException |
|12| | |C|c |IndexOutOfBoundsException |
|13|false|a, b, c|a|a |IndexOutOfBoundsException |
|14| | |b|B |IndexOutOfBoundsException |
|15| | |c|c |IndexOutOfBoundsException |
|16| | |A|a |IndexOutOfBoundsException |
|17| | |B|B |IndexOutOfBoundsException |
|18| | |C|c |IndexOutOfBoundsException |
|19| |A, B, C|a|a |IndexOutOfBoundsException |
|20| | |b|B |IndexOutOfBoundsException |
|21| | |c|c |IndexOutOfBoundsException |
|22| | |A|a |IndexOutOfBoundsException |
|23| | |B|B |IndexOutOfBoundsException |
|24| | |C|c |IndexOutOfBoundsException |

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
> * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat doesn't support case-insensitive field 
> resolution at all.
> * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-21 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587206#comment-16587206
 ] 

Chenxiao Mao commented on SPARK-25175:
--

This is followup of SPARK-25132. I'm working on this.

> Case-insensitive field resolution when reading from ORC
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues. Since Spark has 2 
> OrcFileFormat, we should add support for both.
> * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat doesn't support case-insensitive field 
> resolution at all.
> * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25175) Case-insensitive field resolution when reading from ORC

2018-08-21 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-25175:


 Summary: Case-insensitive field resolution when reading from ORC
 Key: SPARK-25175
 URL: https://issues.apache.org/jira/browse/SPARK-25175
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Chenxiao Mao


SPARK-25132 adds support for case-insensitive field resolution when reading 
from Parquet files. We found ORC files have similar issues. Since Spark has 2 
OrcFileFormat, we should add support for both.
* Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
dependency. This hive OrcFileFormat doesn't support case-insensitive field 
resolution at all.
* SPARK-20682 adds a new ORC data source inside sql/core. This native 
OrcFileFormat supports case-insensitive field resolution, however it cannot 
handle duplicate fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25132) Case-insensitive field resolution when reading from Parquet/ORC

2018-08-19 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25132:
-
Summary: Case-insensitive field resolution when reading from Parquet/ORC  
(was: Spark returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases)

> Case-insensitive field resolution when reading from Parquet/ORC
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25132) Spark returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases

2018-08-16 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16583305#comment-16583305
 ] 

Chenxiao Mao commented on SPARK-25132:
--

I'm working on this.

> Spark returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases
> 
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25132) Spark returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases

2018-08-16 Thread Chenxiao Mao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16582606#comment-16582606
 ] 

Chenxiao Mao commented on SPARK-25132:
--

We found this PR ([https://github.com/apache/spark/pull/15799)] can solve this 
issue, but no idea why it is removed.

> Spark returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases
> 
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25132) Spark returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases

2018-08-16 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-25132:


 Summary: Spark returns NULL for a column whose Hive metastore 
schema and Parquet schema are in different letter cases
 Key: SPARK-25132
 URL: https://issues.apache.org/jira/browse/SPARK-25132
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Chenxiao Mao


Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, regardless of spark.sql.caseSensitive set 
to true or false.

Here is a simple example to reproduce this issue:

scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")

spark-sql> show create table t1;
CREATE TABLE `t1` (`id` BIGINT)
USING parquet
OPTIONS (
 `serialization.format` '1'
)

spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
 > USING parquet
 > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';

spark-sql> select * from t1;
0
1
2
3
4

spark-sql> select * from t2;
NULL
NULL
NULL
NULL
NULL

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org