[jira] [Created] (SPARK-46714) Overwrite partitions with custom location should reset partition locations
Adrian Wang created SPARK-46714: --- Summary: Overwrite partitions with custom location should reset partition locations Key: SPARK-46714 URL: https://issues.apache.org/jira/browse/SPARK-46714 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Adrian Wang In hive metastore we support partitions to be outside of corresponding table location. When overwrite such partition with hive, the overwritten partitions should be recreated under table location. Also, currently if a partition is on a different fileystem from the table, when overwriting spark will throw exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41816) Spark ThriftServer should not close file system when log out
[ https://issues.apache.org/jira/browse/SPARK-41816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang resolved SPARK-41816. - Resolution: Invalid > Spark ThriftServer should not close file system when log out > > > Key: SPARK-41816 > URL: https://issues.apache.org/jira/browse/SPARK-41816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Adrian Wang >Priority: Major > > Currently when enabled impersonation, Spark Thriftserver will close > filesystem instance for the user when logout. If there are two sessions with > the same user, the remaining session will become corrupted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41816) Spark ThriftServer should not close file system when log out
[ https://issues.apache.org/jira/browse/SPARK-41816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-41816: Description: Currently when enable impersonation, Spark Thriftserver will close filesystem instance for the user when logout. If there are two sessions with the same user, the remaining session will become corrupted. (was: Currently when enable impersonation, Spark Thriftserver will close filesystem instance for the user. If there are two sessions with the same user, the remaining session will become corrupted.) > Spark ThriftServer should not close file system when log out > > > Key: SPARK-41816 > URL: https://issues.apache.org/jira/browse/SPARK-41816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Adrian Wang >Priority: Major > > Currently when enable impersonation, Spark Thriftserver will close filesystem > instance for the user when logout. If there are two sessions with the same > user, the remaining session will become corrupted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41816) Spark ThriftServer should not close file system when log out
[ https://issues.apache.org/jira/browse/SPARK-41816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-41816: Description: Currently when enabled impersonation, Spark Thriftserver will close filesystem instance for the user when logout. If there are two sessions with the same user, the remaining session will become corrupted. (was: Currently when enable impersonation, Spark Thriftserver will close filesystem instance for the user when logout. If there are two sessions with the same user, the remaining session will become corrupted.) > Spark ThriftServer should not close file system when log out > > > Key: SPARK-41816 > URL: https://issues.apache.org/jira/browse/SPARK-41816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.1 >Reporter: Adrian Wang >Priority: Major > > Currently when enabled impersonation, Spark Thriftserver will close > filesystem instance for the user when logout. If there are two sessions with > the same user, the remaining session will become corrupted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41816) Spark ThriftServer should not close file system when log out
Adrian Wang created SPARK-41816: --- Summary: Spark ThriftServer should not close file system when log out Key: SPARK-41816 URL: https://issues.apache.org/jira/browse/SPARK-41816 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.1 Reporter: Adrian Wang Currently when enable impersonation, Spark Thriftserver will close filesystem instance for the user. If there are two sessions with the same user, the remaining session will become corrupted. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26764) [SPIP] Spark Relational Cache
[ https://issues.apache.org/jira/browse/SPARK-26764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370993#comment-17370993 ] Adrian Wang commented on SPARK-26764: - [~zshao] Thanks for the interest. We created an open-source plugin: [https://github.com/alibaba/SparkCube], to demonstrate the basic ideas. > [SPIP] Spark Relational Cache > - > > Key: SPARK-26764 > URL: https://issues.apache.org/jira/browse/SPARK-26764 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Adrian Wang >Priority: Major > Attachments: Relational+Cache+SPIP.pdf > > > In modern database systems, relational cache is a common technology to boost > ad-hoc queries. While Spark provides cache natively, Spark SQL should be able > to utilize the relationship between relations to boost all possible queries. > In this SPIP, we will make Spark be able to utilize all defined cached > relations if possible, without explicit substitution in user query, as well > as keep some user defined cache available in different sessions. Materialized > views in many database systems provide similar function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30130) Hardcoded numeric values in common table expressions which utilize GROUP BY are interpreted as ordinal positions
[ https://issues.apache.org/jira/browse/SPARK-30130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345730#comment-17345730 ] Adrian Wang commented on SPARK-30130: - I also met this on 2.4.7, and this has been fix on master/3.1. > Hardcoded numeric values in common table expressions which utilize GROUP BY > are interpreted as ordinal positions > > > Key: SPARK-30130 > URL: https://issues.apache.org/jira/browse/SPARK-30130 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Matt Boegner >Priority: Minor > > Hardcoded numeric values in common table expressions which utilize GROUP BY > are interpreted as ordinal positions. > {code:java} > val df = spark.sql(""" > with a as (select 0 as test, count(*) group by test) > select * from a > """) > df.show(){code} > This results in an error message like {color:#e01e5a}GROUP BY position 0 is > not in select list (valid range is [1, 2]){color} . > > However, this error does not appear in a traditional subselect format. For > example, this query executes correctly: > {code:java} > val df = spark.sql(""" > select * from (select 0 as test, count(*) group by test) a > """) > df.show(){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35238) Add JindoFS SDK in cloud integration documents
Adrian Wang created SPARK-35238: --- Summary: Add JindoFS SDK in cloud integration documents Key: SPARK-35238 URL: https://issues.apache.org/jira/browse/SPARK-35238 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 3.1.1, 3.0.2, 2.4.7, 2.3.4 Reporter: Adrian Wang As an important cloud provider, Alibaba Cloud presents JindoFS SDK to maximize the performance for workloads interacting with Alibaba Cloud OSS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31595) Spark sql cli should allow unescaped quote mark in quoted string
[ https://issues.apache.org/jira/browse/SPARK-31595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094980#comment-17094980 ] Adrian Wang commented on SPARK-31595: - [~Ankitraj] Thanks, I have already created a pull request on this. > Spark sql cli should allow unescaped quote mark in quoted string > > > Key: SPARK-31595 > URL: https://issues.apache.org/jira/browse/SPARK-31595 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Adrian Wang >Priority: Major > > spark-sql> select "'"; > spark-sql> select '"'; > In Spark parser if we pass a text of `select "'";`, there will be > ParserCancellationException, which will be handled by PredictionMode.LL. By > dropping `;` correctly we can avoid that retry. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31595) Spark sql cli should allow unescaped quote mark in quoted string
Adrian Wang created SPARK-31595: --- Summary: Spark sql cli should allow unescaped quote mark in quoted string Key: SPARK-31595 URL: https://issues.apache.org/jira/browse/SPARK-31595 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Adrian Wang spark-sql> select "'"; spark-sql> select '"'; In Spark parser if we pass a text of `select "'";`, there will be ParserCancellationException, which will be handled by PredictionMode.LL. By dropping `;` correctly we can avoid that retry. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29177) Zombie tasks prevents executor from releasing when task exceeds maxResultSize
[ https://issues.apache.org/jira/browse/SPARK-29177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-29177: Description: When we fetch results from executors and found the total size has exceeded the maxResultSize configured, Spark will simply abort the stage and all dependent jobs. But the task triggered this is actually successful, but never post out `TaskEnd` event, as a result it will never be removed from `CoarseGrainedSchedulerBackend`. If dynamic allocation is enabled, there will be zombie executor(s) remaining in resource manager, it will never die until application ends. (was: When we fetch results from executors and found the total size has exceeded the maxResultSize configured, Spark will simply abort the stage and all dependent jobs. But the task triggered this is actually successful, but never posted `CompletionEvent` out, as a result it will never be removed from `CoarseGrainedSchedulerBackend`. If dynamic allocation is enabled, there will be zombie executor(s) remaining in resource manager, it will never die until application ends.) > Zombie tasks prevents executor from releasing when task exceeds maxResultSize > - > > Key: SPARK-29177 > URL: https://issues.apache.org/jira/browse/SPARK-29177 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 2.4.4 >Reporter: Adrian Wang >Priority: Major > > When we fetch results from executors and found the total size has exceeded > the maxResultSize configured, Spark will simply abort the stage and all > dependent jobs. But the task triggered this is actually successful, but never > post out `TaskEnd` event, as a result it will never be removed from > `CoarseGrainedSchedulerBackend`. If dynamic allocation is enabled, there will > be zombie executor(s) remaining in resource manager, it will never die until > application ends. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29177) Zombie tasks prevents executor from releasing when task exceeds maxResultSize
Adrian Wang created SPARK-29177: --- Summary: Zombie tasks prevents executor from releasing when task exceeds maxResultSize Key: SPARK-29177 URL: https://issues.apache.org/jira/browse/SPARK-29177 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4, 2.3.4 Reporter: Adrian Wang When we fetch results from executors and found the total size has exceeded the maxResultSize configured, Spark will simply abort the stage and all dependent jobs. But the task triggered this is actually successful, but never posted `CompletionEvent` out, as a result it will never be removed from `CoarseGrainedSchedulerBackend`. If dynamic allocation is enabled, there will be zombie executor(s) remaining in resource manager, it will never die until application ends. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931057#comment-16931057 ] Adrian Wang commented on SPARK-13446: - or you can just apply the patch from SPARK-27349 and recompile your spark. Hope it works! > Spark need to support reading data from Hive 2.0.0 metastore > > > Key: SPARK-13446 > URL: https://issues.apache.org/jira/browse/SPARK-13446 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Lifeng Wang >Assignee: Xiao Li >Priority: Major > Fix For: 2.2.0 > > > Spark provided HIveContext class to read data from hive metastore directly. > While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has > released, it's better to upgrade to support Hive 2.0.0. > {noformat} > 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI > thrift://hsw-node13:9083 > 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current > connections: 1 > 16/02/23 02:35:02 INFO metastore: Connected to metastore. > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185) > at > org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735) > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931055#comment-16931055 ] Adrian Wang commented on SPARK-13446: - [~jpbordi][~headcra6] I am using mysql as hive metastore backend, leaving 1.2.1hive jar in my spark/jars directory, without putting any additional hive jars in there, and reading from hive 2.x metastore service, it just works fine. ``` hive-beeline-1.2.1.spark2.jar hive-cli-1.2.1.spark2.jar hive-exec-1.2.1.spark2.jar hive-jdbc-1.2.1.spark2.jar hive-metastore-1.2.1.spark2.jar ``` That what it returns with `ls $SPARK_HOME/jars/hive-*` > Spark need to support reading data from Hive 2.0.0 metastore > > > Key: SPARK-13446 > URL: https://issues.apache.org/jira/browse/SPARK-13446 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Lifeng Wang >Assignee: Xiao Li >Priority: Major > Fix For: 2.2.0 > > > Spark provided HIveContext class to read data from hive metastore directly. > While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has > released, it's better to upgrade to support Hive 2.0.0. > {noformat} > 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI > thrift://hsw-node13:9083 > 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current > connections: 1 > 16/02/23 02:35:02 INFO metastore: Connected to metastore. > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185) > at > org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735) > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930298#comment-16930298 ] Adrian Wang commented on SPARK-13446: - [~elgalu][~toopt4][~headcra6][~jpbordi][~F7753] The reference to this variable has been removed in SPARK-27349 , which will be included in SPARK 3.0. For spark 2.x, you should exclude hive-exec.jar of hive 2.x or above from your spark extra class path, so you can avoid this exception. > Spark need to support reading data from Hive 2.0.0 metastore > > > Key: SPARK-13446 > URL: https://issues.apache.org/jira/browse/SPARK-13446 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Lifeng Wang >Assignee: Xiao Li >Priority: Major > Fix For: 2.2.0 > > > Spark provided HIveContext class to read data from hive metastore directly. > While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has > released, it's better to upgrade to support Hive 2.0.0. > {noformat} > 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI > thrift://hsw-node13:9083 > 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current > connections: 1 > 16/02/23 02:35:02 INFO metastore: Connected to metastore. > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185) > at > org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735) > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View
[ https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928100#comment-16928100 ] Adrian Wang commented on SPARK-29038: - This seems duplicates with our proposal of SPARK-26764 . We have implemented similar features and have already had it running in our customer's production environment. > SPIP: Support Spark Materialized View > - > > Key: SPARK-29038 > URL: https://issues.apache.org/jira/browse/SPARK-29038 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Lantao Jin >Priority: Major > > Materialized view is an important approach in DBMS to cache data to > accelerate queries. By creating a materialized view through SQL, the data > that can be cached is very flexible, and needs to be configured arbitrarily > according to specific usage scenarios. The Materialization Manager > automatically updates the cache data according to changes in detail source > tables, simplifying user work. When user submit query, Spark optimizer > rewrites the execution plan based on the available materialized view to > determine the optimal execution plan. > Details in [design > doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27279) Reuse subquery should compare child node only
Adrian Wang created SPARK-27279: --- Summary: Reuse subquery should compare child node only Key: SPARK-27279 URL: https://issues.apache.org/jira/browse/SPARK-27279 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Adrian Wang For now, `ReuseSubquery` in Spark compares two subqueries at `SubqueryExec` level, which invalidates the `ReuseSubquery` rule. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22601) Data load is getting displayed successful on providing non existing hdfs file path
[ https://issues.apache.org/jira/browse/SPARK-22601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-22601: Fix Version/s: (was: 2.2.1) 2.2.2 > Data load is getting displayed successful on providing non existing hdfs file > path > -- > > Key: SPARK-22601 > URL: https://issues.apache.org/jira/browse/SPARK-22601 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sujith Chacko >Assignee: Sujith Chacko >Priority: Minor > Fix For: 2.2.2 > > > Data load is getting displayed successful on providing non existing hdfs file > path where as in local path proper error message is getting displayed > create table tb2 (a string, b int); > load data inpath 'hdfs://hacluster/data1.csv' into table tb2 > Note: data1.csv does not exist in HDFS > when local non existing file path is given below error message will be > displayed > "LOAD DATA input path does not exist". attached snapshots of behaviour in > spark 2.1 and spark 2.2 version -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26764) [SPIP] Spark Relational Cache
[ https://issues.apache.org/jira/browse/SPARK-26764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779170#comment-16779170 ] Adrian Wang commented on SPARK-26764: - Hi [~Tagar] , the idea has something common with materialized view, while we would also make query rewriting available for Spark's cached query, and the data materialization process will be more configurable. > [SPIP] Spark Relational Cache > - > > Key: SPARK-26764 > URL: https://issues.apache.org/jira/browse/SPARK-26764 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Adrian Wang >Priority: Major > Attachments: Relational+Cache+SPIP.pdf > > > In modern database systems, relational cache is a common technology to boost > ad-hoc queries. While Spark provides cache natively, Spark SQL should be able > to utilize the relationship between relations to boost all possible queries. > In this SPIP, we will make Spark be able to utilize all defined cached > relations if possible, without explicit substitution in user query, as well > as keep some user defined cache available in different sessions. Materialized > views in many database systems provide similar function. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26764) [SPIP] Spark Relational Cache
Adrian Wang created SPARK-26764: --- Summary: [SPIP] Spark Relational Cache Key: SPARK-26764 URL: https://issues.apache.org/jira/browse/SPARK-26764 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.0 Reporter: Adrian Wang Attachments: Relational+Cache+SPIP.pdf In modern database systems, relational cache is a common technology to boost ad-hoc queries. While Spark provides cache natively, Spark SQL should be able to utilize the relationship between relations to boost all possible queries. In this SPIP, we will make Spark be able to utilize all defined cached relations if possible, without explicit substitution in user query, as well as keep some user defined cache available in different sessions. Materialized views in many database systems provide similar function. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26764) [SPIP] Spark Relational Cache
[ https://issues.apache.org/jira/browse/SPARK-26764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-26764: Attachment: Relational+Cache+SPIP.pdf > [SPIP] Spark Relational Cache > - > > Key: SPARK-26764 > URL: https://issues.apache.org/jira/browse/SPARK-26764 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Adrian Wang >Priority: Major > Attachments: Relational+Cache+SPIP.pdf > > > In modern database systems, relational cache is a common technology to boost > ad-hoc queries. While Spark provides cache natively, Spark SQL should be able > to utilize the relationship between relations to boost all possible queries. > In this SPIP, we will make Spark be able to utilize all defined cached > relations if possible, without explicit substitution in user query, as well > as keep some user defined cache available in different sessions. Materialized > views in many database systems provide similar function. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale
[ https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700836#comment-16700836 ] Adrian Wang commented on SPARK-26155: - [~Jk_Self] can you also test this on Spark 2.4? > Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS > in 3TB scale > -- > > Key: SPARK-26155 > URL: https://issues.apache.org/jira/browse/SPARK-26155 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis > in Spark2.3 without L486 & 487.pdf, q19.sql > > > In our test environment, we found a serious performance degradation issue in > Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious > performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark > 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated > this problem and figured out the root cause is in community patch SPARK-21052 > which add metrics to hash join process. And the impact code is > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > . Q19 costs about 30 seconds without these two lines code and 126 seconds > with these code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale
[ https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700253#comment-16700253 ] Adrian Wang commented on SPARK-26155: - [~viirya] , thanks for your reply. [~Jk_Self] initially found this when comparing Spark 2.1 and Spark 2.3, and after a binary search against the commit tree, she found the difference is caused by SPARK-21052 . Finally she remove the 2 lines from Spark 2.3 source code and recompiled, the performance regression is gone. > Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS > in 3TB scale > -- > > Key: SPARK-26155 > URL: https://issues.apache.org/jira/browse/SPARK-26155 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis > in Spark2.3 without L486 & 487.pdf, q19.sql > > > In our test environment, we found a serious performance degradation issue in > Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious > performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark > 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated > this problem and figured out the root cause is in community patch SPARK-21052 > which add metrics to hash join process. And the impact code is > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > . Q19 costs about 30 seconds without these two lines code and 126 seconds > with these code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26181) the `hasMinMaxStats` method of `ColumnStatsMap` is not correct
Adrian Wang created SPARK-26181: --- Summary: the `hasMinMaxStats` method of `ColumnStatsMap` is not correct Key: SPARK-26181 URL: https://issues.apache.org/jira/browse/SPARK-26181 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26155) Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS in 3TB scale
[ https://issues.apache.org/jira/browse/SPARK-26155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696541#comment-16696541 ] Adrian Wang commented on SPARK-26155: - It seems the performance downgrade is related to CPU cache, the metrics collection happens to break that... > Spark SQL performance degradation after apply SPARK-21052 with Q19 of TPC-DS > in 3TB scale > -- > > Key: SPARK-26155 > URL: https://issues.apache.org/jira/browse/SPARK-26155 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > Attachments: Q19 analysis in Spark2.3 with L486&487.pdf, Q19 analysis > in Spark2.3 without L486 & 487.pdf, q19.sql > > > In our test environment, we found a serious performance degradation issue in > Spark2.3 when running TPC-DS on SKX 8180. Several queries have serious > performance degradation. For example, TPC-DS Q19 needs 126 seconds with Spark > 2.3 while it needs only 29 seconds with Spark2.1 on 3TB data. We investigated > this problem and figured out the root cause is in community patch SPARK-21052 > which add metrics to hash join process. And the impact code is > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > . Q19 costs about 30 seconds without these two lines code and 126 seconds > with these code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14631) "drop database cascade" needs to unregister functions for HiveExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-14631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang closed SPARK-14631. --- Resolution: Not A Problem > "drop database cascade" needs to unregister functions for HiveExternalCatalog > - > > Key: SPARK-14631 > URL: https://issues.apache.org/jira/browse/SPARK-14631 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Adrian Wang > > as HIVE-12304, drop database cascade of hive did not drop functions as well. > We need to fix this when call `dropDatabase` in HiveExternalCatalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17427) function SIZE should return -1 when parameter is null
Adrian Wang created SPARK-17427: --- Summary: function SIZE should return -1 when parameter is null Key: SPARK-17427 URL: https://issues.apache.org/jira/browse/SPARK-17427 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Priority: Minor `select size(null)` returns -1 in Hive. In order to be compatible, we need to return -1 also. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4003) Add {Big Decimal, Timestamp, Date} types to Java SqlContext
[ https://issues.apache.org/jira/browse/SPARK-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15388922#comment-15388922 ] Adrian Wang commented on SPARK-4003: DataTypes.TimestampType is not using java.sql.Timestamp internally. you should only use exposed API. > Add {Big Decimal, Timestamp, Date} types to Java SqlContext > --- > > Key: SPARK-4003 > URL: https://issues.apache.org/jira/browse/SPARK-4003 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Adrian Wang >Assignee: Adrian Wang > Fix For: 1.2.0 > > > in JavaSqlContext, we need to let java program use big decimal, timestamp, > date types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16515) [SPARK][SQL] transformation script got failure for python script
[ https://issues.apache.org/jira/browse/SPARK-16515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15374241#comment-15374241 ] Adrian Wang commented on SPARK-16515: - The problem is spark did not find the right record writer from its conf when it has to write records to standard output. So when python read data from standard input, it crashes. > [SPARK][SQL] transformation script got failure for python script > > > Key: SPARK-16515 > URL: https://issues.apache.org/jira/browse/SPARK-16515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Yi Zhou >Priority: Critical > > Run below SQL and get transformation script error for python script like > below error message. > Query SQL: > {code} > CREATE VIEW q02_spark_sql_engine_validation_power_test_0_temp AS > SELECT DISTINCT > sessionid, > wcs_item_sk > FROM > ( > FROM > ( > SELECT > wcs_user_sk, > wcs_item_sk, > (wcs_click_date_sk * 24 * 60 * 60 + wcs_click_time_sk) AS tstamp_inSec > FROM web_clickstreams > WHERE wcs_item_sk IS NOT NULL > AND wcs_user_sk IS NOT NULL > DISTRIBUTE BY wcs_user_sk > SORT BY > wcs_user_sk, > tstamp_inSec -- "sessionize" reducer script requires the cluster by uid > and sort by tstamp > ) clicksAnWebPageType > REDUCE > wcs_user_sk, > tstamp_inSec, > wcs_item_sk > USING 'python q2-sessionize.py 3600' > AS ( > wcs_item_sk BIGINT, > sessionid STRING) > ) q02_tmp_sessionize > CLUSTER BY sessionid > {code} > Error Message: > {code} > 16/07/06 16:59:02 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 157.0 > (TID 171, hw-node5): org.apache.spark.SparkException: Subprocess exited with > status 1. Error: Traceback (most recent call last): > File "q2-sessionize.py", line 49, in > user_sk, tstamp_str, item_sk = line.strip().split("\t") > ValueError: too many values to unpack > at > org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144) > at > org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:192) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Subprocess exited with status 1. > Error: Traceback (most recent call last): > File "q2-sessionize.py", line 49, in > user_sk, tstamp_str, item_sk = line.strip().split("\t") > ValueError: too many values to unpack > at > org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.checkFailureAndPropagate(ScriptTransformation.scala:144) > at > org.apache.spark.sql.hive.execution.ScriptTransformation$$anon$1.hasNext(ScriptTransformation.scala:181) > ... 14 more > 16/07/06 16:59:02 INFO scheduler.TaskSetManager: Lost task 7.0 in stage 157.0 > (TID 173) on executor hw-node5: org.apache.spark.SparkException (Subprocess > exited with status 1. Error: Traceback (most recent call last): > File "q2-sessionize.py", line 49, in > user_sk, tstamp_str, item_sk = line.strip().split("\t") > ValueError: too many values to unpack > ) [duplicate 1] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15397) 'locate' UDF got different result with boundary value case compared to Hive engine
[ https://issues.apache.org/jira/browse/SPARK-15397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang resolved SPARK-15397. - Resolution: Fixed > 'locate' UDF got different result with boundary value case compared to Hive > engine > -- > > Key: SPARK-15397 > URL: https://issues.apache.org/jira/browse/SPARK-15397 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 2.0.0 >Reporter: Yi Zhou >Assignee: Adrian Wang > > Spark SQL: > select locate("abc", "abc", 1); > 0 > Hive: > select locate("abc", "abc", 1); > 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14126) [Table related commands] Truncate table
[ https://issues.apache.org/jira/browse/SPARK-14126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247161#comment-15247161 ] Adrian Wang commented on SPARK-14126: - Yes, still working. > [Table related commands] Truncate table > --- > > Key: SPARK-14126 > URL: https://issues.apache.org/jira/browse/SPARK-14126 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > TOK_TRUNCATETABLE > We also need to check the behavior of Hive when we call truncate table on a > partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14631) "drop database cascade" needs to unregister functions for HiveExternalCatalog
Adrian Wang created SPARK-14631: --- Summary: "drop database cascade" needs to unregister functions for HiveExternalCatalog Key: SPARK-14631 URL: https://issues.apache.org/jira/browse/SPARK-14631 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang as HIVE-12304, drop database cascade of hive did not drop functions as well. We need to fix this when call `dropDatabase` in HiveExternalCatalog. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14126) [Table related commands] Truncate table
[ https://issues.apache.org/jira/browse/SPARK-14126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236353#comment-15236353 ] Adrian Wang commented on SPARK-14126: - I'm working on this. > [Table related commands] Truncate table > --- > > Key: SPARK-14126 > URL: https://issues.apache.org/jira/browse/SPARK-14126 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > TOK_TRUNCATETABLE > We also need to check the behavior of Hive when we call truncate table on a > partitioned table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14021) Support custom context derived from HiveContext for SparkSQLEnv
[ https://issues.apache.org/jira/browse/SPARK-14021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-14021: Description: This is to create a custom context for command bin/spark-sql and sbin/start-thriftserver. Any context that is derived from HiveContext is acceptable. User need to configure the class name of custom context in a config of spark.sql.context.class, and make sure the class in classpath. This is to provide a more elegant way for custom configurations and changes for infrastructure team. > Support custom context derived from HiveContext for SparkSQLEnv > --- > > Key: SPARK-14021 > URL: https://issues.apache.org/jira/browse/SPARK-14021 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Adrian Wang > > This is to create a custom context for command bin/spark-sql and > sbin/start-thriftserver. Any context that is derived from HiveContext is > acceptable. User need to configure the class name of custom context in a > config of spark.sql.context.class, and make sure the class in classpath. This > is to provide a more elegant way for custom configurations and changes for > infrastructure team. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14021) Support custom context derived from HiveContext for SparkSQLEnv
Adrian Wang created SPARK-14021: --- Summary: Support custom context derived from HiveContext for SparkSQLEnv Key: SPARK-14021 URL: https://issues.apache.org/jira/browse/SPARK-14021 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13819) using a tegexp_replace in a gropu by clause raises a nullpointerexception
[ https://issues.apache.org/jira/browse/SPARK-13819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15194902#comment-15194902 ] Adrian Wang commented on SPARK-13819: - I'll take a look at this. > using a tegexp_replace in a gropu by clause raises a nullpointerexception > - > > Key: SPARK-13819 > URL: https://issues.apache.org/jira/browse/SPARK-13819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Javier Pérez > > 1. Start start-thriftserver.sh > 2. connect with beeline > 3. Perform the following query over a table: > SELECT t0.textsample > FROM test t0 > ORDER BY regexp_replace( > t0.code, > concat('\\Q', 'a', '\\E'), > regexp_replace( >regexp_replace('zz', '', ''), > '\\$', > '\\$')) DESC; > Problem: NullPointerException > Trace: > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.RegExpReplace.nullSafeEval(regexpExpressions.scala:224) > at > org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:458) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:36) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:27) > at scala.math.Ordering$class.gt(Ordering.scala:97) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.gt(ordering.scala:27) > at org.apache.spark.RangePartitioner.getPartition(Partitioner.scala:168) > at > org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180) > at > org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13837) SQL Context function to_date() returns wrong date
[ https://issues.apache.org/jira/browse/SPARK-13837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15194892#comment-15194892 ] Adrian Wang commented on SPARK-13837: - Which timezone are your system in? > SQL Context function to_date() returns wrong date > - > > Key: SPARK-13837 > URL: https://issues.apache.org/jira/browse/SPARK-13837 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: Python version: > 2.7.6 (default, Mar 22 2014, 22:59:56) > [GCC 4.8.2] >Reporter: Arnaud Caruso > > When using the SQL Context function to_date on a timestamp, it sometimes > returns the wrong date. > Here's how to reproduce the bug in Python: > data = [[datetime.datetime(2015, 2, 20, 0, 0, 2)],[datetime.datetime(2015, > 10, 9, 0, 0, 2)]] > rddData = sc.parallelize(data) > fields=[StructField('timestamp', TimestampType(), True)] > schema=StructType(fields) > data_table=sqlCtx.createDataFrame(data,schema) > sqlCtx.registerDataFrameAsTable(data_table,"data") > query="SELECT timestamp, TO_DATE(timestamp) FROM data " > df=sqlCtx.sql(query) > df.collect() > Here are the results I get: > [Row(timestamp=datetime.datetime(2015, 2, 20, 0, 0, 2), > _c1=datetime.date(2015, 2, 20)), > Row(timestamp=datetime.datetime(2015, 10, 9, 0, 0, 2), > _c1=datetime.date(2015, 10, 8))] > The first date is right but the second date is wrong, it returns October 8th > instead of returning October 9th. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186674#comment-15186674 ] Adrian Wang commented on SPARK-13393: - That's the case we should throw exceptions. > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186673#comment-15186673 ] Adrian Wang commented on SPARK-13393: - See my updated comment. That's not reasonable. > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186660#comment-15186660 ] Adrian Wang edited comment on SPARK-13393 at 3/9/16 7:31 AM: - How do you resolve it? Both sides are `df`, so we can resolve df("key") to single side, which leads to a Cartesian join (4 output rows); or we can resolve to both sides (2 output rows). We are not able to tell what the user meant to. The current design would not throw any exception because we assume same cols in condition are from different sides, as I have declared. I don't think that's a decent way. was (Author: adrian-wang): How do you resolve it? Both sides are `df`, so we can resolve df("key") to single side, which leads to a Cartesian join (4 output rows); or we can resolve to both sides (2 output rows). We are not able to tell what the user meant to. > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186661#comment-15186661 ] Adrian Wang commented on SPARK-13393: - How do you resolve it? Both sides are `df`, so we can resolve df("key") to single side, which leads to a Cartesian join (4 output rows); or we can resolve to both sides (2 output rows). We are not able to tell what the user meant to. > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186660#comment-15186660 ] Adrian Wang commented on SPARK-13393: - How do you resolve it? Both sides are `df`, so we can resolve df("key") to single side, which leads to a Cartesian join (4 output rows); or we can resolve to both sides (2 output rows). We are not able to tell what the user meant to. > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-13393: Comment: was deleted (was: How do you resolve it? Both sides are `df`, so we can resolve df("key") to single side, which leads to a Cartesian join (4 output rows); or we can resolve to both sides (2 output rows). We are not able to tell what the user meant to.) > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186652#comment-15186652 ] Adrian Wang commented on SPARK-13393: - In your example, df1("name") and df2("name") is exactly the same to each other, it's easy to throw an exception explicitly to tell user not to join 2 same dataframes without alias. We can do the same to this issue too. > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186642#comment-15186642 ] Adrian Wang commented on SPARK-13393: - This is another issue; here we are talking about `varadha` and `df`, which is obviously different dataframes. For exactly the same dataframe, I think aliasing is still necessary. > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186605#comment-15186605 ] Adrian Wang commented on SPARK-13393: - So that's the reason I have to introduce the layer of `JoinedData` to keep left and right dataframe instance, then we can trace what the user wants to project with the specific dataframe info in Column instance (if exists). > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186580#comment-15186580 ] Adrian Wang commented on SPARK-13393: - Hi [~srinathsmn] In this `errorDF`, both `df('id')` and `varadha('id')` has the same `exprId`(they all come from `df`), so we cannot disambiguate between them in this design now. As a workaround, you should write code like `correctDF`, assign an alias for the columns first, or you can register df as table and then use a complete SQL query to get your data. I think this is a bug under current design. I think we should put down the dataframe information in `Column` instances, and use an interval representation of `JoindeData` as the return value of `def join()`, in order to resolve ambiguity caused by self-join. For now, even I write something like val errorDF = df.join(varadha, df("id") === df("id"), "left_outer").select(df("id"), varadha("id") as "varadha_id") The result would still be the same, since we are assuming condition with ambiguity should always be resolved to both sides. I can draft a design doc for this if you are interested. cc [~smilegator][~rxin][~marmbrus] > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177356#comment-15177356 ] Adrian Wang commented on SPARK-13446: - That's not enough. We still need some code change. > Spark need to support reading data from Hive 2.0.0 metastore > > > Key: SPARK-13446 > URL: https://issues.apache.org/jira/browse/SPARK-13446 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Lifeng Wang > > Spark provided HIveContext class to read data from hive metastore directly. > While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has > released, it's better to upgrade to support Hive 2.0.0. > {noformat} > 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI > thrift://hsw-node13:9083 > 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current > connections: 1 > 16/02/23 02:35:02 INFO metastore: Connected to metastore. > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185) > at > org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175034#comment-15175034 ] Adrian Wang commented on SPARK-13393: - [~srinathsmn] I have identified the issue, and working on this. > Column mismatch issue in left_outer join using Spark DataFrame > -- > > Key: SPARK-13393 > URL: https://issues.apache.org/jira/browse/SPARK-13393 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Varadharajan > > Consider the below snippet: > {code:title=test.scala|borderStyle=solid} > case class Person(id: Int, name: String) > val df = sc.parallelize(List( > Person(1, "varadha"), > Person(2, "nagaraj") > )).toDF > val varadha = df.filter("id = 1") > val errorDF = df.join(varadha, df("id") === varadha("id"), > "left_outer").select(df("id"), varadha("id") as "varadha_id") > val nagaraj = df.filter("id = 2").select(df("id") as "n_id") > val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), > "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id") > {code} > The `errorDF` dataframe, after the left join is messed up and shows as below: > | id|varadha_id| > | 1| 1| > | 2| 2 (*This should've been null*)| > whereas correctDF has the correct output after the left join: > | id|nagaraj_id| > | 1| null| > | 2| 2| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13446) Spark need to support reading data from HIve 2.0.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158516#comment-15158516 ] Adrian Wang commented on SPARK-13446: - Hive 2.0 use HIVE_ZOOKEEPER_SESSION_TIMEOUT instead of HIVE_STATS_JDBC_TIMEOUT. see: HIVE-12164 We will look into this. > Spark need to support reading data from HIve 2.0.0 metastore > > > Key: SPARK-13446 > URL: https://issues.apache.org/jira/browse/SPARK-13446 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Lifeng Wang > > Spark provided HIveContext class to read data from hive metastore directly. > While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has > released, it's better to upgrade to support Hive 2.0.0. > {noformat} > 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI > thrift://hsw-node13:9083 > 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current > connections: 1 > 16/02/23 02:35:02 INFO metastore: Connected to metastore. > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185) > at > org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12930) NullPointerException running hive query with array dereference in select and where clause
[ https://issues.apache.org/jira/browse/SPARK-12930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156600#comment-15156600 ] Adrian Wang commented on SPARK-12930: - Could you try SPARK-13056? > NullPointerException running hive query with array dereference in select and > where clause > - > > Key: SPARK-12930 > URL: https://issues.apache.org/jira/browse/SPARK-12930 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Thomas Graves > > I had a user doing a hive query from spark where they had a array dereference > in the select clause and in the where clause, it gave the user a > NullPointerException when the where clause should have filtered it out. Its > like spark is evaluating the select part before running the where clause. > The info['pos'] below is what caused the issue: > Query looked like: > SELECT foo, > info['pos'] AS pos > FROM db.table > WHERE date >= '$initialDate' AND > date <= '$finalDate' AND > info is not null AND > info['pos'] is not null > LIMIT 10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13301) PySpark Dataframe return wrong results with custom UDF
[ https://issues.apache.org/jira/browse/SPARK-13301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151913#comment-15151913 ] Adrian Wang commented on SPARK-13301: - Hi Simone, I tried your code using master branch and the result is OK. > PySpark Dataframe return wrong results with custom UDF > -- > > Key: SPARK-13301 > URL: https://issues.apache.org/jira/browse/SPARK-13301 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: PySpark in yarn-client mode - CDH 5.5.1 >Reporter: Simone >Priority: Critical > > Using a User Defined Function in PySpark inside the withColumn() method of > Dataframe, gives wrong results. > Here an example: > from pyspark.sql import functions > import string > myFunc = functions.udf(lambda s: string.lower(s)) > myDF.select("col1", "col2").withColumn("col3", myFunc(myDF["col1"])).show() > |col1| col2|col3| > |1265AB4F65C05740E...|Ivo|4f00ae514e7c015be...| > |1D94AB4F75C83B51E...| Raffaele|4f00dcf6422100c0e...| > |4F008903600A0133E...| Cristina|4f008903600a0133e...| > The results are wrong and seem to be random: some record are OK (for example > the third) some others NO (for example the first 2). > The problem seems not occur with Spark built-in functions: > from pyspark.sql.functions import * > myDF.select("col1", "col2").withColumn("col3", lower(myDF["col1"])).show() > Without the withColumn() method, results seems to be always correct: > myDF.select("col1", "col2", myFunc(myDF["col1"])).show() > This can be considered only in part a workaround because you have to list > each time all column of your Dataframe. > Also in Scala/Java the problems seems not occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC
[ https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148196#comment-15148196 ] Adrian Wang commented on SPARK-13283: - So the problem here is that "from" is a reserved word in MySQL, but we failed to keep the backtick around it, do we? > Spark doesn't escape column names when creating table on JDBC > - > > Key: SPARK-13283 > URL: https://issues.apache.org/jira/browse/SPARK-13283 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > Hi, > I have following problem. > I have DF where one of the columns has 'from' name. > {code} > root > |-- from: decimal(20,0) (nullable = true) > {code} > When I'm saving it to MySQL database I'm getting error: > {code} > Py4JJavaError: An error occurred while calling o183.jdbc. > : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an > error in your SQL syntax; check the manual that corresponds to your MySQL > server version for the right syntax to use near 'from DECIMAL(20,0) , ' at > line 1 > {code} > I think the problem is that Spark doesn't escape column names with ` sign on > creating table. > {code} > `from` > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC
[ https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148171#comment-15148171 ] Adrian Wang commented on SPARK-13283: - See comments from SPARK-13297, this have been fixed in master branch. > Spark doesn't escape column names when creating table on JDBC > - > > Key: SPARK-13283 > URL: https://issues.apache.org/jira/browse/SPARK-13283 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > Hi, > I have following problem. > I have DF where one of the columns has 'from' name. > {code} > root > |-- from: decimal(20,0) (nullable = true) > {code} > When I'm saving it to MySQL database I'm getting error: > {code} > Py4JJavaError: An error occurred while calling o183.jdbc. > : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an > error in your SQL syntax; check the manual that corresponds to your MySQL > server version for the right syntax to use near 'from DECIMAL(20,0) , ' at > line 1 > {code} > I think the problem is that Spark doesn't escape column names with ` sign on > creating table. > {code} > `from` > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12985) Spark Hive thrift server big decimal data issue
[ https://issues.apache.org/jira/browse/SPARK-12985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130013#comment-15130013 ] Adrian Wang commented on SPARK-12985: - I think this is a problem of Simba. JDBC never require a `Decimal` to be a `HiveDecimal` > Spark Hive thrift server big decimal data issue > --- > > Key: SPARK-12985 > URL: https://issues.apache.org/jira/browse/SPARK-12985 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Alex Liu >Priority: Minor > > I tested the trial version JDBC driver from Simba, it works for simple query. > But there is some issue with data mapping. e.g. > {code} > java.sql.SQLException: [Simba][SparkJDBCDriver](500312) Error in fetching > data rows: java.math.BigDecimal cannot be cast to > org.apache.hadoop.hive.common.type.HiveDecimal; > at > com.simba.spark.hivecommon.api.HS2Client.buildExceptionFromTStatus(Unknown > Source) > at com.simba.spark.hivecommon.api.HS2Client.fetchNRows(Unknown Source) > at com.simba.spark.hivecommon.api.HS2Client.fetchRows(Unknown Source) > at com.simba.spark.hivecommon.dataengine.BackgroundFetcher.run(Unknown > Source) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > Caused by: com.simba.spark.support.exceptions.GeneralException: > [Simba][SparkJDBCDriver](500312) Error in fetching data rows: > java.math.BigDecimal cannot be cast to > org.apache.hadoop.hive.common.type.HiveDecimal; > ... 8 more > {code} > To fix it > {code} >case DecimalType() => > -to += from.getDecimal(ordinal) > +to += HiveDecimal.create(from.getDecimal(ordinal)) > {code} > to > https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L87 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13056) Map column would throw NPE if value is null
Adrian Wang created SPARK-13056: --- Summary: Map column would throw NPE if value is null Key: SPARK-13056 URL: https://issues.apache.org/jira/browse/SPARK-13056 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Create a map like { "a": "somestring", "b": null} Query like SELECT col["b"] FROM t1; NPE would be thrown. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12828) support natural join
Adrian Wang created SPARK-12828: --- Summary: support natural join Key: SPARK-12828 URL: https://issues.apache.org/jira/browse/SPARK-12828 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang support queries like: select * from t1 natural join t2; select * from t1 natural left join t2; select * from t1 natural right join t2; select * from t1 natural full outer join t2; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11983) remove all unused codegen fallback traits
Adrian Wang created SPARK-11983: --- Summary: remove all unused codegen fallback traits Key: SPARK-11983 URL: https://issues.apache.org/jira/browse/SPARK-11983 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11983) remove all unused codegen fallback traits
[ https://issues.apache.org/jira/browse/SPARK-11983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-11983: Description: We use trait `CodegenFallback` to generate default code gen code, and if we have implemented genCode, then no need to derive from this trait. > remove all unused codegen fallback traits > - > > Key: SPARK-11983 > URL: https://issues.apache.org/jira/browse/SPARK-11983 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Adrian Wang >Priority: Minor > > We use trait `CodegenFallback` to generate default code gen code, and if we > have implemented genCode, then no need to derive from this trait. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11972) [Spark SQL] the value of 'hiveconf' parameter in CLI can't be got after enter spark-sql session
[ https://issues.apache.org/jira/browse/SPARK-11972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026093#comment-15026093 ] Adrian Wang commented on SPARK-11972: - SPARK-11624 would resolve this, too. That's because we created a new SessionState that haven't take commandline options into account. > [Spark SQL] the value of 'hiveconf' parameter in CLI can't be got after enter > spark-sql session > --- > > Key: SPARK-11972 > URL: https://issues.apache.org/jira/browse/SPARK-11972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Yi Zhou >Priority: Critical > > Reproduce Steps: > /usr/lib/spark/bin/spark-sql -v --driver-memory 4g --executor-memory 7g > --executor-cores 5 --num-executors 31 --master yarn-client --conf > spark.yarn.executor.memoryOverhead=1024 --hiveconf RESULT_TABLE=test_result01 > {code} > >use test; > >DROP TABLE IF EXISTS ${hiveconf:RESULT_TABLE}; > 15/11/24 13:45:12 INFO parse.ParseDriver: Parsing command: DROP TABLE IF > EXISTS ${hiveconf:RESULT_TABLE} > NoViableAltException(16@[192:1: tableName : (db= identifier DOT tab= > identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME > $tab) );]) > at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) > at org.antlr.runtime.DFA.predict(DFA.java:144) > at > org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4747) > at > org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:45918) > at > org.apache.hadoop.hive.ql.parse.HiveParser.dropTableStatement(HiveParser.java:7133) > at > org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2655) > at > org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650) > at > org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109) > at > org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202) > at > org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166) > at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:276) > at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) > at > scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at > scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34) > at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:295) > at > org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:65) > at > org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:65) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:279) > at > org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:226) > at > org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:225) > at >
[jira] [Created] (SPARK-11916) Expression TRIM/LTRIM/RTRIM to support specific trim word
Adrian Wang created SPARK-11916: --- Summary: Expression TRIM/LTRIM/RTRIM to support specific trim word Key: SPARK-11916 URL: https://issues.apache.org/jira/browse/SPARK-11916 Project: Spark Issue Type: Improvement Reporter: Adrian Wang Priority: Minor supports expressions like `trim('xxxabcxxx', 'x')` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11624) Spark SQL CLI will set sessionstate twice
Adrian Wang created SPARK-11624: --- Summary: Spark SQL CLI will set sessionstate twice Key: SPARK-11624 URL: https://issues.apache.org/jira/browse/SPARK-11624 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11624) Spark SQL CLI will set sessionstate twice
[ https://issues.apache.org/jira/browse/SPARK-11624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-11624: Description: spark-sql> !echo "test"; Exception in thread "main" java.lang.ClassCastException: org.apache.hadoop.hive.ql.session.SessionState cannot be cast to org.apache.hadoop.hive.cli.CliSessionState at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:112) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:301) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:242) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:691) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Spark SQL CLI will set sessionstate twice > - > > Key: SPARK-11624 > URL: https://issues.apache.org/jira/browse/SPARK-11624 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Adrian Wang > > spark-sql> !echo "test"; > Exception in thread "main" java.lang.ClassCastException: > org.apache.hadoop.hive.ql.session.SessionState cannot be cast to > org.apache.hadoop.hive.cli.CliSessionState > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:112) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:301) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:242) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:691) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11591) flush spark-sql command line history to history file
Adrian Wang created SPARK-11591: --- Summary: flush spark-sql command line history to history file Key: SPARK-11591 URL: https://issues.apache.org/jira/browse/SPARK-11591 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang currently, spark-sql would not flush command history when exiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11592) flush spark-sql command line history to history file
Adrian Wang created SPARK-11592: --- Summary: flush spark-sql command line history to history file Key: SPARK-11592 URL: https://issues.apache.org/jira/browse/SPARK-11592 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang currently, spark-sql would not flush command history when exiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-11591) flush spark-sql command line history to history file
[ https://issues.apache.org/jira/browse/SPARK-11591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang closed SPARK-11591. --- Resolution: Fixed > flush spark-sql command line history to history file > > > Key: SPARK-11591 > URL: https://issues.apache.org/jira/browse/SPARK-11591 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Adrian Wang > > currently, spark-sql would not flush command history when exiting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11396) datetime function: to_unix_timestamp
Adrian Wang created SPARK-11396: --- Summary: datetime function: to_unix_timestamp Key: SPARK-11396 URL: https://issues.apache.org/jira/browse/SPARK-11396 Project: Spark Issue Type: Sub-task Reporter: Adrian Wang `to_unix_timestamp` is the deterministic version of `unix_timestamp`, as it accepts at least one parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11312) Cannot drop temporary function
Adrian Wang created SPARK-11312: --- Summary: Cannot drop temporary function Key: SPARK-11312 URL: https://issues.apache.org/jira/browse/SPARK-11312 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang create temporary function is done by executionHive, while DROP TEMPORARY FUNCTION is done by metadataHive -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11312) Cannot drop temporary function
[ https://issues.apache.org/jira/browse/SPARK-11312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang resolved SPARK-11312. - Resolution: Duplicate > Cannot drop temporary function > -- > > Key: SPARK-11312 > URL: https://issues.apache.org/jira/browse/SPARK-11312 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Adrian Wang > > create temporary function is done by executionHive, while DROP TEMPORARY > FUNCTION is done by metadataHive -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10507) reject temporal expressions such as timestamp - timestamp at parse time
[ https://issues.apache.org/jira/browse/SPARK-10507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952747#comment-14952747 ] Adrian Wang commented on SPARK-10507: - It seems this bug has been fixed long long ago. I just checked 1.5.0 and there's no such problem. > reject temporal expressions such as timestamp - timestamp at parse time > > > Key: SPARK-10507 > URL: https://issues.apache.org/jira/browse/SPARK-10507 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 >Reporter: N Campbell >Priority: Minor > > TIMESTAMP - TIMESTAMP in ISO-SQL should return an interval type which SPARK > does not support.. > A similar expression in Hive 0.13 fails with Error: Could not create > ResultSet: Required field 'type' is unset! > Struct:TPrimitiveTypeEntry(type:null) and SPARK has similar "challenges". > While Hive 1.2.1 has added some interval type support it is far from complete > with respect to ISO-SQL. > The ability to compute the period of time (years, days, weeks, hours, ...) > between timestamps or add/substract intervals from a timestamp are extremely > common in business applications. > Currently, a value expression such as select timestampcol - timestampcol from > t will fail during execution and not parse time. While the error thrown > states that fact, it would better for those value expressions to be rejected > at parse time along with indicating the expression that is causing the parser > error. > Operation: execute > Errors: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 6214.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 6214.0 (TID 21208, sandbox.hortonworks.com): java.lang.RuntimeException: Type > TimestampType does not support numeric operations > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.expressions.Subtract.numeric$lzycompute(arithmetic.scala:138) > at > org.apache.spark.sql.catalyst.expressions.Subtract.numeric(arithmetic.scala:136) > at > org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:150) > at > org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:113) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) > at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:813) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) > {code} > {code} > create table if not exists TTS ( RNUM int , CTS timestamp )TERMINATED BY > '\n' > STORED AS orc ; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10463) remove PromotePrecision during optimization
Adrian Wang created SPARK-10463: --- Summary: remove PromotePrecision during optimization Key: SPARK-10463 URL: https://issues.apache.org/jira/browse/SPARK-10463 Project: Spark Issue Type: Improvement Reporter: Adrian Wang Priority: Trivial This node is not necessary after HiveTypeCoercion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8360) Streaming DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712377#comment-14712377 ] Adrian Wang commented on SPARK-8360: https://github.com/intel-bigdata/spark-streamingsql Our streaming sql project is highly related to this jira ticket. Streaming DataFrames Key: SPARK-8360 URL: https://issues.apache.org/jira/browse/SPARK-8360 Project: Spark Issue Type: Umbrella Components: SQL, Streaming Reporter: Reynold Xin Umbrella ticket to track what's needed to make streaming DataFrame a reality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10130) type coercion for IF should have children resolved first
[ https://issues.apache.org/jira/browse/SPARK-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-10130: Priority: Blocker (was: Major) type coercion for IF should have children resolved first Key: SPARK-10130 URL: https://issues.apache.org/jira/browse/SPARK-10130 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Priority: Blocker SELECT IF(a 0, a, 0) FROM (SELECT key a FROM src) temp; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10130) type coercion for IF should have children resolved first
[ https://issues.apache.org/jira/browse/SPARK-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-10130: Fix Version/s: (was: 1.5.0) type coercion for IF should have children resolved first Key: SPARK-10130 URL: https://issues.apache.org/jira/browse/SPARK-10130 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang SELECT IF(a 0, a, 0) FROM (SELECT key a FROM src) temp; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10130) type coercion for IF should have children resolved first
[ https://issues.apache.org/jira/browse/SPARK-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-10130: Target Version/s: 1.5.0 type coercion for IF should have children resolved first Key: SPARK-10130 URL: https://issues.apache.org/jira/browse/SPARK-10130 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Priority: Blocker SELECT IF(a 0, a, 0) FROM (SELECT key a FROM src) temp; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10130) type coercion for IF should have children resolved first
Adrian Wang created SPARK-10130: --- Summary: type coercion for IF should have children resolved first Key: SPARK-10130 URL: https://issues.apache.org/jira/browse/SPARK-10130 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang SELECT IF(a 0, a, 0) FROM (SELECT key a FROM src) temp; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10130) type coercion for IF should have children resolved first
[ https://issues.apache.org/jira/browse/SPARK-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-10130: Fix Version/s: 1.5.0 type coercion for IF should have children resolved first Key: SPARK-10130 URL: https://issues.apache.org/jira/browse/SPARK-10130 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Fix For: 1.5.0 SELECT IF(a 0, a, 0) FROM (SELECT key a FROM src) temp; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10083) CaseWhen should support type coercion of DecimalType and FractionalType
Adrian Wang created SPARK-10083: --- Summary: CaseWhen should support type coercion of DecimalType and FractionalType Key: SPARK-10083 URL: https://issues.apache.org/jira/browse/SPARK-10083 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang create t1 (a decimal(7, 2), b long); select case when 1=1 then a else 1.0 end from t1; select case when 1=1 then a else b end from t1; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9374) [Spark SQL] Throw out erorr of AnalysisException: nondeterministic expressions are only allowed in Project or Filter during the spark sql parse phase
[ https://issues.apache.org/jira/browse/SPARK-9374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14643761#comment-14643761 ] Adrian Wang commented on SPARK-9374: [~chenghao][~cloud_fan][~jameszhouyi] UnixTimestamp is a non-deterministic expression, because when we pass zero argument to this function, it means the same with current_timestamp. And there is a determistic version of this function in hive, namely to_unix_timstamp. We could use that temporarily. After SPARK-8174 resolved, we would be able to tell whether the use of unix_timestamp is deterministic or not. [Spark SQL] Throw out erorr of AnalysisException: nondeterministic expressions are only allowed in Project or Filter during the spark sql parse phase --- Key: SPARK-9374 URL: https://issues.apache.org/jira/browse/SPARK-9374 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yi Zhou Priority: Blocker #Spark SQL Query INSERT INTO TABLE TEST_QUERY_0_result SELECT w_state, i_item_id, SUM( CASE WHEN (unix_timestamp(d_date,'-MM-dd') unix_timestamp('2001-03-16','-MM-dd')) THEN ws_sales_price - COALESCE(wr_refunded_cash,0) ELSE 0.0 END ) AS sales_before, SUM( CASE WHEN (unix_timestamp(d_date,'-MM-dd') = unix_timestamp('2001-03-16','-MM-dd')) THEN ws_sales_price - coalesce(wr_refunded_cash,0) ELSE 0.0 END ) AS sales_after FROM ( SELECT * FROM web_sales ws LEFT OUTER JOIN web_returns wr ON (ws.ws_order_number = wr.wr_order_number AND ws.ws_item_sk = wr.wr_item_sk) ) a1 JOIN item i ON a1.ws_item_sk = i.i_item_sk JOIN warehouse w ON a1.ws_warehouse_sk = w.w_warehouse_sk JOIN date_dim d ON a1.ws_sold_date_sk = d.d_date_sk AND unix_timestamp(d.d_date, '-MM-dd') = unix_timestamp('2001-03-16', '-MM-dd') - 30*24*60*60 --subtract 30 days in seconds AND unix_timestamp(d.d_date, '-MM-dd') = unix_timestamp('2001-03-16', '-MM-dd') + 30*24*60*60 --add 30 days in seconds GROUP BY w_state,i_item_id CLUSTER BY w_state,i_item_id Error Message## org.apache.spark.sql.AnalysisException: nondeterministic expressions are only allowed in Project or Filter, found: (((ws_sold_date_sk = d_date_sk) (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date,-MM-dd) = (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd) - CAST30 * 24) * 60) * 60), LongType (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date,-MM-dd) = (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd) + CAST30 * 24) * 60) * 60), LongType in operator Join Inner, Somews_sold_date_sk#289L = d_date_sk#383L) (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date#385,-MM-dd) = (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd) - CAST30 * 24) * 60) * 60), LongType (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(d_date#385,-MM-dd) = (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFUnixTimeStamp(2001-03-16,-MM-dd) + CAST30 * 24) * 60) * 60), LongType) ; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:148) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) at
[jira] [Commented] (SPARK-9196) DatetimeExpressionsSuite: function current_timestamp is flaky
[ https://issues.apache.org/jira/browse/SPARK-9196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633733#comment-14633733 ] Adrian Wang commented on SPARK-9196: Thanks, I'll fix it asap DatetimeExpressionsSuite: function current_timestamp is flaky - Key: SPARK-9196 URL: https://issues.apache.org/jira/browse/SPARK-9196 Project: Spark Issue Type: Bug Reporter: Davies Liu Assignee: Adrian Wang Priority: Critical {code} - function current_timestamp *** FAILED *** (77 milliseconds) [info] Results do not match for query: [info] == Parsed Logical Plan == [info] 'Project [unresolvedalias(('CURRENT_TIMESTAMP() = 'CURRENT_TIMESTAMP()))] [info]OneRowRelation$ [info] [info] == Analyzed Logical Plan == [info] _c0: boolean [info] Project [(currenttimestamp() = currenttimestamp()) AS _c0#11436] [info]OneRowRelation$ [info] [info] == Optimized Logical Plan == [info] Project [false AS _c0#11436] [info]OneRowRelation$ [info] [info] == Physical Plan == [info] Project [false AS _c0#11436] [info]PhysicalRDD ParallelCollectionRDD[650] at apply at Transformer.scala:22 [info] [info] Code Generation: true [info] == RDD == [info] == Results == [info] !== Correct Answer - 1 == == Spark Answer - 1 == [info] ![true] [false] (QueryTest.scala:61) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$class.fail(Assertions.scala:1328) [info] at org.scalatest.FunSuite.fail(FunSuite.scala:1555) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:61) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:67) [info] at org.apache.spark.sql.DatetimeExpressionsSuite$$anonfun$2.apply$mcV$sp(DatetimeExpressionsSuite.scala:42) [info] at org.apache.spark.sql.DatetimeExpressionsSuite$$anonfun$2.apply(DatetimeExpressionsSuite.scala:39) [info] at org.apache.spark.sql.DatetimeExpressionsSuite$$anonfun$2.apply(DatetimeExpressionsSuite.scala:39) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) [info] at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) [info] at org.scalatest.Suite$class.run(Suite.scala:1424) [info] at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:545) [info] at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) [info] at org.scalatest.FunSuite.run(FunSuite.scala:1555) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) [info] at
[jira] [Commented] (SPARK-9196) DatetimeExpressionsSuite: function current_timestamp is flaky
[ https://issues.apache.org/jira/browse/SPARK-9196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633740#comment-14633740 ] Adrian Wang commented on SPARK-9196: if this is very often, we can ignore this test for now, I'll fix it tomorrow. DatetimeExpressionsSuite: function current_timestamp is flaky - Key: SPARK-9196 URL: https://issues.apache.org/jira/browse/SPARK-9196 Project: Spark Issue Type: Bug Reporter: Davies Liu Assignee: Adrian Wang Priority: Critical {code} - function current_timestamp *** FAILED *** (77 milliseconds) [info] Results do not match for query: [info] == Parsed Logical Plan == [info] 'Project [unresolvedalias(('CURRENT_TIMESTAMP() = 'CURRENT_TIMESTAMP()))] [info]OneRowRelation$ [info] [info] == Analyzed Logical Plan == [info] _c0: boolean [info] Project [(currenttimestamp() = currenttimestamp()) AS _c0#11436] [info]OneRowRelation$ [info] [info] == Optimized Logical Plan == [info] Project [false AS _c0#11436] [info]OneRowRelation$ [info] [info] == Physical Plan == [info] Project [false AS _c0#11436] [info]PhysicalRDD ParallelCollectionRDD[650] at apply at Transformer.scala:22 [info] [info] Code Generation: true [info] == RDD == [info] == Results == [info] !== Correct Answer - 1 == == Spark Answer - 1 == [info] ![true] [false] (QueryTest.scala:61) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$class.fail(Assertions.scala:1328) [info] at org.scalatest.FunSuite.fail(FunSuite.scala:1555) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:61) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:67) [info] at org.apache.spark.sql.DatetimeExpressionsSuite$$anonfun$2.apply$mcV$sp(DatetimeExpressionsSuite.scala:42) [info] at org.apache.spark.sql.DatetimeExpressionsSuite$$anonfun$2.apply(DatetimeExpressionsSuite.scala:39) [info] at org.apache.spark.sql.DatetimeExpressionsSuite$$anonfun$2.apply(DatetimeExpressionsSuite.scala:39) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) [info] at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) [info] at org.scalatest.Suite$class.run(Suite.scala:1424) [info] at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:545) [info] at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) [info] at org.scalatest.FunSuite.run(FunSuite.scala:1555) [info] at
[jira] [Commented] (SPARK-9196) DatetimeExpressionsSuite: function current_timestamp is flaky
[ https://issues.apache.org/jira/browse/SPARK-9196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634420#comment-14634420 ] Adrian Wang commented on SPARK-9196: [~davies] I got 2 solutions for this problem: 1. Let this function back by something like a lazy val, but we need to put a flag for each query indicating whether it has been evaluated. 2. Substitute this function with constants at analysis phrase. This would be a little different from hive, since hive get the current timestamp at the beginning of evaluation. We can also find a way to mark multiple appearance as a single object at analysis phrase. cc [~marmbrus] DatetimeExpressionsSuite: function current_timestamp is flaky - Key: SPARK-9196 URL: https://issues.apache.org/jira/browse/SPARK-9196 Project: Spark Issue Type: Bug Reporter: Davies Liu Assignee: Adrian Wang Priority: Critical {code} - function current_timestamp *** FAILED *** (77 milliseconds) [info] Results do not match for query: [info] == Parsed Logical Plan == [info] 'Project [unresolvedalias(('CURRENT_TIMESTAMP() = 'CURRENT_TIMESTAMP()))] [info]OneRowRelation$ [info] [info] == Analyzed Logical Plan == [info] _c0: boolean [info] Project [(currenttimestamp() = currenttimestamp()) AS _c0#11436] [info]OneRowRelation$ [info] [info] == Optimized Logical Plan == [info] Project [false AS _c0#11436] [info]OneRowRelation$ [info] [info] == Physical Plan == [info] Project [false AS _c0#11436] [info]PhysicalRDD ParallelCollectionRDD[650] at apply at Transformer.scala:22 [info] [info] Code Generation: true [info] == RDD == [info] == Results == [info] !== Correct Answer - 1 == == Spark Answer - 1 == [info] ![true] [false] (QueryTest.scala:61) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$class.fail(Assertions.scala:1328) [info] at org.scalatest.FunSuite.fail(FunSuite.scala:1555) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:61) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:67) [info] at org.apache.spark.sql.DatetimeExpressionsSuite$$anonfun$2.apply$mcV$sp(DatetimeExpressionsSuite.scala:42) [info] at org.apache.spark.sql.DatetimeExpressionsSuite$$anonfun$2.apply(DatetimeExpressionsSuite.scala:39) [info] at org.apache.spark.sql.DatetimeExpressionsSuite$$anonfun$2.apply(DatetimeExpressionsSuite.scala:39) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) [info] at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) [info] at org.scalatest.Suite$class.run(Suite.scala:1424) [info] at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) [info] at
[jira] [Commented] (SPARK-9196) DatetimeExpressionsSuite: function current_timestamp is flaky
[ https://issues.apache.org/jira/browse/SPARK-9196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634452#comment-14634452 ] Adrian Wang commented on SPARK-9196: [~marmbrus]we have test for that case. The flaky test is to prove multi entries of the same function within one query would return the same value. Actually, as this function is always foldable, at optimization phrase we will get all values, so the gap would not be too large. p.s: hive's definition for current_timestamp(): Returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value. DatetimeExpressionsSuite: function current_timestamp is flaky - Key: SPARK-9196 URL: https://issues.apache.org/jira/browse/SPARK-9196 Project: Spark Issue Type: Bug Reporter: Davies Liu Assignee: Adrian Wang Priority: Critical {code} - function current_timestamp *** FAILED *** (77 milliseconds) [info] Results do not match for query: [info] == Parsed Logical Plan == [info] 'Project [unresolvedalias(('CURRENT_TIMESTAMP() = 'CURRENT_TIMESTAMP()))] [info]OneRowRelation$ [info] [info] == Analyzed Logical Plan == [info] _c0: boolean [info] Project [(currenttimestamp() = currenttimestamp()) AS _c0#11436] [info]OneRowRelation$ [info] [info] == Optimized Logical Plan == [info] Project [false AS _c0#11436] [info]OneRowRelation$ [info] [info] == Physical Plan == [info] Project [false AS _c0#11436] [info]PhysicalRDD ParallelCollectionRDD[650] at apply at Transformer.scala:22 [info] [info] Code Generation: true [info] == RDD == [info] == Results == [info] !== Correct Answer - 1 == == Spark Answer - 1 == [info] ![true] [false] (QueryTest.scala:61) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$class.fail(Assertions.scala:1328) [info] at org.scalatest.FunSuite.fail(FunSuite.scala:1555) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:61) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:67) [info] at org.apache.spark.sql.DatetimeExpressionsSuite$$anonfun$2.apply$mcV$sp(DatetimeExpressionsSuite.scala:42) [info] at org.apache.spark.sql.DatetimeExpressionsSuite$$anonfun$2.apply(DatetimeExpressionsSuite.scala:39) [info] at org.apache.spark.sql.DatetimeExpressionsSuite$$anonfun$2.apply(DatetimeExpressionsSuite.scala:39) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) [info] at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) [info] at org.scalatest.Suite$class.run(Suite.scala:1424) [info] at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) [info] at
[jira] [Commented] (SPARK-9051) SortMergeCompatibilitySuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-9051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629104#comment-14629104 ] Adrian Wang commented on SPARK-9051: This might have something to do with SPARK-9027 , not quite sure though. I'll take a closer look. SortMergeCompatibilitySuite is flaky - Key: SPARK-9051 URL: https://issues.apache.org/jira/browse/SPARK-9051 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Adrian Wang Priority: Critical For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.3,label=centos/2951/testReport/junit/org.apache.spark.sql.hive.execution/SortMergeCompatibilitySuite/auto_sortmerge_join_16/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9051) SortMergeCompatibilitySuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-9051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629105#comment-14629105 ] Adrian Wang commented on SPARK-9051: Just find that Michael has revert it and SPARK-6910, and things are OK now. So that's the cause. SortMergeCompatibilitySuite is flaky - Key: SPARK-9051 URL: https://issues.apache.org/jira/browse/SPARK-9051 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Adrian Wang Priority: Critical For example: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.3,label=centos/2951/testReport/junit/org.apache.spark.sql.hive.execution/SortMergeCompatibilitySuite/auto_sortmerge_join_16/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616301#comment-14616301 ] Adrian Wang commented on SPARK-8864: just provide the precise of current design for your information. Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs (1).pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616299#comment-14616299 ] Adrian Wang commented on SPARK-8864: no, that's not enough. Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs (1).pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616293#comment-14616293 ] Adrian Wang edited comment on SPARK-8864 at 7/7/15 7:34 AM: Then we are using a Long for us. Long can be up to 9.2E18, which is more than 1E8 days. Hive is using a Long for seconds and an int for nanoseconds, but I think a single Long here for day-time interval is fine. was (Author: adrian-wang): Then we are using a Long for us. Long can be up to 9.2E18, which is more than 1E11 days. Hive is using a Long for seconds and an int for nanoseconds, but I think a single Long here for day-time interval is fine. Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs (1).pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616268#comment-14616268 ] Adrian Wang commented on SPARK-8864: Thanks for the design. Two comments: 1. If a IntervalType value is in year-month format, we cannot use 100ns to represent it. Hive use two internal types to handle year-month and day-time separately. 2. When casting TimestampType into StringType, or casting from StringType(unless it is a ISO8601 time string which contains timezone info), we should also consider timezone. Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs.pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616293#comment-14616293 ] Adrian Wang commented on SPARK-8864: Then we are using a Long for us. Long can be up to 9.2E18, which is more than 1E11 days. Hive is using a Long for seconds and an int for nanoseconds, but I think a single Long here for day-time interval is fine. Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs (1).pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5215) concat support in sqlcontext
[ https://issues.apache.org/jira/browse/SPARK-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang resolved SPARK-5215. Resolution: Duplicate concat support in sqlcontext Key: SPARK-5215 URL: https://issues.apache.org/jira/browse/SPARK-5215 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang define concat follow rules in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8174) date/time function: unix_timestamp
[ https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578513#comment-14578513 ] Adrian Wang commented on SPARK-8174: I'll deal with this. date/time function: unix_timestamp -- Key: SPARK-8174 URL: https://issues.apache.org/jira/browse/SPARK-8174 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin 3 variants: {code} unix_timestamp(): long Gets current Unix timestamp in seconds. unix_timestamp(string|date): long Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801 unix_timestamp(string date, string pattern): long Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', '-MM-dd') = 1237532400. {code} See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8182) date/time function: minute
[ https://issues.apache.org/jira/browse/SPARK-8182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578523#comment-14578523 ] Adrian Wang commented on SPARK-8182: I'll deal with this. date/time function: minute -- Key: SPARK-8182 URL: https://issues.apache.org/jira/browse/SPARK-8182 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin minute(string|date|timestamp): int Returns the minute of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8183) date/time function: second
[ https://issues.apache.org/jira/browse/SPARK-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578524#comment-14578524 ] Adrian Wang commented on SPARK-8183: I'll deal with this. date/time function: second -- Key: SPARK-8183 URL: https://issues.apache.org/jira/browse/SPARK-8183 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin second(string|date|timestamp): int Returns the second of the timestamp. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8184) date/time function: weekofyear
[ https://issues.apache.org/jira/browse/SPARK-8184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578525#comment-14578525 ] Adrian Wang commented on SPARK-8184: I'll deal with this. date/time function: weekofyear -- Key: SPARK-8184 URL: https://issues.apache.org/jira/browse/SPARK-8184 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin weekofyear(string|date|timestamp): int Returns the week number of a timestamp string: weekofyear(1970-11-01 00:00:00) = 44, weekofyear(1970-11-01) = 44. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8193) date/time function: current_timestamp
[ https://issues.apache.org/jira/browse/SPARK-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578601#comment-14578601 ] Adrian Wang commented on SPARK-8193: I'll deal with this. date/time function: current_timestamp - Key: SPARK-8193 URL: https://issues.apache.org/jira/browse/SPARK-8193 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin current_timestamp(): timestamp Returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value. We should just replace this with a timestamp literal in the optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8159) Improve SQL/DataFrame expression coverage
[ https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578509#comment-14578509 ] Adrian Wang commented on SPARK-8159: Are we missing xpath functions? Improve SQL/DataFrame expression coverage - Key: SPARK-8159 URL: https://issues.apache.org/jira/browse/SPARK-8159 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin This is an umbrella ticket to track new expressions we are adding to SQL/DataFrame. For each new expression, we should: 1. Add a new Expression implementation in org.apache.spark.sql.catalyst.expressions 2. If applicable, implement the code generated version (by implementing genCode). 3. Add comprehensive unit tests (for all the data types the expressions support). 4. If applicable, add a new function for DataFrame in org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for Python. For date/time functions, put them in expressions/datetime.scala, and create a DateTimeFunctionSuite.scala for testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-8159) Improve SQL/DataFrame expression coverage
[ https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Wang updated SPARK-8159: --- Comment: was deleted (was: Are we missing xpath functions?) Improve SQL/DataFrame expression coverage - Key: SPARK-8159 URL: https://issues.apache.org/jira/browse/SPARK-8159 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin This is an umbrella ticket to track new expressions we are adding to SQL/DataFrame. For each new expression, we should: 1. Add a new Expression implementation in org.apache.spark.sql.catalyst.expressions 2. If applicable, implement the code generated version (by implementing genCode). 3. Add comprehensive unit tests (for all the data types the expressions support). 4. If applicable, add a new function for DataFrame in org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for Python. For date/time functions, put them in expressions/datetime.scala, and create a DateTimeFunctionSuite.scala for testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8159) Improve SQL/DataFrame expression coverage
[ https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578508#comment-14578508 ] Adrian Wang commented on SPARK-8159: Are we missing xpath functions? Improve SQL/DataFrame expression coverage - Key: SPARK-8159 URL: https://issues.apache.org/jira/browse/SPARK-8159 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin This is an umbrella ticket to track new expressions we are adding to SQL/DataFrame. For each new expression, we should: 1. Add a new Expression implementation in org.apache.spark.sql.catalyst.expressions 2. If applicable, implement the code generated version (by implementing genCode). 3. Add comprehensive unit tests (for all the data types the expressions support). 4. If applicable, add a new function for DataFrame in org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for Python. For date/time functions, put them in expressions/datetime.scala, and create a DateTimeFunctionSuite.scala for testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8177) date/time function: year
[ https://issues.apache.org/jira/browse/SPARK-8177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578518#comment-14578518 ] Adrian Wang commented on SPARK-8177: I'll deal with this. date/time function: year Key: SPARK-8177 URL: https://issues.apache.org/jira/browse/SPARK-8177 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin year(string|date|timestamp): int Returns the year part of a date or a timestamp string: year(1970-01-01 00:00:00) = 1970, year(1970-01-01) = 1970. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org