date:20190725

[jira] [Updated] (SPARK-28510) Implement Spark's own GetFunctionsOperation

2019-07-25 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28510:

Description: We should implement Spark's own {{GetFunctionsOperation}} 
because Spark SQL and Hive's UDF have many differences.

> Implement Spark's own GetFunctionsOperation
> ---
>
> Key: SPARK-28510
> URL: https://issues.apache.org/jira/browse/SPARK-28510
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> We should implement Spark's own {{GetFunctionsOperation}} because Spark SQL 
> and Hive's UDF have many differences.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28511) Get REV from RELEASE_VERSION instead of VERSION

2019-07-25 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-28511:
-

 Summary: Get REV from RELEASE_VERSION instead of VERSION
 Key: SPARK-28511
 URL: https://issues.apache.org/jira/browse/SPARK-28511
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


Unlike the other versions, `x.x.0-SNAPSHOT` causes `x.x.-1`. Although this will 
not happen in the tags (there is no `SNAPSHOT` postfix), we had better fix this.
{code}
$ dev/create-release/do-release-docker.sh -d /tmp/spark-3.0.0 -n
Output directory already exists. Overwrite and continue? [y/n] y
Branch [branch-2.4]: master
Current branch version is 3.0.0-SNAPSHOT.
Release [3.0.-1]:
{code}

The following is the expected behavior.
{code}
$ dev/create-release/do-release-docker.sh -d /tmp/spark-3.0.0 -n
Branch [branch-2.4]: master
Current branch version is 3.0.0-SNAPSHOT.
Release [3.0.0]:
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28512) New optional mode: throw runtime exceptions on casting failures

2019-07-25 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-28512:
--

 Summary: New optional mode: throw runtime exceptions on casting 
failures
 Key: SPARK-28512
 URL: https://issues.apache.org/jira/browse/SPARK-28512
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


In popular DBMS like MySQL/PostgreSQL/Oracle, runtime exceptions are thrown on 
casting, e.g. cast('abc' as int) 
While in Spark, the result is converted as null silently. It is by design since 
we don't want a long-running job aborted by some casting failure. But there are 
scenarios that users want to make sure all the data conversion are correct, 
like the way they use MySQL/PostgreSQL/Oracle.

If the changes touch too much code, we can limit the new optional mode to table 
insertion first. By default the new behavior is disabled.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26995) Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when using snappy

2019-07-25 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26995:
--
Fix Version/s: 2.4.4

> Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when 
> using snappy
> -
>
> Key: SPARK-26995
> URL: https://issues.apache.org/jira/browse/SPARK-26995
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 2.4.4, 3.0.0
>
>
> Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when 
> using snappy.  
> The issue can be reproduced for example as follows: 
> `Seq(1,2).toDF("id").write.format("parquet").save("DELETEME1")`  
> The key part of the error stack is as follows `Caused by: 
> java.lang.UnsatisfiedLinkError: 
> /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so: 
> Error loading shared library ld-linux-x86-64.so.2: Noded by 
> /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so)`  
> The source of the error appears to be due to the fact that libsnappyjava.so 
> needs ld-linux-x86-64.so.2 and looks for it in /lib, while in Alpine Linux 
> 3.9.0 with libc6-compat version 1.1.20-r3 ld-linux-x86-64.so.2 is located in 
> /lib64.
> Note: this issue is not present with Alpine Linux 3.8 and libc6-compat 
> version 1.1.19-r10 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26995) Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when using snappy

2019-07-25 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892559#comment-16892559
 ] 

Dongjoon Hyun commented on SPARK-26995:
---

This is backported to `branch-2.4` via 
https://github.com/apache/spark/pull/25255 .

> Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when 
> using snappy
> -
>
> Key: SPARK-26995
> URL: https://issues.apache.org/jira/browse/SPARK-26995
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 2.4.4, 3.0.0
>
>
> Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when 
> using snappy.  
> The issue can be reproduced for example as follows: 
> `Seq(1,2).toDF("id").write.format("parquet").save("DELETEME1")`  
> The key part of the error stack is as follows `Caused by: 
> java.lang.UnsatisfiedLinkError: 
> /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so: 
> Error loading shared library ld-linux-x86-64.so.2: Noded by 
> /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so)`  
> The source of the error appears to be due to the fact that libsnappyjava.so 
> needs ld-linux-x86-64.so.2 and looks for it in /lib, while in Alpine Linux 
> 3.9.0 with libc6-compat version 1.1.20-r3 ld-linux-x86-64.so.2 is located in 
> /lib64.
> Note: this issue is not present with Alpine Linux 3.8 and libc6-compat 
> version 1.1.19-r10 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28156) Join plan sometimes does not use cached query

2019-07-25 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28156:

Fix Version/s: 2.4.4

> Join plan sometimes does not use cached query
> -
>
> Key: SPARK-28156
> URL: https://issues.apache.org/jira/browse/SPARK-28156
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: Bruce Robbins
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> I came across a case where a cached query is referenced on both sides of a 
> join, but the InMemoryRelation is inserted on only one side. This case occurs 
> only when the cached query uses a (Hive-style) view.
> Consider this example:
> {noformat}
> // create the data
> val df1 = Seq.tabulate(10) { x => (x, x + 1, x + 2, x + 3) }.toDF("a", "b", 
> "c", "d")
> df1.write.mode("overwrite").format("orc").saveAsTable("table1")
> sql("drop view if exists table1_vw")
> sql("create view table1_vw as select * from table1")
> // create the cached query
> val cacheddataDf = sql("""
> select a, b, c, d
> from table1_vw
> """)
> import org.apache.spark.storage.StorageLevel.DISK_ONLY
> cacheddataDf.createOrReplaceTempView("cacheddata")
> cacheddataDf.persist(DISK_ONLY)
> // main query
> val queryDf = sql(s"""
> select leftside.a, leftside.b
> from cacheddata leftside
> join cacheddata rightside
> on leftside.a = rightside.a
> """)
> queryDf.explain(true)
> {noformat}
> Note that the optimized plan does not use an InMemoryRelation for the right 
> side, but instead just uses a Relation:
> {noformat}
> Project [a#45, b#46]
> +- Join Inner, (a#45 = a#37)
>:- Project [a#45, b#46]
>:  +- Filter isnotnull(a#45)
>: +- InMemoryRelation [a#45, b#46, c#47, d#48], StorageLevel(disk, 1 
> replicas)
>:   +- *(1) FileScan orc default.table1[a#37,b#38,c#39,d#40] 
> Batched: true, DataFilters: [], Format: ORC, Location: 
> InMemoryFileIndex[file:/Users/brobbins/github/spark_upstream/spark-warehouse/table1],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
>+- Project [a#37]
>   +- Filter isnotnull(a#37)
>  +- Relation[a#37,b#38,c#39,d#40] orc
> {noformat}
> The fragment does not match the cached query because AliasViewChild adds an 
> extra projection under the View on the right side (see #2 below).
> AliasViewChild adds the extra projection because the exprIds in the View's 
> output appears to have been renamed by Analyzer$ResolveReferences (#1 below). 
> I have not yet looked at why.
> {noformat}
> -
> -
> -
>+- SubqueryAlias `rightside`
>   +- SubqueryAlias `cacheddata`
>  +- Project [a#73, b#74, c#75, d#76]
> +- SubqueryAlias `default`.`table1_vw`
> (#1) ->+- View (`default`.`table1_vw`, [a#73,b#74,c#75,d#76])
> (#2) ->   +- Project [cast(a#45 as int) AS a#73, cast(b#46 as int) AS 
> b#74, cast(c#47 as int) AS c#75, cast(d#48 as int) AS d#76]
>  +- Project [cast(a#37 as int) AS a#45, cast(b#38 as int) 
> AS b#46, cast(c#39 as int) AS c#47, cast(d#40 as int) AS d#48]
> +- Project [a#37, b#38, c#39, d#40]
>+- SubqueryAlias `default`.`table1`
>   +- Relation[a#37,b#38,c#39,d#40] orc
> {noformat}
> In a larger query (where cachedata may be referred on either side only 
> indirectly), this phenomenon can create certain oddities, as the fragment is 
> not replaced with InMemoryRelation, and the fragment is present when the plan 
> is optimized as a whole.
> In Spark 2.1.3, Spark uses InMemoryRelation on both sides.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27688) Beeline should show database in the prompt

2019-07-25 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892580#comment-16892580
 ] 

Yuming Wang commented on SPARK-27688:
-

{code:sh}
build/sbt clean package -Phive -Phive-thriftserver -Phadoop-3.2
export SPARK_PREPEND_CLASSES=true
sbin/stop-thriftserver.sh
{code}

{noformat}
[root@spark-3267648 apache-spark]# bin/beeline -u 
jdbc:hive2://localhost:1/default  --showDbInPrompt
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
ahead of assembly.
log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Connecting to jdbc:hive2://localhost:1/default
Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.5 by Apache Hive
0: jdbc:hive2://localhost:1/default (default)> use db2;
+-+
| Result  |
+-+
+-+
No rows selected (0.168 seconds)
0: jdbc:hive2://localhost:1/default (db2)>
0: jdbc:hive2://localhost:1/default (db2)> use db1;
+-+
| Result  |
+-+
+-+
No rows selected (0.091 seconds)
0: jdbc:hive2://localhost:1/default (db1)>
0: jdbc:hive2://localhost:1/default (db1)>

{noformat}

> Beeline should show database in the prompt
> --
>
> Key: SPARK-27688
> URL: https://issues.apache.org/jira/browse/SPARK-27688
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Sandeep Katta
>Priority: Minor
>
> Since [HIVE-14123|https://issues.apache.org/jira/browse/HIVE-14123] supports 
> display of database in beeline. Spark should also support this



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27688) Beeline should show database in the prompt

2019-07-25 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27688:

Affects Version/s: (was: 2.4.3)
   3.0.0

> Beeline should show database in the prompt
> --
>
> Key: SPARK-27688
> URL: https://issues.apache.org/jira/browse/SPARK-27688
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Priority: Minor
>
> Since [HIVE-14123|https://issues.apache.org/jira/browse/HIVE-14123] supports 
> display of database in beeline. Spark should also support this



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27688) Beeline should show database in the prompt

2019-07-25 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-27688.
-
Resolution: Fixed

We supported it after upgrading the built-in Hive to 2.3.5.

> Beeline should show database in the prompt
> --
>
> Key: SPARK-27688
> URL: https://issues.apache.org/jira/browse/SPARK-27688
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Priority: Minor
>
> Since [HIVE-14123|https://issues.apache.org/jira/browse/HIVE-14123] supports 
> display of database in beeline. Spark should also support this



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27688) Beeline should show database in the prompt

2019-07-25 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892580#comment-16892580
 ] 

Yuming Wang edited comment on SPARK-27688 at 7/25/19 9:24 AM:
--

{code:sh}
build/sbt clean package -Phive -Phive-thriftserver -Phadoop-3.2
export SPARK_PREPEND_CLASSES=true
sbin/start-thriftserver.sh
{code}

{noformat}
[root@spark-3267648 apache-spark]# bin/beeline -u 
jdbc:hive2://localhost:1/default  --showDbInPrompt
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
ahead of assembly.
log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Connecting to jdbc:hive2://localhost:1/default
Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.5 by Apache Hive
0: jdbc:hive2://localhost:1/default (default)> use db2;
+-+
| Result  |
+-+
+-+
No rows selected (0.168 seconds)
0: jdbc:hive2://localhost:1/default (db2)>
0: jdbc:hive2://localhost:1/default (db2)> use db1;
+-+
| Result  |
+-+
+-+
No rows selected (0.091 seconds)
0: jdbc:hive2://localhost:1/default (db1)>
0: jdbc:hive2://localhost:1/default (db1)>

{noformat}


was (Author: q79969786):
{code:sh}
build/sbt clean package -Phive -Phive-thriftserver -Phadoop-3.2
export SPARK_PREPEND_CLASSES=true
sbin/stop-thriftserver.sh
{code}

{noformat}
[root@spark-3267648 apache-spark]# bin/beeline -u 
jdbc:hive2://localhost:1/default  --showDbInPrompt
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
ahead of assembly.
log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Connecting to jdbc:hive2://localhost:1/default
Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.3.5 by Apache Hive
0: jdbc:hive2://localhost:1/default (default)> use db2;
+-+
| Result  |
+-+
+-+
No rows selected (0.168 seconds)
0: jdbc:hive2://localhost:1/default (db2)>
0: jdbc:hive2://localhost:1/default (db2)> use db1;
+-+
| Result  |
+-+
+-+
No rows selected (0.091 seconds)
0: jdbc:hive2://localhost:1/default (db1)>
0: jdbc:hive2://localhost:1/default (db1)>

{noformat}

> Beeline should show database in the prompt
> --
>
> Key: SPARK-27688
> URL: https://issues.apache.org/jira/browse/SPARK-27688
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Priority: Minor
>
> Since [HIVE-14123|https://issues.apache.org/jira/browse/HIVE-14123] supports 
> display of database in beeline. Spark should also support this



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22063) Upgrade lintr to latest commit sha1 ID

2019-07-25 Thread Manu Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892595#comment-16892595
 ] 

Manu Zhang commented on SPARK-22063:


Is there any update in this thread ? Which lint-r version is used in build now ?

I also find upgrading lint-r would also upgrade testthat to latest version 
while [SparkR requires testthat 1.0.2 
|https://github.com/apache/spark/blob/master/docs/building-spark.md#running-r-tests]

> Upgrade lintr to latest commit sha1 ID
> --
>
> Key: SPARK-22063
> URL: https://issues.apache.org/jira/browse/SPARK-22063
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this 
> pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026])
>  and SPARK-14074.
> Today, I tried to upgrade the latest, 
> https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72
> This fixes many bugs and now finds many instances that I have observed and 
> thought should be caught time to time:
> {code}
> inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis 
> in a function call.
>   return (output)
>  ^
> R/column.R:241:1: style: Lines should not be more than 100 characters.
> #'
> \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{
> ^~~~
> R/context.R:332:1: style: Variable and function names should not be longer 
> than 30 characters.
> spark.getSparkFilesRootDirectory <- function() {
> ^~~~
> R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters.
> #' @param j,select expression for the single Column or a list of columns to 
> select from the SparkDataFrame.
> ^~~
> R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters.
> #' @return A new SparkDataFrame containing only the rows that meet the 
> condition with selected columns.
> ^~~
> R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a 
> function call.
>   return (joinRes)
>  ^
> R/DataFrame.R:2652:1: style: Variable and function names should not be longer 
> than 30 characters.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
> ^
> R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a 
> function call.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
>   ^
> R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a 
> function call.
> stop ("The following column name: ", newJoin, " occurs more than once 
> in the 'DataFrame'.",
>  ^
> R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters.
> #' @note The statistics provided by \code{summary} were change in 2.3.0 use 
> \link{describe} for previous defaults.
> ^~
> R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{cube} creates a single global 
> aggregate and is equivalent to
> ^~~
> R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{rollup} creates a single global 
> aggregate and is equivalent to
> ^
> R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a 
> function call.
>   switch (type,
>  ^
> R/functions.R:41:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{window}, it must be a time Column 
> of \code{TimestampType}.
> ^
> R/functions.R:93:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{shiftLeft}, \code{shiftRight} and 
> \code{shiftRightUnsigned},
> ^~~
> R/functio

[jira] [Comment Edited] (SPARK-22063) Upgrade lintr to latest commit sha1 ID

2019-07-25 Thread Manu Zhang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892595#comment-16892595
 ] 

Manu Zhang edited comment on SPARK-22063 at 7/25/19 9:41 AM:
-

Is there any update in this thread ? Which lint-r version is used in build now ?

I also find upgrading lint-r would upgrade testthat to latest version while 
[SparkR requires testthat 1.0.2 
|https://github.com/apache/spark/blob/master/docs/building-spark.md#running-r-tests]


was (Author: mauzhang):
Is there any update in this thread ? Which lint-r version is used in build now ?

I also find upgrading lint-r would also upgrade testthat to latest version 
while [SparkR requires testthat 1.0.2 
|https://github.com/apache/spark/blob/master/docs/building-spark.md#running-r-tests]

> Upgrade lintr to latest commit sha1 ID
> --
>
> Key: SPARK-22063
> URL: https://issues.apache.org/jira/browse/SPARK-22063
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this 
> pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026])
>  and SPARK-14074.
> Today, I tried to upgrade the latest, 
> https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72
> This fixes many bugs and now finds many instances that I have observed and 
> thought should be caught time to time:
> {code}
> inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis 
> in a function call.
>   return (output)
>  ^
> R/column.R:241:1: style: Lines should not be more than 100 characters.
> #'
> \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{
> ^~~~
> R/context.R:332:1: style: Variable and function names should not be longer 
> than 30 characters.
> spark.getSparkFilesRootDirectory <- function() {
> ^~~~
> R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters.
> #' @param j,select expression for the single Column or a list of columns to 
> select from the SparkDataFrame.
> ^~~
> R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters.
> #' @return A new SparkDataFrame containing only the rows that meet the 
> condition with selected columns.
> ^~~
> R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a 
> function call.
>   return (joinRes)
>  ^
> R/DataFrame.R:2652:1: style: Variable and function names should not be longer 
> than 30 characters.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
> ^
> R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a 
> function call.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
>   ^
> R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a 
> function call.
> stop ("The following column name: ", newJoin, " occurs more than once 
> in the 'DataFrame'.",
>  ^
> R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters.
> #' @note The statistics provided by \code{summary} were change in 2.3.0 use 
> \link{describe} for previous defaults.
> ^~
> R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{cube} creates a single global 
> aggregate and is equivalent to
> ^~~
> R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{rollup} creates a single global 
> aggregate and is equivalent to
> ^
> R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a 
> function call.
>   switch (type,
>  ^
> R/functions.R:41:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{window}, it must be a time Column 
> of \code{TimestampType}.
> ^~~

[jira] [Created] (SPARK-28513) Compute distinct label sets instead of subsets

2019-07-25 Thread Martin Junghanns (JIRA)

Martin Junghanns created SPARK-28513:


 Summary: Compute distinct label sets instead of subsets
 Key: SPARK-28513
 URL: https://issues.apache.org/jira/browse/SPARK-28513
 Project: Spark
  Issue Type: Improvement
  Components: Graph
Affects Versions: 3.0.0
Reporter: Martin Junghanns


{code:scala}
CypherSession::createDataFrame(nodes: DataFrame, rels: DataFrame)
{code}
 currently computes NodeFrames by filtering label columns, computing all 
possible subsets and creating one NodeFrame for each subset. This results in 
2^n sets / NodeFrames.

Instead, we should compute the distinct label sets that actually occur in the 
nodes DataFrame.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28514) Remove the redundant transformImpl method in RF & GBT

2019-07-25 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-28514:


 Summary: Remove the redundant transformImpl method in RF & GBT
 Key: SPARK-28514
 URL: https://issues.apache.org/jira/browse/SPARK-28514
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


1, In GBTClassifier & RandomForestClassifier, the real transform methods 
inherit from ProbabilisticClassificationModel which can deal with multi output 
columns.

The transformImpl method, which deals with only one column - predictionCol, 
completely does nothing. This is quite confusing.

 

2, In GBTRegressor & RandomForestRegressor, the transformImpl do exactly what 
the superclass PredictionModel does (except model broadcasting), so can be 
removed.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28515) to_timestamp returns null for summer time switch dates

2019-07-25 Thread JIRA

Andreas Költringer created SPARK-28515:
--

 Summary: to_timestamp returns null for summer time switch dates
 Key: SPARK-28515
 URL: https://issues.apache.org/jira/browse/SPARK-28515
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
 Environment: Spark 2.4.3 on Linux 64bit, openjdk-8-jre-headless
Reporter: Andreas Költringer


I am not sure if this is a bug - but it was a very unexpected behavior, so I'd 
like some clarification.

When parsing datetime-strings, when the date-time in question falls into the 
range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 2am 
the clock was forwarded to 3am), the {{to_timestamp}} method returns {{NULL}}.

Minimal Example (using Python):

{{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
['date_str'])}}
{{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
'MMddhhmm')).show()}}
{{++---+
 }}
{{|    date_str|  timestamp|}}
{{++---+}}
{{|201503290159|2015-03-29 01:59:00|}}
{{|201503290200|   null|}}
{{++---+}}

A solution (or workaround) is to set the time zone for Spark to UTC:

{{spark.conf.set("spark.sql.session.timeZone", "UTC")}}

(see e.g. [https://stackoverflow.com/q/52594762)]

 

Plain Java does not do this, e.g. this works as expected:

{{ SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm"); Date 
parsedDate = dateFormat.parse("201503290201"); Timestamp timestamp = new 
java.sql.Timestamp(parsedDate.getTime());}}

 

So, is this really the intended behaviour? Is there documentation about this? 
THX.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28515) to_timestamp returns null for summer time switch dates

2019-07-25 Thread JIRA



 [ 
https://issues.apache.org/jira/browse/SPARK-28515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Költringer updated SPARK-28515:
---
Description: 
I am not sure if this is a bug - but it was a very unexpected behavior, so I'd 
like some clarification.

When parsing datetime-strings, when the date-time in question falls into the 
range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 2am 
the clock was forwarded to 3am), the {{to_timestamp}} method returns {{NULL}}.

Minimal Example (using Python):

{{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
['date_str'])}}
 {{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
'MMddhhmm')).show()}}
 {{+-+--+   
  }}
 {{|    date_str|  timestamp|}}
 {{+-+--+}}
 {{|201503290159|2015-03-29 01:59:00|}}
 {{|201503290200|   null|}}
 {{+-+--+}}

A solution (or workaround) is to set the time zone for Spark to UTC:

{{spark.conf.set("spark.sql.session.timeZone", "UTC")}}

(see e.g. [https://stackoverflow.com/q/52594762)]

 

Plain Java does not do this, e.g. this works as expected:

{{SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm"); }}

{{Date parsedDate = dateFormat.parse("201503290201"); }}

{{Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());}}

 

So, is this really the intended behaviour? Is there documentation about this? 
THX.

  was:
I am not sure if this is a bug - but it was a very unexpected behavior, so I'd 
like some clarification.

When parsing datetime-strings, when the date-time in question falls into the 
range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 2am 
the clock was forwarded to 3am), the {{to_timestamp}} method returns {{NULL}}.

Minimal Example (using Python):

{{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
['date_str'])}}
{{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
'MMddhhmm')).show()}}
{{++---+
 }}
{{|    date_str|  timestamp|}}
{{++---+}}
{{|201503290159|2015-03-29 01:59:00|}}
{{|201503290200|   null|}}
{{++---+}}

A solution (or workaround) is to set the time zone for Spark to UTC:

{{spark.conf.set("spark.sql.session.timeZone", "UTC")}}

(see e.g. [https://stackoverflow.com/q/52594762)]

 

Plain Java does not do this, e.g. this works as expected:

{{ SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm"); Date 
parsedDate = dateFormat.parse("201503290201"); Timestamp timestamp = new 
java.sql.Timestamp(parsedDate.getTime());}}

 

So, is this really the intended behaviour? Is there documentation about this? 
THX.


> to_timestamp returns null for summer time switch dates
> --
>
> Key: SPARK-28515
> URL: https://issues.apache.org/jira/browse/SPARK-28515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3 on Linux 64bit, openjdk-8-jre-headless
>Reporter: Andreas Költringer
>Priority: Major
>
> I am not sure if this is a bug - but it was a very unexpected behavior, so 
> I'd like some clarification.
> When parsing datetime-strings, when the date-time in question falls into the 
> range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 
> 2am the clock was forwarded to 3am), the {{to_timestamp}} method returns 
> {{NULL}}.
> Minimal Example (using Python):
> {{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
> ['date_str'])}}
>  {{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
> 'MMddhhmm')).show()}}
>  {{+-+--+ 
>     }}
>  {{|    date_str|  timestamp|}}
>  {{+-+--+}}
>  {{|201503290159|2015-03-29 01:59:00|}}
>  {{|201503290200|   null|}}
>  {{+-+--+}}
> A solution (or workaround) is to set the time zone for Spark to UTC:
> {{spark.conf.set("spark.sql.session.timeZone", "UTC")}}
> (see e.g. [https://stackoverflow.com/q/52594762)]
>  
> Plain Java does not do this, e.g. this works as expected:
> {{SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm"); }}
> {{Date parsedDate = dateFormat.parse("201503290201"); }}
> {{Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());}}
>  
> So, is this really the intended behaviour? Is there documentation about this? 
> THX.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (SPARK-28515) to_timestamp returns null for summer time switch dates

2019-07-25 Thread JIRA



 [ 
https://issues.apache.org/jira/browse/SPARK-28515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Költringer updated SPARK-28515:
---
Description: 
I am not sure if this is a bug - but it was a very unexpected behavior, so I'd 
like some clarification.

When parsing datetime-strings, when the date-time in question falls into the 
range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 2am 
the clock was forwarded to 3am), the {{to_timestamp}} method returns {{NULL}}.

Minimal Example (using Python):

{{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
['date_str'])}}
 {{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
'MMddhhmm')).show()}}
 {{+-+--+   
  }}
 {{|    date_str|  timestamp|}}
 {{+-+--+}}
 {{|201503290159|2015-03-29 01:59:00|}}
 {{|201503290200|   null|}}
 {{+-+--+}}

A solution (or workaround) is to set the time zone for Spark to UTC:

{{spark.conf.set("spark.sql.session.timeZone", "UTC")}}

(see e.g. [https://stackoverflow.com/q/52594762)]

 

Plain Java does not do this, e.g. this works as expected:

{{SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm"); }}
{{Date parsedDate = dateFormat.parse("201503290201"); }}
{{Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime()); }}

 

So, is this really the intended behaviour? Is there documentation about this? 
THX.

  was:
I am not sure if this is a bug - but it was a very unexpected behavior, so I'd 
like some clarification.

When parsing datetime-strings, when the date-time in question falls into the 
range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 2am 
the clock was forwarded to 3am), the {{to_timestamp}} method returns {{NULL}}.

Minimal Example (using Python):

{{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
['date_str'])}}
 {{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
'MMddhhmm')).show()}}
 {{+-+--+   
  }}
 {{|    date_str|  timestamp|}}
 {{+-+--+}}
 {{|201503290159|2015-03-29 01:59:00|}}
 {{|201503290200|   null|}}
 {{+-+--+}}

A solution (or workaround) is to set the time zone for Spark to UTC:

{{spark.conf.set("spark.sql.session.timeZone", "UTC")}}

(see e.g. [https://stackoverflow.com/q/52594762)]

 

Plain Java does not do this, e.g. this works as expected:

{{SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm"); }}

{{Date parsedDate = dateFormat.parse("201503290201"); }}

{{Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());}}

 

So, is this really the intended behaviour? Is there documentation about this? 
THX.


> to_timestamp returns null for summer time switch dates
> --
>
> Key: SPARK-28515
> URL: https://issues.apache.org/jira/browse/SPARK-28515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3 on Linux 64bit, openjdk-8-jre-headless
>Reporter: Andreas Költringer
>Priority: Major
>
> I am not sure if this is a bug - but it was a very unexpected behavior, so 
> I'd like some clarification.
> When parsing datetime-strings, when the date-time in question falls into the 
> range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 
> 2am the clock was forwarded to 3am), the {{to_timestamp}} method returns 
> {{NULL}}.
> Minimal Example (using Python):
> {{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
> ['date_str'])}}
>  {{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
> 'MMddhhmm')).show()}}
>  {{+-+--+ 
>     }}
>  {{|    date_str|  timestamp|}}
>  {{+-+--+}}
>  {{|201503290159|2015-03-29 01:59:00|}}
>  {{|201503290200|   null|}}
>  {{+-+--+}}
> A solution (or workaround) is to set the time zone for Spark to UTC:
> {{spark.conf.set("spark.sql.session.timeZone", "UTC")}}
> (see e.g. [https://stackoverflow.com/q/52594762)]
>  
> Plain Java does not do this, e.g. this works as expected:
> {{SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm"); }}
> {{Date parsedDate = dateFormat.parse("201503290201"); }}
> {{Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime()); }}
>  
> So, is this really the intended behaviour? Is there documentation about this? 
> THX.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (SPARK-28515) to_timestamp returns null for summer time switch dates

2019-07-25 Thread JIRA



 [ 
https://issues.apache.org/jira/browse/SPARK-28515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Költringer updated SPARK-28515:
---
Description: 
I am not sure if this is a bug - but it was a very unexpected behavior, so I'd 
like some clarification.

When parsing datetime-strings, when the date-time in question falls into the 
range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 2am 
the clock was forwarded to 3am), the {{to_timestamp}} method returns {{NULL}}.

Minimal Example (using Python):

{{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
['date_str'])}}
 {{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
'MMddhhmm')).show()}}
 {{+-+--+   
  }}
 {{|    date_str|  timestamp|}}
 {{+-+--+}}
 {{|201503290159|2015-03-29 01:59:00|}}
 {{|201503290200|   null|}}
 {{+-+--+}}

A solution (or workaround) is to set the time zone for Spark to UTC:

{{spark.conf.set("spark.sql.session.timeZone", "UTC")}}

(see e.g. [https://stackoverflow.com/q/52594762)]

 

Plain Java does not do this, e.g. this works as expected:
{{
SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm");
Date parsedDate = dateFormat.parse("201503290201");
Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());
}}
 

So, is this really the intended behaviour? Is there documentation about this? 
THX.

  was:
I am not sure if this is a bug - but it was a very unexpected behavior, so I'd 
like some clarification.

When parsing datetime-strings, when the date-time in question falls into the 
range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 2am 
the clock was forwarded to 3am), the {{to_timestamp}} method returns {{NULL}}.

Minimal Example (using Python):

{{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
['date_str'])}}
 {{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
'MMddhhmm')).show()}}
 {{+-+--+   
  }}
 {{|    date_str|  timestamp|}}
 {{+-+--+}}
 {{|201503290159|2015-03-29 01:59:00|}}
 {{|201503290200|   null|}}
 {{+-+--+}}

A solution (or workaround) is to set the time zone for Spark to UTC:

{{spark.conf.set("spark.sql.session.timeZone", "UTC")}}

(see e.g. [https://stackoverflow.com/q/52594762)]

 

Plain Java does not do this, e.g. this works as expected:

{{SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm"); }}
{{Date parsedDate = dateFormat.parse("201503290201"); }}
{{Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime()); }}

 

So, is this really the intended behaviour? Is there documentation about this? 
THX.


> to_timestamp returns null for summer time switch dates
> --
>
> Key: SPARK-28515
> URL: https://issues.apache.org/jira/browse/SPARK-28515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3 on Linux 64bit, openjdk-8-jre-headless
>Reporter: Andreas Költringer
>Priority: Major
>
> I am not sure if this is a bug - but it was a very unexpected behavior, so 
> I'd like some clarification.
> When parsing datetime-strings, when the date-time in question falls into the 
> range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 
> 2am the clock was forwarded to 3am), the {{to_timestamp}} method returns 
> {{NULL}}.
> Minimal Example (using Python):
> {{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
> ['date_str'])}}
>  {{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
> 'MMddhhmm')).show()}}
>  {{+-+--+ 
>     }}
>  {{|    date_str|  timestamp|}}
>  {{+-+--+}}
>  {{|201503290159|2015-03-29 01:59:00|}}
>  {{|201503290200|   null|}}
>  {{+-+--+}}
> A solution (or workaround) is to set the time zone for Spark to UTC:
> {{spark.conf.set("spark.sql.session.timeZone", "UTC")}}
> (see e.g. [https://stackoverflow.com/q/52594762)]
>  
> Plain Java does not do this, e.g. this works as expected:
> {{
> SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm");
> Date parsedDate = dateFormat.parse("201503290201");
> Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());
> }}
>  
> So, is this really the intended behaviour? Is there documentation about this? 
> THX.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-

[jira] [Updated] (SPARK-28515) to_timestamp returns null for summer time switch dates

2019-07-25 Thread JIRA



 [ 
https://issues.apache.org/jira/browse/SPARK-28515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Költringer updated SPARK-28515:
---
Description: 
I am not sure if this is a bug - but it was a very unexpected behavior, so I'd 
like some clarification.

When parsing datetime-strings, when the date-time in question falls into the 
range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 2am 
the clock was forwarded to 3am), the {{to_timestamp}} method returns {{NULL}}.

Minimal Example (using Python):

{{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
['date_str'])}}
 {{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
'MMddhhmm')).show()}}
 {{+-+--+   
  }}
 {{|    date_str|  timestamp|}}
 {{+-+--+}}
 {{|201503290159|2015-03-29 01:59:00|}}
 {{|201503290200|   null|}}
 {{+-+--+}}

A solution (or workaround) is to set the time zone for Spark to UTC:

{{spark.conf.set("spark.sql.session.timeZone", "UTC")}}

(see e.g. [https://stackoverflow.com/q/52594762)]

 

Plain Java does not do this, e.g. this works as expected:
{{SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm");}}
{{Date parsedDate = dateFormat.parse("201503290201");}}
{{Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());}}
 

So, is this really the intended behaviour? Is there documentation about this? 
THX.

  was:
I am not sure if this is a bug - but it was a very unexpected behavior, so I'd 
like some clarification.

When parsing datetime-strings, when the date-time in question falls into the 
range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 2am 
the clock was forwarded to 3am), the {{to_timestamp}} method returns {{NULL}}.

Minimal Example (using Python):

{{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
['date_str'])}}
 {{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
'MMddhhmm')).show()}}
 {{+-+--+   
  }}
 {{|    date_str|  timestamp|}}
 {{+-+--+}}
 {{|201503290159|2015-03-29 01:59:00|}}
 {{|201503290200|   null|}}
 {{+-+--+}}

A solution (or workaround) is to set the time zone for Spark to UTC:

{{spark.conf.set("spark.sql.session.timeZone", "UTC")}}

(see e.g. [https://stackoverflow.com/q/52594762)]

 

Plain Java does not do this, e.g. this works as expected:
{{
SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm");
Date parsedDate = dateFormat.parse("201503290201");
Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());
}}
 

So, is this really the intended behaviour? Is there documentation about this? 
THX.


> to_timestamp returns null for summer time switch dates
> --
>
> Key: SPARK-28515
> URL: https://issues.apache.org/jira/browse/SPARK-28515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3 on Linux 64bit, openjdk-8-jre-headless
>Reporter: Andreas Költringer
>Priority: Major
>
> I am not sure if this is a bug - but it was a very unexpected behavior, so 
> I'd like some clarification.
> When parsing datetime-strings, when the date-time in question falls into the 
> range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 
> 2am the clock was forwarded to 3am), the {{to_timestamp}} method returns 
> {{NULL}}.
> Minimal Example (using Python):
> {{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
> ['date_str'])}}
>  {{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
> 'MMddhhmm')).show()}}
>  {{+-+--+ 
>     }}
>  {{|    date_str|  timestamp|}}
>  {{+-+--+}}
>  {{|201503290159|2015-03-29 01:59:00|}}
>  {{|201503290200|   null|}}
>  {{+-+--+}}
> A solution (or workaround) is to set the time zone for Spark to UTC:
> {{spark.conf.set("spark.sql.session.timeZone", "UTC")}}
> (see e.g. [https://stackoverflow.com/q/52594762)]
>  
> Plain Java does not do this, e.g. this works as expected:
> {{SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm");}}
> {{Date parsedDate = dateFormat.parse("201503290201");}}
> {{Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());}}
>  
> So, is this really the intended behaviour? Is there documentation about this? 
> THX.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (SPARK-28515) to_timestamp returns null for summer time switch dates

2019-07-25 Thread JIRA



 [ 
https://issues.apache.org/jira/browse/SPARK-28515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Költringer updated SPARK-28515:
---
Description: 
I am not sure if this is a bug - but it was a very unexpected behavior, so I'd 
like some clarification.

When parsing datetime-strings, when the date-time in question falls into the 
range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 2am 
the clock was forwarded to 3am), the {{to_timestamp}} method returns {{NULL}}.

Minimal Example (using Python):

{{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
['date_str'])}}
 {{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
'MMddhhmm')).show()}}
 {{+-+--+}}
 {{|    date_str|  timestamp|}}
 {{+-+--+}}
 {{|201503290159|2015-03-29 01:59:00|}}
 {{|201503290200|   null|}}
 {{+-+--+}}

A solution (or workaround) is to set the time zone for Spark to UTC:

{{spark.conf.set("spark.sql.session.timeZone", "UTC")}}

(see e.g. [https://stackoverflow.com/q/52594762)]

 

Plain Java does not do this, e.g. this works as expected:
{{SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm");}}
{{Date parsedDate = dateFormat.parse("201503290201");}}
{{Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());}}
 

So, is this really the intended behaviour? Is there documentation about this? 
THX.

  was:
I am not sure if this is a bug - but it was a very unexpected behavior, so I'd 
like some clarification.

When parsing datetime-strings, when the date-time in question falls into the 
range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 2am 
the clock was forwarded to 3am), the {{to_timestamp}} method returns {{NULL}}.

Minimal Example (using Python):

{{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
['date_str'])}}
 {{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
'MMddhhmm')).show()}}
 {{+-+--+   
  }}
 {{|    date_str|  timestamp|}}
 {{+-+--+}}
 {{|201503290159|2015-03-29 01:59:00|}}
 {{|201503290200|   null|}}
 {{+-+--+}}

A solution (or workaround) is to set the time zone for Spark to UTC:

{{spark.conf.set("spark.sql.session.timeZone", "UTC")}}

(see e.g. [https://stackoverflow.com/q/52594762)]

 

Plain Java does not do this, e.g. this works as expected:
{{SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm");}}
{{Date parsedDate = dateFormat.parse("201503290201");}}
{{Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());}}
 

So, is this really the intended behaviour? Is there documentation about this? 
THX.


> to_timestamp returns null for summer time switch dates
> --
>
> Key: SPARK-28515
> URL: https://issues.apache.org/jira/browse/SPARK-28515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3 on Linux 64bit, openjdk-8-jre-headless
>Reporter: Andreas Költringer
>Priority: Major
>
> I am not sure if this is a bug - but it was a very unexpected behavior, so 
> I'd like some clarification.
> When parsing datetime-strings, when the date-time in question falls into the 
> range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 
> 2am the clock was forwarded to 3am), the {{to_timestamp}} method returns 
> {{NULL}}.
> Minimal Example (using Python):
> {{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
> ['date_str'])}}
>  {{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
> 'MMddhhmm')).show()}}
>  {{+-+--+}}
>  {{|    date_str|  timestamp|}}
>  {{+-+--+}}
>  {{|201503290159|2015-03-29 01:59:00|}}
>  {{|201503290200|   null|}}
>  {{+-+--+}}
> A solution (or workaround) is to set the time zone for Spark to UTC:
> {{spark.conf.set("spark.sql.session.timeZone", "UTC")}}
> (see e.g. [https://stackoverflow.com/q/52594762)]
>  
> Plain Java does not do this, e.g. this works as expected:
> {{SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm");}}
> {{Date parsedDate = dateFormat.parse("201503290201");}}
> {{Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());}}
>  
> So, is this really the intended behaviour? Is there documentation about this? 
> THX.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional command

[jira] [Resolved] (SPARK-28497) Disallow upcasting complex data types to string type

2019-07-25 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28497.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25242
[https://github.com/apache/spark/pull/25242]

> Disallow upcasting complex data types to string type
> 
>
> Key: SPARK-28497
> URL: https://issues.apache.org/jira/browse/SPARK-28497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> In the current implementation. complex types like Array/Map/StructType are 
> allowed to upcast as StringType.
> This is not safe casting. We should disallow it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28497) Disallow upcasting complex data types to string type

2019-07-25 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28497:


Assignee: Gengliang Wang

> Disallow upcasting complex data types to string type
> 
>
> Key: SPARK-28497
> URL: https://issues.apache.org/jira/browse/SPARK-28497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> In the current implementation. complex types like Array/Map/StructType are 
> allowed to upcast as StringType.
> This is not safe casting. We should disallow it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28512) New optional mode: throw runtime exceptions on casting failures

2019-07-25 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892683#comment-16892683
 ] 

Takeshi Yamamuro commented on SPARK-28512:
--

Is this related to https://issues.apache.org/jira/browse/SPARK-28470?

cc: [~mgaido]

> New optional mode: throw runtime exceptions on casting failures
> ---
>
> Key: SPARK-28512
> URL: https://issues.apache.org/jira/browse/SPARK-28512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In popular DBMS like MySQL/PostgreSQL/Oracle, runtime exceptions are thrown 
> on casting, e.g. cast('abc' as int) 
> While in Spark, the result is converted as null silently. It is by design 
> since we don't want a long-running job aborted by some casting failure. But 
> there are scenarios that users want to make sure all the data conversion are 
> correct, like the way they use MySQL/PostgreSQL/Oracle.
> If the changes touch too much code, we can limit the new optional mode to 
> table insertion first. By default the new behavior is disabled.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28512) New optional mode: throw runtime exceptions on casting failures

2019-07-25 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892692#comment-16892692
 ] 

Marco Gaido commented on SPARK-28512:
-

Thanks for pinging me [~maropu]. It is not the same issue, I think, because in 
SPARK-28470 we are dealing only with cases when there is an overflow. Here 
there is no overflow. Simply the value is not valid for the casted type.

> New optional mode: throw runtime exceptions on casting failures
> ---
>
> Key: SPARK-28512
> URL: https://issues.apache.org/jira/browse/SPARK-28512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In popular DBMS like MySQL/PostgreSQL/Oracle, runtime exceptions are thrown 
> on casting, e.g. cast('abc' as int) 
> While in Spark, the result is converted as null silently. It is by design 
> since we don't want a long-running job aborted by some casting failure. But 
> there are scenarios that users want to make sure all the data conversion are 
> correct, like the way they use MySQL/PostgreSQL/Oracle.
> If the changes touch too much code, we can limit the new optional mode to 
> table insertion first. By default the new behavior is disabled.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28421) SparseVector.apply performance optimization

2019-07-25 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28421:
--
Affects Version/s: 2.4.3
 Priority: Minor  (was: Major)
Fix Version/s: 2.4.4

> SparseVector.apply performance optimization
> ---
>
> Key: SPARK-28421
> URL: https://issues.apache.org/jira/browse/SPARK-28421
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 2.4.3
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.4.4, 3.0.0
>
>
> Current impl of SparseVector.apply is inefficient:
> on each call,  breeze.linalg.SparseVector & 
> breeze.collection.mutable.SparseArray are created internally, then 
> binary-search is used to search the input position.
>  
> This place should be optimized like .ml.SparseMatrix, which directly use 
> binary search, without conversion to breeze.linalg.Matrix.
>  
> I tested the performance and found that if we avoid the internal conversions, 
> then a 2.5~5X speed up can be obtained.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28288) Convert and port 'window.sql' into UDF test base

2019-07-25 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28288:


Assignee: YoungGyu Chun

> Convert and port 'window.sql' into UDF test base
> 
>
> Key: SPARK-28288
> URL: https://issues.apache.org/jira/browse/SPARK-28288
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: YoungGyu Chun
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28288) Convert and port 'window.sql' into UDF test base

2019-07-25 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28288.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25195
[https://github.com/apache/spark/pull/25195]

> Convert and port 'window.sql' into UDF test base
> 
>
> Key: SPARK-28288
> URL: https://issues.apache.org/jira/browse/SPARK-28288
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: YoungGyu Chun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-07-25 Thread Tomas Bartalos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892875#comment-16892875
 ] 

Tomas Bartalos commented on SPARK-28025:


I'm also affected by performance issue caused by .crc files leak in checkpoint 
directory. [~skonto] thank you for the workaround, it works.

Would it be possible to implement cleaning of crc files, when one needs the 
checksum ?

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Priority: Major
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API

2019-07-25 Thread Michael Spector (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892888#comment-16892888
 ] 

Michael Spector commented on SPARK-28415:
-

In this case I think it's a regression / broken API bug.



> Add messageHandler to Kafka 10 direct stream API
> 
>
> Key: SPARK-28415
> URL: https://issues.apache.org/jira/browse/SPARK-28415
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.4.3
>Reporter: Michael Spector
>Priority: Major
>
> Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new 
> Kafka API is what prevents us from upgrading our processes to use it, and 
> here's why:
>  # messageHandler() allowed parsing / filtering / projecting huge JSON files 
> at an early stage (only a small subset of JSON fields is required for a 
> process), without this current cluster configuration doesn't keep up with the 
> traffic.
>  # Transforming Kafka events right after a stream is created prevents from 
> using HasOffsetRanges interface later. This means that whole message must be 
> propagated to the end of a pipeline, which is very ineffective.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28511) Get REV from RELEASE_VERSION instead of VERSION

2019-07-25 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28511.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25254
[https://github.com/apache/spark/pull/25254]

> Get REV from RELEASE_VERSION instead of VERSION
> ---
>
> Key: SPARK-28511
> URL: https://issues.apache.org/jira/browse/SPARK-28511
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> Unlike the other versions, `x.x.0-SNAPSHOT` causes `x.x.-1`. Although this 
> will not happen in the tags (there is no `SNAPSHOT` postfix), we had better 
> fix this.
> {code}
> $ dev/create-release/do-release-docker.sh -d /tmp/spark-3.0.0 -n
> Output directory already exists. Overwrite and continue? [y/n] y
> Branch [branch-2.4]: master
> Current branch version is 3.0.0-SNAPSHOT.
> Release [3.0.-1]:
> {code}
> The following is the expected behavior.
> {code}
> $ dev/create-release/do-release-docker.sh -d /tmp/spark-3.0.0 -n
> Branch [branch-2.4]: master
> Current branch version is 3.0.0-SNAPSHOT.
> Release [3.0.0]:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28511) Get REV from RELEASE_VERSION instead of VERSION

2019-07-25 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28511:
-

Assignee: Dongjoon Hyun

> Get REV from RELEASE_VERSION instead of VERSION
> ---
>
> Key: SPARK-28511
> URL: https://issues.apache.org/jira/browse/SPARK-28511
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> Unlike the other versions, `x.x.0-SNAPSHOT` causes `x.x.-1`. Although this 
> will not happen in the tags (there is no `SNAPSHOT` postfix), we had better 
> fix this.
> {code}
> $ dev/create-release/do-release-docker.sh -d /tmp/spark-3.0.0 -n
> Output directory already exists. Overwrite and continue? [y/n] y
> Branch [branch-2.4]: master
> Current branch version is 3.0.0-SNAPSHOT.
> Release [3.0.-1]:
> {code}
> The following is the expected behavior.
> {code}
> $ dev/create-release/do-release-docker.sh -d /tmp/spark-3.0.0 -n
> Branch [branch-2.4]: master
> Current branch version is 3.0.0-SNAPSHOT.
> Release [3.0.0]:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27734) Add memory based thresholds for shuffle spill

2019-07-25 Thread JIRA



[ 
https://issues.apache.org/jira/browse/SPARK-27734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893018#comment-16893018
 ] 

Paweł Wiejacha commented on SPARK-27734:


We also encountered this problem. We've reduced the problem to shuffling 60 GiB 
of data divided into 5 partitions using *repartitionAndSortWithinPartitions*() 
and processing (*foreachPartition*()) all of them using a single executor that 
has 2 GiB of memory assigned. Processing each partition takes ~70 minutes (52 
min GC time) and CPU usage is very high (due to GC).

Setting *spark.shuffle.spill.numElementsForceSpillThreshold* is very 
inconvenient, so it would be nice to accept Adrian's pull request.

> Add memory based thresholds for shuffle spill
> -
>
> Key: SPARK-27734
> URL: https://issues.apache.org/jira/browse/SPARK-27734
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Adrian Muraru
>Priority: Minor
>
> When running large shuffles (700TB input data, 200k map tasks, 50k reducers 
> on a 300 nodes cluster) the job is regularly OOMing in map and reduce phase.
> IIUC ShuffleExternalSorter (map side) and ExternalAppendOnlyMap and 
> ExternalSorter (reduce side) are trying to max out the available execution 
> memory. This in turn doesn't play nice with the Garbage Collector and 
> executors are failing with OutOfMemoryError when the memory allocation from 
> these in-memory structure is maxing out the available heap size (in our case 
> we are running with 9 cores/executor, 32G per executor)
> To mitigate this, I set 
> {{spark.shuffle.spill.numElementsForceSpillThreshold}} to force the spill on 
> disk. While this config works, it is not flexible enough as it's expressed in 
> number of elements, and in our case we run multiple shuffles in a single job 
> and element size is different from one stage to another.
> We have an internal patch to extend this behaviour and add two new parameters 
> to control the spill based on memory usage:
> - spark.shuffle.spill.map.maxRecordsSizeForSpillThreshold
> - spark.shuffle.spill.reduce.maxRecordsSizeForSpillThreshold
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28509) K8S integration tests are failing

2019-07-25 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893031#comment-16893031
 ] 

shane knapp commented on SPARK-28509:
-

checked that worker today and all k8s builds are running successfully:
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/13265/console

> K8S integration tests are failing
> -
>
> Key: SPARK-28509
> URL: https://issues.apache.org/jira/browse/SPARK-28509
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: shane knapp
>Priority: Major
>
> I've been seeing lots of failures in master. e.g. 
> https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/13180/console
> {noformat}
> - Start pod creation from template *** FAILED ***
>   io.fabric8.kubernetes.client.KubernetesClientException: 404 page not found
>   at 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:201)
>   at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:571)
>   at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:198)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>   ...
> - PVs with local storage *** FAILED ***
>   io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://192.168.39.112:8443/api/v1/persistentvolumes. Message: 
> PersistentVolume "test-local-pv" is invalid: [spec.local: Forbidden: Local 
> volumes are disabled by feature-gate, metadata.annotations: Required value: 
> Local volume requires node affinity]. Received status: Status(apiVersion=v1, 
> code=422, details=StatusDetails(causes=[StatusCause(field=spec.local, 
> message=Forbidden: Local volumes are disabled by feature-gate, 
> reason=FieldValueForbidden, additionalProperties={}), 
> StatusCause(field=metadata.annotations, message=Required value: Local volume 
> requires node affinity, reason=FieldValueRequired, additionalProperties={})], 
> group=null, kind=PersistentVolume, name=test-local-pv, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=PersistentVolume "test-local-pv" is invalid: [spec.local: Forbidden: 
> Local volumes are disabled by feature-gate, metadata.annotations: Required 
> value: Local volume requires node affinity], 
> metadata=ListMeta(_continue=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
>   at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:478)
>   at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:417)
>   at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:381)
>   at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:344)
>   at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:227)
>   at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:787)
>   at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:357)
>   at 
> org.apache.spark.deploy.k8s.integrationtest.PVTestsSuite.setupLocalStorage(PVTestsSuite.scala:87)
>   at 
> org.apache.spark.deploy.k8s.integrationtest.PVTestsSuite.$anonfun$$init$$1(PVTestsSuite.scala:137)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   ...
> - Launcher client dependencies *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 1 times 
> over 6.67390320003 minutes. Last failure message: assertion failed: 
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-28509) K8S integration tests are failing

2019-07-25 Thread shane knapp (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp closed SPARK-28509.
---

> K8S integration tests are failing
> -
>
> Key: SPARK-28509
> URL: https://issues.apache.org/jira/browse/SPARK-28509
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: shane knapp
>Priority: Major
>
> I've been seeing lots of failures in master. e.g. 
> https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/13180/console
> {noformat}
> - Start pod creation from template *** FAILED ***
>   io.fabric8.kubernetes.client.KubernetesClientException: 404 page not found
>   at 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:201)
>   at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:571)
>   at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:198)
>   at okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
>   at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>   ...
> - PVs with local storage *** FAILED ***
>   io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://192.168.39.112:8443/api/v1/persistentvolumes. Message: 
> PersistentVolume "test-local-pv" is invalid: [spec.local: Forbidden: Local 
> volumes are disabled by feature-gate, metadata.annotations: Required value: 
> Local volume requires node affinity]. Received status: Status(apiVersion=v1, 
> code=422, details=StatusDetails(causes=[StatusCause(field=spec.local, 
> message=Forbidden: Local volumes are disabled by feature-gate, 
> reason=FieldValueForbidden, additionalProperties={}), 
> StatusCause(field=metadata.annotations, message=Required value: Local volume 
> requires node affinity, reason=FieldValueRequired, additionalProperties={})], 
> group=null, kind=PersistentVolume, name=test-local-pv, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=PersistentVolume "test-local-pv" is invalid: [spec.local: Forbidden: 
> Local volumes are disabled by feature-gate, metadata.annotations: Required 
> value: Local volume requires node affinity], 
> metadata=ListMeta(_continue=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
>   at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:478)
>   at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:417)
>   at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:381)
>   at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:344)
>   at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:227)
>   at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:787)
>   at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:357)
>   at 
> org.apache.spark.deploy.k8s.integrationtest.PVTestsSuite.setupLocalStorage(PVTestsSuite.scala:87)
>   at 
> org.apache.spark.deploy.k8s.integrationtest.PVTestsSuite.$anonfun$$init$$1(PVTestsSuite.scala:137)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   ...
> - Launcher client dependencies *** FAILED ***
>   The code passed to eventually never returned normally. Attempted 1 times 
> over 6.67390320003 minutes. Last failure message: assertion failed: 
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25914) Separate projection from grouping and aggregate in logical Aggregate

2019-07-25 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-25914:
---

Assignee: (was: Dilip Biswal)

> Separate projection from grouping and aggregate in logical Aggregate
> 
>
> Key: SPARK-25914
> URL: https://issues.apache.org/jira/browse/SPARK-25914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Priority: Major
>
> Currently the Spark SQL logical Aggregate has two expression fields: 
> {{groupingExpressions}} and {{aggregateExpressions}}, in which 
> {{aggregateExpressions}} is actually the result expressions, or in other 
> words, the project list in the SELECT clause.
>   
>  This would cause an exception while processing the following query:
> {code:java}
> SELECT concat('x', concat(a, 's'))
> FROM testData2
> GROUP BY concat(a, 's'){code}
>  After optimization, the query becomes:
> {code:java}
> SELECT concat('x', a, 's')
> FROM testData2
> GROUP BY concat(a, 's'){code}
> The optimization rule {{CombineConcats}} optimizes the expressions by 
> flattening "concat" and causes the query to fail since the expression 
> {{concat('x', a, 's')}} in the SELECT clause is neither referencing a 
> grouping expression nor a aggregate expression.
>   
>  The problem is that we try to mix two operations in one operator, and worse, 
> in one field: the group-and-aggregate operation and the project operation. 
> There are two ways to solve this problem:
>  1. Break the two operations into two logical operators, which means a 
> group-by query can usually be mapped into a Project-over-Aggregate pattern.
>  2. Break the two operations into multiple fields in the Aggregate operator, 
> the same way we do for physical aggregate classes (e.g., 
> {{HashAggregateExec}}, or {{SortAggregateExec}}). Thus, 
> {{groupingExpressions}} would still be the expressions from the GROUP BY 
> clause (as before), but {{aggregateExpressions}} would contain aggregate 
> functions only, and {{resultExpressions}} would be the project list in the 
> SELECT clause holding references to either {{groupingExpressions}} or 
> {{aggregateExpressions}}.
>   
>  I would say option 1 is even clearer, but it would be more likely to break 
> the pattern matching in existing optimization rules and thus require more 
> changes in the compiler. So we'd probably wanna go with option 2. That said, 
> I suggest we achieve this goal through two iterative steps:
>   
>  Phase 1: Keep the current fields of logical Aggregate as 
> {{groupingExpressions}} and {{aggregateExpressions}}, but change the 
> semantics of {{aggregateExpressions}} by replacing the grouping expressions 
> with corresponding references to expressions in {{groupingExpressions}}. The 
> aggregate expressions in  {{aggregateExpressions}} will remain the same.
>   
>  Phase 2: Add {{resultExpressions}} for the project list, and keep only 
> aggregate expressions in {{aggregateExpressions}}.
>   



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25787) [K8S] Spark can't use data locality information

2019-07-25 Thread JIRA



[ 
https://issues.apache.org/jira/browse/SPARK-25787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893043#comment-16893043
 ] 

Paweł Wiejacha commented on SPARK-25787:


I *can* reproduce this issue. In Spark UI, Locality Level is always ANY instead 
of NODE_LOCAL when reading data from HDFS.

As Yinan Li said, it seems that:

> Support for data locality on k8s has not been ported to the upstream Spark 
> repo yet.

I think that at least the pull request below should be ported and merged to 
support HDFS data locality in Spark on Kubernetes.

https://github.com/apache-spark-on-k8s/spark/pull/216

Could you please reopen this issue?

> [K8S] Spark can't use data locality information
> ---
>
> Key: SPARK-25787
> URL: https://issues.apache.org/jira/browse/SPARK-25787
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Maciej Bryński
>Priority: Major
>
> I started experimenting with Spark based on this presentation:
> https://www.slideshare.net/databricks/hdfs-on-kuberneteslessons-learned-with-kimoon-kim
> I'm using excelent https://github.com/apache-spark-on-k8s/kubernetes-HDFS
> charts to deploy HDFS.
> Unfortunately reading from HDFS gives ANY locality for every task.
> Is data locality working on Kubernetes cluster ?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28516) adds `to_char`

2019-07-25 Thread Dylan Guedes (JIRA)

Dylan Guedes created SPARK-28516:


 Summary: adds `to_char`
 Key: SPARK-28516
 URL: https://issues.apache.org/jira/browse/SPARK-28516
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dylan Guedes


Currently, Spark does not have support for `to_char`. PgSQL, however, 
[does|[https://www.postgresql.org/docs/9.6/functions-formatting.html]]:

Query example: 

 SELECT to_char(SUM(n) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND 1 
FOLLOWING),'9D9')



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16754) NPE when defining case class and searching Encoder in the same line

2019-07-25 Thread Josh Rosen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-16754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893115#comment-16893115
 ] 

Josh Rosen commented on SPARK-16754:


This is still a problem as of Spark 2.4.x; I just encountered this in a 
Zeppelin notebook. I was able to work around the problem by putting my case 
class definition and usage in separate Zeppelin paragraphs / cells, but it's a 
confusing experience for new users.

> NPE when defining case class and searching Encoder in the same line
> ---
>
> Key: SPARK-16754
> URL: https://issues.apache.org/jira/browse/SPARK-16754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.3.0
> Environment: Spark Shell for Scala 2.11
>Reporter: Shixiong Zhu
>Priority: Minor
>
> Reproducer:
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> case class TestCaseClass(value: Int)
> import spark.implicits._
> Seq(TestCaseClass(1)).toDS().collect()
> // Exiting paste mode, now interpreting.
> java.lang.RuntimeException: baseClassName: $line14.$read
>   at 
> org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$12.apply(objects.scala:251)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$12.apply(objects.scala:251)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:251)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$$anonfun$3.apply(GenerateSafeProjection.scala:145)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$$anonfun$3.apply(GenerateSafeProjection.scala:142)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:142)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:821)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.constructProjection$lzycompute(ExpressionEncoder.scala:258)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.constructProjection(ExpressionEncoder.scala:258)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:289)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$15.apply(Dataset.scala:2218)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$15.apply(Dataset.scala:2218)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2218)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2568)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2217)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.sc

[jira] [Updated] (SPARK-28502) Error with struct conversion while using pandas_udf

2019-07-25 Thread Nasir Ali (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nasir Ali updated SPARK-28502:
--
Issue Type: Bug  (was: Question)

> Error with struct conversion while using pandas_udf
> ---
>
> Key: SPARK-28502
> URL: https://issues.apache.org/jira/browse/SPARK-28502
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: OS: Ubuntu
> Python: 3.6
>Reporter: Nasir Ali
>Priority: Minor
>
> What I am trying to do: Group data based on time intervals (e.g., 15 days 
> window) and perform some operations on dataframe using (pandas) UDFs. I don't 
> know if there is a better/cleaner way to do it.
> Below is the sample code that I tried and error message I am getting.
>  
> {code:java}
> df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"),
> (13.00, "2018-03-11T12:27:18+00:00"),
> (25.00, "2018-03-12T11:27:18+00:00"),
> (20.00, "2018-03-13T15:27:18+00:00"),
> (17.00, "2018-03-14T12:27:18+00:00"),
> (99.00, "2018-03-15T11:27:18+00:00"),
> (156.00, "2018-03-22T11:27:18+00:00"),
> (17.00, "2018-03-31T11:27:18+00:00"),
> (25.00, "2018-03-15T11:27:18+00:00"),
> (25.00, "2018-03-16T11:27:18+00:00")
> ],
>["id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> schema = StructType([
> StructField("id", IntegerType()),
> StructField("ts", TimestampType())
> ])
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def some_udf(df):
> # some computation
> return df
> df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show()
> {code}
> This throws following exception:
> {code:java}
> TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]>
> {code}
>  
> However, if I use builtin agg method then it works all fine. For example,
> {code:java}
> df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False)
> {code}
> Output
> {code:java}
> +-+--+---+
> |id   |window|avg(id)|
> +-+--+---+
> |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0   |
> |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0   |
> |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0  |
> |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0   |
> |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0   |
> |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0   |
> |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0   |
> +-+--+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16754) NPE when defining case class and searching Encoder in the same line

2019-07-25 Thread Shixiong Zhu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-16754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893125#comment-16893125
 ] 

Shixiong Zhu commented on SPARK-16754:
--

I think prepending 
`org.apache.spark.sql.catalyst.encoders.OuterScopes.addOuterScope(this)` to 
every command should fix the problem. I already forget why we didn't do this 
for Scala 2.11.

> NPE when defining case class and searching Encoder in the same line
> ---
>
> Key: SPARK-16754
> URL: https://issues.apache.org/jira/browse/SPARK-16754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.3.0
> Environment: Spark Shell for Scala 2.11
>Reporter: Shixiong Zhu
>Priority: Minor
>
> Reproducer:
> {code}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> case class TestCaseClass(value: Int)
> import spark.implicits._
> Seq(TestCaseClass(1)).toDS().collect()
> // Exiting paste mode, now interpreting.
> java.lang.RuntimeException: baseClassName: $line14.$read
>   at 
> org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$12.apply(objects.scala:251)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$12.apply(objects.scala:251)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:251)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$$anonfun$3.apply(GenerateSafeProjection.scala:145)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$$anonfun$3.apply(GenerateSafeProjection.scala:142)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:142)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:821)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.constructProjection$lzycompute(ExpressionEncoder.scala:258)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.constructProjection(ExpressionEncoder.scala:258)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:289)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$15.apply(Dataset.scala:2218)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$15.apply(Dataset.scala:2218)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2218)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2568)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2217)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:)
>   at org.apache.spark.sql.Dataset.withCallback(Dataset.

[jira] [Assigned] (SPARK-27845) DataSourceV2: InsertTable

2019-07-25 Thread Burak Yavuz (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-27845:
---

Assignee: John Zhuge

> DataSourceV2: InsertTable
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27845) DataSourceV2: InsertTable

2019-07-25 Thread Burak Yavuz (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893162#comment-16893162
 ] 

Burak Yavuz commented on SPARK-27845:
-

Resolved with [https://github.com/apache/spark/pull/24832]

> DataSourceV2: InsertTable
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27845) DataSourceV2: InsertTable

2019-07-25 Thread Burak Yavuz (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-27845.
-
   Resolution: Done
Fix Version/s: 3.0.0

> DataSourceV2: InsertTable
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Major
> Fix For: 3.0.0
>
>
> Support multiple catalogs in the following use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28264) Revisiting Python / pandas UDF

2019-07-25 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893209#comment-16893209
 ] 

Bryan Cutler commented on SPARK-28264:
--

It's great to be taking another look at this, I think some aspects are really 
confusing. I left some comments in the doc, but to sum it up I think anything 
we can do to reduce the number of arguments and options will make it more user 
friendly. I worry that while replacing the pandas udf types with other options 
would make things more flexible, I'm not sure it makes it any easier to 
understand.

> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
>  
> See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28517) pyspark with --conf spark.jars.packages causes duplicate jars to be uploaded

2019-07-25 Thread Barry (JIRA)

Barry created SPARK-28517:
-

 Summary: pyspark with --conf spark.jars.packages causes duplicate 
jars to be uploaded
 Key: SPARK-28517
 URL: https://issues.apache.org/jira/browse/SPARK-28517
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 2.4.3
 Environment: spark 2.4.3_2.12 without hadoop

yarn 2.6

python 2.7.16

centos 7
Reporter: Barry


h2. Steps to reproduce:

{{spark-submit --master yarn --conf 
"spark.jars.packages=org.apache.spark:spark-avro_2.12:2.4.3" 
${SPARK_HOME}/examples/src/main/python/pi.py 100}}
h2. Undesirable behavior:

warnings are printed package jars have been added to the distributed cache 
multiple times

{{19/07/25 23:25:07 WARN Client: Same path resource 
file:///home/barryl/.ivy2/jars/org.apache.spark_spark-avro_2.12-2.4.3.jar added 
multiple times to distributed cache.}}
{{19/07/25 23:25:07 WARN Client: Same path resource 
file:///home/barryl/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar added 
multiple times to distributed cache.}}

This does not happen for Scala jobs, only Pyspark

 
h2. Full output of example run.

{{[barryl@hostname ~]$ /opt/spark2/bin/spark-submit --master yarn --conf 
"spark.jars.packages=org.apache.spark:spark-avro_2.12:2.4.3" 
/opt/spark2/examples/src/main/python/pi.py 100}}
{{Ivy Default Cache set to: /home/barryl/.ivy2/cache}}
{{The jars for the packages stored in: /home/barryl/.ivy2/jars}}
{{:: loading settings :: url = 
jar:file:/opt/spark-2.4.3-bin-without-hadoop-scala-2.12/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml}}
{{org.apache.spark#spark-avro_2.12 added as a dependency}}
{{:: resolving dependencies :: 
org.apache.spark#spark-submit-parent-2c34ecff-b060-4af9-9b9f-83867672748c;1.0}}
{{    confs: [default]}}
{{    found org.apache.spark#spark-avro_2.12;2.4.3 in central}}
{{    found org.spark-project.spark#unused;1.0.0 in central}}
{{:: resolution report :: resolve 457ms :: artifacts dl 5ms}}
{{    :: modules in use:}}
{{    org.apache.spark#spark-avro_2.12;2.4.3 from central in [default]}}
{{    org.spark-project.spark#unused;1.0.0 from central in [default]}}
{{    -}}
{{    |  |    modules    ||   artifacts   |}}
{{    |   conf   | number| search|dwnlded|evicted|| number|dwnlded|}}
{{    -}}
{{    |  default |   2   |   0   |   0   |   0   ||   2   |   0   |}}
{{    -}}
{{:: retrieving :: 
org.apache.spark#spark-submit-parent-2c34ecff-b060-4af9-9b9f-83867672748c}}
{{    confs: [default]}}
{{    0 artifacts copied, 2 already retrieved (0kB/7ms)}}
{{19/07/25 23:25:03 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.}}
{{19/07/25 23:25:07 WARN Client: Same path resource 
file:///home/barryl/.ivy2/jars/org.apache.spark_spark-avro_2.12-2.4.3.jar added 
multiple times to distributed cache.}}
{{19/07/25 23:25:07 WARN Client: Same path resource 
file:///home/barryl/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar added 
multiple times to distributed cache.}}
{{19/07/25 23:25:28 WARN TaskSetManager: Stage 0 contains a task of very large 
size (365 KB). The maximum recommended task size is 100 KB.}}
{{Pi is roughly 3.142308}}

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28516) Data Type Formatting Functions: `to_char`

2019-07-25 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28516:

Summary: Data Type Formatting Functions: `to_char`  (was: adds `to_char`)

> Data Type Formatting Functions: `to_char`
> -
>
> Key: SPARK-28516
> URL: https://issues.apache.org/jira/browse/SPARK-28516
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> Currently, Spark does not have support for `to_char`. PgSQL, however, 
> [does|[https://www.postgresql.org/docs/9.6/functions-formatting.html]]:
> Query example: 
>  SELECT to_char(SUM(n) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND 1 
> FOLLOWING),'9D9')



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28137) Data Type Formatting Functions: `to_number`

2019-07-25 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28137:

Summary: Data Type Formatting Functions: `to_number`  (was: Data Type 
Formatting Functions: `to_char`)

> Data Type Formatting Functions: `to_number`
> ---
>
> Key: SPARK-28137
> URL: https://issues.apache.org/jira/browse/SPARK-28137
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Return Type||Description||Example||
> |{{to_char(}}{{timestamp}}{{, }}{{text}}{{)}}|{{text}}|convert time stamp to 
> string|{{to_char(current_timestamp, 'HH12:MI:SS')}}|
> |{{to_char(}}{{interval}}{{, }}{{text}}{{)}}|{{text}}|convert interval to 
> string|{{to_char(interval '15h 2m 12s', 'HH24:MI:SS')}}|
> |{{to_char(}}{{int}}{{, }}{{text}}{{)}}|{{text}}|convert integer to 
> string|{{to_char(125, '999')}}|
> |{{to_char}}{{(}}{{double precision}}{{, }}{{text}}{{)}}|{{text}}|convert 
> real/double precision to string|{{to_char(125.8::real, '999D9')}}|
> |{{to_char(}}{{numeric}}{{, }}{{text}}{{)}}|{{text}}|convert numeric to 
> string|{{to_char(-125.8, '999D99S')}}|
> |{{to_number(}}{{text}}{{, }}{{text}}{{)}}|{{numeric}}|convert string to 
> numeric|{{to_number('12,454.8-', '99G999D9S')}}|
> https://www.postgresql.org/docs/12/functions-formatting.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28137) Data Type Formatting Functions: `to_char`

2019-07-25 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28137:

Summary: Data Type Formatting Functions: `to_char`  (was: Data Type 
Formatting Functions)

> Data Type Formatting Functions: `to_char`
> -
>
> Key: SPARK-28137
> URL: https://issues.apache.org/jira/browse/SPARK-28137
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Return Type||Description||Example||
> |{{to_char(}}{{timestamp}}{{, }}{{text}}{{)}}|{{text}}|convert time stamp to 
> string|{{to_char(current_timestamp, 'HH12:MI:SS')}}|
> |{{to_char(}}{{interval}}{{, }}{{text}}{{)}}|{{text}}|convert interval to 
> string|{{to_char(interval '15h 2m 12s', 'HH24:MI:SS')}}|
> |{{to_char(}}{{int}}{{, }}{{text}}{{)}}|{{text}}|convert integer to 
> string|{{to_char(125, '999')}}|
> |{{to_char}}{{(}}{{double precision}}{{, }}{{text}}{{)}}|{{text}}|convert 
> real/double precision to string|{{to_char(125.8::real, '999D9')}}|
> |{{to_char(}}{{numeric}}{{, }}{{text}}{{)}}|{{text}}|convert numeric to 
> string|{{to_char(-125.8, '999D99S')}}|
> |{{to_number(}}{{text}}{{, }}{{text}}{{)}}|{{numeric}}|convert string to 
> numeric|{{to_number('12,454.8-', '99G999D9S')}}|
> https://www.postgresql.org/docs/12/functions-formatting.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28516) Data Type Formatting Functions: `to_char`

2019-07-25 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28516:

Description: 
Currently, Spark does not have support for `to_char`. PgSQL, however, 
[does|[https://www.postgresql.org/docs/12/functions-formatting.html]]:

Query example: 
{code:sql}
SELECT to_char(SUM(n) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND 1 
FOLLOWING),'9D9')
{code}


||Function||Return Type||Description||Example||
|{{to_char(}}{{timestamp}}{{, }}{{text}}{{)}}|{{text}}|convert time stamp to 
string|{{to_char(current_timestamp, 'HH12:MI:SS')}}|
|{{to_char(}}{{interval}}{{, }}{{text}}{{)}}|{{text}}|convert interval to 
string|{{to_char(interval '15h 2m 12s', 'HH24:MI:SS')}}|
|{{to_char(}}{{int}}{{, }}{{text}}{{)}}|{{text}}|convert integer to 
string|{{to_char(125, '999')}}|
|{{to_char}}{{(}}{{double precision}}{{, }}{{text}}{{)}}|{{text}}|convert 
real/double precision to string|{{to_char(125.8::real, '999D9')}}|
|{{to_char(}}{{numeric}}{{, }}{{text}}{{)}}|{{text}}|convert numeric to 
string|{{to_char(-125.8, '999D99S')}}|


  was:
Currently, Spark does not have support for `to_char`. PgSQL, however, 
[does|[https://www.postgresql.org/docs/9.6/functions-formatting.html]]:

Query example: 

 SELECT to_char(SUM(n) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND 1 
FOLLOWING),'9D9')


> Data Type Formatting Functions: `to_char`
> -
>
> Key: SPARK-28516
> URL: https://issues.apache.org/jira/browse/SPARK-28516
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> Currently, Spark does not have support for `to_char`. PgSQL, however, 
> [does|[https://www.postgresql.org/docs/12/functions-formatting.html]]:
> Query example: 
> {code:sql}
> SELECT to_char(SUM(n) OVER (ORDER BY i ROWS BETWEEN CURRENT ROW AND 1 
> FOLLOWING),'9D9')
> {code}
> ||Function||Return Type||Description||Example||
> |{{to_char(}}{{timestamp}}{{, }}{{text}}{{)}}|{{text}}|convert time stamp to 
> string|{{to_char(current_timestamp, 'HH12:MI:SS')}}|
> |{{to_char(}}{{interval}}{{, }}{{text}}{{)}}|{{text}}|convert interval to 
> string|{{to_char(interval '15h 2m 12s', 'HH24:MI:SS')}}|
> |{{to_char(}}{{int}}{{, }}{{text}}{{)}}|{{text}}|convert integer to 
> string|{{to_char(125, '999')}}|
> |{{to_char}}{{(}}{{double precision}}{{, }}{{text}}{{)}}|{{text}}|convert 
> real/double precision to string|{{to_char(125.8::real, '999D9')}}|
> |{{to_char(}}{{numeric}}{{, }}{{text}}{{)}}|{{text}}|convert numeric to 
> string|{{to_char(-125.8, '999D99S')}}|



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28137) Data Type Formatting Functions: `to_number`

2019-07-25 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28137:

Description: 
||Function||Return Type||Description||Example||
|{{to_number(}}{{text}}{{, }}{{text}}{{)}}|{{numeric}}|convert string to 
numeric|{{to_number('12,454.8-', '99G999D9S')}}|


https://www.postgresql.org/docs/12/functions-formatting.html

  was:
||Function||Return Type||Description||Example||
|{{to_char(}}{{timestamp}}{{, }}{{text}}{{)}}|{{text}}|convert time stamp to 
string|{{to_char(current_timestamp, 'HH12:MI:SS')}}|
|{{to_char(}}{{interval}}{{, }}{{text}}{{)}}|{{text}}|convert interval to 
string|{{to_char(interval '15h 2m 12s', 'HH24:MI:SS')}}|
|{{to_char(}}{{int}}{{, }}{{text}}{{)}}|{{text}}|convert integer to 
string|{{to_char(125, '999')}}|
|{{to_char}}{{(}}{{double precision}}{{, }}{{text}}{{)}}|{{text}}|convert 
real/double precision to string|{{to_char(125.8::real, '999D9')}}|
|{{to_char(}}{{numeric}}{{, }}{{text}}{{)}}|{{text}}|convert numeric to 
string|{{to_char(-125.8, '999D99S')}}|
|{{to_number(}}{{text}}{{, }}{{text}}{{)}}|{{numeric}}|convert string to 
numeric|{{to_number('12,454.8-', '99G999D9S')}}|


https://www.postgresql.org/docs/12/functions-formatting.html


> Data Type Formatting Functions: `to_number`
> ---
>
> Key: SPARK-28137
> URL: https://issues.apache.org/jira/browse/SPARK-28137
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Return Type||Description||Example||
> |{{to_number(}}{{text}}{{, }}{{text}}{{)}}|{{numeric}}|convert string to 
> numeric|{{to_number('12,454.8-', '99G999D9S')}}|
> https://www.postgresql.org/docs/12/functions-formatting.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23469) HashingTF should use corrected MurmurHash3 implementation

2019-07-25 Thread Huaxin Gao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893258#comment-16893258
 ] 

Huaxin Gao commented on SPARK-23469:


I will work on this jira once PR https://github.com/apache/spark/pull/25250 
(migrate the implementation of HashingTF from MLlib to ML) is merged. 

> HashingTF should use corrected MurmurHash3 implementation
> -
>
> Key: SPARK-23469
> URL: https://issues.apache.org/jira/browse/SPARK-23469
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> [SPARK-23381] added a corrected MurmurHash3 implementation but left the old 
> implementation alone.  In Spark 2.3 and earlier, HashingTF will use the old 
> implementation.  (We should not backport a fix for HashingTF since it would 
> be a major change of behavior.)  But we should correct HashingTF in Spark 
> 2.4; this JIRA is for tracking this fix.
> * Update HashingTF to use new implementation of MurmurHash3
> * Ensure backwards compatibility for ML persistence by having HashingTF use 
> the old MurmurHash3 when a model from Spark 2.3 or earlier is loaded.  We can 
> add a Param to allow this.
> Also, HashingTF still calls into the old spark.mllib.feature.HashingTF, so I 
> recommend we first migrate the code to spark.ml: [SPARK-21748].  We can leave 
> spark.mllib alone and just fix MurmurHash3 in spark.ml.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28476) Support ALTER DATABASE SET LOCATION

2019-07-25 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893274#comment-16893274
 ] 

Xiao Li commented on SPARK-28476:
-

[~ivy.weichan.xu]

 

> Support ALTER DATABASE SET LOCATION
> ---
>
> Key: SPARK-28476
> URL: https://issues.apache.org/jira/browse/SPARK-28476
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> We can support the syntax of ALTER (DATABASE|SCHEMA) database_name SET 
> LOCATION path
> Ref: [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28476) Support ALTER DATABASE SET LOCATION

2019-07-25 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-28476:
---

Assignee: Weichen Xu

> Support ALTER DATABASE SET LOCATION
> ---
>
> Key: SPARK-28476
> URL: https://issues.apache.org/jira/browse/SPARK-28476
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Weichen Xu
>Priority: Major
>
> We can support the syntax of ALTER (DATABASE|SCHEMA) database_name SET 
> LOCATION path
> Ref: [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28289) Convert and port 'union.sql' into UDF test base

2019-07-25 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28289.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25202
[https://github.com/apache/spark/pull/25202]

> Convert and port 'union.sql' into UDF test base
> ---
>
> Key: SPARK-28289
> URL: https://issues.apache.org/jira/browse/SPARK-28289
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Yiheng Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28289) Convert and port 'union.sql' into UDF test base

2019-07-25 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28289:


Assignee: Yiheng Wang

> Convert and port 'union.sql' into UDF test base
> ---
>
> Key: SPARK-28289
> URL: https://issues.apache.org/jira/browse/SPARK-28289
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Yiheng Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28365) Fallback locale to en_US in StopWordsRemover if system default locale isn't in available locales in JVM

2019-07-25 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28365.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25133
[https://github.com/apache/spark/pull/25133]

> Fallback locale to en_US in StopWordsRemover if system default locale isn't 
> in available locales in JVM
> ---
>
> Key: SPARK-28365
> URL: https://issues.apache.org/jira/browse/SPARK-28365
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> Because the local default locale isn't in available locales at {{Locale}}, 
> when I did some tests locally with python code, {{StopWordsRemover}} related 
> python test hits some errors, like:
> {code}
> Traceback (most recent call last):
>   File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in 
> test_stopwordsremover
> stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output")
>   File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper
> return func(self, **kwargs)
>   File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__
> self.uid)
>   File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
> return java_obj(*java_args)
>   File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
> 1554, in __call__
> answer, self._gateway_client, None, self._fqn)
>   File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco
> raise converted
> pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 
> parameter locale given invalid value en_TW.'
> {code}
> As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, 
> it is better to have a workable locale if system default locale can't be 
> found in available locales in JVM. Otherwise, users have to manually change 
> system locale or accessing a private property _jvm in PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28365) Fallback locale to en_US in StopWordsRemover if system default locale isn't in available locales in JVM

2019-07-25 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28365:


Assignee: Liang-Chi Hsieh

> Fallback locale to en_US in StopWordsRemover if system default locale isn't 
> in available locales in JVM
> ---
>
> Key: SPARK-28365
> URL: https://issues.apache.org/jira/browse/SPARK-28365
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> Because the local default locale isn't in available locales at {{Locale}}, 
> when I did some tests locally with python code, {{StopWordsRemover}} related 
> python test hits some errors, like:
> {code}
> Traceback (most recent call last):
>   File "/spark-1/python/pyspark/ml/tests/test_feature.py", line 87, in 
> test_stopwordsremover
> stopWordRemover = StopWordsRemover(inputCol="input", outputCol="output")
>   File "/spark-1/python/pyspark/__init__.py", line 111, in wrapper
> return func(self, **kwargs)
>   File "/spark-1/python/pyspark/ml/feature.py", line 2646, in __init__
> self.uid)
>   File "/spark-1/python/pyspark/ml/wrapper.py", line 67, in _new_java_obj
> return java_obj(*java_args)
>   File /spark-1/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 
> 1554, in __call__
> answer, self._gateway_client, None, self._fqn)
>   File "/spark-1/python/pyspark/sql/utils.py", line 93, in deco
> raise converted
> pyspark.sql.utils.IllegalArgumentException: 'StopWordsRemover_4598673ee802 
> parameter locale given invalid value en_TW.'
> {code}
> As per [~hyukjin.kwon]'s advice, instead of setting up locale to pass test, 
> it is better to have a workable locale if system default locale can't be 
> found in available locales in JVM. Otherwise, users have to manually change 
> system locale or accessing a private property _jvm in PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28515) to_timestamp returns null for summer time switch dates

2019-07-25 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28515:
-
Description: 
I am not sure if this is a bug - but it was a very unexpected behavior, so I'd 
like some clarification.

When parsing datetime-strings, when the date-time in question falls into the 
range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 2am 
the clock was forwarded to 3am), the {{to_timestamp}} method returns {{NULL}}.

Minimal Example (using Python):
{code:java}
>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
>>> ['date_str'])
>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
>>> 'MMddhhmm')).show()
-+
|    date_str|  timestamp|
-+
|201503290159|2015-03-29 01:59:00|
|201503290200|   null|
-+ {code}

A solution (or workaround) is to set the time zone for Spark to UTC:

{{spark.conf.set("spark.sql.session.timeZone", "UTC")}}

(see e.g. [https://stackoverflow.com/q/52594762)]

Plain Java does not do this, e.g. this works as expected:
 
{code:java}
SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm");
Date parsedDate = dateFormat.parse("201503290201");
Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());{code}

So, is this really the intended behaviour? Is there documentation about this? 
THX.

  was:
I am not sure if this is a bug - but it was a very unexpected behavior, so I'd 
like some clarification.

When parsing datetime-strings, when the date-time in question falls into the 
range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 2am 
the clock was forwarded to 3am), the {{to_timestamp}} method returns {{NULL}}.

Minimal Example (using Python):

{{>>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
['date_str'])}}
 {{>>> df.withColumn('timestamp', F.to_timestamp('date_str', 
'MMddhhmm')).show()}}
 {{+-+--+}}
 {{|    date_str|  timestamp|}}
 {{+-+--+}}
 {{|201503290159|2015-03-29 01:59:00|}}
 {{|201503290200|   null|}}
 {{+-+--+}}

A solution (or workaround) is to set the time zone for Spark to UTC:

{{spark.conf.set("spark.sql.session.timeZone", "UTC")}}

(see e.g. [https://stackoverflow.com/q/52594762)]

 

Plain Java does not do this, e.g. this works as expected:
{{SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm");}}
{{Date parsedDate = dateFormat.parse("201503290201");}}
{{Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());}}
 

So, is this really the intended behaviour? Is there documentation about this? 
THX.


> to_timestamp returns null for summer time switch dates
> --
>
> Key: SPARK-28515
> URL: https://issues.apache.org/jira/browse/SPARK-28515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3 on Linux 64bit, openjdk-8-jre-headless
>Reporter: Andreas Költringer
>Priority: Major
>
> I am not sure if this is a bug - but it was a very unexpected behavior, so 
> I'd like some clarification.
> When parsing datetime-strings, when the date-time in question falls into the 
> range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 
> 2am the clock was forwarded to 3am), the {{to_timestamp}} method returns 
> {{NULL}}.
> Minimal Example (using Python):
> {code:java}
> >>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
> >>> ['date_str'])
> >>> df.withColumn('timestamp', F.to_timestamp('date_str', 
> >>> 'MMddhhmm')).show()
> -+
> |    date_str|  timestamp|
> -+
> |201503290159|2015-03-29 01:59:00|
> |201503290200|   null|
> -+ {code}
> A solution (or workaround) is to set the time zone for Spark to UTC:
> {{spark.conf.set("spark.sql.session.timeZone", "UTC")}}
> (see e.g. [https://stackoverflow.com/q/52594762)]
> Plain Java does not do this, e.g. this works as expected:
>  
> {code:java}
> SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm");
> Date parsedDate = dateFormat.parse("201503290201");
> Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());{code}
> So, is this really the intended behaviour? Is there documentation about this? 
> THX.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28515) to_timestamp returns null for summer time switch dates

2019-07-25 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893318#comment-16893318
 ] 

Hyukjin Kwon commented on SPARK-28515:
--

Which Python version do you use? IIRC, Python 3.4 has an issue for the folded 
time in DST.

> to_timestamp returns null for summer time switch dates
> --
>
> Key: SPARK-28515
> URL: https://issues.apache.org/jira/browse/SPARK-28515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3 on Linux 64bit, openjdk-8-jre-headless
>Reporter: Andreas Költringer
>Priority: Major
>
> I am not sure if this is a bug - but it was a very unexpected behavior, so 
> I'd like some clarification.
> When parsing datetime-strings, when the date-time in question falls into the 
> range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 
> 2am the clock was forwarded to 3am), the {{to_timestamp}} method returns 
> {{NULL}}.
> Minimal Example (using Python):
> {code:java}
> >>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
> >>> ['date_str'])
> >>> df.withColumn('timestamp', F.to_timestamp('date_str', 
> >>> 'MMddhhmm')).show()
> -+
> |    date_str|  timestamp|
> -+
> |201503290159|2015-03-29 01:59:00|
> |201503290200|   null|
> -+ {code}
> A solution (or workaround) is to set the time zone for Spark to UTC:
> {{spark.conf.set("spark.sql.session.timeZone", "UTC")}}
> (see e.g. [https://stackoverflow.com/q/52594762)]
> Plain Java does not do this, e.g. this works as expected:
>  
> {code:java}
> SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm");
> Date parsedDate = dateFormat.parse("201503290201");
> Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());{code}
> So, is this really the intended behaviour? Is there documentation about this? 
> THX.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28515) to_timestamp returns null for summer time switch dates

2019-07-25 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893318#comment-16893318
 ] 

Hyukjin Kwon edited comment on SPARK-28515 at 7/26/19 4:29 AM:
---

Which Python version do you use? IIRC, Python 3.4 and Python 3.5 has an issue 
for the folded time in DST.


was (Author: hyukjin.kwon):
Which Python version do you use? IIRC, Python 3.4 has an issue for the folded 
time in DST.

> to_timestamp returns null for summer time switch dates
> --
>
> Key: SPARK-28515
> URL: https://issues.apache.org/jira/browse/SPARK-28515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3 on Linux 64bit, openjdk-8-jre-headless
>Reporter: Andreas Költringer
>Priority: Major
>
> I am not sure if this is a bug - but it was a very unexpected behavior, so 
> I'd like some clarification.
> When parsing datetime-strings, when the date-time in question falls into the 
> range of a "summer time switch" (e.g. in (most of) Europe, on 2015-03-29 at 
> 2am the clock was forwarded to 3am), the {{to_timestamp}} method returns 
> {{NULL}}.
> Minimal Example (using Python):
> {code:java}
> >>> df = spark.createDataFrame([('201503290159',), ('201503290200',)], 
> >>> ['date_str'])
> >>> df.withColumn('timestamp', F.to_timestamp('date_str', 
> >>> 'MMddhhmm')).show()
> -+
> |    date_str|  timestamp|
> -+
> |201503290159|2015-03-29 01:59:00|
> |201503290200|   null|
> -+ {code}
> A solution (or workaround) is to set the time zone for Spark to UTC:
> {{spark.conf.set("spark.sql.session.timeZone", "UTC")}}
> (see e.g. [https://stackoverflow.com/q/52594762)]
> Plain Java does not do this, e.g. this works as expected:
>  
> {code:java}
> SimpleDateFormat dateFormat = new SimpleDateFormat("MMddhhmm");
> Date parsedDate = dateFormat.parse("201503290201");
> Timestamp timestamp = new java.sql.Timestamp(parsedDate.getTime());{code}
> So, is this really the intended behaviour? Is there documentation about this? 
> THX.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28512) New optional mode: throw runtime exceptions on casting failures

2019-07-25 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28512:
--
Issue Type: Improvement  (was: Bug)

> New optional mode: throw runtime exceptions on casting failures
> ---
>
> Key: SPARK-28512
> URL: https://issues.apache.org/jira/browse/SPARK-28512
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In popular DBMS like MySQL/PostgreSQL/Oracle, runtime exceptions are thrown 
> on casting, e.g. cast('abc' as int) 
> While in Spark, the result is converted as null silently. It is by design 
> since we don't want a long-running job aborted by some casting failure. But 
> there are scenarios that users want to make sure all the data conversion are 
> correct, like the way they use MySQL/PostgreSQL/Oracle.
> If the changes touch too much code, we can limit the new optional mode to 
> table insertion first. By default the new behavior is disabled.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28512) New optional mode: throw runtime exceptions on casting failures

2019-07-25 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893331#comment-16893331
 ] 

Dongjoon Hyun commented on SPARK-28512:
---

I changed the type from `BUG` to `Improvement` since this is a `New optional 
mode`.

> New optional mode: throw runtime exceptions on casting failures
> ---
>
> Key: SPARK-28512
> URL: https://issues.apache.org/jira/browse/SPARK-28512
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In popular DBMS like MySQL/PostgreSQL/Oracle, runtime exceptions are thrown 
> on casting, e.g. cast('abc' as int) 
> While in Spark, the result is converted as null silently. It is by design 
> since we don't want a long-running job aborted by some casting failure. But 
> there are scenarios that users want to make sure all the data conversion are 
> correct, like the way they use MySQL/PostgreSQL/Oracle.
> If the changes touch too much code, we can limit the new optional mode to 
> table insertion first. By default the new behavior is disabled.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28512) New optional mode: throw runtime exceptions on casting failures

2019-07-25 Thread Gengliang Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893334#comment-16893334
 ] 

Gengliang Wang commented on SPARK-28512:


[~dongjoon]Thanks!

> New optional mode: throw runtime exceptions on casting failures
> ---
>
> Key: SPARK-28512
> URL: https://issues.apache.org/jira/browse/SPARK-28512
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In popular DBMS like MySQL/PostgreSQL/Oracle, runtime exceptions are thrown 
> on casting, e.g. cast('abc' as int) 
> While in Spark, the result is converted as null silently. It is by design 
> since we don't want a long-running job aborted by some casting failure. But 
> there are scenarios that users want to make sure all the data conversion are 
> correct, like the way they use MySQL/PostgreSQL/Oracle.
> If the changes touch too much code, we can limit the new optional mode to 
> table insertion first. By default the new behavior is disabled.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25653) Add tag ExtendedHiveTest for HiveSparkSubmitSuite

2019-07-25 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25653:
--
Priority: Minor  (was: Major)

> Add tag ExtendedHiveTest for HiveSparkSubmitSuite
> -
>
> Key: SPARK-25653
> URL: https://issues.apache.org/jira/browse/SPARK-25653
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> The total run time of HiveSparkSubmitSuite is about 10 minutes.
> While the related code is stable, add tag ExtendedHiveTest for it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28518) Only filter out the checksum file when enabling fall back to HDFS

2019-07-25 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-28518:
---

 Summary: Only filter out the checksum file when enabling fall back 
to HDFS
 Key: SPARK-28518
 URL: https://issues.apache.org/jira/browse/SPARK-28518
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25653) Add tag ExtendedHiveTest for HiveSparkSubmitSuite

2019-07-25 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25653:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Add tag ExtendedHiveTest for HiveSparkSubmitSuite
> -
>
> Key: SPARK-25653
> URL: https://issues.apache.org/jira/browse/SPARK-25653
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> The total run time of HiveSparkSubmitSuite is about 10 minutes.
> While the related code is stable, add tag ExtendedHiveTest for it.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28518) Only filter out the checksum file when enabling fall back to HDFS

2019-07-25 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28518:

Description: 
StatisticsCollectionTestBase.getDataSize is incorrect. We should filter out the 
checksum file with reference to ChecksumFileSystem.isChecksumFile.

https://github.com/apache/spark/pull/25014#discussion_r307050435

> Only filter out the checksum file when enabling fall back to HDFS
> -
>
> Key: SPARK-28518
> URL: https://issues.apache.org/jira/browse/SPARK-28518
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> StatisticsCollectionTestBase.getDataSize is incorrect. We should filter out 
> the checksum file with reference to ChecksumFileSystem.isChecksumFile.
> https://github.com/apache/spark/pull/25014#discussion_r307050435



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28518) Fix StatisticsCollectionTestBase#getDataSize

2019-07-25 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28518:

Summary: Fix StatisticsCollectionTestBase#getDataSize  (was: Only filter 
out the checksum file when enabling fall back to HDFS)

> Fix StatisticsCollectionTestBase#getDataSize
> 
>
> Key: SPARK-28518
> URL: https://issues.apache.org/jira/browse/SPARK-28518
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> StatisticsCollectionTestBase.getDataSize is incorrect. We should filter out 
> the checksum file with reference to ChecksumFileSystem.isChecksumFile.
> https://github.com/apache/spark/pull/25014#discussion_r307050435



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28518) Fix StatisticsCollectionTestBase#getDataSize refer to ChecksumFileSystem.isChecksumFile

2019-07-25 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28518:

Summary: Fix StatisticsCollectionTestBase#getDataSize refer to 
ChecksumFileSystem.isChecksumFile  (was: Fix 
StatisticsCollectionTestBase#getDataSize)

> Fix StatisticsCollectionTestBase#getDataSize refer to 
> ChecksumFileSystem.isChecksumFile
> ---
>
> Key: SPARK-28518
> URL: https://issues.apache.org/jira/browse/SPARK-28518
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> StatisticsCollectionTestBase.getDataSize is incorrect. We should filter out 
> the checksum file with reference to ChecksumFileSystem.isChecksumFile.
> https://github.com/apache/spark/pull/25014#discussion_r307050435



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28518) Fix StatisticsCollectionTestBase#getDataSize refer to ChecksumFileSystem.isChecksumFile

2019-07-25 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28518:

Description: 
StatisticsCollectionTestBase.getDataSize is incorrect. We should refer to 
ChecksumFileSystem.isChecksumFile.

https://github.com/apache/spark/pull/25014#discussion_r307050435

  was:
StatisticsCollectionTestBase.getDataSize is incorrect. We should filter out the 
checksum file with reference to ChecksumFileSystem.isChecksumFile.

https://github.com/apache/spark/pull/25014#discussion_r307050435


> Fix StatisticsCollectionTestBase#getDataSize refer to 
> ChecksumFileSystem.isChecksumFile
> ---
>
> Key: SPARK-28518
> URL: https://issues.apache.org/jira/browse/SPARK-28518
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> StatisticsCollectionTestBase.getDataSize is incorrect. We should refer to 
> ChecksumFileSystem.isChecksumFile.
> https://github.com/apache/spark/pull/25014#discussion_r307050435



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28518) Fix StatisticsCollectionTestBase#getDataSize refer to ChecksumFileSystem#isChecksumFile

2019-07-25 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28518:

Summary: Fix StatisticsCollectionTestBase#getDataSize refer to 
ChecksumFileSystem#isChecksumFile  (was: Fix 
StatisticsCollectionTestBase#getDataSize refer to 
ChecksumFileSystem.isChecksumFile)

> Fix StatisticsCollectionTestBase#getDataSize refer to 
> ChecksumFileSystem#isChecksumFile
> ---
>
> Key: SPARK-28518
> URL: https://issues.apache.org/jira/browse/SPARK-28518
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> StatisticsCollectionTestBase.getDataSize is incorrect. We should refer to 
> ChecksumFileSystem.isChecksumFile.
> https://github.com/apache/spark/pull/25014#discussion_r307050435



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28237) Idempotence checker for Idempotent batches in RuleExecutors

2019-07-25 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-28237.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Idempotence checker for Idempotent batches in RuleExecutors
> ---
>
> Key: SPARK-28237
> URL: https://issues.apache.org/jira/browse/SPARK-28237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
> Fix For: 3.0.0
>
>
> The current {{RuleExecutor}} system contains two kinds of strategies: 
> {{Once}} and {{FixedPoint}}. The {{Once}} strategy is supposed to run once. 
> However, for particular rules (e.g. PullOutNondeterministic), they are 
> designed to be idempotent, but Spark currently lacks corresponding mechanism 
> to prevent such kind of non-idempotent behavior from happening.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28237) Idempotence checker for Idempotent batches in RuleExecutors

2019-07-25 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-28237:
---

Assignee: Yesheng Ma

> Idempotence checker for Idempotent batches in RuleExecutors
> ---
>
> Key: SPARK-28237
> URL: https://issues.apache.org/jira/browse/SPARK-28237
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Assignee: Yesheng Ma
>Priority: Major
> Fix For: 3.0.0
>
>
> The current {{RuleExecutor}} system contains two kinds of strategies: 
> {{Once}} and {{FixedPoint}}. The {{Once}} strategy is supposed to run once. 
> However, for particular rules (e.g. PullOutNondeterministic), they are 
> designed to be idempotent, but Spark currently lacks corresponding mechanism 
> to prevent such kind of non-idempotent behavior from happening.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27714) Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12

2019-07-25 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-27714:

Target Version/s: 3.0.0

> Support Join Reorder based on Genetic Algorithm when the # of joined tables > 
> 12
> 
>
> Key: SPARK-27714
> URL: https://issues.apache.org/jira/browse/SPARK-27714
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Priority: Major
>
> Now the join reorder logic is based on dynamic planning which can find the 
> most optimized plan theoretically, but the searching cost grows rapidly with 
> the # of joined tables grows. It would be better to introduce Genetic 
> algorithm (GA) to overcome this problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27714) Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12

2019-07-25 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-27714:
---

Assignee: Xianyin Xin

> Support Join Reorder based on Genetic Algorithm when the # of joined tables > 
> 12
> 
>
> Key: SPARK-27714
> URL: https://issues.apache.org/jira/browse/SPARK-27714
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
>Priority: Major
>
> Now the join reorder logic is based on dynamic planning which can find the 
> most optimized plan theoretically, but the searching cost grows rapidly with 
> the # of joined tables grows. It would be better to introduce Genetic 
> algorithm (GA) to overcome this problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

75 matches

Mail list logo