date:20200123

[jira] [Assigned] (SPARK-30543) RandomForest add Param bootstrap to control sampling method

2020-01-23 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30543:


Assignee: zhengruifeng

> RandomForest add Param bootstrap to control sampling method
> ---
>
> Key: SPARK-30543
> URL: https://issues.apache.org/jira/browse/SPARK-30543
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> Current RF with numTrees=1 will directly build a tree using the orignial 
> dataset,
> while with numTrees>1 it will use bootstrap samples to build trees.
> This design is to train a DecisionTreeModel by the impl of RandomForest, 
> however, it is somewhat strange.
> In Scikit-Learn, there is a param bootstrap to control bootstrap samples are 
> used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30543) RandomForest add Param bootstrap to control sampling method

2020-01-23 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30543.
--
Resolution: Resolved

> RandomForest add Param bootstrap to control sampling method
> ---
>
> Key: SPARK-30543
> URL: https://issues.apache.org/jira/browse/SPARK-30543
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> Current RF with numTrees=1 will directly build a tree using the orignial 
> dataset,
> while with numTrees>1 it will use bootstrap samples to build trees.
> This design is to train a DecisionTreeModel by the impl of RandomForest, 
> however, it is somewhat strange.
> In Scikit-Learn, there is a param bootstrap to control bootstrap samples are 
> used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation

2020-01-23 Thread weiwenda (Jira)

weiwenda created SPARK-30617:


 Summary: Is there any possible that spark no longer restrict 
enumerate types of spark.sql.catalogImplementation
 Key: SPARK-30617
 URL: https://issues.apache.org/jira/browse/SPARK-30617
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: weiwenda
 Fix For: 3.1.0, 2.4.6


# We have implemented a complex ExternalCatalog which is used for retrieving 
multi isomerism database's metadata(sush as elasticsearch、postgresql), so that 
we can make a mixture query between hive and our online data.
 # But as spark require that value of spark.sql.catalogImplementation must be 
one of in-memory/hive, we have to modify SparkSession and rebuild spark to make 
our project work.
 # Finally, we hope spark removing above restriction, so that it's will be much 
easier to let us keep pace with new spark version. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30618) Why does SparkSQL allow `WHERE` to be table alias?

2020-01-23 Thread Chunjun Xiao (Jira)

Chunjun Xiao created SPARK-30618:


 Summary: Why does SparkSQL allow `WHERE` to be table alias?
 Key: SPARK-30618
 URL: https://issues.apache.org/jira/browse/SPARK-30618
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.4.4
Reporter: Chunjun Xiao


An empty `WHERE` expression is valid in Spark SQL, as: `SELECT * FROM XXX 
WHERE`. Here `WHERE` is parsed as the table alias.

I think this surprises most SQL users, as this is an invalid statement in some 
SQL engines like MySQL.  

I checked the source code, and found more keywords (in most SQL system) are 
treated as `noReserved` and allowed to be table alias.  

Could anyone please give the rationality behind this decision?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation

2020-01-23 Thread weiwenda (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021899#comment-17021899
 ] 

weiwenda commented on SPARK-30617:
--

one solution at [https://github.com/apache/spark/pull/27338]

> Is there any possible that spark no longer restrict enumerate types of 
> spark.sql.catalogImplementation
> --
>
> Key: SPARK-30617
> URL: https://issues.apache.org/jira/browse/SPARK-30617
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: weiwenda
>Priority: Minor
> Fix For: 3.1.0, 2.4.6
>
>
> # We have implemented a complex ExternalCatalog which is used for retrieving 
> multi isomerism database's metadata(sush as elasticsearch、postgresql), so 
> that we can make a mixture query between hive and our online data.
>  # But as spark require that value of spark.sql.catalogImplementation must be 
> one of in-memory/hive, we have to modify SparkSession and rebuild spark to 
> make our project work.
>  # Finally, we hope spark removing above restriction, so that it's will be 
> much easier to let us keep pace with new spark version. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29175) Make maven central repository in IsolatedClientLoader configurable

2020-01-23 Thread Yuanjian Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021909#comment-17021909
 ] 

Yuanjian Li commented on SPARK-29175:
-

Thanks for the review, change the config name in the follow-up: 
[https://github.com/apache/spark/pull/27339]. 

> Make maven central repository in IsolatedClientLoader configurable
> --
>
> Key: SPARK-29175
> URL: https://issues.apache.org/jira/browse/SPARK-29175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> We need to connect a central repository in IsolatedClientLoader for 
> downloading Hive jars. Here we added a new config 
> `spark.sql.additionalRemoteRepositories`, a comma-delimited string config of 
> the optional additional remote maven mirror repositories, it can be used as 
> the additional remote repositories for the default maven central repo.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30619) org.slf4j.Logger and org.apache.commons.collections classes not built as part of hadoop-provided profile

2020-01-23 Thread Abhishek Rao (Jira)

Abhishek Rao created SPARK-30619:


 Summary: org.slf4j.Logger and org.apache.commons.collections 
classes not built as part of hadoop-provided profile
 Key: SPARK-30619
 URL: https://issues.apache.org/jira/browse/SPARK-30619
 Project: Spark
  Issue Type: Question
  Components: Build
Affects Versions: 2.4.4, 2.4.2
 Environment: Spark on kubernetes
Reporter: Abhishek Rao


We're using spark-2.4.4-bin-without-hadoop.tgz and executing Java Word count 
(org.apache.spark.examples.JavaWordCount) example on local files.

But we're seeing that it is expecting org.slf4j.Logger and 
org.apache.commons.collections classes to be available for executing this.

We expected the binary to work as it is for local files. Is there anything 
which we're missing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30556) Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext

2020-01-23 Thread Ajith S (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021937#comment-17021937
 ] 

Ajith S commented on SPARK-30556:
-

Yes, it exist in lower version like 2.3.x too

> Copy sparkContext.localproperties to child thread 
> inSubqueryExec.executionContext
> -
>
> Key: SPARK-30556
> URL: https://issues.apache.org/jira/browse/SPARK-30556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Major
> Fix For: 3.0.0
>
>
> Local properties set via sparkContext are not available as TaskContext 
> properties when executing  jobs and threadpools have idle threads which are 
> reused
> Explanation:
> When SubqueryExec, the {{relationFuture}} is evaluated via a separate thread. 
> The threads inherit the {{localProperties}} from sparkContext as they are the 
> child threads.
> These threads are controlled via the executionContext (thread pools). Each 
> Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads.
> Scenarios where the thread pool has threads which are idle and reused for a 
> subsequent new query, the thread local properties will not be inherited from 
> spark context (thread properties are inherited only on thread creation) hence 
> end up having old or no properties set. This will cause taskset properties to 
> be missing when properties are transferred by child thread via 
> {{sparkContext.runJob/submitJob}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30556) Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext

2020-01-23 Thread Ajith S (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021936#comment-17021936
 ] 

Ajith S commented on SPARK-30556:
-

Raised backport PR for branch 2.4 [https://github.com/apache/spark/pull/27340]

> Copy sparkContext.localproperties to child thread 
> inSubqueryExec.executionContext
> -
>
> Key: SPARK-30556
> URL: https://issues.apache.org/jira/browse/SPARK-30556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Major
> Fix For: 3.0.0
>
>
> Local properties set via sparkContext are not available as TaskContext 
> properties when executing  jobs and threadpools have idle threads which are 
> reused
> Explanation:
> When SubqueryExec, the {{relationFuture}} is evaluated via a separate thread. 
> The threads inherit the {{localProperties}} from sparkContext as they are the 
> child threads.
> These threads are controlled via the executionContext (thread pools). Each 
> Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads.
> Scenarios where the thread pool has threads which are idle and reused for a 
> subsequent new query, the thread local properties will not be inherited from 
> spark context (thread properties are inherited only on thread creation) hence 
> end up having old or no properties set. This will cause taskset properties to 
> be missing when properties are transferred by child thread via 
> {{sparkContext.runJob/submitJob}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30546) Make interval type more future-proof

2020-01-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30546:

Summary: Make interval type more future-proof  (was: Make interval type 
more future-proofing)

> Make interval type more future-proof
> 
>
> Key: SPARK-30546
> URL: https://issues.apache.org/jira/browse/SPARK-30546
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> Before 3.0 we may make some efforts for the current interval type to make it
> more future-proofing. e.g.
> 1. add unstable annotation to the CalendarInterval class. People already use
> it as UDF inputs so it’s better to make it clear it’s unstable.
> 2. Add a schema checker to prohibit create v2 custom catalog table with
> intervals, as same as what we do for the builtin catalog
> 3. Add a schema checker for DataFrameWriterV2 too
> 4. Make the interval type incomparable as version 2.4 for disambiguation of
> comparison between year-month and day-time fields
> 5. The 3.0 newly added to_csv should not support output intervals as same as
> using CSV file format or make it fully support as normal strings
> 6. The function to_json should not allow using interval as a key field as
> same as the value field and JSON datasource, with a legacy config to
> restore or make it fully support as normal strings
> 7. Revert interval ISO/ANSI SQL Standard output since we decide not to
> follow ANSI, so there is no round trip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30546) Make interval type more future-proof

2020-01-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30546:

Description: 
We've decided to not follow the SQL standard to define the interval type in 
3.0. We should try our best to hide intervals from data sources/external 
catalogs as much as possible, to not leak internals to external systems.

In Spark 2.4, intervals are exposed in the following places:
1. The `CalendarIntervalType` is public
2. `Colum.cast` accepts `CalendarIntervalType` and can cast string to interval.
3. `DataFrame.collect` can return `CalendarInterval` objects.
4. UDF can tale `CalendarInterval` as input.
5. data sources can return IntervalRow directly which may contain 
`CalendarInterval`.

In Spark 3.0, we don't want to break Spark 2.4 applications, but we should not 
expose intervals wider than 2.4. In general, we should avoid leaking intervals 
to DS v2 and catalog plugins. We should also revert some PostgresSQL specific 
interval features.

  was:
Before 3.0 we may make some efforts for the current interval type to make it
more future-proofing. e.g.
1. add unstable annotation to the CalendarInterval class. People already use
it as UDF inputs so it’s better to make it clear it’s unstable.
2. Add a schema checker to prohibit create v2 custom catalog table with
intervals, as same as what we do for the builtin catalog
3. Add a schema checker for DataFrameWriterV2 too
4. Make the interval type incomparable as version 2.4 for disambiguation of
comparison between year-month and day-time fields
5. The 3.0 newly added to_csv should not support output intervals as same as
using CSV file format or make it fully support as normal strings
6. The function to_json should not allow using interval as a key field as
same as the value field and JSON datasource, with a legacy config to
restore or make it fully support as normal strings
7. Revert interval ISO/ANSI SQL Standard output since we decide not to
follow ANSI, so there is no round trip.


> Make interval type more future-proof
> 
>
> Key: SPARK-30546
> URL: https://issues.apache.org/jira/browse/SPARK-30546
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> We've decided to not follow the SQL standard to define the interval type in 
> 3.0. We should try our best to hide intervals from data sources/external 
> catalogs as much as possible, to not leak internals to external systems.
> In Spark 2.4, intervals are exposed in the following places:
> 1. The `CalendarIntervalType` is public
> 2. `Colum.cast` accepts `CalendarIntervalType` and can cast string to 
> interval.
> 3. `DataFrame.collect` can return `CalendarInterval` objects.
> 4. UDF can tale `CalendarInterval` as input.
> 5. data sources can return IntervalRow directly which may contain 
> `CalendarInterval`.
> In Spark 3.0, we don't want to break Spark 2.4 applications, but we should 
> not expose intervals wider than 2.4. In general, we should avoid leaking 
> intervals to DS v2 and catalog plugins. We should also revert some 
> PostgresSQL specific interval features.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30546) Make interval type more future-proof

2020-01-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30546.
-
Fix Version/s: 3.0.0
 Assignee: Kent Yao
   Resolution: Fixed

> Make interval type more future-proof
> 
>
> Key: SPARK-30546
> URL: https://issues.apache.org/jira/browse/SPARK-30546
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> We've decided to not follow the SQL standard to define the interval type in 
> 3.0. We should try our best to hide intervals from data sources/external 
> catalogs as much as possible, to not leak internals to external systems.
> In Spark 2.4, intervals are exposed in the following places:
> 1. The `CalendarIntervalType` is public
> 2. `Colum.cast` accepts `CalendarIntervalType` and can cast string to 
> interval.
> 3. `DataFrame.collect` can return `CalendarInterval` objects.
> 4. UDF can tale `CalendarInterval` as input.
> 5. data sources can return IntervalRow directly which may contain 
> `CalendarInterval`.
> In Spark 3.0, we don't want to break Spark 2.4 applications, but we should 
> not expose intervals wider than 2.4. In general, we should avoid leaking 
> intervals to DS v2 and catalog plugins. We should also revert some 
> PostgresSQL specific interval features.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30592) Interval support for csv and json functions

2020-01-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30592:

Description: to_json already supports intervals in Spark 2.4. To be 
consistent, we should support intervals in from_json, from_csn and to_csv as 
well.  (was: to_csv
from_csv
to_json
from_json)

> Interval support for csv and json functions
> ---
>
> Key: SPARK-30592
> URL: https://issues.apache.org/jira/browse/SPARK-30592
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> to_json already supports intervals in Spark 2.4. To be consistent, we should 
> support intervals in from_json, from_csn and to_csv as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30620) avoid unnecessary serialization in AggregateExpression

2020-01-23 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-30620:
---

 Summary: avoid unnecessary serialization in AggregateExpression
 Key: SPARK-30620
 URL: https://issues.apache.org/jira/browse/SPARK-30620
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30621) Dynamic Pruning thread propagates the localProperties to task

2020-01-23 Thread Ajith S (Jira)

Ajith S created SPARK-30621:
---

 Summary: Dynamic Pruning thread propagates the localProperties to 
task
 Key: SPARK-30621
 URL: https://issues.apache.org/jira/browse/SPARK-30621
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Ajith S


Local properties set via sparkContext are not available as TaskContext 
properties when executing parallel jobs and threadpools have idle threads

Explanation:
When executing parallel jobs via SubqueryBroadcastExec, the {{relationFuture}} 
is evaluated via a separate thread. The threads inherit the {{localProperties}} 
from sparkContext as they are the child threads.
These threads are controlled via the executionContext (thread pools). Each 
Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads.
Scenarios where the thread pool has threads which are idle and reused for a 
subsequent new query, the thread local properties will not be inherited from 
spark context (thread properties are inherited only on thread creation) hence 
end up having old or no properties set. This will cause taskset properties to 
be missing when properties are transferred by child thread



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30605) move defaultNamespace from SupportsNamespace to CatalogPlugin

2020-01-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30605.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27319
[https://github.com/apache/spark/pull/27319]

> move defaultNamespace from SupportsNamespace to CatalogPlugin
> -
>
> Key: SPARK-30605
> URL: https://issues.apache.org/jira/browse/SPARK-30605
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30622) commands should return dummy statistics

2020-01-23 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-30622:
---

 Summary: commands should return dummy statistics
 Key: SPARK-30622
 URL: https://issues.apache.org/jira/browse/SPARK-30622
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30557) Add public documentation for SPARK_SUBMIT_OPTS

2020-01-23 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-30557.
--
Resolution: Won't Fix

> Add public documentation for SPARK_SUBMIT_OPTS
> --
>
> Key: SPARK-30557
> URL: https://issues.apache.org/jira/browse/SPARK-30557
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Documentation
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Is `SPARK_SUBMIT_OPTS` part of Spark's public interface? If so, it needs some 
> documentation. I cannot see it documented 
> [anywhere|https://github.com/apache/spark/search?q=SPARK_SUBMIT_OPTS&unscoped_q=SPARK_SUBMIT_OPTS]
>  in the docs.
> How do you use it? What is it useful for? What's an example usage? etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30623) Spark external shuffle allow disable of separate event loop group

2020-01-23 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-30623:
-

 Summary: Spark external shuffle allow disable of separate event 
loop group
 Key: SPARK-30623
 URL: https://issues.apache.org/jira/browse/SPARK-30623
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 2.4.4, 3.0.0
Reporter: Thomas Graves


In SPARK-24355 changes were made to add a separate event loop group for 
processing ChunkFetchRequests , this  allow fors the other threads to handle 
regular connection requests when the configuration value is set. This however 
seems to have added some latency (see pr 22173 comments at the end).

To help with this we could make sure the secondary event loop group isn't used 
when the configuration of spark.shuffle.server.chunkFetchHandlerThreadsPercent 
isn't explicitly set. This should result in getting the same behavior as before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30624) JDBCV2 with catalog functionalities

2020-01-23 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-30624:
---

 Summary: JDBCV2 with catalog functionalities
 Key: SPARK-30624
 URL: https://issues.apache.org/jira/browse/SPARK-30624
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30601) Add a Google Maven Central as a primary repository

2020-01-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30601:
--
Fix Version/s: 2.4.5

> Add a Google Maven Central as a primary repository
> --
>
> Key: SPARK-30601
> URL: https://issues.apache.org/jira/browse/SPARK-30601
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> See 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Adding-Maven-Central-mirror-from-Google-to-the-build-td28728.html]
> This Jira targets to switch the main repo to Google Maven Central.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30556) Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext

2020-01-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30556:
--
Fix Version/s: 2.4.5

> Copy sparkContext.localproperties to child thread 
> inSubqueryExec.executionContext
> -
>
> Key: SPARK-30556
> URL: https://issues.apache.org/jira/browse/SPARK-30556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Local properties set via sparkContext are not available as TaskContext 
> properties when executing  jobs and threadpools have idle threads which are 
> reused
> Explanation:
> When SubqueryExec, the {{relationFuture}} is evaluated via a separate thread. 
> The threads inherit the {{localProperties}} from sparkContext as they are the 
> child threads.
> These threads are controlled via the executionContext (thread pools). Each 
> Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads.
> Scenarios where the thread pool has threads which are idle and reused for a 
> subsequent new query, the thread local properties will not be inherited from 
> spark context (thread properties are inherited only on thread creation) hence 
> end up having old or no properties set. This will cause taskset properties to 
> be missing when properties are transferred by child thread via 
> {{sparkContext.runJob/submitJob}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30556) Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext

2020-01-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30556:
--
Affects Version/s: 2.3.4

> Copy sparkContext.localproperties to child thread 
> inSubqueryExec.executionContext
> -
>
> Key: SPARK-30556
> URL: https://issues.apache.org/jira/browse/SPARK-30556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Local properties set via sparkContext are not available as TaskContext 
> properties when executing  jobs and threadpools have idle threads which are 
> reused
> Explanation:
> When SubqueryExec, the {{relationFuture}} is evaluated via a separate thread. 
> The threads inherit the {{localProperties}} from sparkContext as they are the 
> child threads.
> These threads are controlled via the executionContext (thread pools). Each 
> Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads.
> Scenarios where the thread pool has threads which are idle and reused for a 
> subsequent new query, the thread local properties will not be inherited from 
> spark context (thread properties are inherited only on thread creation) hence 
> end up having old or no properties set. This will cause taskset properties to 
> be missing when properties are transferred by child thread via 
> {{sparkContext.runJob/submitJob}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30556) Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext

2020-01-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022273#comment-17022273
 ] 

Dongjoon Hyun commented on SPARK-30556:
---

Thank you for confirming, [~ajithshetty].
This is backported to branch-2.4 via https://github.com/apache/spark/pull/27340

> Copy sparkContext.localproperties to child thread 
> inSubqueryExec.executionContext
> -
>
> Key: SPARK-30556
> URL: https://issues.apache.org/jira/browse/SPARK-30556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Local properties set via sparkContext are not available as TaskContext 
> properties when executing  jobs and threadpools have idle threads which are 
> reused
> Explanation:
> When SubqueryExec, the {{relationFuture}} is evaluated via a separate thread. 
> The threads inherit the {{localProperties}} from sparkContext as they are the 
> child threads.
> These threads are controlled via the executionContext (thread pools). Each 
> Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads.
> Scenarios where the thread pool has threads which are idle and reused for a 
> subsequent new query, the thread local properties will not be inherited from 
> spark context (thread properties are inherited only on thread creation) hence 
> end up having old or no properties set. This will cause taskset properties to 
> be missing when properties are transferred by child thread via 
> {{sparkContext.runJob/submitJob}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30620) avoid unnecessary serialization in AggregateExpression

2020-01-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30620.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27342
[https://github.com/apache/spark/pull/27342]

> avoid unnecessary serialization in AggregateExpression
> --
>
> Key: SPARK-30620
> URL: https://issues.apache.org/jira/browse/SPARK-30620
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28794) Document CREATE TABLE in SQL Reference.

2020-01-23 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-28794.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26759
[https://github.com/apache/spark/pull/26759]

> Document CREATE TABLE in SQL Reference.
> ---
>
> Key: SPARK-28794
> URL: https://issues.apache.org/jira/browse/SPARK-28794
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: pavithra ramachandran
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28794) Document CREATE TABLE in SQL Reference.

2020-01-23 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-28794:
-
Priority: Minor  (was: Major)

> Document CREATE TABLE in SQL Reference.
> ---
>
> Key: SPARK-28794
> URL: https://issues.apache.org/jira/browse/SPARK-28794
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: pavithra ramachandran
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28794) Document CREATE TABLE in SQL Reference.

2020-01-23 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-28794:


Assignee: pavithra ramachandran

> Document CREATE TABLE in SQL Reference.
> ---
>
> Key: SPARK-28794
> URL: https://issues.apache.org/jira/browse/SPARK-28794
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: pavithra ramachandran
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30275) Add gitlab-ci.yml file for reproducible builds

2020-01-23 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022296#comment-17022296
 ] 

Sean R. Owen commented on SPARK-30275:
--

How is it different from just building the software normally? I get that maybe 
it pushes the buttons for you to run mvn package, but just weighing that 
against maintaining yet another integration.

> Add gitlab-ci.yml file for reproducible builds
> --
>
> Key: SPARK-30275
> URL: https://issues.apache.org/jira/browse/SPARK-30275
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jim Kleckner
>Priority: Minor
>
> It would be desirable to have public reproducible builds such as provided by 
> gitlab or others.
>  
> Here is a candidate patch set to build spark using gitlab-ci:
> * https://gitlab.com/jkleckner/spark/tree/add-gitlab-ci-yml
> Let me know if there is interest in a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29206) Number of shuffle Netty server threads should be a multiple of number of chunk fetch handler threads

2020-01-23 Thread Min Shen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022342#comment-17022342
 ] 

Min Shen commented on SPARK-29206:
--

With more investigation into the Netty side issues, we are addressing this with 
a different approach in https://issues.apache.org/jira/browse/SPARK-30512.

> Number of shuffle Netty server threads should be a multiple of number of 
> chunk fetch handler threads
> 
>
> Key: SPARK-29206
> URL: https://issues.apache.org/jira/browse/SPARK-29206
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Min Shen
>Priority: Major
>
> In SPARK-24355, we proposed to use a separate chunk fetch handler thread pool 
> to handle the slow-to-process chunk fetch requests in order to improve the 
> responsiveness of shuffle service for RPC requests.
> Initially, we thought by making the number of Netty server threads larger 
> than the number of chunk fetch handler threads, it would reserve some threads 
> for RPC requests thus resolving the various RPC request timeout issues we 
> experienced previously. The solution worked in our cluster initially. 
> However, as the number of Spark applications in our cluster continues to 
> increase, we saw the RPC request (SASL authentication specifically) timeout 
> issue again:
> {noformat}
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
>  {noformat}
> After further investigation, we realized that as the number of concurrent 
> clients connecting to a shuffle service increases, it becomes _VERY_ 
> important to configure the number of Netty server threads and number of chunk 
> fetch handler threads correctly. Specifically, the number of Netty server 
> threads needs to be a multiple of the number of chunk fetch handler threads. 
> The reason is explained in details below:
> When a channel is established on the Netty server, it is registered with both 
> the Netty server default EventLoopGroup and the chunk fetch handler 
> EventLoopGroup. Once registered, this channel sticks with a given thread in 
> both EventLoopGroups, i.e. all requests from this channel is going to be 
> handled by the same thread. Right now, Spark shuffle Netty server uses the 
> default Netty strategy to select a thread from a EventLoopGroup to be 
> associated with a new channel, which is simply round-robin (Netty's 
> DefaultEventExecutorChooserFactory).
> In SPARK-24355, with the introduced chunk fetch handler thread pool, all 
> chunk fetch requests from a given channel will be first added to the task 
> queue of the chunk fetch handler thread associated with that channel. When 
> the requests get processed, the chunk fetch request handler thread will 
> submit a task to the task queue of the Netty server thread that's also 
> associated with this channel. If the number of Netty server threads is not a 
> multiple of the number of chunk fetch handler threads, it would become a 
> problem when the server has a large number of concurrent connections.
> Assume we configure the number of Netty server threads as 40 and the 
> percentage of chunk fetch handler threads as 87, which leads to 35 chunk 
> fetch handler threads. Then according to the round-robin policy, channel 0, 
> 40, 80, 120, 160, 200, 240, and 280 will all be associated with the 1st Netty 
> server thread in the default EventLoopGroup. However, since the chunk fetch 
> handler thread pool only has 35 threads, out of these 8 channels, only 
> channel 0 and 280 will be associated with the same chunk fetch handler 
> thread. Thus, channel 0, 40, 80, 120, 160, 200, 240 will all be associated 
> with different chunk fetch handler threads but associated with the same Netty 
> server thread. This means, the 7 different chunk fetch handler threads 
> associated with these channels could potentially submit tasks to the task 
> queue of the same Netty ser

[jira] [Commented] (SPARK-30612) can't resolve qualified column name with v2 tables

2020-01-23 Thread Burak Yavuz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022343#comment-17022343
 ] 

Burak Yavuz commented on SPARK-30612:
-

SPARK-30314 should help make this work easier

> can't resolve qualified column name with v2 tables
> --
>
> Key: SPARK-30612
> URL: https://issues.apache.org/jira/browse/SPARK-30612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> When running queries with qualified columns like `SELECT t.a FROM t`, it 
> fails to resolve for v2 tables.
> v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We 
> should do the same for v2 tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30625) Add `escapeChar` parameter to the `like` function

2020-01-23 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30625:
--

 Summary: Add `escapeChar` parameter to the `like` function
 Key: SPARK-30625
 URL: https://issues.apache.org/jira/browse/SPARK-30625
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


SPARK-28083 supported LIKE ... ESCAPE syntax
{code:sql}
spark-sql> SELECT '_Apache Spark_' like '__%Spark__' escape '_';
true
{code}
but the `like` function can accept only 2 parameters. If we pass the third one, 
it fails with:
{code:sql}
spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_');
Error in query: Invalid number of arguments for function like. Expected: 2; 
Found: 3; line 1 pos 7
{code}
The ticket aims to support the third parameter in `like` as `escapeChar`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27282) Spark incorrect results when using UNION with GROUP BY clause

2020-01-23 Thread Sofia (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022348#comment-17022348
 ] 

Sofia commented on SPARK-27282:
---

No idea [~tgraves], I'm still working with spark-sql and spark-core ==> 
2.3.2.3.1.0.0-78 (for HDP 3.1)

and scala ==> 2.11.8.

When I tried debugging (in a first level) using explain(true), I find out that 
the main reason of this error is the misuse of *+ReuseExchange in the optimized 
plan+*. I used a workaround to handle this issue.

> Spark incorrect results when using UNION with GROUP BY clause
> -
>
> Key: SPARK-27282
> URL: https://issues.apache.org/jira/browse/SPARK-27282
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, SQL
>Affects Versions: 2.3.2
> Environment: I'm using :
> IntelliJ  IDEA ==> 2018.1.4
> spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1)
> scala ==> 2.11.8
>Reporter: Sofia
>Priority: Blocker
>  Labels: correctness
>
> When using UNION clause after a GROUP BY clause in spark, the results 
> obtained are wrong.
> The following example explicit this issue:
> {code:java}
> CREATE TABLE test_un (
> col1 varchar(255),
> col2 varchar(255),
> col3 varchar(255),
> col4 varchar(255)
> );
> INSERT INTO test_un (col1, col2, col3, col4)
> VALUES (1,1,2,4),
> (1,1,2,4),
> (1,1,3,5),
> (2,2,2,null);
> {code}
> I used the following code :
> {code:java}
> val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un")
> val  y = x
>.filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col3")
>   .agg(count(col("col3")).alias("cnt"))
>   .withColumn("col_name", lit("col3"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col3").alias("col_value"), col("cnt"))
> val z = x
>   .filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col4")
>   .agg(count(col("col4")).alias("cnt"))
>   .withColumn("col_name", lit("col4"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col4").alias("col_value"), col("cnt"))
> y.union(z).show()
> {code}
>  And i obtained the following results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|5|1|
> |1|1|col3|4|2|
> |1|1|col4|5|1|
> |1|1|col4|4|2|
> Expected results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|3|1|
> |1|1|col3|2|2|
> |1|1|col4|4|2|
> |1|1|col4|5|1|
> But when i remove the last row of the table, i obtain the correct results.
> {code:java}
> (2,2,2,null){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30625) Add `escapeChar` parameter to the `like` function

2020-01-23 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022398#comment-17022398
 ] 

Maxim Gekk commented on SPARK-30625:


I have implemented the feature but I am not sure it is useful. Should I submit 
a PR for that, WDYT [~dongjoon] [~Gengliang.Wang] [~cloud_fan] [~beliefer] 
[~maropu] [~hyukjin.kwon] ?

> Add `escapeChar` parameter to the `like` function
> -
>
> Key: SPARK-30625
> URL: https://issues.apache.org/jira/browse/SPARK-30625
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> SPARK-28083 supported LIKE ... ESCAPE syntax
> {code:sql}
> spark-sql> SELECT '_Apache Spark_' like '__%Spark__' escape '_';
> true
> {code}
> but the `like` function can accept only 2 parameters. If we pass the third 
> one, it fails with:
> {code:sql}
> spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_');
> Error in query: Invalid number of arguments for function like. Expected: 2; 
> Found: 3; line 1 pos 7
> {code}
> The ticket aims to support the third parameter in `like` as `escapeChar`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2020-01-23 Thread Ivelin Tchangalov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022354#comment-17022354
 ] 

Ivelin Tchangalov commented on SPARK-27913:
---

I'm curious if there's any progress or solution for this issue.

> Spark SQL's native ORC reader implements its own schema evolution
> -
>
> Key: SPARK-27913
> URL: https://issues.apache.org/jira/browse/SPARK-27913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Owen O'Malley
>Priority: Major
>
> ORC's reader handles a wide range of schema evolution, but the Spark SQL 
> native ORC bindings do not provide the desired schema to the ORC reader. This 
> causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30626) [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID

2020-01-23 Thread Jiaxin Shan (Jira)

Jiaxin Shan created SPARK-30626:
---

 Summary: [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID
 Key: SPARK-30626
 URL: https://issues.apache.org/jira/browse/SPARK-30626
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.4, 3.0.0
Reporter: Jiaxin Shan


This should be a minor improvement.

The use case is we want to look up environment variables and create application 
folder and redirect driver logs to application folder.  Executors has it and we 
want to make a change to driver as well. 

```
 Limits:
 cpu: 1024m
 memory: 896Mi
 Requests:
 cpu: 1
 memory: 896Mi
 Environment:
 SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
 SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e
 SPARK_CONF_DIR: /opt/spark/conf

```

 

https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79

We need SPARK_APPLICATION_ID inside the pod to organize logs 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30626) [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID

2020-01-23 Thread Jiaxin Shan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaxin Shan updated SPARK-30626:

Description: 
This should be a minor improvement.

The use case is we want to look up environment variables and create application 
folder and redirect driver logs to application folder.  Executors has it and we 
want to make a change to driver as well. 

 
{code:java}
Limits:
 cpu: 1024m
 memory: 896Mi
 Requests:
 cpu: 1
 memory: 896Mi
Environment:
 SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
 SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e
 SPARK_CONF_DIR: /opt/spark/conf{code}
 

[https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79]

We need SPARK_APPLICATION_ID inside the pod to organize logs 

  was:
This should be a minor improvement.

The use case is we want to look up environment variables and create application 
folder and redirect driver logs to application folder.  Executors has it and we 
want to make a change to driver as well. 

```
 Limits:
 cpu: 1024m
 memory: 896Mi
 Requests:
 cpu: 1
 memory: 896Mi
 Environment:
 SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
 SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e
 SPARK_CONF_DIR: /opt/spark/conf

```

 

https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79

We need SPARK_APPLICATION_ID inside the pod to organize logs 


> [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID
> 
>
> Key: SPARK-30626
> URL: https://issues.apache.org/jira/browse/SPARK-30626
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jiaxin Shan
>Priority: Minor
>
> This should be a minor improvement.
> The use case is we want to look up environment variables and create 
> application folder and redirect driver logs to application folder.  Executors 
> has it and we want to make a change to driver as well. 
>  
> {code:java}
> Limits:
>  cpu: 1024m
>  memory: 896Mi
>  Requests:
>  cpu: 1
>  memory: 896Mi
> Environment:
>  SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
>  SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e
>  SPARK_CONF_DIR: /opt/spark/conf{code}
>  
> [https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79]
> We need SPARK_APPLICATION_ID inside the pod to organize logs 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30626) [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env

2020-01-23 Thread Jiaxin Shan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaxin Shan updated SPARK-30626:

Summary: [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env  
(was: [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID)

> [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env
> 
>
> Key: SPARK-30626
> URL: https://issues.apache.org/jira/browse/SPARK-30626
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jiaxin Shan
>Priority: Minor
>
> This should be a minor improvement.
> The use case is we want to look up environment variables and create 
> application folder and redirect driver logs to application folder.  Executors 
> has it and we want to make a change to driver as well. 
>  
> {code:java}
> Limits:
>  cpu: 1024m
>  memory: 896Mi
>  Requests:
>  cpu: 1
>  memory: 896Mi
> Environment:
>  SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
>  SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e
>  SPARK_CONF_DIR: /opt/spark/conf{code}
>  
> [https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79]
> We need SPARK_APPLICATION_ID inside the pod to organize logs 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30626) [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env

2020-01-23 Thread Jiaxin Shan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022426#comment-17022426
 ] 

Jiaxin Shan commented on SPARK-30626:
-

I have an improvement change for this and let me create a PR

> [K8S] Spark driver pod doesn't have SPARK_APPLICATION_ID env
> 
>
> Key: SPARK-30626
> URL: https://issues.apache.org/jira/browse/SPARK-30626
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jiaxin Shan
>Priority: Minor
>
> This should be a minor improvement.
> The use case is we want to look up environment variables and create 
> application folder and redirect driver logs to application folder.  Executors 
> has it and we want to make a change to driver as well. 
>  
> {code:java}
> Limits:
>  cpu: 1024m
>  memory: 896Mi
>  Requests:
>  cpu: 1
>  memory: 896Mi
> Environment:
>  SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
>  SPARK_LOCAL_DIRS: /var/data/spark-9c315655-aba4-47fb-821c-30268d02af7e
>  SPARK_CONF_DIR: /opt/spark/conf{code}
>  
> [https://github.com/apache/spark/blob/afe70b3b5321439318a456c7d19b7074171a286a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala#L73-L79]
> We need SPARK_APPLICATION_ID inside the pod to organize logs 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2020-01-23 Thread Jeff Evans (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022430#comment-17022430
 ] 

Jeff Evans commented on SPARK-19248:


After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2020-01-23 Thread Jeff Evans (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022430#comment-17022430
 ] 

Jeff Evans edited comment on SPARK-19248 at 1/23/20 7:53 PM:
-

After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.  Otherwise, you need to escape the 
literal backslash before the dot character (and of course, in a string literal, 
the backslashes themselves also need escaped), so you would need the pattern to 
be {{'( |.)*'}}


was (Author: jeff.w.evans):
After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.  Otherwise, you need to escape the 
literal backslash before the dot character, so you would need the pattern to be 
{{'( |.)*'}}

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2020-01-23 Thread Jeff Evans (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022430#comment-17022430
 ] 

Jeff Evans edited comment on SPARK-19248 at 1/23/20 7:53 PM:
-

After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.  Otherwise, you need to escape the 
literal backslash before the dot character, so you would need the pattern to be 
{{'( |.)*'}}


was (Author: jeff.w.evans):
After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2020-01-23 Thread Jeff Evans (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022430#comment-17022430
 ] 

Jeff Evans edited comment on SPARK-19248 at 1/23/20 8:06 PM:
-

After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.  Otherwise, you need to escape the 
literal backslash before the dot character (and of course, in a string literal, 
the backslashes themselves also need escaped), so you would need the pattern to 
be {{'( |.)*'}}

By the way, this isn't Python-specific behavior.  Even if you use a Scala 
session, and use the {{expr}} expression (which I don't see in the sample 
sessions above), you will notice the same thing happening.

{code}
val df = spark.createDataFrame(Seq((0, "..   5."))).toDF("id","col")

df.selectExpr("""regexp_replace(col, "( |\.)*", "")""").show()
+-+
|regexp_replace(col, ( |.)*, )|
+-+
| |
+-+

spark.conf.set("spark.sql.parser.escapedStringLiterals", true)

df.selectExpr("""regexp_replace(col, "( |\.)*", "")""").show()
+--+
|regexp_replace(col, ( |\.)*, )|
+--+
| 5|
+--+
{code}


was (Author: jeff.w.evans):
After some debugging, I figured out what's going on here.  The crux of this is 
the {{spark.sql.parser.escapedStringLiterals}} config setting, introduced under 
SPARK-20399.  This behavior changed in 2.0 (see 
[here|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L483]).
  If you start your PySpark sessions described above with this line:

{{spark.conf.set("spark.sql.parser.escapedStringLiterals", True)}}

then you should see the 1.6 behavior.  Otherwise, you need to escape the 
literal backslash before the dot character (and of course, in a string literal, 
the backslashes themselves also need escaped), so you would need the pattern to 
be {{'( |.)*'}}

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30570) Update scalafmt to 1.0.3 with onlyChangedFiles feature

2020-01-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30570:
-

Assignee: Cody Koeninger

> Update scalafmt to 1.0.3 with onlyChangedFiles feature
> --
>
> Key: SPARK-30570
> URL: https://issues.apache.org/jira/browse/SPARK-30570
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
>Priority: Minor
>
> [https://github.com/SimonJPegg/mvn_scalafmt/releases/tag/v1.0.3]
> added an option onlyChangedFiles which was one of the things holding back the 
> upgrade in SPARK-29293



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30570) Update scalafmt to 1.0.3 with onlyChangedFiles feature

2020-01-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30570.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27279
[https://github.com/apache/spark/pull/27279]

> Update scalafmt to 1.0.3 with onlyChangedFiles feature
> --
>
> Key: SPARK-30570
> URL: https://issues.apache.org/jira/browse/SPARK-30570
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
>Priority: Minor
> Fix For: 3.0.0
>
>
> [https://github.com/SimonJPegg/mvn_scalafmt/releases/tag/v1.0.3]
> added an option onlyChangedFiles which was one of the things holding back the 
> upgrade in SPARK-29293



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30603) Keep the reserved properties of namespaces and tables private

2020-01-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30603.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27318
[https://github.com/apache/spark/pull/27318]

> Keep the reserved properties of namespaces and tables private
> -
>
> Key: SPARK-30603
> URL: https://issues.apache.org/jira/browse/SPARK-30603
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> the reserved properties of namespaces and tables should be private



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30603) Keep the reserved properties of namespaces and tables private

2020-01-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30603:
-

Assignee: Kent Yao

> Keep the reserved properties of namespaces and tables private
> -
>
> Key: SPARK-30603
> URL: https://issues.apache.org/jira/browse/SPARK-30603
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> the reserved properties of namespaces and tables should be private



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28403) Executor Allocation Manager can add an extra executor when speculative tasks

2020-01-23 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022527#comment-17022527
 ] 

Thomas Graves commented on SPARK-28403:
---

so after looking at the pr for this, this logic may have been an attempt to get 
executors on different hosts. The speculation logic in the scheduler is such 
that it will only run a speculative task on a different host then the current 
running task.

> Executor Allocation Manager can add an extra executor when speculative tasks
> 
>
> Key: SPARK-28403
> URL: https://issues.apache.org/jira/browse/SPARK-28403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> It looks like SPARK-19326 added a bug in the execuctor allocation maanger 
> where it adds an extra executor when it shouldn't when we have pending 
> speculative tasks but the target number didn't change. 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L377]
> It doesn't look like this is necessary since it already added in the 
> pendingSpeculative tasks.
> See the questioning of this on the PR at:
> https://github.com/apache/spark/pull/18492/files#diff-b096353602813e47074ace09a3890d56R379



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30275) Add gitlab-ci.yml file for reproducible builds

2020-01-23 Thread Jim Kleckner (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022569#comment-17022569
 ] 

Jim Kleckner commented on SPARK-30275:
--

* Builds on my Mac use whatever I have installed on my machine whereas having a 
well-defined remote CI system eliminates variability.
 * The build process doesn't load my local system.
 * A push is just a git push rather than an image push which from home can take 
a long time since my ISP has very wimpy upload speeds.

Obviously some CI/CD tooling exists for spark testing and release on the back 
end, but that isn't available to most people.

> Add gitlab-ci.yml file for reproducible builds
> --
>
> Key: SPARK-30275
> URL: https://issues.apache.org/jira/browse/SPARK-30275
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jim Kleckner
>Priority: Minor
>
> It would be desirable to have public reproducible builds such as provided by 
> gitlab or others.
>  
> Here is a candidate patch set to build spark using gitlab-ci:
> * https://gitlab.com/jkleckner/spark/tree/add-gitlab-ci-yml
> Let me know if there is interest in a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19248) Regex_replace works in 1.6 but not in 2.0

2020-01-23 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022576#comment-17022576
 ] 

Nicholas Chammas commented on SPARK-19248:
--

Thanks for getting to the bottom of the issue, [~jeff.w.evans], and for 
providing a workaround.

Would an appropriate solution be to make {{escapedStringLiterals}} default to 
{{True}}? Or does that cause other problems?

> Regex_replace works in 1.6 but not in 2.0
> -
>
> Key: SPARK-19248
> URL: https://issues.apache.org/jira/browse/SPARK-19248
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.4.3
>Reporter: Lucas Tittmann
>Priority: Major
>  Labels: correctness
>
> We found an error in Spark 2.0.2 execution of Regex. Using PySpark In 1.6.2, 
> we get the following, expected behaviour:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'5')]
> {noformat}
> In Spark 2.0.2, with the same code, we get the following:
> {noformat}
> df = sqlContext.createDataFrame([('..   5.',)], ['col'])
> dfout = df.selectExpr(*["regexp_replace(col, '[ \.]*', '') AS col"]).collect()
> z.show(dfout)
> >>> [Row(col=u'5')]
> dfout2 = df.selectExpr(*["regexp_replace(col, '( |\.)*', '') AS 
> col"]).collect()
> z.show(dfout2)
> >>> [Row(col=u'')]
> {noformat}
> As you can see, the second regex shows different behaviour depending on the 
> Spark version. We checked the regex in Java, and both should be correct and 
> work. Therefore, regex execution in 2.0.2 seems to be erroneous. I do not 
> have the possibility to confirm in 2.1 at the moment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30615) normalize the column name in AlterTable

2020-01-23 Thread Burak Yavuz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022577#comment-17022577
 ] 

Burak Yavuz commented on SPARK-30615:
-

I actually had a PR in progress on this. Let me push that

> normalize the column name in AlterTable
> ---
>
> Key: SPARK-30615
> URL: https://issues.apache.org/jira/browse/SPARK-30615
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Because of case insensitive resolution, the column name in AlterTable may 
> match the table schema but not exactly the same. To ease DS v2 
> implementations, Spark should normalize the column name before passing them 
> to v2 catalogs, so that users don't need to care about the case sensitive 
> config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30625) Add `escapeChar` parameter to the `like` function

2020-01-23 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022579#comment-17022579
 ] 

Takeshi Yamamuro commented on SPARK-30625:
--

Yea, supporting that looks fine to me.

> Add `escapeChar` parameter to the `like` function
> -
>
> Key: SPARK-30625
> URL: https://issues.apache.org/jira/browse/SPARK-30625
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> SPARK-28083 supported LIKE ... ESCAPE syntax
> {code:sql}
> spark-sql> SELECT '_Apache Spark_' like '__%Spark__' escape '_';
> true
> {code}
> but the `like` function can accept only 2 parameters. If we pass the third 
> one, it fails with:
> {code:sql}
> spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_');
> Error in query: Invalid number of arguments for function like. Expected: 2; 
> Found: 3; line 1 pos 7
> {code}
> The ticket aims to support the third parameter in `like` as `escapeChar`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28396) Add PathCatalog for data source V2

2020-01-23 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-28396.

Resolution: Won't Fix

> Add PathCatalog for data source V2
> --
>
> Key: SPARK-28396
> URL: https://issues.apache.org/jira/browse/SPARK-28396
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add PathCatalog for data source V2, so that:
> 1. We can convert SaveMode in DataFrameWriter into catalog table operations, 
> instead of supporting SaveMode in file source V2.
> 2. Support create-table SQL statements like "CREATE TABLE orc.'path'"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30298) bucket join cannot work for self-join with views

2020-01-23 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-30298.
--
Fix Version/s: 3.0.0
 Assignee: Terry Kim
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/26943

> bucket join cannot work for self-join with views
> 
>
> Key: SPARK-30298
> URL: https://issues.apache.org/jira/browse/SPARK-30298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiaoju Wu
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.0.0
>
>
> This UT may fail at the last line:
> {code:java}
> test("bucket join cannot work for self-join with views") {
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "1") {
>   withTable("t1") {
> val df = (0 until 20).map(i => (i, i)).toDF("i", "j").as("df")
> df.write
>   .format("parquet")
>   .bucketBy(8, "i")
>   .saveAsTable("t1")
> sql(s"create view v1 as select * from t1").collect()
> val plan1 = sql("SELECT * FROM t1 a JOIN t1 b ON a.i = 
> b.i").queryExecution.executedPlan
> assert(plan1.collect { case exchange : ShuffleExchangeExec => 
> exchange }.isEmpty)
> val plan2 = sql("SELECT * FROM t1 a JOIN v1 b ON a.i = 
> b.i").queryExecution.executedPlan
> assert(plan2.collect { case exchange : ShuffleExchangeExec => 
> exchange }.isEmpty)
>   }
> }
>   }
> {code}
> It's because View will add Project with Alias, then Join's 
> requiredDistribution is based on Alias, but ProjectExec passes child's 
> outputPartition up without Alias. Then the satisfies check cannot meet in 
> EnsureRequirement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30627) Disable all the V2 file sources in Spark 3.0 by default

2020-01-23 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-30627:
--

 Summary: Disable all the V2 file sources in Spark 3.0 by default
 Key: SPARK-30627
 URL: https://issues.apache.org/jira/browse/SPARK-30627
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


There are still some missing parts in the file source V2 framework:
1. It doesn't support reporting file scan metrics such as 
"numOutputRows"/"numFiles"/"fileSize" like `FileSourceScanExec`. 
2. It doesn't support partition pruning with subqueries or dynamic partition 
pruning.

As we are going to code freeze on Jan 31st, I suggest disabling all the V2 file 
sources in Spark 3.0 by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage

2020-01-23 Thread Rahul Kumar Challapalli (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022596#comment-17022596
 ] 

Rahul Kumar Challapalli commented on SPARK-30218:
-

We currently are detecting that there is a self-join, but the OP seems to be 
asking about why spark doesn't disambiguate the columns. So I am not sure if we 
can close this issue. Thoughts?

> Columns used in inequality conditions for joins not resolved correctly in 
> case of common lineage
> 
>
> Key: SPARK-30218
> URL: https://issues.apache.org/jira/browse/SPARK-30218
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Francesco Cavrini
>Priority: Major
>  Labels: correctness
>
> When columns from different data-frames that have a common lineage are used 
> in inequality conditions in joins, they are not resolved correctly. In 
> particular, both the column from the left DF and the one from the right DF 
> are resolved to the same column, thus making the inequality condition either 
> always satisfied or always not-satisfied.
> Minimal example to reproduce follows.
> {code:python}
> import pyspark.sql.functions as F
> data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 
> 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], 
> ["id", "kind", "timestamp"])
> df_left = data.where(F.col("kind") == "A").alias("left")
> df_right = data.where(F.col("kind") == "B").alias("right")
> conds = [df_left["id"] == df_right["id"]]
> conds.append(df_right["timestamp"].between(df_left["timestamp"], 
> df_left["timestamp"] + 2))
> res = df_left.join(df_right, conds, how="left")
> {code}
> The result is:
> | id|kind|timestamp| id|kind|timestamp|
> |id1|   A|0|id1|   B|1|
> |id1|   A|0|id1|   B|5|
> |id1|   A|1|id1|   B|1|
> |id1|   A|1|id1|   B|5|
> |id2|   A|2|id2|   B|   10|
> |id2|   A|3|id2|   B|   10|
> which violates the condition that the timestamp from the right DF should be 
> between df_left["timestamp"] and  df_left["timestamp"] + 2.
> The plan shows the problem in the column resolution.
> {code:bash}
> == Parsed Logical Plan ==
> Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && 
> (timestamp#2L <= (timestamp#2L + cast(2 as bigint)
> :- SubqueryAlias `left`
> :  +- Filter (kind#1 = A)
> : +- LogicalRDD [id#0, kind#1, timestamp#2L], false
> +- SubqueryAlias `right`
>+- Filter (kind#37 = B)
>   +- LogicalRDD [id#36, kind#37, timestamp#38L], false
> {code}
> Note, the columns used in the equality condition of the join have been 
> correctly resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30628) File source V2: Support partition pruning with subqueries

2020-01-23 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-30628:
--

 Summary: File source V2: Support partition pruning with subqueries
 Key: SPARK-30628
 URL: https://issues.apache.org/jira/browse/SPARK-30628
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30615) normalize the column name in AlterTable

2020-01-23 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022617#comment-17022617
 ] 

Terry Kim commented on SPARK-30615:
---

Thanks [~brkyvz] for heads up.

> normalize the column name in AlterTable
> ---
>
> Key: SPARK-30615
> URL: https://issues.apache.org/jira/browse/SPARK-30615
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Because of case insensitive resolution, the column name in AlterTable may 
> match the table schema but not exactly the same. To ease DS v2 
> implementations, Spark should normalize the column name before passing them 
> to v2 catalogs, so that users don't need to care about the case sensitive 
> config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)

2020-01-23 Thread Mathew Wicks (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022625#comment-17022625
 ] 

Mathew Wicks commented on SPARK-28921:
--

It is not enough to replace the kuberntes-client.jar in your $SPARK_HOME/jars, 
you must also replace:
* $SPARK_HOME/jars/kubernetes-client-*.jar
* $SPARK_HOME/jars/kubernetes-model-common-*jar
* $SPARK_HOME/jars/kubernetes-model-*.jar 
* $SPARK_HOME/jars/okhttp-*.jar
* $SPARK_HOME/jars/okio-*.jar

With the versions specified in this PR:
https://github.com/apache/spark/commit/65c0a7812b472147c615fb4fe779da9d0a11ff18

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 
> 1.12.10, 1.11.10)
> ---
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: Paul Schweigert
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)

2020-01-23 Thread Mathew Wicks (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022625#comment-17022625
 ] 

Mathew Wicks edited comment on SPARK-28921 at 1/24/20 1:03 AM:
---

It is not enough to replace the kuberntes-client.jar in your $SPARK_HOME/jars, 
you must also replace:
 * $SPARK_HOME/jars/kubernetes-client-*.jar
 * $SPARK_HOME/jars/kubernetes-model-common-*jar
 * $SPARK_HOME/jars/kubernetes-model-*.jar
 * $SPARK_HOME/jars/okhttp-*.jar
 * $SPARK_HOME/jars/okio-*.jar

With the versions specified in this PR:
 
[https://github.com/apache/spark/commit/65c0a7812b472147c615fb4fe779da9d0a11ff18]


was (Author: thesuperzapper):
It is not enough to replace the kuberntes-client.jar in your $SPARK_HOME/jars, 
you must also replace:
* $SPARK_HOME/jars/kubernetes-client-*.jar
* $SPARK_HOME/jars/kubernetes-model-common-*jar
* $SPARK_HOME/jars/kubernetes-model-*.jar 
* $SPARK_HOME/jars/okhttp-*.jar
* $SPARK_HOME/jars/okio-*.jar

With the versions specified in this PR:
https://github.com/apache/spark/commit/65c0a7812b472147c615fb4fe779da9d0a11ff18

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 
> 1.12.10, 1.11.10)
> ---
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: Paul Schweigert
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30625) Add `escapeChar` parameter to the `like` function

2020-01-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022628#comment-17022628
 ] 

Dongjoon Hyun commented on SPARK-30625:
---

+1

> Add `escapeChar` parameter to the `like` function
> -
>
> Key: SPARK-30625
> URL: https://issues.apache.org/jira/browse/SPARK-30625
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> SPARK-28083 supported LIKE ... ESCAPE syntax
> {code:sql}
> spark-sql> SELECT '_Apache Spark_' like '__%Spark__' escape '_';
> true
> {code}
> but the `like` function can accept only 2 parameters. If we pass the third 
> one, it fails with:
> {code:sql}
> spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_');
> Error in query: Invalid number of arguments for function like. Expected: 2; 
> Found: 3; line 1 pos 7
> {code}
> The ticket aims to support the third parameter in `like` as `escapeChar`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage

2020-01-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022630#comment-17022630
 ] 

Dongjoon Hyun commented on SPARK-30218:
---

How do you disambiguate them? Could you describe your idea?

> Columns used in inequality conditions for joins not resolved correctly in 
> case of common lineage
> 
>
> Key: SPARK-30218
> URL: https://issues.apache.org/jira/browse/SPARK-30218
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Francesco Cavrini
>Priority: Major
>  Labels: correctness
>
> When columns from different data-frames that have a common lineage are used 
> in inequality conditions in joins, they are not resolved correctly. In 
> particular, both the column from the left DF and the one from the right DF 
> are resolved to the same column, thus making the inequality condition either 
> always satisfied or always not-satisfied.
> Minimal example to reproduce follows.
> {code:python}
> import pyspark.sql.functions as F
> data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 
> 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], 
> ["id", "kind", "timestamp"])
> df_left = data.where(F.col("kind") == "A").alias("left")
> df_right = data.where(F.col("kind") == "B").alias("right")
> conds = [df_left["id"] == df_right["id"]]
> conds.append(df_right["timestamp"].between(df_left["timestamp"], 
> df_left["timestamp"] + 2))
> res = df_left.join(df_right, conds, how="left")
> {code}
> The result is:
> | id|kind|timestamp| id|kind|timestamp|
> |id1|   A|0|id1|   B|1|
> |id1|   A|0|id1|   B|5|
> |id1|   A|1|id1|   B|1|
> |id1|   A|1|id1|   B|5|
> |id2|   A|2|id2|   B|   10|
> |id2|   A|3|id2|   B|   10|
> which violates the condition that the timestamp from the right DF should be 
> between df_left["timestamp"] and  df_left["timestamp"] + 2.
> The plan shows the problem in the column resolution.
> {code:bash}
> == Parsed Logical Plan ==
> Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && 
> (timestamp#2L <= (timestamp#2L + cast(2 as bigint)
> :- SubqueryAlias `left`
> :  +- Filter (kind#1 = A)
> : +- LogicalRDD [id#0, kind#1, timestamp#2L], false
> +- SubqueryAlias `right`
>+- Filter (kind#37 = B)
>   +- LogicalRDD [id#36, kind#37, timestamp#38L], false
> {code}
> Note, the columns used in the equality condition of the join have been 
> correctly resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage

2020-01-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022630#comment-17022630
 ] 

Dongjoon Hyun edited comment on SPARK-30218 at 1/24/20 1:16 AM:


How do you disambiguate them? Could you describe your idea, [~rkins]?


was (Author: dongjoon):
How do you disambiguate them? Could you describe your idea?

> Columns used in inequality conditions for joins not resolved correctly in 
> case of common lineage
> 
>
> Key: SPARK-30218
> URL: https://issues.apache.org/jira/browse/SPARK-30218
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Francesco Cavrini
>Priority: Major
>  Labels: correctness
>
> When columns from different data-frames that have a common lineage are used 
> in inequality conditions in joins, they are not resolved correctly. In 
> particular, both the column from the left DF and the one from the right DF 
> are resolved to the same column, thus making the inequality condition either 
> always satisfied or always not-satisfied.
> Minimal example to reproduce follows.
> {code:python}
> import pyspark.sql.functions as F
> data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 
> 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], 
> ["id", "kind", "timestamp"])
> df_left = data.where(F.col("kind") == "A").alias("left")
> df_right = data.where(F.col("kind") == "B").alias("right")
> conds = [df_left["id"] == df_right["id"]]
> conds.append(df_right["timestamp"].between(df_left["timestamp"], 
> df_left["timestamp"] + 2))
> res = df_left.join(df_right, conds, how="left")
> {code}
> The result is:
> | id|kind|timestamp| id|kind|timestamp|
> |id1|   A|0|id1|   B|1|
> |id1|   A|0|id1|   B|5|
> |id1|   A|1|id1|   B|1|
> |id1|   A|1|id1|   B|5|
> |id2|   A|2|id2|   B|   10|
> |id2|   A|3|id2|   B|   10|
> which violates the condition that the timestamp from the right DF should be 
> between df_left["timestamp"] and  df_left["timestamp"] + 2.
> The plan shows the problem in the column resolution.
> {code:bash}
> == Parsed Logical Plan ==
> Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && 
> (timestamp#2L <= (timestamp#2L + cast(2 as bigint)
> :- SubqueryAlias `left`
> :  +- Filter (kind#1 = A)
> : +- LogicalRDD [id#0, kind#1, timestamp#2L], false
> +- SubqueryAlias `right`
>+- Filter (kind#37 = B)
>   +- LogicalRDD [id#36, kind#37, timestamp#38L], false
> {code}
> Note, the columns used in the equality condition of the join have been 
> correctly resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)

2020-01-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022631#comment-17022631
 ] 

Dongjoon Hyun commented on SPARK-28921:
---

Thank you for updating, [~thesuperzapper]. What problem did you hit when you 
don't change the others?
BTW, 2.4.5 RC2 vote is coming.

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 
> 1.12.10, 1.11.10)
> ---
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: Paul Schweigert
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30629) cleanClosure on recursive call leads to node stack overflow

2020-01-23 Thread Maciej Szymkiewicz (Jira)

Maciej Szymkiewicz created SPARK-30629:
--

 Summary: cleanClosure on recursive call leads to node stack 
overflow
 Key: SPARK-30629
 URL: https://issues.apache.org/jira/browse/SPARK-30629
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 3.0.0
Reporter: Maciej Szymkiewicz


This problem surfaced while handling SPARK-22817. In theory there are tests, 
which cover that problem, but it seems like they have been dead for some reason.

Reproducible example

{code:r}
f <- function(x) {
  f(x)
}

newF <- cleanClosure(f)
{code}


Just looking at the {{cleanClosure}} /  {{processClosure}} pair, that function 
that is being processed is not added to {{checkedFuncs}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation

2020-01-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30617:
--
Fix Version/s: (was: 2.4.6)
   (was: 3.1.0)

> Is there any possible that spark no longer restrict enumerate types of 
> spark.sql.catalogImplementation
> --
>
> Key: SPARK-30617
> URL: https://issues.apache.org/jira/browse/SPARK-30617
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: weiwenda
>Priority: Minor
>
> # We have implemented a complex ExternalCatalog which is used for retrieving 
> multi isomerism database's metadata(sush as elasticsearch、postgresql), so 
> that we can make a mixture query between hive and our online data.
>  # But as spark require that value of spark.sql.catalogImplementation must be 
> one of in-memory/hive, we have to modify SparkSession and rebuild spark to 
> make our project work.
>  # Finally, we hope spark removing above restriction, so that it's will be 
> much easier to let us keep pace with new spark version. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation

2020-01-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30617:
--
Affects Version/s: (was: 2.4.4)
   3.0.0

> Is there any possible that spark no longer restrict enumerate types of 
> spark.sql.catalogImplementation
> --
>
> Key: SPARK-30617
> URL: https://issues.apache.org/jira/browse/SPARK-30617
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: weiwenda
>Priority: Minor
>
> # We have implemented a complex ExternalCatalog which is used for retrieving 
> multi isomerism database's metadata(sush as elasticsearch、postgresql), so 
> that we can make a mixture query between hive and our online data.
>  # But as spark require that value of spark.sql.catalogImplementation must be 
> one of in-memory/hive, we have to modify SparkSession and rebuild spark to 
> make our project work.
>  # Finally, we hope spark removing above restriction, so that it's will be 
> much easier to let us keep pace with new spark version. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30612) can't resolve qualified column name with v2 tables

2020-01-23 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022675#comment-17022675
 ] 

Terry Kim commented on SPARK-30612:
---

Thanks [~brkyvz]

There are two approaches we can take. One is to wrap v2 table with 
`SubqueryAlias`. Another is to update `DataSourceV2Relation`'s output 
(Seq[AttributeReference]) to have qualifier directly (after SPARK-30314). Which 
route should I take?

 

> can't resolve qualified column name with v2 tables
> --
>
> Key: SPARK-30612
> URL: https://issues.apache.org/jira/browse/SPARK-30612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> When running queries with qualified columns like `SELECT t.a FROM t`, it 
> fails to resolve for v2 tables.
> v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We 
> should do the same for v2 tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30617) Is there any possible that spark no longer restrict enumerate types of spark.sql.catalogImplementation

2020-01-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022676#comment-17022676
 ] 

Dongjoon Hyun commented on SPARK-30617:
---

Thanks, [~994184...@qq.com]. I added the existing JIRAs. I'd like to recommend 
the followings according to the community guide.
- https://spark.apache.org/contributing.html

1. Please don't set `Fix Versions`. That is used by the committer when the PR 
is merged finally.
2. For `Affected Version`, please set the master branch version number for the 
new feature JIRA. (For now, it's 3.0.0.) Since Apache Spark allows bug-fix 
backporting only, there is no way to affect released versions.
3. If possible, please search before creating a JIRA. Usually, people think in 
the similar ways.

> Is there any possible that spark no longer restrict enumerate types of 
> spark.sql.catalogImplementation
> --
>
> Key: SPARK-30617
> URL: https://issues.apache.org/jira/browse/SPARK-30617
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: weiwenda
>Priority: Minor
>
> # We have implemented a complex ExternalCatalog which is used for retrieving 
> multi isomerism database's metadata(sush as elasticsearch、postgresql), so 
> that we can make a mixture query between hive and our online data.
>  # But as spark require that value of spark.sql.catalogImplementation must be 
> one of in-memory/hive, we have to modify SparkSession and rebuild spark to 
> make our project work.
>  # Finally, we hope spark removing above restriction, so that it's will be 
> much easier to let us keep pace with new spark version. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30612) can't resolve qualified column name with v2 tables

2020-01-23 Thread Burak Yavuz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022678#comment-17022678
 ] 

Burak Yavuz commented on SPARK-30612:
-

I prefer SubqueryAlias. We need to support all degrees of the user provided
identifier I believe:

SELECT testcat.ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl

SELECT ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl

SELECT ns2.tbl.foo FROM testcat.ns1.ns2.tbl

SELECT tbl.foo FROM testcat.ns1.ns2.tbl

should all work.

However I'm not sure if

SELECT spark_catalog.default.tbl.foo FROM tbl

should work. Are my assumptions correct?




> can't resolve qualified column name with v2 tables
> --
>
> Key: SPARK-30612
> URL: https://issues.apache.org/jira/browse/SPARK-30612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> When running queries with qualified columns like `SELECT t.a FROM t`, it 
> fails to resolve for v2 tables.
> v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We 
> should do the same for v2 tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering

2020-01-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022694#comment-17022694
 ] 

Dongjoon Hyun commented on SPARK-30421:
---

Technically, Python `pandas` follows the same lazy manner, [~tobias_hermann].
{code}
>>> df
   col1  col2
0 1 3
1 2 4
>>> df.drop(columns=["col1"]).loc[df["col1"] == 1]
   col2
0 3
{code}

> Dropped columns still available for filtering
> -
>
> Key: SPARK-30421
> URL: https://issues.apache.org/jira/browse/SPARK-30421
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Tobias Hermann
>Priority: Minor
>
> The following minimal example:
> {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
> df.select("foo").where($"bar" === "a").show
> df.drop("bar").where($"bar" === "a").show
> {quote}
> should result in an error like the following:
> {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given 
> input columns: [foo];
> {quote}
> However, it does not but instead works without error, as if the column "bar" 
> would exist.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12

2020-01-23 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14643:

Priority: Blocker  (was: Major)

> Remove overloaded methods which become ambiguous in Scala 2.12
> --
>
> Key: SPARK-14643
> URL: https://issues.apache.org/jira/browse/SPARK-14643
> Project: Spark
>  Issue Type: Task
>  Components: Build, Project Infra
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Spark 1.x's Dataset API runs into subtle source incompatibility problems for 
> Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a 
> nutshell, the current API has overloaded methods whose signatures are 
> ambiguous when resolving calls that use the Java 8 lambda syntax (only if 
> Spark is build against Scala 2.12).
> This issue is somewhat subtle, so there's a full writeup at 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing
>  which describes the exact circumstances under which the current APIs are 
> problematic. The writeup also proposes a solution which involves the removal 
> of certain overloads only in Scala 2.12 builds of Spark and the introduction 
> of implicit conversions for retaining source compatibility.
> We don't need to implement any of these changes until we add Scala 2.12 
> support since the changes must only be applied when building against Scala 
> 2.12 and will be done via traits + shims which are mixed in via 
> per-Scala-version source directories (like how we handle the 
> Scala-version-specific parts of the REPL). For now, this JIRA acts as a 
> placeholder so that the parent JIRA reflects the complete set of tasks which 
> need to be finished for 2.12 support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12

2020-01-23 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14643:

Target Version/s: 3.0.0

> Remove overloaded methods which become ambiguous in Scala 2.12
> --
>
> Key: SPARK-14643
> URL: https://issues.apache.org/jira/browse/SPARK-14643
> Project: Spark
>  Issue Type: Task
>  Components: Build, Project Infra
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Spark 1.x's Dataset API runs into subtle source incompatibility problems for 
> Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a 
> nutshell, the current API has overloaded methods whose signatures are 
> ambiguous when resolving calls that use the Java 8 lambda syntax (only if 
> Spark is build against Scala 2.12).
> This issue is somewhat subtle, so there's a full writeup at 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing
>  which describes the exact circumstances under which the current APIs are 
> problematic. The writeup also proposes a solution which involves the removal 
> of certain overloads only in Scala 2.12 builds of Spark and the introduction 
> of implicit conversions for retaining source compatibility.
> We don't need to implement any of these changes until we add Scala 2.12 
> support since the changes must only be applied when building against Scala 
> 2.12 and will be done via traits + shims which are mixed in via 
> per-Scala-version source directories (like how we handle the 
> Scala-version-specific parts of the REPL). For now, this JIRA acts as a 
> placeholder so that the parent JIRA reflects the complete set of tasks which 
> need to be finished for 2.12 support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12

2020-01-23 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-14643:
-
  Assignee: (was: Josh Rosen)

> Remove overloaded methods which become ambiguous in Scala 2.12
> --
>
> Key: SPARK-14643
> URL: https://issues.apache.org/jira/browse/SPARK-14643
> Project: Spark
>  Issue Type: Task
>  Components: Build, Project Infra
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> Spark 1.x's Dataset API runs into subtle source incompatibility problems for 
> Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a 
> nutshell, the current API has overloaded methods whose signatures are 
> ambiguous when resolving calls that use the Java 8 lambda syntax (only if 
> Spark is build against Scala 2.12).
> This issue is somewhat subtle, so there's a full writeup at 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing
>  which describes the exact circumstances under which the current APIs are 
> problematic. The writeup also proposes a solution which involves the removal 
> of certain overloads only in Scala 2.12 builds of Spark and the introduction 
> of implicit conversions for retaining source compatibility.
> We don't need to implement any of these changes until we add Scala 2.12 
> support since the changes must only be applied when building against Scala 
> 2.12 and will be done via traits + shims which are mixed in via 
> per-Scala-version source directories (like how we handle the 
> Scala-version-specific parts of the REPL). For now, this JIRA acts as a 
> placeholder so that the parent JIRA reflects the complete set of tasks which 
> need to be finished for 2.12 support.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30627) Disable all the V2 file sources in Spark 3.0 by default

2020-01-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30627.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27348
[https://github.com/apache/spark/pull/27348]

> Disable all the V2 file sources in Spark 3.0 by default
> ---
>
> Key: SPARK-30627
> URL: https://issues.apache.org/jira/browse/SPARK-30627
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> There are still some missing parts in the file source V2 framework:
> 1. It doesn't support reporting file scan metrics such as 
> "numOutputRows"/"numFiles"/"fileSize" like `FileSourceScanExec`. 
> 2. It doesn't support partition pruning with subqueries or dynamic partition 
> pruning.
> As we are going to code freeze on Jan 31st, I suggest disabling all the V2 
> file sources in Spark 3.0 by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30630) Deprecate numTrees in GBT

2020-01-23 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-30630:
--

 Summary: Deprecate numTrees in GBT
 Key: SPARK-30630
 URL: https://issues.apache.org/jira/browse/SPARK-30630
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.5, 3.0.0
Reporter: Huaxin Gao


Currently, GBT has
{code:java}
/**
 * Number of trees in ensemble
 */
@Since("2.0.0")
val getNumTrees: Int = trees.length{code}
and
{code:java}
/** Number of trees in ensemble */
val numTrees: Int = trees.length{code}
I will deprecate numTrees in 2.4.5 and remove it in 3.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30631) Mitigate SQL injections - can't parameterize query parameters for JDBC connectors

2020-01-23 Thread Jorge (Jira)

Jorge created SPARK-30631:
-

 Summary: Mitigate SQL injections - can't parameterize query 
parameters for JDBC connectors
 Key: SPARK-30631
 URL: https://issues.apache.org/jira/browse/SPARK-30631
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Jorge


One of the options to read from a JDBC connection is a query.

Sometimes, this query is parameterized (e.g. column name, values, etc).

The JDBC API does not support parameterizing SQL queries, which puts the burden 
of escaping SQL on the developer. This burden is unnecessary and a security 
risk.

Very often, drivers provide a specific API to securely parameterize SQL 
statements.

This issue proposes allowing the developers to pass "query" and "parameters" to 
the JDBC options, so that it is the driver, not the developer, that escape 
parameters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-30421) Dropped columns still available for filtering

2020-01-23 Thread Tobias Hermann (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tobias Hermann updated SPARK-30421:
---
Comment: was deleted

(was: [~dongjoon] Thanks, I think that's not good. So I just opened a Pandas 
issue too. :D

[https://github.com/pandas-dev/pandas/issues/31272])

> Dropped columns still available for filtering
> -
>
> Key: SPARK-30421
> URL: https://issues.apache.org/jira/browse/SPARK-30421
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Tobias Hermann
>Priority: Minor
>
> The following minimal example:
> {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
> df.select("foo").where($"bar" === "a").show
> df.drop("bar").where($"bar" === "a").show
> {quote}
> should result in an error like the following:
> {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given 
> input columns: [foo];
> {quote}
> However, it does not but instead works without error, as if the column "bar" 
> would exist.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering

2020-01-23 Thread Tobias Hermann (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022715#comment-17022715
 ] 

Tobias Hermann commented on SPARK-30421:


[~dongjoon] Thanks, I think that's not good. So I just opened a Pandas issue 
too. :D

[https://github.com/pandas-dev/pandas/issues/31272]

> Dropped columns still available for filtering
> -
>
> Key: SPARK-30421
> URL: https://issues.apache.org/jira/browse/SPARK-30421
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Tobias Hermann
>Priority: Minor
>
> The following minimal example:
> {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
> df.select("foo").where($"bar" === "a").show
> df.drop("bar").where($"bar" === "a").show
> {quote}
> should result in an error like the following:
> {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given 
> input columns: [foo];
> {quote}
> However, it does not but instead works without error, as if the column "bar" 
> would exist.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering

2020-01-23 Thread Tobias Hermann (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022719#comment-17022719
 ] 

Tobias Hermann commented on SPARK-30421:


[~dongjoon] No, that's different. To make it equivalent, you'd have to change 
your example to the following:
{quote}import pandas as pd

df = pd.DataFrame(data=\{'foo': [0, 1], 'bar': ["a", "b"]})
df2 = df.drop(columns=["bar"])
df2[df2["bar"] == "a"]
{quote}
And that correctly results in
{quote}KeyError: 'bar'
{quote}
In Spark, however, the following code works without error:
{quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
val df2 = df.drop("bar")
df2.where($"bar" === "a").show
{quote}

> Dropped columns still available for filtering
> -
>
> Key: SPARK-30421
> URL: https://issues.apache.org/jira/browse/SPARK-30421
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Tobias Hermann
>Priority: Minor
>
> The following minimal example:
> {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
> df.select("foo").where($"bar" === "a").show
> df.drop("bar").where($"bar" === "a").show
> {quote}
> should result in an error like the following:
> {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given 
> input columns: [foo];
> {quote}
> However, it does not but instead works without error, as if the column "bar" 
> would exist.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

80 matches

Mail list logo