[jira] [Resolved] (SPARK-23736) High-order function: concat(array1, array2, ..., arrayN) → array

2018-04-19 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23736.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20858
[https://github.com/apache/spark/pull/20858]

> High-order function: concat(array1, array2, ..., arrayN) → array
> 
>
> Key: SPARK-23736
> URL: https://issues.apache.org/jira/browse/SPARK-23736
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 2.4.0
>
>
> Extend the _concat_ function to also support array columns.
> Example:
> {{concat(array(1, 2, 3), array(10, 20, 30), array(100, 200)) => [1, 2, 3,10, 
> 20, 30,100, 200] }}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23736) High-order function: concat(array1, array2, ..., arrayN) → array

2018-04-19 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23736:
-

Assignee: Marek Novotny

> High-order function: concat(array1, array2, ..., arrayN) → array
> 
>
> Key: SPARK-23736
> URL: https://issues.apache.org/jira/browse/SPARK-23736
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Assignee: Marek Novotny
>Priority: Major
>
> Extend the _concat_ function to also support array columns.
> Example:
> {{concat(array(1, 2, 3), array(10, 20, 30), array(100, 200)) => [1, 2, 3,10, 
> 20, 30,100, 200] }}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22448) Add functions like Mode(), NumNulls(), etc. in Summarizer

2018-04-19 Thread Dedunu Dhananjaya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445355#comment-16445355
 ] 

Dedunu Dhananjaya commented on SPARK-22448:
---

I'm thinking about implementing this. Is it okay to start working on this?

> Add functions like Mode(), NumNulls(), etc. in Summarizer
> -
>
> Key: SPARK-22448
> URL: https://issues.apache.org/jira/browse/SPARK-22448
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer
>Affects Versions: 2.2.0
>Reporter: Abdeali Kothari
>Priority: Trivial
>
> Would be very useful to have a MODE() function in the Summary statistics 
> currently supported by DataSets.
> I can see that the Summarizer has many useful functions in 2.3.0 and it would 
> be useful to add the following to it:
>  - Mode - Element that occurs maximum number of times
>  - CSS - Cumulative Sum of Squares ... Sum((x - mean)^2)
>  - NumNull - The number of values that are NULL in the column
>  - SUM - Just the sum of the column ...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23877) Metadata-only queries do not push down filter conditions

2018-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445302#comment-16445302
 ] 

Apache Spark commented on SPARK-23877:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/2

> Metadata-only queries do not push down filter conditions
> 
>
> Key: SPARK-23877
> URL: https://issues.apache.org/jira/browse/SPARK-23877
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> Metadata-only queries currently load all partitions in a table instead of 
> passing filter conditions to list only matching partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23877) Metadata-only queries do not push down filter conditions

2018-04-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23877.
-
Resolution: Fixed

Issue resolved by pull request 20988
[https://github.com/apache/spark/pull/20988]

> Metadata-only queries do not push down filter conditions
> 
>
> Key: SPARK-23877
> URL: https://issues.apache.org/jira/browse/SPARK-23877
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> Metadata-only queries currently load all partitions in a table instead of 
> passing filter conditions to list only matching partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23877) Metadata-only queries do not push down filter conditions

2018-04-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23877:
---

Assignee: Ryan Blue

> Metadata-only queries do not push down filter conditions
> 
>
> Key: SPARK-23877
> URL: https://issues.apache.org/jira/browse/SPARK-23877
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> Metadata-only queries currently load all partitions in a table instead of 
> passing filter conditions to list only matching partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24016) Yarn does not update node blacklist in static allocation

2018-04-19 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445252#comment-16445252
 ] 

Saisai Shao commented on SPARK-24016:
-

I think this can be useful if we enabled 
"spark.blacklist.killBlacklistedExecutors". NM could avoid relaunching 
executors on the bad nodes.

> Yarn does not update node blacklist in static allocation
> 
>
> Key: SPARK-24016
> URL: https://issues.apache.org/jira/browse/SPARK-24016
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Task-based blacklisting keeps track of bad nodes, and updates YARN with that 
> set of nodes so that Spark will not receive more containers on that node.  
> However, that only happens with dynamic allocation.  Though its far more 
> important with dynamic allocation, even with static allocation this matters; 
> if executors die, or if the cluster was too busy at the original resource 
> request to give all the containers, the spark application will add new 
> containers in the middle.  And we want an updated node blacklist for that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side

2018-04-19 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445104#comment-16445104
 ] 

Hyukjin Kwon commented on SPARK-24025:
--

would be nicer if we take an action. If it's almost duplicated or the fix is 
expected to be the same or together, let's resolve it; otherwise, we could 
leave a link.

> Join of bucketed and non-bucketed tables can give two exchanges and sorts for 
> non-bucketed side
> ---
>
> Key: SPARK-24025
> URL: https://issues.apache.org/jira/browse/SPARK-24025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: {code:java}
> ./bin/spark-shell --version
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.3.0
> /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162
> Branch master
> Compiled by user sameera on 2018-02-22T19:24:29Z
> Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
> Url g...@github.com:sameeragarwal/spark.git
> Type --help for more information.{code}
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: join-jira.png
>
>
> While exploring bucketing I found the following join query of non-bucketed 
> and bucketed tables that ends up with two exchanges and two sorts in the 
> physical plan for the non-bucketed join side.
> {code}
> // Make sure that you don't end up with a BroadcastHashJoin and a 
> BroadcastExchange
> // Disable auto broadcasting
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> val bucketedTableName = "bucketed_4_id"
> val large = spark.range(100)
> large.write
>   .bucketBy(4, "id")
>   .sortBy("id")
>   .mode("overwrite")
>   .saveAsTable(bucketedTableName)
> // Describe the table and include bucketing spec only
> val descSQL = sql(s"DESC FORMATTED 
> $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === 
> "Sort Columns")
> scala> descSQL.show(truncate = false)
> +--+-+---+
> |col_name  |data_type|comment|
> +--+-+---+
> |Num Buckets   |4|   |
> |Bucket Columns|[`id`]   |   |
> |Sort Columns  |[`id`]   |   |
> +--+-+---+
> val bucketedTable = spark.table(bucketedTableName)
> val t1 = spark.range(4)
>   .repartition(2, $"id")  // Use just 2 partitions
>   .sortWithinPartitions("id") // sort partitions
> val q = t1.join(bucketedTable, "id")
> // Note two exchanges and sorts
> scala> q.explain
> == Physical Plan ==
> *(5) Project [id#79L]
> +- *(5) SortMergeJoin [id#79L], [id#77L], Inner
>:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#79L, 4)
>: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0
>:+- Exchange hashpartitioning(id#79L, 2)
>:   +- *(1) Range (0, 4, step=1, splits=8)
>+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0
>   +- *(4) Project [id#77L]
>  +- *(4) Filter isnotnull(id#77L)
> +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct
> q.foreach(_ => ())
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24030) SparkSQL percentile_approx function is too slow for over 1,060,000 records.

2018-04-19 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445103#comment-16445103
 ] 

Hyukjin Kwon commented on SPARK-24030:
--

It should be easier to debug and find the cause if there're a simple reproducer 
if possible, logs or screen captures for the UI.

> SparkSQL percentile_approx function is too slow for over 1,060,000 records.
> ---
>
> Key: SPARK-24030
> URL: https://issues.apache.org/jira/browse/SPARK-24030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: zeppline + Spark 2.2.1 on Amazon EMR and local laptop.
>Reporter: Seok-Joon,Yun
>Priority: Major
>
> I used percentile_approx functions for over 1,060,000 records. It is too 
> slow. It takes about 90 mins. So I tried for 1,040,000 records. It take about 
> 10 secs.
> I tested for data reading on JDBC and parquet. It takes same time lengths.
> I wonder that function is not designed for multi worker.
> I looked gangglia and spark history. It worked on one worker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24030) SparkSQL percentile_approx function is too slow for over 1,060,000 records.

2018-04-19 Thread Seok-Joon,Yun (JIRA)
Seok-Joon,Yun created SPARK-24030:
-

 Summary: SparkSQL percentile_approx function is too slow for over 
1,060,000 records.
 Key: SPARK-24030
 URL: https://issues.apache.org/jira/browse/SPARK-24030
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
 Environment: zeppline + Spark 2.2.1 on Amazon EMR and local laptop.
Reporter: Seok-Joon,Yun


I used percentile_approx functions for over 1,060,000 records. It is too slow. 
It takes about 90 mins. So I tried for 1,040,000 records. It take about 10 secs.

I tested for data reading on JDBC and parquet. It takes same time lengths.

I wonder that function is not designed for multi worker.

I looked gangglia and spark history. It worked on one worker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior

2018-04-19 Thread Matt Cheah (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah resolved SPARK-24028.

Resolution: Cannot Reproduce

Closing this for now - we're continuing to investigate this internally and our 
findings lead away from this hypothesis. Sorry for the extraneous ticket! I'll 
re-open this if we have a good reason to think this is actually the problem.

> [K8s] Creating secrets and config maps before creating the driver pod has 
> unpredictable behavior
> 
>
> Key: SPARK-24028
> URL: https://issues.apache.org/jira/browse/SPARK-24028
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Critical
>
> Currently we create the Kubernetes resources the driver depends on - such as 
> the properties config map and secrets to mount into the pod - only after we 
> create the driver pod. This is because we want these extra objects to 
> immediately have an owner reference to be tied to the driver pod.
> On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works 
> fine, but other times the driver ends up being started with empty volumes 
> instead of volumes with the contents of the secrets we expect. The result is 
> that sometimes the driver will start without these files mounted, which leads 
> to various failures if the driver requires these files to be present early on 
> in their code. Missing the properties file config map, for example, would 
> mean spark-submit doesn't have a properties file to read at all. See the 
> warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.]
> Unfortunately we cannot link owner references to non-existent objects, so we 
> have to do this instead:
>  # Create the auxiliary resources without any owner references.
>  # Create the driver pod mounting these resources into volumes, as before.
>  # If #2 fails, clean up the resources created in #1.
>  # Edit the auxiliary resources to have an owner reference for the driver pod.
> The multi-step approach leaves a small chance for us to leak resources - for 
> example, if we fail to make the resource edits in #4 for some reason. This 
> also changes the permissioning mode required for spark-submit - credentials 
> provided to spark-submit need to be able to edit resources in addition to 
> creating them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24029) Set "reuse address" flag on listen sockets

2018-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24029:


Assignee: (was: Apache Spark)

> Set "reuse address" flag on listen sockets
> --
>
> Key: SPARK-24029
> URL: https://issues.apache.org/jira/browse/SPARK-24029
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Setting the "reuse address" option allows a server socket to be bound to a 
> port that may still have connections in the "close wait" state. Without it, 
> for example, re-starting the history server could result in a BindException.
> This is more important for things that have explicit ports, but doesn't hurt 
> also in other places, especially since Spark allows most servers to bind to a 
> port range by setting the root port + the retry count.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24029) Set "reuse address" flag on listen sockets

2018-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24029:


Assignee: Apache Spark

> Set "reuse address" flag on listen sockets
> --
>
> Key: SPARK-24029
> URL: https://issues.apache.org/jira/browse/SPARK-24029
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> Setting the "reuse address" option allows a server socket to be bound to a 
> port that may still have connections in the "close wait" state. Without it, 
> for example, re-starting the history server could result in a BindException.
> This is more important for things that have explicit ports, but doesn't hurt 
> also in other places, especially since Spark allows most servers to bind to a 
> port range by setting the root port + the retry count.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24029) Set "reuse address" flag on listen sockets

2018-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445028#comment-16445028
 ] 

Apache Spark commented on SPARK-24029:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21110

> Set "reuse address" flag on listen sockets
> --
>
> Key: SPARK-24029
> URL: https://issues.apache.org/jira/browse/SPARK-24029
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Setting the "reuse address" option allows a server socket to be bound to a 
> port that may still have connections in the "close wait" state. Without it, 
> for example, re-starting the history server could result in a BindException.
> This is more important for things that have explicit ports, but doesn't hurt 
> also in other places, especially since Spark allows most servers to bind to a 
> port range by setting the root port + the retry count.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side

2018-04-19 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444998#comment-16444998
 ] 

Jacek Laskowski commented on SPARK-24025:
-

The other issue seems similar.

> Join of bucketed and non-bucketed tables can give two exchanges and sorts for 
> non-bucketed side
> ---
>
> Key: SPARK-24025
> URL: https://issues.apache.org/jira/browse/SPARK-24025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: {code:java}
> ./bin/spark-shell --version
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.3.0
> /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162
> Branch master
> Compiled by user sameera on 2018-02-22T19:24:29Z
> Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
> Url g...@github.com:sameeragarwal/spark.git
> Type --help for more information.{code}
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: join-jira.png
>
>
> While exploring bucketing I found the following join query of non-bucketed 
> and bucketed tables that ends up with two exchanges and two sorts in the 
> physical plan for the non-bucketed join side.
> {code}
> // Make sure that you don't end up with a BroadcastHashJoin and a 
> BroadcastExchange
> // Disable auto broadcasting
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> val bucketedTableName = "bucketed_4_id"
> val large = spark.range(100)
> large.write
>   .bucketBy(4, "id")
>   .sortBy("id")
>   .mode("overwrite")
>   .saveAsTable(bucketedTableName)
> // Describe the table and include bucketing spec only
> val descSQL = sql(s"DESC FORMATTED 
> $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === 
> "Sort Columns")
> scala> descSQL.show(truncate = false)
> +--+-+---+
> |col_name  |data_type|comment|
> +--+-+---+
> |Num Buckets   |4|   |
> |Bucket Columns|[`id`]   |   |
> |Sort Columns  |[`id`]   |   |
> +--+-+---+
> val bucketedTable = spark.table(bucketedTableName)
> val t1 = spark.range(4)
>   .repartition(2, $"id")  // Use just 2 partitions
>   .sortWithinPartitions("id") // sort partitions
> val q = t1.join(bucketedTable, "id")
> // Note two exchanges and sorts
> scala> q.explain
> == Physical Plan ==
> *(5) Project [id#79L]
> +- *(5) SortMergeJoin [id#79L], [id#77L], Inner
>:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#79L, 4)
>: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0
>:+- Exchange hashpartitioning(id#79L, 2)
>:   +- *(1) Range (0, 4, step=1, splits=8)
>+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0
>   +- *(4) Project [id#77L]
>  +- *(4) Filter isnotnull(id#77L)
> +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct
> q.foreach(_ => ())
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24029) Set "reuse address" flag on listen sockets

2018-04-19 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-24029:
--

 Summary: Set "reuse address" flag on listen sockets
 Key: SPARK-24029
 URL: https://issues.apache.org/jira/browse/SPARK-24029
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


Setting the "reuse address" option allows a server socket to be bound to a port 
that may still have connections in the "close wait" state. Without it, for 
example, re-starting the history server could result in a BindException.

This is more important for things that have explicit ports, but doesn't hurt 
also in other places, especially since Spark allows most servers to bind to a 
port range by setting the root port + the retry count.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24005) Remove usage of Scala’s parallel collection

2018-04-19 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444930#comment-16444930
 ] 

Xiao Li commented on SPARK-24005:
-

cc [~maxgekk]

> Remove usage of Scala’s parallel collection
> ---
>
> Key: SPARK-24005
> URL: https://issues.apache.org/jira/browse/SPARK-24005
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> {noformat}
> val par = (1 to 100).par.flatMap { i =>
>   Thread.sleep(1000)
>   1 to 1000
> }.toSeq
> {noformat}
> We are unable to interrupt the execution of parallel collections. We need to 
> create a common utility function to do it, instead of using Scala parallel 
> collections



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23996) Implement the optimal KLL algorithms for quantiles in streams

2018-04-19 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444909#comment-16444909
 ] 

Miao Wang edited comment on SPARK-23996 at 4/19/18 10:34 PM:
-

[~timhunter] There is a Java implementation: 
[https://github.com/DataSketches/sketches-core/tree/master/src/main/java/com/yahoo/sketches/kll]

I am reading the code now.

Is 
[QuantileSummaries.scala|https://github.com/apache/spark/pull/15002/files#diff-48cfea2315f9ae39c33801873c4dbf84]
 your implementation?

Thanks!

Miao


was (Author: wm624):
[~timhunter] There is a Java implementation: 
[https://github.com/DataSketches/sketches-core/tree/master/src/main/java/com/yahoo/sketches/kll]

I am reading the code now.

Can you point me your spark implementation?

Thanks!

Miao

> Implement the optimal KLL algorithms for quantiles in streams
> -
>
> Key: SPARK-23996
> URL: https://issues.apache.org/jira/browse/SPARK-23996
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Priority: Major
>
> The current implementation for approximate quantiles - a variant of 
> Grunwald-Khanna, which I implemented - is not the best in light of recent 
> papers:
>  - it is not exactly the one from the paper for performance reasons, but the 
> changes are not documented beyond comments on the code
>  - there are now more optimal algorithms with proven bounds (unlike q-digest, 
> the other contender at the time)
> I propose that we revisit the current implementation and look at the 
> Karnin-Lang-Liberty algorithm (KLL) for example:
> [https://arxiv.org/abs/1603.05346]
> [https://edoliberty.github.io//papers/streamingQuantiles.pdf]
> This algorithm seems to have favorable characteristics for streaming and a 
> distributed implementation, and there is a python implementation for 
> reference.
> It is a fairly standalone piece, and in that respect available to people who 
> don't know too much about spark internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side

2018-04-19 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444911#comment-16444911
 ] 

Takeshi Yamamuro commented on SPARK-24025:
--

This is the same with https://issues.apache.org/jira/browse/SPARK-17570?

> Join of bucketed and non-bucketed tables can give two exchanges and sorts for 
> non-bucketed side
> ---
>
> Key: SPARK-24025
> URL: https://issues.apache.org/jira/browse/SPARK-24025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: {code:java}
> ./bin/spark-shell --version
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.3.0
> /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162
> Branch master
> Compiled by user sameera on 2018-02-22T19:24:29Z
> Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
> Url g...@github.com:sameeragarwal/spark.git
> Type --help for more information.{code}
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: join-jira.png
>
>
> While exploring bucketing I found the following join query of non-bucketed 
> and bucketed tables that ends up with two exchanges and two sorts in the 
> physical plan for the non-bucketed join side.
> {code}
> // Make sure that you don't end up with a BroadcastHashJoin and a 
> BroadcastExchange
> // Disable auto broadcasting
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> val bucketedTableName = "bucketed_4_id"
> val large = spark.range(100)
> large.write
>   .bucketBy(4, "id")
>   .sortBy("id")
>   .mode("overwrite")
>   .saveAsTable(bucketedTableName)
> // Describe the table and include bucketing spec only
> val descSQL = sql(s"DESC FORMATTED 
> $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === 
> "Sort Columns")
> scala> descSQL.show(truncate = false)
> +--+-+---+
> |col_name  |data_type|comment|
> +--+-+---+
> |Num Buckets   |4|   |
> |Bucket Columns|[`id`]   |   |
> |Sort Columns  |[`id`]   |   |
> +--+-+---+
> val bucketedTable = spark.table(bucketedTableName)
> val t1 = spark.range(4)
>   .repartition(2, $"id")  // Use just 2 partitions
>   .sortWithinPartitions("id") // sort partitions
> val q = t1.join(bucketedTable, "id")
> // Note two exchanges and sorts
> scala> q.explain
> == Physical Plan ==
> *(5) Project [id#79L]
> +- *(5) SortMergeJoin [id#79L], [id#77L], Inner
>:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#79L, 4)
>: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0
>:+- Exchange hashpartitioning(id#79L, 2)
>:   +- *(1) Range (0, 4, step=1, splits=8)
>+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0
>   +- *(4) Project [id#77L]
>  +- *(4) Filter isnotnull(id#77L)
> +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct
> q.foreach(_ => ())
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23996) Implement the optimal KLL algorithms for quantiles in streams

2018-04-19 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444909#comment-16444909
 ] 

Miao Wang commented on SPARK-23996:
---

[~timhunter] There is a Java implementation: 
[https://github.com/DataSketches/sketches-core/tree/master/src/main/java/com/yahoo/sketches/kll]

I am reading the code now.

Can you point me your spark implementation?

Thanks!

Miao

> Implement the optimal KLL algorithms for quantiles in streams
> -
>
> Key: SPARK-23996
> URL: https://issues.apache.org/jira/browse/SPARK-23996
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Priority: Major
>
> The current implementation for approximate quantiles - a variant of 
> Grunwald-Khanna, which I implemented - is not the best in light of recent 
> papers:
>  - it is not exactly the one from the paper for performance reasons, but the 
> changes are not documented beyond comments on the code
>  - there are now more optimal algorithms with proven bounds (unlike q-digest, 
> the other contender at the time)
> I propose that we revisit the current implementation and look at the 
> Karnin-Lang-Liberty algorithm (KLL) for example:
> [https://arxiv.org/abs/1603.05346]
> [https://edoliberty.github.io//papers/streamingQuantiles.pdf]
> This algorithm seems to have favorable characteristics for streaming and a 
> distributed implementation, and there is a python implementation for 
> reference.
> It is a fairly standalone piece, and in that respect available to people who 
> don't know too much about spark internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior

2018-04-19 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444902#comment-16444902
 ] 

Anirudh Ramanathan commented on SPARK-24028:


My suspicion here is that this has to do with timing. An easy way to check may 
be to add a sleep() of a few seconds during driver pod startup and seeing if 
the issue resolves itself. Looks like there may have been a race condition with 
the storage mounting logic in the past, but if you're seeing this fresh in 
1.9.4, that is something we should file a bug about in upstream. 

All the recent runs of 
https://k8s-testgrid.appspot.com/sig-big-data#spark-periodic-latest-gke on 
v1.9.6 have been green. Any ideas on how we can reproduce this?

> [K8s] Creating secrets and config maps before creating the driver pod has 
> unpredictable behavior
> 
>
> Key: SPARK-24028
> URL: https://issues.apache.org/jira/browse/SPARK-24028
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Critical
>
> Currently we create the Kubernetes resources the driver depends on - such as 
> the properties config map and secrets to mount into the pod - only after we 
> create the driver pod. This is because we want these extra objects to 
> immediately have an owner reference to be tied to the driver pod.
> On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works 
> fine, but other times the driver ends up being started with empty volumes 
> instead of volumes with the contents of the secrets we expect. The result is 
> that sometimes the driver will start without these files mounted, which leads 
> to various failures if the driver requires these files to be present early on 
> in their code. Missing the properties file config map, for example, would 
> mean spark-submit doesn't have a properties file to read at all. See the 
> warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.]
> Unfortunately we cannot link owner references to non-existent objects, so we 
> have to do this instead:
>  # Create the auxiliary resources without any owner references.
>  # Create the driver pod mounting these resources into volumes, as before.
>  # If #2 fails, clean up the resources created in #1.
>  # Edit the auxiliary resources to have an owner reference for the driver pod.
> The multi-step approach leaves a small chance for us to leak resources - for 
> example, if we fail to make the resource edits in #4 for some reason. This 
> also changes the permissioning mode required for spark-submit - credentials 
> provided to spark-submit need to be able to edit resources in addition to 
> creating them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior

2018-04-19 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444902#comment-16444902
 ] 

Anirudh Ramanathan edited comment on SPARK-24028 at 4/19/18 10:22 PM:
--

My suspicion here is that this has to do with timing. An easy way to check may 
be to add a sleep() of a few seconds during driver pod startup and seeing if 
the issue resolves itself. Looks like there may have been a race condition with 
the storage mounting logic in the past, but if you're seeing this fresh in 
1.9.4, that is something we should file a bug about in upstream Kubernetes. 

All the recent runs of 
https://k8s-testgrid.appspot.com/sig-big-data#spark-periodic-latest-gke on 
v1.9.6 have been green. Any ideas on how we can reproduce this?


was (Author: foxish):
My suspicion here is that this has to do with timing. An easy way to check may 
be to add a sleep() of a few seconds during driver pod startup and seeing if 
the issue resolves itself. Looks like there may have been a race condition with 
the storage mounting logic in the past, but if you're seeing this fresh in 
1.9.4, that is something we should file a bug about in upstream. 

All the recent runs of 
https://k8s-testgrid.appspot.com/sig-big-data#spark-periodic-latest-gke on 
v1.9.6 have been green. Any ideas on how we can reproduce this?

> [K8s] Creating secrets and config maps before creating the driver pod has 
> unpredictable behavior
> 
>
> Key: SPARK-24028
> URL: https://issues.apache.org/jira/browse/SPARK-24028
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Critical
>
> Currently we create the Kubernetes resources the driver depends on - such as 
> the properties config map and secrets to mount into the pod - only after we 
> create the driver pod. This is because we want these extra objects to 
> immediately have an owner reference to be tied to the driver pod.
> On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works 
> fine, but other times the driver ends up being started with empty volumes 
> instead of volumes with the contents of the secrets we expect. The result is 
> that sometimes the driver will start without these files mounted, which leads 
> to various failures if the driver requires these files to be present early on 
> in their code. Missing the properties file config map, for example, would 
> mean spark-submit doesn't have a properties file to read at all. See the 
> warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.]
> Unfortunately we cannot link owner references to non-existent objects, so we 
> have to do this instead:
>  # Create the auxiliary resources without any owner references.
>  # Create the driver pod mounting these resources into volumes, as before.
>  # If #2 fails, clean up the resources created in #1.
>  # Edit the auxiliary resources to have an owner reference for the driver pod.
> The multi-step approach leaves a small chance for us to leak resources - for 
> example, if we fail to make the resource edits in #4 for some reason. This 
> also changes the permissioning mode required for spark-submit - credentials 
> provided to spark-submit need to be able to edit resources in addition to 
> creating them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior

2018-04-19 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444899#comment-16444899
 ] 

Yinan Li commented on SPARK-24028:
--

2.3.0 does create a configmap for the init-container if one is used. See 
[https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/DriverInitContainerBootstrapStep.scala#L54.]
 The content of this configmap is used when the init-container starts.

> [K8s] Creating secrets and config maps before creating the driver pod has 
> unpredictable behavior
> 
>
> Key: SPARK-24028
> URL: https://issues.apache.org/jira/browse/SPARK-24028
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Critical
>
> Currently we create the Kubernetes resources the driver depends on - such as 
> the properties config map and secrets to mount into the pod - only after we 
> create the driver pod. This is because we want these extra objects to 
> immediately have an owner reference to be tied to the driver pod.
> On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works 
> fine, but other times the driver ends up being started with empty volumes 
> instead of volumes with the contents of the secrets we expect. The result is 
> that sometimes the driver will start without these files mounted, which leads 
> to various failures if the driver requires these files to be present early on 
> in their code. Missing the properties file config map, for example, would 
> mean spark-submit doesn't have a properties file to read at all. See the 
> warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.]
> Unfortunately we cannot link owner references to non-existent objects, so we 
> have to do this instead:
>  # Create the auxiliary resources without any owner references.
>  # Create the driver pod mounting these resources into volumes, as before.
>  # If #2 fails, clean up the resources created in #1.
>  # Edit the auxiliary resources to have an owner reference for the driver pod.
> The multi-step approach leaves a small chance for us to leak resources - for 
> example, if we fail to make the resource edits in #4 for some reason. This 
> also changes the permissioning mode required for spark-submit - credentials 
> provided to spark-submit need to be able to edit resources in addition to 
> creating them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior

2018-04-19 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444898#comment-16444898
 ] 

Matt Cheah commented on SPARK-24028:


[~liyinan926] just noticed your comment edit - agreed that the init-container 
would use a config map too and would hit similar issues in theory.

But also, do init-containers have different volume mount guarantees than main 
containers? I.e. do init-containers wait while the main container tries to run 
without the volume?

> [K8s] Creating secrets and config maps before creating the driver pod has 
> unpredictable behavior
> 
>
> Key: SPARK-24028
> URL: https://issues.apache.org/jira/browse/SPARK-24028
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Critical
>
> Currently we create the Kubernetes resources the driver depends on - such as 
> the properties config map and secrets to mount into the pod - only after we 
> create the driver pod. This is because we want these extra objects to 
> immediately have an owner reference to be tied to the driver pod.
> On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works 
> fine, but other times the driver ends up being started with empty volumes 
> instead of volumes with the contents of the secrets we expect. The result is 
> that sometimes the driver will start without these files mounted, which leads 
> to various failures if the driver requires these files to be present early on 
> in their code. Missing the properties file config map, for example, would 
> mean spark-submit doesn't have a properties file to read at all. See the 
> warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.]
> Unfortunately we cannot link owner references to non-existent objects, so we 
> have to do this instead:
>  # Create the auxiliary resources without any owner references.
>  # Create the driver pod mounting these resources into volumes, as before.
>  # If #2 fails, clean up the resources created in #1.
>  # Edit the auxiliary resources to have an owner reference for the driver pod.
> The multi-step approach leaves a small chance for us to leak resources - for 
> example, if we fail to make the resource edits in #4 for some reason. This 
> also changes the permissioning mode required for spark-submit - credentials 
> provided to spark-submit need to be able to edit resources in addition to 
> creating them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior

2018-04-19 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444894#comment-16444894
 ] 

Matt Cheah commented on SPARK-24028:


I don't think the 2.3.0 release creates any volume mounts except in the 
credentials step: 
[https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/DriverKubernetesCredentialsStep.scala#L85]

Whereas now we create volumes and immediately depend on them.

We made a minor adjustment to our spark-submit implementation to include the 
small-files mounting step that was in the prototype but not yet in mainline: 
[https://github.com/palantir/spark/pull/358/files#diff-4aff876dd463610b94530f92b0cf600aR31]
 - these files are the ones that we're missing.

But the config map for the properties file is concerning enough.

> [K8s] Creating secrets and config maps before creating the driver pod has 
> unpredictable behavior
> 
>
> Key: SPARK-24028
> URL: https://issues.apache.org/jira/browse/SPARK-24028
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Critical
>
> Currently we create the Kubernetes resources the driver depends on - such as 
> the properties config map and secrets to mount into the pod - only after we 
> create the driver pod. This is because we want these extra objects to 
> immediately have an owner reference to be tied to the driver pod.
> On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works 
> fine, but other times the driver ends up being started with empty volumes 
> instead of volumes with the contents of the secrets we expect. The result is 
> that sometimes the driver will start without these files mounted, which leads 
> to various failures if the driver requires these files to be present early on 
> in their code. Missing the properties file config map, for example, would 
> mean spark-submit doesn't have a properties file to read at all. See the 
> warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.]
> Unfortunately we cannot link owner references to non-existent objects, so we 
> have to do this instead:
>  # Create the auxiliary resources without any owner references.
>  # Create the driver pod mounting these resources into volumes, as before.
>  # If #2 fails, clean up the resources created in #1.
>  # Edit the auxiliary resources to have an owner reference for the driver pod.
> The multi-step approach leaves a small chance for us to leak resources - for 
> example, if we fail to make the resource edits in #4 for some reason. This 
> also changes the permissioning mode required for spark-submit - credentials 
> provided to spark-submit need to be able to edit resources in addition to 
> creating them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior

2018-04-19 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444890#comment-16444890
 ] 

Yinan Li edited comment on SPARK-24028 at 4/19/18 10:14 PM:


I run a 1.9.6 cluster. No, I was using the 2.3.0 release. The configmap I was 
referring to was for the init-container.


was (Author: liyinan926):
I run a 1.9.6 cluster. No, I was using the 2.3.0 release.

> [K8s] Creating secrets and config maps before creating the driver pod has 
> unpredictable behavior
> 
>
> Key: SPARK-24028
> URL: https://issues.apache.org/jira/browse/SPARK-24028
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Critical
>
> Currently we create the Kubernetes resources the driver depends on - such as 
> the properties config map and secrets to mount into the pod - only after we 
> create the driver pod. This is because we want these extra objects to 
> immediately have an owner reference to be tied to the driver pod.
> On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works 
> fine, but other times the driver ends up being started with empty volumes 
> instead of volumes with the contents of the secrets we expect. The result is 
> that sometimes the driver will start without these files mounted, which leads 
> to various failures if the driver requires these files to be present early on 
> in their code. Missing the properties file config map, for example, would 
> mean spark-submit doesn't have a properties file to read at all. See the 
> warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.]
> Unfortunately we cannot link owner references to non-existent objects, so we 
> have to do this instead:
>  # Create the auxiliary resources without any owner references.
>  # Create the driver pod mounting these resources into volumes, as before.
>  # If #2 fails, clean up the resources created in #1.
>  # Edit the auxiliary resources to have an owner reference for the driver pod.
> The multi-step approach leaves a small chance for us to leak resources - for 
> example, if we fail to make the resource edits in #4 for some reason. This 
> also changes the permissioning mode required for spark-submit - credentials 
> provided to spark-submit need to be able to edit resources in addition to 
> creating them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior

2018-04-19 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444890#comment-16444890
 ] 

Yinan Li commented on SPARK-24028:
--

I run a 1.9.6 cluster. No, I was using the 2.3.0 release.

> [K8s] Creating secrets and config maps before creating the driver pod has 
> unpredictable behavior
> 
>
> Key: SPARK-24028
> URL: https://issues.apache.org/jira/browse/SPARK-24028
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Critical
>
> Currently we create the Kubernetes resources the driver depends on - such as 
> the properties config map and secrets to mount into the pod - only after we 
> create the driver pod. This is because we want these extra objects to 
> immediately have an owner reference to be tied to the driver pod.
> On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works 
> fine, but other times the driver ends up being started with empty volumes 
> instead of volumes with the contents of the secrets we expect. The result is 
> that sometimes the driver will start without these files mounted, which leads 
> to various failures if the driver requires these files to be present early on 
> in their code. Missing the properties file config map, for example, would 
> mean spark-submit doesn't have a properties file to read at all. See the 
> warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.]
> Unfortunately we cannot link owner references to non-existent objects, so we 
> have to do this instead:
>  # Create the auxiliary resources without any owner references.
>  # Create the driver pod mounting these resources into volumes, as before.
>  # If #2 fails, clean up the resources created in #1.
>  # Edit the auxiliary resources to have an owner reference for the driver pod.
> The multi-step approach leaves a small chance for us to leak resources - for 
> example, if we fail to make the resource edits in #4 for some reason. This 
> also changes the permissioning mode required for spark-submit - credentials 
> provided to spark-submit need to be able to edit resources in addition to 
> creating them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior

2018-04-19 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444887#comment-16444887
 ] 

Matt Cheah commented on SPARK-24028:


[~liyinan926] what specific point version of Kubernetes are you running? And 
are you using the tip of master for spark-submit where we now use the 
properties file?

> [K8s] Creating secrets and config maps before creating the driver pod has 
> unpredictable behavior
> 
>
> Key: SPARK-24028
> URL: https://issues.apache.org/jira/browse/SPARK-24028
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Critical
>
> Currently we create the Kubernetes resources the driver depends on - such as 
> the properties config map and secrets to mount into the pod - only after we 
> create the driver pod. This is because we want these extra objects to 
> immediately have an owner reference to be tied to the driver pod.
> On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works 
> fine, but other times the driver ends up being started with empty volumes 
> instead of volumes with the contents of the secrets we expect. The result is 
> that sometimes the driver will start without these files mounted, which leads 
> to various failures if the driver requires these files to be present early on 
> in their code. Missing the properties file config map, for example, would 
> mean spark-submit doesn't have a properties file to read at all. See the 
> warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.]
> Unfortunately we cannot link owner references to non-existent objects, so we 
> have to do this instead:
>  # Create the auxiliary resources without any owner references.
>  # Create the driver pod mounting these resources into volumes, as before.
>  # If #2 fails, clean up the resources created in #1.
>  # Edit the auxiliary resources to have an owner reference for the driver pod.
> The multi-step approach leaves a small chance for us to leak resources - for 
> example, if we fail to make the resource edits in #4 for some reason. This 
> also changes the permissioning mode required for spark-submit - credentials 
> provided to spark-submit need to be able to edit resources in addition to 
> creating them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior

2018-04-19 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444881#comment-16444881
 ] 

Anirudh Ramanathan edited comment on SPARK-24028 at 4/19/18 10:09 PM:
--

This is unexpected. Is this a recent change? You mention a 1.9.4 cluster has 
this issue. 
Like Yinan said, it doesn't sound like the right behavior if an empty file is 
found instead of what you expect - the expectation is that the secret/configmap 
will exist and be mounted, or not exist and cause the pod to go pending and 
retrying till mounting succeeds.


was (Author: foxish):
This is curious. Is this a recent change? You mention a 1.9.4 cluster has this 
issue. 
Like Yinan said, it doesn't sound like the right behavior if an empty file is 
found instead of what you expect - the expectation is that the secret/configmap 
will exist and be mounted, or not exist and cause the pod to go pending and 
retrying till mounting succeeds.

> [K8s] Creating secrets and config maps before creating the driver pod has 
> unpredictable behavior
> 
>
> Key: SPARK-24028
> URL: https://issues.apache.org/jira/browse/SPARK-24028
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Critical
>
> Currently we create the Kubernetes resources the driver depends on - such as 
> the properties config map and secrets to mount into the pod - only after we 
> create the driver pod. This is because we want these extra objects to 
> immediately have an owner reference to be tied to the driver pod.
> On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works 
> fine, but other times the driver ends up being started with empty volumes 
> instead of volumes with the contents of the secrets we expect. The result is 
> that sometimes the driver will start without these files mounted, which leads 
> to various failures if the driver requires these files to be present early on 
> in their code. Missing the properties file config map, for example, would 
> mean spark-submit doesn't have a properties file to read at all. See the 
> warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.]
> Unfortunately we cannot link owner references to non-existent objects, so we 
> have to do this instead:
>  # Create the auxiliary resources without any owner references.
>  # Create the driver pod mounting these resources into volumes, as before.
>  # If #2 fails, clean up the resources created in #1.
>  # Edit the auxiliary resources to have an owner reference for the driver pod.
> The multi-step approach leaves a small chance for us to leak resources - for 
> example, if we fail to make the resource edits in #4 for some reason. This 
> also changes the permissioning mode required for spark-submit - credentials 
> provided to spark-submit need to be able to edit resources in addition to 
> creating them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24022) Flaky test: SparkContextSuite

2018-04-19 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24022.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21105
[https://github.com/apache/spark/pull/21105]

> Flaky test: SparkContextSuite
> -
>
> Key: SPARK-24022
> URL: https://issues.apache.org/jira/browse/SPARK-24022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.3.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> A flaky pattern found and fixed in SPARK-23775 but similar things exist in 
> SparkContextSuite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior

2018-04-19 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444881#comment-16444881
 ] 

Anirudh Ramanathan commented on SPARK-24028:


This is curious. Is this a recent change? You mention a 1.9.4 cluster has this 
issue. 
Like Yinan said, it doesn't sound like the right behavior if an empty file is 
found instead of what you expect - the expectation is that the secret/configmap 
will exist and be mounted, or not exist and cause the pod to go pending and 
retrying till mounting succeeds.

> [K8s] Creating secrets and config maps before creating the driver pod has 
> unpredictable behavior
> 
>
> Key: SPARK-24028
> URL: https://issues.apache.org/jira/browse/SPARK-24028
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Critical
>
> Currently we create the Kubernetes resources the driver depends on - such as 
> the properties config map and secrets to mount into the pod - only after we 
> create the driver pod. This is because we want these extra objects to 
> immediately have an owner reference to be tied to the driver pod.
> On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works 
> fine, but other times the driver ends up being started with empty volumes 
> instead of volumes with the contents of the secrets we expect. The result is 
> that sometimes the driver will start without these files mounted, which leads 
> to various failures if the driver requires these files to be present early on 
> in their code. Missing the properties file config map, for example, would 
> mean spark-submit doesn't have a properties file to read at all. See the 
> warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.]
> Unfortunately we cannot link owner references to non-existent objects, so we 
> have to do this instead:
>  # Create the auxiliary resources without any owner references.
>  # Create the driver pod mounting these resources into volumes, as before.
>  # If #2 fails, clean up the resources created in #1.
>  # Edit the auxiliary resources to have an owner reference for the driver pod.
> The multi-step approach leaves a small chance for us to leak resources - for 
> example, if we fail to make the resource edits in #4 for some reason. This 
> also changes the permissioning mode required for spark-submit - credentials 
> provided to spark-submit need to be able to edit resources in addition to 
> creating them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24022) Flaky test: SparkContextSuite

2018-04-19 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24022:
--

Assignee: Gabor Somogyi

> Flaky test: SparkContextSuite
> -
>
> Key: SPARK-24022
> URL: https://issues.apache.org/jira/browse/SPARK-24022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.3.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
>
> A flaky pattern found and fixed in SPARK-23775 but similar things exist in 
> SparkContextSuite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior

2018-04-19 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444856#comment-16444856
 ] 

Yinan Li commented on SPARK-24028:
--

I am also running a 1.9 cluster on GKE and I have never run into the issue you 
mentioned above. I do often see events on the driver pod showing that the 
configmap failed to mount, but eventually retries just succeeded. I believe a 
pod won't start running if any of the specified volumes (being it a secret 
volume, a configmap volume, or something else) fail to mount, and Kubernetes 
also retries mounting volumes that it failed to mount when the pod first 
started. 

> [K8s] Creating secrets and config maps before creating the driver pod has 
> unpredictable behavior
> 
>
> Key: SPARK-24028
> URL: https://issues.apache.org/jira/browse/SPARK-24028
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Critical
>
> Currently we create the Kubernetes resources the driver depends on - such as 
> the properties config map and secrets to mount into the pod - only after we 
> create the driver pod. This is because we want these extra objects to 
> immediately have an owner reference to be tied to the driver pod.
> On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works 
> fine, but other times the driver ends up being started with empty volumes 
> instead of volumes with the contents of the secrets we expect. The result is 
> that sometimes the driver will start without these files mounted, which leads 
> to various failures if the driver requires these files to be present early on 
> in their code. Missing the properties file config map, for example, would 
> mean spark-submit doesn't have a properties file to read at all. See the 
> warning on [https://kubernetes.io/docs/concepts/storage/volumes/#secret.]
> Unfortunately we cannot link owner references to non-existent objects, so we 
> have to do this instead:
>  # Create the auxiliary resources without any owner references.
>  # Create the driver pod mounting these resources into volumes, as before.
>  # If #2 fails, clean up the resources created in #1.
>  # Edit the auxiliary resources to have an owner reference for the driver pod.
> The multi-step approach leaves a small chance for us to leak resources - for 
> example, if we fail to make the resource edits in #4 for some reason. This 
> also changes the permissioning mode required for spark-submit - credentials 
> provided to spark-submit need to be able to edit resources in addition to 
> creating them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24028) [K8s] Creating secrets and config maps before creating the driver pod has unpredictable behavior

2018-04-19 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-24028:
--

 Summary: [K8s] Creating secrets and config maps before creating 
the driver pod has unpredictable behavior
 Key: SPARK-24028
 URL: https://issues.apache.org/jira/browse/SPARK-24028
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Matt Cheah


Currently we create the Kubernetes resources the driver depends on - such as 
the properties config map and secrets to mount into the pod - only after we 
create the driver pod. This is because we want these extra objects to 
immediately have an owner reference to be tied to the driver pod.

On our Kubernetes 1.9.4. cluster, we're seeing that sometimes this works fine, 
but other times the driver ends up being started with empty volumes instead of 
volumes with the contents of the secrets we expect. The result is that 
sometimes the driver will start without these files mounted, which leads to 
various failures if the driver requires these files to be present early on in 
their code. Missing the properties file config map, for example, would mean 
spark-submit doesn't have a properties file to read at all. See the warning on 
[https://kubernetes.io/docs/concepts/storage/volumes/#secret.]

Unfortunately we cannot link owner references to non-existent objects, so we 
have to do this instead:
 # Create the auxiliary resources without any owner references.
 # Create the driver pod mounting these resources into volumes, as before.
 # If #2 fails, clean up the resources created in #1.
 # Edit the auxiliary resources to have an owner reference for the driver pod.

The multi-step approach leaves a small chance for us to leak resources - for 
example, if we fail to make the resource edits in #4 for some reason. This also 
changes the permissioning mode required for spark-submit - credentials provided 
to spark-submit need to be able to edit resources in addition to creating them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24020) Sort-merge join inner range optimization

2018-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24020:


Assignee: Apache Spark

> Sort-merge join inner range optimization
> 
>
> Key: SPARK-24020
> URL: https://issues.apache.org/jira/browse/SPARK-24020
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Petar Zecevic
>Assignee: Apache Spark
>Priority: Major
>
> The problem we are solving is the case where you have two big tables 
> partitioned by X column, but also sorted by Y column (within partitions) and 
> you need to calculate an expensive function on the joined rows. During a 
> sort-merge join, Spark will do cross-joins of all rows that have the same X 
> values and calculate the function's value on all of them. If the two tables 
> have a large number of rows per X, this can result in a huge number of 
> calculations.
> We hereby propose an optimization that would allow you to reduce the number 
> of matching rows per X using a range condition on Y columns of the two 
> tables. Something like:
> ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d
> The way SMJ is currently implemented, these extra conditions have no 
> influence on the number of rows (per X) being checked because these extra 
> conditions are put in the same block with the function being calculated.
> Here we propose a change to the sort-merge join so that, when these extra 
> conditions are specified, a queue is used instead of the 
> ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a 
> moving window across the values from the right relation as the left row 
> changes. You could call this a combination of an equi-join and a theta join 
> (we call it "sort-merge inner range join").
> Potential use-cases for this are joins based on spatial or temporal distance 
> calculations.
> The optimization should be triggered automatically when an equi-join 
> expression is present AND lower and upper range conditions on a secondary 
> column are specified. If the tables aren't sorted by both columns, 
> appropriate sorts should be added.
> To limit the impact of this change we also propose adding a new parameter 
> (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which 
> could be used to switch off the optimization entirely.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24020) Sort-merge join inner range optimization

2018-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24020:


Assignee: (was: Apache Spark)

> Sort-merge join inner range optimization
> 
>
> Key: SPARK-24020
> URL: https://issues.apache.org/jira/browse/SPARK-24020
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Petar Zecevic
>Priority: Major
>
> The problem we are solving is the case where you have two big tables 
> partitioned by X column, but also sorted by Y column (within partitions) and 
> you need to calculate an expensive function on the joined rows. During a 
> sort-merge join, Spark will do cross-joins of all rows that have the same X 
> values and calculate the function's value on all of them. If the two tables 
> have a large number of rows per X, this can result in a huge number of 
> calculations.
> We hereby propose an optimization that would allow you to reduce the number 
> of matching rows per X using a range condition on Y columns of the two 
> tables. Something like:
> ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d
> The way SMJ is currently implemented, these extra conditions have no 
> influence on the number of rows (per X) being checked because these extra 
> conditions are put in the same block with the function being calculated.
> Here we propose a change to the sort-merge join so that, when these extra 
> conditions are specified, a queue is used instead of the 
> ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a 
> moving window across the values from the right relation as the left row 
> changes. You could call this a combination of an equi-join and a theta join 
> (we call it "sort-merge inner range join").
> Potential use-cases for this are joins based on spatial or temporal distance 
> calculations.
> The optimization should be triggered automatically when an equi-join 
> expression is present AND lower and upper range conditions on a secondary 
> column are specified. If the tables aren't sorted by both columns, 
> appropriate sorts should be added.
> To limit the impact of this change we also propose adding a new parameter 
> (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which 
> could be used to switch off the optimization entirely.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24020) Sort-merge join inner range optimization

2018-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444814#comment-16444814
 ] 

Apache Spark commented on SPARK-24020:
--

User 'zecevicp' has created a pull request for this issue:
https://github.com/apache/spark/pull/21109

> Sort-merge join inner range optimization
> 
>
> Key: SPARK-24020
> URL: https://issues.apache.org/jira/browse/SPARK-24020
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Petar Zecevic
>Priority: Major
>
> The problem we are solving is the case where you have two big tables 
> partitioned by X column, but also sorted by Y column (within partitions) and 
> you need to calculate an expensive function on the joined rows. During a 
> sort-merge join, Spark will do cross-joins of all rows that have the same X 
> values and calculate the function's value on all of them. If the two tables 
> have a large number of rows per X, this can result in a huge number of 
> calculations.
> We hereby propose an optimization that would allow you to reduce the number 
> of matching rows per X using a range condition on Y columns of the two 
> tables. Something like:
> ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d
> The way SMJ is currently implemented, these extra conditions have no 
> influence on the number of rows (per X) being checked because these extra 
> conditions are put in the same block with the function being calculated.
> Here we propose a change to the sort-merge join so that, when these extra 
> conditions are specified, a queue is used instead of the 
> ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a 
> moving window across the values from the right relation as the left row 
> changes. You could call this a combination of an equi-join and a theta join 
> (we call it "sort-merge inner range join").
> Potential use-cases for this are joins based on spatial or temporal distance 
> calculations.
> The optimization should be triggered automatically when an equi-join 
> expression is present AND lower and upper range conditions on a secondary 
> column are specified. If the tables aren't sorted by both columns, 
> appropriate sorts should be added.
> To limit the impact of this change we also propose adding a new parameter 
> (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which 
> could be used to switch off the optimization entirely.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24027) Support MapType(StringType, DataType) as root type by from_json

2018-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24027:


Assignee: Apache Spark

> Support MapType(StringType, DataType) as root type by from_json
> ---
>
> Key: SPARK-24027
> URL: https://issues.apache.org/jira/browse/SPARK-24027
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Currently, *MapType* is not supported by the *from_json* function as the root 
> type. For example, the following code doesn't work on Spark 2.3:
> {code}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val schema = MapType(StringType, IntegerType)
> schema: org.apache.spark.sql.types.MapType = 
> MapType(StringType,IntegerType,true)
> scala> val in = Seq("""{"a": 1, "b": 2, "c": 3}""").toDS()
> in: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> in.select(from_json($"value", schema, Map[String, String]())).collect()
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'jsontostructs(`value`)' due to data type mismatch: Input schema 
> map must be a struct or an array of structs.
> {code}
> Purpose of the ticket is to support _MapType with StringType as keys type_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24027) Support MapType(StringType, DataType) as root type by from_json

2018-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24027:


Assignee: (was: Apache Spark)

> Support MapType(StringType, DataType) as root type by from_json
> ---
>
> Key: SPARK-24027
> URL: https://issues.apache.org/jira/browse/SPARK-24027
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, *MapType* is not supported by the *from_json* function as the root 
> type. For example, the following code doesn't work on Spark 2.3:
> {code}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val schema = MapType(StringType, IntegerType)
> schema: org.apache.spark.sql.types.MapType = 
> MapType(StringType,IntegerType,true)
> scala> val in = Seq("""{"a": 1, "b": 2, "c": 3}""").toDS()
> in: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> in.select(from_json($"value", schema, Map[String, String]())).collect()
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'jsontostructs(`value`)' due to data type mismatch: Input schema 
> map must be a struct or an array of structs.
> {code}
> Purpose of the ticket is to support _MapType with StringType as keys type_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24027) Support MapType(StringType, DataType) as root type by from_json

2018-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444724#comment-16444724
 ] 

Apache Spark commented on SPARK-24027:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21108

> Support MapType(StringType, DataType) as root type by from_json
> ---
>
> Key: SPARK-24027
> URL: https://issues.apache.org/jira/browse/SPARK-24027
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, *MapType* is not supported by the *from_json* function as the root 
> type. For example, the following code doesn't work on Spark 2.3:
> {code}
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> val schema = MapType(StringType, IntegerType)
> schema: org.apache.spark.sql.types.MapType = 
> MapType(StringType,IntegerType,true)
> scala> val in = Seq("""{"a": 1, "b": 2, "c": 3}""").toDS()
> in: org.apache.spark.sql.Dataset[String] = [value: string]
> scala> in.select(from_json($"value", schema, Map[String, String]())).collect()
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'jsontostructs(`value`)' due to data type mismatch: Input schema 
> map must be a struct or an array of structs.
> {code}
> Purpose of the ticket is to support _MapType with StringType as keys type_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24027) Support MapType(StringType, DataType) as root type by from_json

2018-04-19 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24027:
--

 Summary: Support MapType(StringType, DataType) as root type by 
from_json
 Key: SPARK-24027
 URL: https://issues.apache.org/jira/browse/SPARK-24027
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Currently, *MapType* is not supported by the *from_json* function as the root 
type. For example, the following code doesn't work on Spark 2.3:
{code}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> val schema = MapType(StringType, IntegerType)
schema: org.apache.spark.sql.types.MapType = 
MapType(StringType,IntegerType,true)

scala> val in = Seq("""{"a": 1, "b": 2, "c": 3}""").toDS()
in: org.apache.spark.sql.Dataset[String] = [value: string]

scala> in.select(from_json($"value", schema, Map[String, String]())).collect()
org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(`value`)' 
due to data type mismatch: Input schema map must be a struct or an 
array of structs.
{code}

Purpose of the ticket is to support _MapType with StringType as keys type_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19826) spark.ml Python API for PIC

2018-04-19 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444564#comment-16444564
 ] 

Huaxin Gao commented on SPARK-19826:


I am working on this and will have a PR soon. Thanks!

> spark.ml Python API for PIC
> ---
>
> Key: SPARK-19826
> URL: https://issues.apache.org/jira/browse/SPARK-19826
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19827) spark.ml R API for PIC

2018-04-19 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1679#comment-1679
 ] 

Miao Wang commented on SPARK-19827:
---

[~felixcheung] The scala code is just merged. I can work this now.

> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22362) Add unit test for Window Aggregate Functions

2018-04-19 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-22362.
---
   Resolution: Fixed
 Assignee: Attila Zsolt Piros
Fix Version/s: 2.4.0

> Add unit test for Window Aggregate Functions
> 
>
> Key: SPARK-22362
> URL: https://issues.apache.org/jira/browse/SPARK-22362
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.0
>
>
> * Declarative
> * Imperative
> * UDAF



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23340) Upgrade Apache ORC to 1.4.3

2018-04-19 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23340:

Fix Version/s: 2.3.1

> Upgrade Apache ORC to 1.4.3
> ---
>
> Key: SPARK-23340
> URL: https://issues.apache.org/jira/browse/SPARK-23340
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> This issue updates Apache ORC dependencies to 1.4.3 released on February 9th.
> Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 
> more patches including bug fixes (https://s.apache.org/Fll8).
> Especially, the following ORC-285 is fixed at 1.4.3.
> {code}
> scala> val df = Seq(Array.empty[Float]).toDF()
> scala> df.write.format("orc").save("/tmp/floatarray")
> scala> spark.read.orc("/tmp/floatarray")
> res1: org.apache.spark.sql.DataFrame = [value: array]
> scala> spark.read.orc("/tmp/floatarray").show()
> 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.io.IOException: Error reading file: 
> file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc
>   at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191)
>   at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
> ...
> Caused by: java.io.EOFException: Read past EOF for compressed stream Stream 
> for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23990) Instruments logging improvements - ML regression package

2018-04-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23990:
--
Shepherd: Joseph K. Bradley

> Instruments logging improvements - ML regression package
> 
>
> Key: SPARK-23990
> URL: https://issues.apache.org/jira/browse/SPARK-23990
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
> Environment: Instruments logging improvements - ML regression package
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23990) Instruments logging improvements - ML regression package

2018-04-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-23990:
-

Assignee: Weichen Xu

> Instruments logging improvements - ML regression package
> 
>
> Key: SPARK-23990
> URL: https://issues.apache.org/jira/browse/SPARK-23990
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
> Environment: Instruments logging improvements - ML regression package
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24026) spark.ml Scala/Java API for PIC

2018-04-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-24026.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21090
[https://github.com/apache/spark/pull/21090]

> spark.ml Scala/Java API for PIC
> ---
>
> Key: SPARK-24026
> URL: https://issues.apache.org/jira/browse/SPARK-24026
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24026) spark.ml Scala/Java API for PIC

2018-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444358#comment-16444358
 ] 

Apache Spark commented on SPARK-24026:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/21090

> spark.ml Scala/Java API for PIC
> ---
>
> Key: SPARK-24026
> URL: https://issues.apache.org/jira/browse/SPARK-24026
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24026) spark.ml Scala/Java API for PIC

2018-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24026:


Assignee: Miao Wang  (was: Apache Spark)

> spark.ml Scala/Java API for PIC
> ---
>
> Key: SPARK-24026
> URL: https://issues.apache.org/jira/browse/SPARK-24026
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24026) spark.ml Scala/Java API for PIC

2018-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24026:


Assignee: Apache Spark  (was: Miao Wang)

> spark.ml Scala/Java API for PIC
> ---
>
> Key: SPARK-24026
> URL: https://issues.apache.org/jira/browse/SPARK-24026
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24026) spark.ml Scala/Java API for PIC

2018-04-19 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-24026:
-

 Summary: spark.ml Scala/Java API for PIC
 Key: SPARK-24026
 URL: https://issues.apache.org/jira/browse/SPARK-24026
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.3.0
Reporter: Joseph K. Bradley
Assignee: Miao Wang


See parent JIRA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24020) Sort-merge join inner range optimization

2018-04-19 Thread Petar Zecevic (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444335#comment-16444335
 ] 

Petar Zecevic commented on SPARK-24020:
---

No, this implementation only applies to equi-joins that have range conditions 
on different columns. You can think of it as an equi-join with "sub-band" 
conditions. Hence the name we gave it ("sort-merge inner range join").

> Sort-merge join inner range optimization
> 
>
> Key: SPARK-24020
> URL: https://issues.apache.org/jira/browse/SPARK-24020
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Petar Zecevic
>Priority: Major
>
> The problem we are solving is the case where you have two big tables 
> partitioned by X column, but also sorted by Y column (within partitions) and 
> you need to calculate an expensive function on the joined rows. During a 
> sort-merge join, Spark will do cross-joins of all rows that have the same X 
> values and calculate the function's value on all of them. If the two tables 
> have a large number of rows per X, this can result in a huge number of 
> calculations.
> We hereby propose an optimization that would allow you to reduce the number 
> of matching rows per X using a range condition on Y columns of the two 
> tables. Something like:
> ... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d
> The way SMJ is currently implemented, these extra conditions have no 
> influence on the number of rows (per X) being checked because these extra 
> conditions are put in the same block with the function being calculated.
> Here we propose a change to the sort-merge join so that, when these extra 
> conditions are specified, a queue is used instead of the 
> ExternalAppendOnlyUnsafeRowArray class. This queue would then used as a 
> moving window across the values from the right relation as the left row 
> changes. You could call this a combination of an equi-join and a theta join 
> (we call it "sort-merge inner range join").
> Potential use-cases for this are joins based on spatial or temporal distance 
> calculations.
> The optimization should be triggered automatically when an equi-join 
> expression is present AND lower and upper range conditions on a secondary 
> column are specified. If the tables aren't sorted by both columns, 
> appropriate sorts should be added.
> To limit the impact of this change we also propose adding a new parameter 
> (tentatively named "spark.sql.join.smj.useInnerRangeOptimization") which 
> could be used to switch off the optimization entirely.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23976) UTF8String.concat() or ByteArray.concat() may allocate shorter structure.

2018-04-19 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-23976.
---
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.4.0

> UTF8String.concat() or ByteArray.concat() may allocate shorter structure.
> -
>
> Key: SPARK-23976
> URL: https://issues.apache.org/jira/browse/SPARK-23976
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.4.0
>
>
> When the three inputs has `0x7FFF_FF00`, `0x7FFF_FF00`, and `0xE00`, the 
> current algorithm allocate the result structure with 0x1000 length due to 
> integer sum overflow.
> We should detect overflow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten

2018-04-19 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-23989.
---
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.4.0
   2.3.1

> When using `SortShuffleWriter`, the data will be overwritten
> 
>
> Key: SPARK-23989
> URL: https://issues.apache.org/jira/browse/SPARK-23989
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.3.1, 2.4.0
>
>
> {color:#33}When using `SortShuffleWriter`, we only insert  
> '{color}{color:#cc7832}AnyRef{color}{color:#33}' into 
> '{color}PartitionedAppendOnlyMap{color:#33}' or 
> '{color}PartitionedPairBuffer{color:#33}'.{color}
> {color:#33}For this function:{color}
> {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: 
> {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832},
>  {color}{color:#4e807d}V{color}]])
> the value of 'records' is `UnsafeRow`, so  the value will be overwritten
> {color:#33} {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24019) AnalysisException for Window function expression to compute derivative

2018-04-19 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444202#comment-16444202
 ] 

Barry Becker commented on SPARK-24019:
--

Lowering to minor because I found a way to specify the deriviative window 
function without getting the above error. The main fix was to remove 
rangeBetween from the window spec.

Here is what I now use and it seems to give the result I am looking for without 
error:
{code:java}

val window = Window.partitionBy("category").orderBy("sequence_num")

// Consider the three sequential series points (Xlag, Ylag), (X, Y), (Xlead, 
Ylead).
// This defines the derivative as (Ylead - Ylag) / (Xlead - Xlag)
// If the lead or lag points are null, then we fall back on using the middle 
point.
val yLead = coalesce(lead("value", 1).over(window), col("value"))
val yLag = coalesce(lag("value", 1).over(window), col("value"))
val xLead = coalesce(lead("sequence_num", 1).over(window), col("sequence_num"))
val xLag = coalesce(lag("sequence_num", 1).over(window), col("sequence_num"))
val derivative: Column = (yLead - yLag) / (xLead - xLag)

val resultDf = simpleDf.withColumn("derivative", derivative)

resultDf.show()
assertResult(strip("""1, b, 100.0, -30.0
  |2, b, 70.0, -20.0
  |3, b, 60.0, -10.0
  |1, a, 2.1, 0.2998
  |2, a, 2.4, 0.8
  |3, a, 3.7, 0.6001
  |4, a, 3.6, -0.10009""")
) {
  resultDf.collect().map(row => row.mkString(", ")).mkString("\n")
}{code}

> AnalysisException for Window function expression to compute derivative
> --
>
> Key: SPARK-24019
> URL: https://issues.apache.org/jira/browse/SPARK-24019
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Ubuntu, spark 2.1.1, standalone.
>Reporter: Barry Becker
>Priority: Minor
>
> I am using spark 2.1.1 currently.
> I created an expression to compute the derivative of some series data using a 
> window function.
> I have a simple reproducible case of the error.
> I'm only filing this bug because the error message says "Please file a bug 
> report with this error message, stack trace, and the query."
> Here they are:
> {code:java}
> ((coalesce(lead(value#9, 1, null) windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), 
> value#9) - coalesce(lag(value#9, 1, null) windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), 
> value#9)) / cast((coalesce(lead(sequence_num#7, 1, null) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 FOLLOWING AND 1 FOLLOWING), sequence_num#7) - coalesce(lag(sequence_num#7, 
> 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, 
> ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), sequence_num#7)) as double)) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, RANGE 
> BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS derivative#14 has multiple 
> Window Specifications (ArrayBuffer(windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT 
> ROW), windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS 
> BETWEEN 1 FOLLOWING AND 1 FOLLOWING), windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))).
> Please file a bug report with this error message, stack trace, and the query.;
> org.apache.spark.sql.AnalysisException: ((coalesce(lead(value#9, 1, null) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 FOLLOWING AND 1 FOLLOWING), value#9) - coalesce(lag(value#9, 1, null) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 PRECEDING AND 1 PRECEDING), value#9)) / cast((coalesce(lead(sequence_num#7, 
> 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, 
> ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), sequence_num#7) - 
> coalesce(lag(sequence_num#7, 1, null) windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), 
> sequence_num#7)) as double)) windowspecdefinition(category#8, sequence_num#7 
> ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 
> derivative#14 has multiple Window Specifications 
> (ArrayBuffer(windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 FOLLOWING AND 1 FOLLOWING), windowspecdefinition(category#8, sequence_num#7 
> ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))).
> Please file a bug report with this error message, 

[jira] [Comment Edited] (SPARK-24019) AnalysisException for Window function expression to compute derivative

2018-04-19 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444202#comment-16444202
 ] 

Barry Becker edited comment on SPARK-24019 at 4/19/18 3:07 PM:
---

Lowering to minor because I found a way to specify the derivative window 
function without getting the above error. The main fix was to remove 
rangeBetween from the window spec.

Here is what I now use and it seems to give the result I am looking for without 
error:
{code:java}
val window = Window.partitionBy("category").orderBy("sequence_num")

// Consider the three sequential series points (Xlag, Ylag), (X, Y), (Xlead, 
Ylead).
// This defines the derivative as (Ylead - Ylag) / (Xlead - Xlag)
// If the lead or lag points are null, then we fall back on using the middle 
point.
val yLead = coalesce(lead("value", 1).over(window), col("value"))
val yLag = coalesce(lag("value", 1).over(window), col("value"))
val xLead = coalesce(lead("sequence_num", 1).over(window), col("sequence_num"))
val xLag = coalesce(lag("sequence_num", 1).over(window), col("sequence_num"))
val derivative: Column = (yLead - yLag) / (xLead - xLag)

val resultDf = simpleDf.withColumn("derivative", derivative)

resultDf.show()
assertResult(strip("""1, b, 100.0, -30.0
  |2, b, 70.0, -20.0
  |3, b, 60.0, -10.0
  |1, a, 2.1, 0.2998
  |2, a, 2.4, 0.8
  |3, a, 3.7, 0.6001
  |4, a, 3.6, -0.10009""")
) {
  resultDf.collect().map(row => row.mkString(", ")).mkString("\n")
}{code}


was (Author: barrybecker4):
Lowering to minor because I found a way to specify the deriviative window 
function without getting the above error. The main fix was to remove 
rangeBetween from the window spec.

Here is what I now use and it seems to give the result I am looking for without 
error:
{code:java}

val window = Window.partitionBy("category").orderBy("sequence_num")

// Consider the three sequential series points (Xlag, Ylag), (X, Y), (Xlead, 
Ylead).
// This defines the derivative as (Ylead - Ylag) / (Xlead - Xlag)
// If the lead or lag points are null, then we fall back on using the middle 
point.
val yLead = coalesce(lead("value", 1).over(window), col("value"))
val yLag = coalesce(lag("value", 1).over(window), col("value"))
val xLead = coalesce(lead("sequence_num", 1).over(window), col("sequence_num"))
val xLag = coalesce(lag("sequence_num", 1).over(window), col("sequence_num"))
val derivative: Column = (yLead - yLag) / (xLead - xLag)

val resultDf = simpleDf.withColumn("derivative", derivative)

resultDf.show()
assertResult(strip("""1, b, 100.0, -30.0
  |2, b, 70.0, -20.0
  |3, b, 60.0, -10.0
  |1, a, 2.1, 0.2998
  |2, a, 2.4, 0.8
  |3, a, 3.7, 0.6001
  |4, a, 3.6, -0.10009""")
) {
  resultDf.collect().map(row => row.mkString(", ")).mkString("\n")
}{code}

> AnalysisException for Window function expression to compute derivative
> --
>
> Key: SPARK-24019
> URL: https://issues.apache.org/jira/browse/SPARK-24019
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Ubuntu, spark 2.1.1, standalone.
>Reporter: Barry Becker
>Priority: Minor
>
> I am using spark 2.1.1 currently.
> I created an expression to compute the derivative of some series data using a 
> window function.
> I have a simple reproducible case of the error.
> I'm only filing this bug because the error message says "Please file a bug 
> report with this error message, stack trace, and the query."
> Here they are:
> {code:java}
> ((coalesce(lead(value#9, 1, null) windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), 
> value#9) - coalesce(lag(value#9, 1, null) windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), 
> value#9)) / cast((coalesce(lead(sequence_num#7, 1, null) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 FOLLOWING AND 1 FOLLOWING), sequence_num#7) - coalesce(lag(sequence_num#7, 
> 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, 
> ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), sequence_num#7)) as double)) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, RANGE 
> BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS derivative#14 has multiple 
> Window Specifications (ArrayBuffer(windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT 
> ROW), windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS 
> BETWEEN 1 FOLLOWING AND 1 FOLLOWING), windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))).
> Please file a bug 

[jira] [Updated] (SPARK-24019) AnalysisException for Window function expression to compute derivative

2018-04-19 Thread Barry Becker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barry Becker updated SPARK-24019:
-
Priority: Minor  (was: Major)

> AnalysisException for Window function expression to compute derivative
> --
>
> Key: SPARK-24019
> URL: https://issues.apache.org/jira/browse/SPARK-24019
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Ubuntu, spark 2.1.1, standalone.
>Reporter: Barry Becker
>Priority: Minor
>
> I am using spark 2.1.1 currently.
> I created an expression to compute the derivative of some series data using a 
> window function.
> I have a simple reproducible case of the error.
> I'm only filing this bug because the error message says "Please file a bug 
> report with this error message, stack trace, and the query."
> Here they are:
> {code:java}
> ((coalesce(lead(value#9, 1, null) windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), 
> value#9) - coalesce(lag(value#9, 1, null) windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), 
> value#9)) / cast((coalesce(lead(sequence_num#7, 1, null) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 FOLLOWING AND 1 FOLLOWING), sequence_num#7) - coalesce(lag(sequence_num#7, 
> 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, 
> ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), sequence_num#7)) as double)) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, RANGE 
> BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS derivative#14 has multiple 
> Window Specifications (ArrayBuffer(windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT 
> ROW), windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS 
> BETWEEN 1 FOLLOWING AND 1 FOLLOWING), windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))).
> Please file a bug report with this error message, stack trace, and the query.;
> org.apache.spark.sql.AnalysisException: ((coalesce(lead(value#9, 1, null) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 FOLLOWING AND 1 FOLLOWING), value#9) - coalesce(lag(value#9, 1, null) 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 PRECEDING AND 1 PRECEDING), value#9)) / cast((coalesce(lead(sequence_num#7, 
> 1, null) windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, 
> ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), sequence_num#7) - 
> coalesce(lag(sequence_num#7, 1, null) windowspecdefinition(category#8, 
> sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), 
> sequence_num#7)) as double)) windowspecdefinition(category#8, sequence_num#7 
> ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 
> derivative#14 has multiple Window Specifications 
> (ArrayBuffer(windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, 
> RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 
> windowspecdefinition(category#8, sequence_num#7 ASC NULLS FIRST, ROWS BETWEEN 
> 1 FOLLOWING AND 1 FOLLOWING), windowspecdefinition(category#8, sequence_num#7 
> ASC NULLS FIRST, ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))).
> Please file a bug report with this error message, stack trace, and the query.;
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$78.apply(Analyzer.scala:1772){code}
> And here is a simple unit test that can be used to reproduce the problem:
> {code:java}
> import com.mineset.spark.testsupport.SparkTestCase.SPARK_SESSION
> import org.apache.spark.sql.Column
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions._
> import org.scalatest.FunSuite
> import com.mineset.spark.testsupport.SparkTestCase._
> /**
> * Test to see that window functions work as expected on spark.
> * @author Barry Becker
> */
> class WindowFunctionSuite extends FunSuite {
> val simpleDf = createSimpleData()
> test("Window function for finding derivatives for 2 series") {
> val window =
> Window.partitionBy("category").orderBy("sequence_num")//.rangeBetween(-1, 1)
> // Consider the three sequential series points (Xlag, Ylag), (X, Y), (Xlead, 
> Ylead).
> // This defines the derivative as (Ylead - Ylag) / (Xlead - Xlag)
> // If the lead or lag points are null, then we fall back on using the middle 
> 

[jira] [Commented] (SPARK-23711) Add fallback to interpreted execution logic

2018-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444099#comment-16444099
 ] 

Apache Spark commented on SPARK-23711:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/21106

> Add fallback to interpreted execution logic
> ---
>
> Key: SPARK-23711
> URL: https://issues.apache.org/jira/browse/SPARK-23711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23711) Add fallback to interpreted execution logic

2018-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23711:


Assignee: (was: Apache Spark)

> Add fallback to interpreted execution logic
> ---
>
> Key: SPARK-23711
> URL: https://issues.apache.org/jira/browse/SPARK-23711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23711) Add fallback to interpreted execution logic

2018-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23711:


Assignee: Apache Spark

> Add fallback to interpreted execution logic
> ---
>
> Key: SPARK-23711
> URL: https://issues.apache.org/jira/browse/SPARK-23711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24021) Fix bug in BlacklistTracker's updateBlacklistForFetchFailure

2018-04-19 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-24021.
--
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21104
[https://github.com/apache/spark/pull/21104]

> Fix bug in BlacklistTracker's updateBlacklistForFetchFailure
> 
>
> Key: SPARK-24021
> URL: https://issues.apache.org/jira/browse/SPARK-24021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>  Labels: easyfix
> Fix For: 2.4.0, 2.3.1
>
>
> There's a miswrite in BlacklistTracker's updateBlacklistForFetchFailure:
>  
> {code:java}
> val blacklistedExecsOnNode =
>     nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]())
> blacklistedExecsOnNode += exec{code}
>  
> where first *exec* should be *host*.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24021) Fix bug in BlacklistTracker's updateBlacklistForFetchFailure

2018-04-19 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-24021:


Assignee: wuyi

> Fix bug in BlacklistTracker's updateBlacklistForFetchFailure
> 
>
> Key: SPARK-24021
> URL: https://issues.apache.org/jira/browse/SPARK-24021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>  Labels: easyfix
> Fix For: 2.3.1, 2.4.0
>
>
> There's a miswrite in BlacklistTracker's updateBlacklistForFetchFailure:
>  
> {code:java}
> val blacklistedExecsOnNode =
>     nodeToBlacklistedExecs.getOrElseUpdate(exec, HashSet[String]())
> blacklistedExecsOnNode += exec{code}
>  
> where first *exec* should be *host*.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21811) Inconsistency when finding the widest common type of a combination of DateType, StringType, and NumericType

2018-04-19 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21811.
--
   Resolution: Fixed
 Assignee: Jiang Xingbo
Fix Version/s: 2.4.0

Fixed in https://github.com/apache/spark/pull/21074

> Inconsistency when finding the widest common type of a combination of 
> DateType, StringType, and NumericType
> ---
>
> Key: SPARK-21811
> URL: https://issues.apache.org/jira/browse/SPARK-21811
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Bald
>Assignee: Jiang Xingbo
>Priority: Minor
> Fix For: 2.4.0
>
>
> Finding the widest common type for the arguments of a variadic function (such 
> as IN or COALESCE) when the types of the arguments are a combination of 
> DateType/TimestampType, StringType, and NumericType fails with an 
> AnalysisException for some orders of the arguments and succeeds with a common 
> type of StringType for other orders of the arguments.
> The below examples used to reproduce the error assume a schema of:
> {{[c1: date, c2: string, c3: int]}}
> The following succeeds:
> {{SELECT coalesce(c1, c2, c3) FROM table}}
> While the following produces an exception:
> {{SELECT coalesce(c1, c3, c2) FROM table}}
> The order of arguments affects the behavior because it looks to be the widest 
> common type is found by repeatedly looking at two arguments at a time, the 
> widest common type found thus far and the next argument. On initial thought 
> of a fix, I think the way the widest common type is found would have to be 
> changed and instead look at all arguments first before deciding what the 
> widest common type should be.
> As my boss is out of office for the rest of the day I will give a pull 
> request a shot, but as I am not super familiar with Scala or Spark's coding 
> style guidelines, a pull request is not promised. Going forward with my 
> attempted pull request, I will assume having DateType/TimestampType, 
> StringType, and NumericType arguments in an IN expression and COALESCE 
> function (and any other function/expression where this combination of 
> argument types can occur) is valid. I find it also quite reasonable to have 
> this combination of argument types to be invalid, so if that's what is 
> decided, then oh well.
> If I were a betting man, I'd say the fix would be made in the following file: 
> [TypeCoercion.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23964) why does Spillable wait for 32 elements?

2018-04-19 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444043#comment-16444043
 ] 

Thomas Graves commented on SPARK-23964:
---

so far in my testing I haven't seen any performance regressions.  Doing the 
accounting to acquire more memory takes no time at all. Obviously if you have a 
small heap and it can't acquire more memory, it will spill but that is what you 
want so you don't oom.

> why does Spillable wait for 32 elements?
> 
>
> Key: SPARK-23964
> URL: https://issues.apache.org/jira/browse/SPARK-23964
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> The spillable class has a check in maybeSpill as to when it tries to acquire 
> more memory and determine if it should spill:
> if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
> Before it looks to see if it should spill.  
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala#L83]
> I'm wondering why it has the elementsRead %32  in it?  If I have a small 
> number of elements that are huge this can easily cause OOM before we actually 
> spill.  
> I saw a few conversations on this and one Jira related: 
> https://issues.apache.org/jira/browse/SPARK-4456 . but I've never seen an 
> answer to this.
> anyone have history on this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23924) High-order function: element_at

2018-04-19 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23924:
-

Assignee: Kazuaki Ishizaki

> High-order function: element_at
> ---
>
> Key: SPARK-23924
> URL: https://issues.apache.org/jira/browse/SPARK-23924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html and 
> https://prestodb.io/docs/current/functions/map.html 
> * element_at(array, index) → E
> Returns element of array at given index. If index > 0, this function provides 
> the same functionality as the SQL-standard subscript operator ([]). If index 
> < 0, element_at accesses elements from the last to the first.
> * element_at(map, key) → V
> Returns value for given key, or NULL if the key is not contained in the map.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23584) Add interpreted execution to NewInstance expression

2018-04-19 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-23584.
---
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.4.0

> Add interpreted execution to NewInstance expression
> ---
>
> Key: SPARK-23584
> URL: https://issues.apache.org/jira/browse/SPARK-23584
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23588) Add interpreted execution for CatalystToExternalMap expression

2018-04-19 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-23588.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

> Add interpreted execution for CatalystToExternalMap expression
> --
>
> Key: SPARK-23588
> URL: https://issues.apache.org/jira/browse/SPARK-23588
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23711) Add fallback to interpreted execution logic

2018-04-19 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443959#comment-16443959
 ] 

Liang-Chi Hsieh commented on SPARK-23711:
-

Ok.

> Add fallback to interpreted execution logic
> ---
>
> Key: SPARK-23711
> URL: https://issues.apache.org/jira/browse/SPARK-23711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23924) High-order function: element_at

2018-04-19 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23924.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21053
[https://github.com/apache/spark/pull/21053]

> High-order function: element_at
> ---
>
> Key: SPARK-23924
> URL: https://issues.apache.org/jira/browse/SPARK-23924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html and 
> https://prestodb.io/docs/current/functions/map.html 
> * element_at(array, index) → E
> Returns element of array at given index. If index > 0, this function provides 
> the same functionality as the SQL-standard subscript operator ([]). If index 
> < 0, element_at accesses elements from the last to the first.
> * element_at(map, key) → V
> Returns value for given key, or NULL if the key is not contained in the map.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23711) Add fallback to interpreted execution logic

2018-04-19 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443934#comment-16443934
 ] 

Herman van Hovell commented on SPARK-23711:
---

Can you make a small PR initially so we can discuss the design a bit?

> Add fallback to interpreted execution logic
> ---
>
> Key: SPARK-23711
> URL: https://issues.apache.org/jira/browse/SPARK-23711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24022) Flaky test: SparkContextSuite

2018-04-19 Thread Gabor Somogyi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-24022:
--
Affects Version/s: (was: 2.4.0)
   2.3.0

> Flaky test: SparkContextSuite
> -
>
> Key: SPARK-24022
> URL: https://issues.apache.org/jira/browse/SPARK-24022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.3.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> A flaky pattern found and fixed in SPARK-23775 but similar things exist in 
> SparkContextSuite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24022) Flaky test: SparkContextSuite

2018-04-19 Thread Gabor Somogyi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-24022:
--
Component/s: (was: SQL)
 Spark Core

> Flaky test: SparkContextSuite
> -
>
> Key: SPARK-24022
> URL: https://issues.apache.org/jira/browse/SPARK-24022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> A flaky pattern found and fixed in SPARK-23775 but similar things exist in 
> SparkContextSuite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23711) Add fallback to interpreted execution logic

2018-04-19 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443924#comment-16443924
 ] 

Liang-Chi Hsieh commented on SPARK-23711:
-

Yeah, I agree that is a good rule. I will prepare a PR for this soon.

> Add fallback to interpreted execution logic
> ---
>
> Key: SPARK-23711
> URL: https://issues.apache.org/jira/browse/SPARK-23711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side

2018-04-19 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-24025:

Attachment: join-jira.png

> Join of bucketed and non-bucketed tables can give two exchanges and sorts for 
> non-bucketed side
> ---
>
> Key: SPARK-24025
> URL: https://issues.apache.org/jira/browse/SPARK-24025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
> Environment: {code:java}
> ./bin/spark-shell --version
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.3.0
> /_/
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162
> Branch master
> Compiled by user sameera on 2018-02-22T19:24:29Z
> Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
> Url g...@github.com:sameeragarwal/spark.git
> Type --help for more information.{code}
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: join-jira.png
>
>
> While exploring bucketing I found the following join query of non-bucketed 
> and bucketed tables that ends up with two exchanges and two sorts in the 
> physical plan for the non-bucketed join side.
> {code}
> // Make sure that you don't end up with a BroadcastHashJoin and a 
> BroadcastExchange
> // Disable auto broadcasting
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> val bucketedTableName = "bucketed_4_id"
> val large = spark.range(100)
> large.write
>   .bucketBy(4, "id")
>   .sortBy("id")
>   .mode("overwrite")
>   .saveAsTable(bucketedTableName)
> // Describe the table and include bucketing spec only
> val descSQL = sql(s"DESC FORMATTED 
> $bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === 
> "Sort Columns")
> scala> descSQL.show(truncate = false)
> +--+-+---+
> |col_name  |data_type|comment|
> +--+-+---+
> |Num Buckets   |4|   |
> |Bucket Columns|[`id`]   |   |
> |Sort Columns  |[`id`]   |   |
> +--+-+---+
> val bucketedTable = spark.table(bucketedTableName)
> val t1 = spark.range(4)
>   .repartition(2, $"id")  // Use just 2 partitions
>   .sortWithinPartitions("id") // sort partitions
> val q = t1.join(bucketedTable, "id")
> // Note two exchanges and sorts
> scala> q.explain
> == Physical Plan ==
> *(5) Project [id#79L]
> +- *(5) SortMergeJoin [id#79L], [id#77L], Inner
>:- *(3) Sort [id#79L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#79L, 4)
>: +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0
>:+- Exchange hashpartitioning(id#79L, 2)
>:   +- *(1) Range (0, 4, step=1, splits=8)
>+- *(4) Sort [id#77L ASC NULLS FIRST], false, 0
>   +- *(4) Project [id#77L]
>  +- *(4) Filter isnotnull(id#77L)
> +- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id],
>  PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
> struct
> q.foreach(_ => ())
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24025) Join of bucketed and non-bucketed tables can give two exchanges and sorts for non-bucketed side

2018-04-19 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-24025:
---

 Summary: Join of bucketed and non-bucketed tables can give two 
exchanges and sorts for non-bucketed side
 Key: SPARK-24025
 URL: https://issues.apache.org/jira/browse/SPARK-24025
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.3.1
 Environment: {code:java}
./bin/spark-shell --version
Welcome to
 __
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.0
/_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_162
Branch master
Compiled by user sameera on 2018-02-22T19:24:29Z
Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
Url g...@github.com:sameeragarwal/spark.git
Type --help for more information.{code}
Reporter: Jacek Laskowski


While exploring bucketing I found the following join query of non-bucketed and 
bucketed tables that ends up with two exchanges and two sorts in the physical 
plan for the non-bucketed join side.

{code}
// Make sure that you don't end up with a BroadcastHashJoin and a 
BroadcastExchange
// Disable auto broadcasting
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

val bucketedTableName = "bucketed_4_id"
val large = spark.range(100)
large.write
  .bucketBy(4, "id")
  .sortBy("id")
  .mode("overwrite")
  .saveAsTable(bucketedTableName)

// Describe the table and include bucketing spec only
val descSQL = sql(s"DESC FORMATTED 
$bucketedTableName").filter($"col_name".contains("Bucket") || $"col_name" === 
"Sort Columns")
scala> descSQL.show(truncate = false)
+--+-+---+
|col_name  |data_type|comment|
+--+-+---+
|Num Buckets   |4|   |
|Bucket Columns|[`id`]   |   |
|Sort Columns  |[`id`]   |   |
+--+-+---+

val bucketedTable = spark.table(bucketedTableName)
val t1 = spark.range(4)
  .repartition(2, $"id")  // Use just 2 partitions
  .sortWithinPartitions("id") // sort partitions

val q = t1.join(bucketedTable, "id")
// Note two exchanges and sorts
scala> q.explain
== Physical Plan ==
*(5) Project [id#79L]
+- *(5) SortMergeJoin [id#79L], [id#77L], Inner
   :- *(3) Sort [id#79L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#79L, 4)
   : +- *(2) Sort [id#79L ASC NULLS FIRST], false, 0
   :+- Exchange hashpartitioning(id#79L, 2)
   :   +- *(1) Range (0, 4, step=1, splits=8)
   +- *(4) Sort [id#77L ASC NULLS FIRST], false, 0
  +- *(4) Project [id#77L]
 +- *(4) Filter isnotnull(id#77L)
+- *(4) FileScan parquet default.bucketed_4_id[id#77L] Batched: 
true, Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/jacek/dev/oss/spark/spark-warehouse/bucketed_4_id],
 PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: 
struct

q.foreach(_ => ())
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24024) Fix deviance calculations in GLM to handle corner cases

2018-04-19 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443892#comment-16443892
 ] 

Teng Peng commented on SPARK-24024:
---

I will first reproduce the issue he has, check how does R handle this, and if 
we have any other fix needed.

> Fix deviance calculations in GLM to handle corner cases
> ---
>
> Key: SPARK-24024
> URL: https://issues.apache.org/jira/browse/SPARK-24024
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Priority: Minor
>
> It is reported by Spark users that the deviance calculations does not handle 
> a corner case. Thus, the correct model summary cannot be obtained. The user 
> has confirmed the the issue is in
> override def deviance(y: Double, mu: Double, weight: Double): Double = {
>  2.0 * weight * (y * math.log(y / mu) - (y - mu))
>  }
> when y = 0.
>  
> The user also mentioned there are many other places he believe we should 
> check the same thing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24024) Fix deviance calculations in GLM to handle corner cases

2018-04-19 Thread Teng Peng (JIRA)
Teng Peng created SPARK-24024:
-

 Summary: Fix deviance calculations in GLM to handle corner cases
 Key: SPARK-24024
 URL: https://issues.apache.org/jira/browse/SPARK-24024
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Teng Peng


It is reported by Spark users that the deviance calculations does not handle a 
corner case. Thus, the correct model summary cannot be obtained. The user has 
confirmed the the issue is in

override def deviance(y: Double, mu: Double, weight: Double): Double = {
 2.0 * weight * (y * math.log(y / mu) - (y - mu))
 }

when y = 0.

 

The user also mentioned there are many other places he believe we should check 
the same thing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24023) Built-in SQL Functions improvement in SparkR

2018-04-19 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443884#comment-16443884
 ] 

Hyukjin Kwon commented on SPARK-24023:
--

cc [~shivaram] too of course.

> Built-in SQL Functions improvement in SparkR
> 
>
> Key: SPARK-24023
> URL: https://issues.apache.org/jira/browse/SPARK-24023
> Project: Spark
>  Issue Type: Umbrella
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to add an R functions corresponding to SPARK-23899. We have 
> been usually adding a function with Scala alone or with both Scala and Python 
> APIs.
> It's could be messy if there are duplicates for R sides in SPARK-23899. 
> Followup for each JIRA might be possible but then again messy to manage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23711) Add fallback to interpreted execution logic

2018-04-19 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443882#comment-16443882
 ] 

Herman van Hovell commented on SPARK-23711:
---

ObjectHashAggregateExec uses code generations for the projections it uses.

I think we as a rule should never create code generated objects directly, and 
use factories with fallback logic instead.

> Add fallback to interpreted execution logic
> ---
>
> Key: SPARK-23711
> URL: https://issues.apache.org/jira/browse/SPARK-23711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23711) Add fallback to interpreted execution logic

2018-04-19 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443879#comment-16443879
 ] 

Liang-Chi Hsieh commented on SPARK-23711:
-

object hash aggregate? If you mean {{ObjectHashAggregateExec}}, seems it 
doesn't support codegen and only runs interpreted mode?

For encoders, I think it is because it directly uses something like 
{{GenerateUnsafeProjection}} and doesn't fallback to interpreted version for 
now. Other places using this kind of codegen code path are the same too.

I didn't check window functions yet. Thanks for pointing it out.

> Add fallback to interpreted execution logic
> ---
>
> Key: SPARK-23711
> URL: https://issues.apache.org/jira/browse/SPARK-23711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24023) Built-in SQL Functions improvement in SparkR

2018-04-19 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24023:
-
Description: 
This JIRA targets to add an R functions corresponding to SPARK-23899. We have 
been usually adding a function with Scala alone or with both Scala and Python 
APIs.
It's could be messy if there are duplicates for R sides in SPARK-23899. 
Followup for each JIRA might be possible but then again messy to manage.

  was:
This JIRA targets to add an R functions corresponding to SPARK-23899. We have 
been usually adding a function with Scala alone or with both Scala and Python 
APIs.
This JIRA targets to add the functions correspondingly to R side too.

It's could be messy and identify to each JIRA if there are duplicates for R 
sides too there. Followup for each JIRA might be possible but again messy to 
manage.


> Built-in SQL Functions improvement in SparkR
> 
>
> Key: SPARK-24023
> URL: https://issues.apache.org/jira/browse/SPARK-24023
> Project: Spark
>  Issue Type: Umbrella
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to add an R functions corresponding to SPARK-23899. We have 
> been usually adding a function with Scala alone or with both Scala and Python 
> APIs.
> It's could be messy if there are duplicates for R sides in SPARK-23899. 
> Followup for each JIRA might be possible but then again messy to manage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24023) Built-in SQL Functions improvement in SparkR

2018-04-19 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24023:
-
Issue Type: Umbrella  (was: Improvement)

> Built-in SQL Functions improvement in SparkR
> 
>
> Key: SPARK-24023
> URL: https://issues.apache.org/jira/browse/SPARK-24023
> Project: Spark
>  Issue Type: Umbrella
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to add an R functions corresponding to SPARK-23899. We have 
> been usually adding a function with Scala alone or with both Scala and Python 
> APIs.
> This JIRA targets to add the functions correspondingly to R side too.
> It's could be messy and identify to each JIRA if there are duplicates for R 
> sides too there. Followup for each JIRA might be possible but again messy to 
> manage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24023) Built-in SQL Functions improvement in SparkR

2018-04-19 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24023:
-
Description: 
This JIRA targets to add an R functions corresponding to SPARK-23899. We have 
been usually adding a function with Scala alone or with both Scala and Python 
APIs.
This JIRA targets to add the functions correspondingly to R side too.

It's could be messy and identify to each JIRA if there are duplicates for R 
sides too there. Followup for each JIRA might be possible but again messy to 
manage.

  was:
This JIRA targets to add an R functions corresponding to SPARK-23899. We have 
been usually adding a function with Scala alone or with both Scala and Python 
APIs.

This JIRA targets to add the functions correspondingly to R side too.



> Built-in SQL Functions improvement in SparkR
> 
>
> Key: SPARK-24023
> URL: https://issues.apache.org/jira/browse/SPARK-24023
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to add an R functions corresponding to SPARK-23899. We have 
> been usually adding a function with Scala alone or with both Scala and Python 
> APIs.
> This JIRA targets to add the functions correspondingly to R side too.
> It's could be messy and identify to each JIRA if there are duplicates for R 
> sides too there. Followup for each JIRA might be possible but again messy to 
> manage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24023) Built-in SQL Functions improvement in SparkR

2018-04-19 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443870#comment-16443870
 ] 

Hyukjin Kwon commented on SPARK-24023:
--

This is meant to be an umbrella. Please add a subtask. 

>From my side: currently, R's function is not being added. Since we know which 
>one to add from SPARK-23899, It would be better to open a JIRA and PR 
>correspondingly in place (of course, it's not a requirement but was just a 
>thought).

Another option might be to open each JIRA when each subtask from SPARK-23899 is 
merged and resolved.

>From Felix: We probably don’t need it as fine grain and so many sub tasks 
>maybe if the code is simple enough for each function.

To sum up, you can add multiple functions together when each function's code is 
expected to be relatively small; otherwise, I think you can also open a JIRA 
for a function. I don't feel strongly. Just make sure the function name is 
written in the JIRA title to reduce confusion.

[~felixcheung] and [~smilegator].

> Built-in SQL Functions improvement in SparkR
> 
>
> Key: SPARK-24023
> URL: https://issues.apache.org/jira/browse/SPARK-24023
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to add an R functions corresponding to SPARK-23899. We have 
> been usually adding a function with Scala alone or with both Scala and Python 
> APIs.
> This JIRA targets to add the functions correspondingly to R side too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24023) Built-in SQL Functions improvement in SparkR

2018-04-19 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-24023:


 Summary: Built-in SQL Functions improvement in SparkR
 Key: SPARK-24023
 URL: https://issues.apache.org/jira/browse/SPARK-24023
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon


This JIRA targets to add an R functions corresponding to SPARK-23899. We have 
been usually adding a function with Scala alone or with both Scala and Python 
APIs.

This JIRA targets to add the functions correspondingly to R side too.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22797) Add multiple column support to PySpark Bucketizer

2018-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22797:


Assignee: zhengruifeng  (was: Apache Spark)

> Add multiple column support to PySpark Bucketizer
> -
>
> Key: SPARK-22797
> URL: https://issues.apache.org/jira/browse/SPARK-22797
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23711) Add fallback to interpreted execution logic

2018-04-19 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443866#comment-16443866
 ] 

Herman van Hovell commented on SPARK-23711:
---

There a lot of places where we do not fallback to interpreted mode and just 
fail, for example: window functions, object hash aggregate, encoders, etc...

> Add fallback to interpreted execution logic
> ---
>
> Key: SPARK-23711
> URL: https://issues.apache.org/jira/browse/SPARK-23711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22797) Add multiple column support to PySpark Bucketizer

2018-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22797:


Assignee: Apache Spark  (was: zhengruifeng)

> Add multiple column support to PySpark Bucketizer
> -
>
> Key: SPARK-22797
> URL: https://issues.apache.org/jira/browse/SPARK-22797
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'

2018-04-19 Thread chris_j (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chris_j updated SPARK-24009:

Description: 
local mode  spark execute "INSERT OVERWRITE LOCAL DIRECTORY " successfully.

on yarn spark execute "INSERT OVERWRITE LOCAL DIRECTORY " failed, not 
permission problem also 

 

1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format 
delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
default.dim_date"  write local directory successful

2.spark-sql  --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format 
delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
default.dim_date"  write hdfs successful

3.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY 
'/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS 
TEXTFILE select * from default.dim_date"  on yarn writr local directory failed

 

 

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.IOException: Mkdirs failed to create 
[file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
 (exists=false, 
cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
 at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
 at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
 ... 8 more
 Caused by: java.io.IOException: Mkdirs failed to create 
[file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
 (exists=false, 
cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
 at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
 at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
 at 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80)
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)

 

  was:
1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format 
delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
default.dim_date"  write local directory successful

3.spark-sql  --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format 
delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
default.dim_date"  write hdfs successful

2.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY 
'/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS 
TEXTFILE select * from default.dim_date"  on yarn writr local directory failed

 

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.IOException: Mkdirs failed to create 

[jira] [Commented] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'

2018-04-19 Thread chris_j (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443860#comment-16443860
 ] 

chris_j commented on SPARK-24009:
-

thanks your answer, but inoperant. local mode spark execute "INSERT OVERWRITE 
LOCAL DIRECTORY " successfully.

on yarn spark execute "INSERT OVERWRITE LOCAL DIRECTORY " failed, not 
permission problem also 

> spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab' 
> -
>
> Key: SPARK-24009
> URL: https://issues.apache.org/jira/browse/SPARK-24009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: chris_j
>Priority: Major
>
> 1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row 
> format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
> default.dim_date"  write local directory successful
> 3.spark-sql  --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format 
> delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
> default.dim_date"  write hdfs successful
> 2.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY 
> '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS 
> TEXTFILE select * from default.dim_date"  on yarn writr local directory failed
>  
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: Mkdirs failed to create 
> [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
>  (exists=false, 
> cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
>  at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
>  at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
>  at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
>  ... 8 more
>  Caused by: java.io.IOException: Mkdirs failed to create 
> [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
>  (exists=false, 
> cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
>  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
>  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
>  at 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80)
>  at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
>  at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-24022) Flaky test: SparkContextSuite

2018-04-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443797#comment-16443797
 ] 

Apache Spark commented on SPARK-24022:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/21105

> Flaky test: SparkContextSuite
> -
>
> Key: SPARK-24022
> URL: https://issues.apache.org/jira/browse/SPARK-24022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> A flaky pattern found and fixed in SPARK-23775 but similar things exist in 
> SparkContextSuite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24022) Flaky test: SparkContextSuite

2018-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24022:


Assignee: Apache Spark

> Flaky test: SparkContextSuite
> -
>
> Key: SPARK-24022
> URL: https://issues.apache.org/jira/browse/SPARK-24022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Major
>
> A flaky pattern found and fixed in SPARK-23775 but similar things exist in 
> SparkContextSuite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24022) Flaky test: SparkContextSuite

2018-04-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24022:


Assignee: (was: Apache Spark)

> Flaky test: SparkContextSuite
> -
>
> Key: SPARK-24022
> URL: https://issues.apache.org/jira/browse/SPARK-24022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> A flaky pattern found and fixed in SPARK-23775 but similar things exist in 
> SparkContextSuite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24022) Flaky test: SparkContextSuite

2018-04-19 Thread Gabor Somogyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443779#comment-16443779
 ] 

Gabor Somogyi commented on SPARK-24022:
---

Working on this.

> Flaky test: SparkContextSuite
> -
>
> Key: SPARK-24022
> URL: https://issues.apache.org/jira/browse/SPARK-24022
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> A flaky pattern found and fixed in SPARK-23775 but similar things exist in 
> SparkContextSuite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24022) Flaky test: SparkContextSuite

2018-04-19 Thread Gabor Somogyi (JIRA)
Gabor Somogyi created SPARK-24022:
-

 Summary: Flaky test: SparkContextSuite
 Key: SPARK-24022
 URL: https://issues.apache.org/jira/browse/SPARK-24022
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.4.0
Reporter: Gabor Somogyi


A flaky pattern found and fixed in SPARK-23775 but similar things exist in 
SparkContextSuite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org