[jira] [Resolved] (SPARK-28204) Make separate two test cases for column pruning in binary files

2019-06-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28204.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25003
[https://github.com/apache/spark/pull/25003]

> Make separate two test cases for column pruning in binary files
> ---
>
> Key: SPARK-28204
> URL: https://issues.apache.org/jira/browse/SPARK-28204
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 3.0.0
>
>
> SPARK-27534 missed to address my own comments at 
> https://github.com/WeichenXu123/spark/pull/8
> It's better to push this in since the codes are already cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28204) Make separate two test cases for column pruning in binary files

2019-06-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28204:


Assignee: Hyukjin Kwon

> Make separate two test cases for column pruning in binary files
> ---
>
> Key: SPARK-28204
> URL: https://issues.apache.org/jira/browse/SPARK-28204
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
>
> SPARK-27534 missed to address my own comments at 
> https://github.com/WeichenXu123/spark/pull/8
> It's better to push this in since the codes are already cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28208) When upgrading to ORC 1.5.6, the reader needs to be closed.

2019-06-28 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875335#comment-16875335
 ] 

Dongjoon Hyun commented on SPARK-28208:
---

As I commented on ORC-525, this is unexpected behavior change to the users at 
the bug fix release.
> Why do we enforce such a behavior change at bug fix release from 1.5.5 to 
> 1.5.6?

> When upgrading to ORC 1.5.6, the reader needs to be closed.
> ---
>
> Key: SPARK-28208
> URL: https://issues.apache.org/jira/browse/SPARK-28208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Owen O'Malley
>Priority: Major
>
> As part of the ORC 1.5.6 release, we optimized the common pattern of:
> {code:java}
> Reader reader = OrcFile.createReader(...);
> RecordReader rows = reader.rows(...);{code}
> which used to open one file handle in the Reader and a second one in the 
> RecordReader. Users were seeing this as a regression when moving from the old 
> Spark ORC reader via hive to the new native reader, because it opened twice 
> as many files on the NameNode.
> In ORC 1.5.6, we changed the ORC library so that it keeps the file handle in 
> the Reader until it is either closed or a RecordReader is created from it. 
> This has cut down the number of file open requests on the NameNode by half in 
> typical spark applications. (Hive's ORC code avoided this problem by putting 
> the file footer in to the input splits, but that has other problems.)
> To get the new optimization without leaking file handles, Spark needs to be 
> close the readers that aren't used to create RecordReaders.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28214) Flaky test: org.apache.spark.streaming.CheckpointSuite.basic rdd checkpoints + dstream graph checkpoint recovery

2019-06-28 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-28214:
--

 Summary: Flaky test: 
org.apache.spark.streaming.CheckpointSuite.basic rdd checkpoints + dstream 
graph checkpoint recovery
 Key: SPARK-28214
 URL: https://issues.apache.org/jira/browse/SPARK-28214
 Project: Spark
  Issue Type: Bug
  Components: DStreams, Tests
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


This test has failed a few times in some PRs. Example of a failure:

{noformat}
Error Message
org.scalatest.exceptions.TestFailedException: Map() was empty No checkpointed 
RDDs in state stream before first failure
Stacktrace
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: Map() was 
empty No checkpointed RDDs in state stream before first failure
at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
at 
org.apache.spark.streaming.CheckpointSuite.$anonfun$new$3(CheckpointSuite.scala:266)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
at 
org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
at 
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
at 
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
at 
org.apache.spark.streaming.CheckpointSuite.org$scalatest$BeforeAndAfter$$super$runTest(CheckpointSuite.scala:209)
{noformat}

On top of that, when this failure happens, the test leaves a running 
{{SparkContext}} behind, which makes every single unit test run after it on 
that project fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc

2019-06-28 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28206:

Target Version/s: 3.0.0

> "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
> -
>
> Key: SPARK-28206
> URL: https://issues.apache.org/jira/browse/SPARK-28206
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.4.1
>Reporter: Xiangrui Meng
>Priority: Major
> Attachments: Screen Shot 2019-06-28 at 9.55.13 AM.png
>
>
> Just noticed that in [pandas_udf API doc 
> |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf],
>  "@pandas_udf" is render as ":pandas_udf".
> cc: [~hyukjin.kwon] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28200) Decimal overflow handling in ExpressionEncoder

2019-06-28 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-28200:
---
Summary: Decimal overflow handling in ExpressionEncoder  (was: Overflow 
handling in `ExpressionEncoder`)

> Decimal overflow handling in ExpressionEncoder
> --
>
> Key: SPARK-28200
> URL: https://issues.apache.org/jira/browse/SPARK-28200
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> As pointed out in https://github.com/apache/spark/pull/20350, we are 
> currently not checking the overflow when serializing a java/scala 
> `BigDecimal` in `ExpressionEncoder` / `ScalaReflection`.
> We should add this check there too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28213) Remove duplication between columnar and ColumnarBatchScan

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28213:


Assignee: Apache Spark

> Remove duplication between columnar and ColumnarBatchScan
> -
>
> Key: SPARK-28213
> URL: https://issues.apache.org/jira/browse/SPARK-28213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Assignee: Apache Spark
>Priority: Major
>
> There is a lot of duplicate code between Columanr.scala and 
> ColumanrBatchScan.  This should fix that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28213) Remove duplication between columnar and ColumnarBatchScan

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28213:


Assignee: (was: Apache Spark)

> Remove duplication between columnar and ColumnarBatchScan
> -
>
> Key: SPARK-28213
> URL: https://issues.apache.org/jira/browse/SPARK-28213
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> There is a lot of duplicate code between Columanr.scala and 
> ColumanrBatchScan.  This should fix that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28213) Remove duplication between columnar and ColumnarBatchScan

2019-06-28 Thread Robert Joseph Evans (JIRA)
Robert Joseph Evans created SPARK-28213:
---

 Summary: Remove duplication between columnar and ColumnarBatchScan
 Key: SPARK-28213
 URL: https://issues.apache.org/jira/browse/SPARK-28213
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Robert Joseph Evans


There is a lot of duplicate code between Columanr.scala and ColumanrBatchScan.  
This should fix that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28192) Data Source - State - Write side

2019-06-28 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875198#comment-16875198
 ] 

Jungtaek Lim commented on SPARK-28192:
--

Totally interested! Once I decide to implement this with DSv2, SPARK-23889 is a 
blocker for this issue, so the sooner the better. If SPARK-23889 is somewhat 
requiring huge efforts and could become a blocker for Spark 3.0 then I could 
wait or try to go ahead with DSv1.

SPARK-23889 is filed one year ago (with discussion prior to filing issue), so 
maybe better to take it sooner.

> Data Source - State - Write side
> 
>
> Key: SPARK-28192
> URL: https://issues.apache.org/jira/browse/SPARK-28192
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the efforts on addressing batch write on state data source.
> It could include "state repartition" if it doesn't require huge effort for 
> new DSv2, but it can be also move out to separate issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27945) Make minimal changes to support columnar processing

2019-06-28 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-27945:
-

Assignee: Robert Joseph Evans

> Make minimal changes to support columnar processing
> ---
>
> Key: SPARK-27945
> URL: https://issues.apache.org/jira/browse/SPARK-27945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
>Priority: Major
>
> As the first step for SPARK-27396 this is to put in the minimum changes 
> needed to allow a plugin to support columnar processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27945) Make minimal changes to support columnar processing

2019-06-28 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-27945.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

> Make minimal changes to support columnar processing
> ---
>
> Key: SPARK-27945
> URL: https://issues.apache.org/jira/browse/SPARK-27945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Assignee: Robert Joseph Evans
>Priority: Major
> Fix For: 3.0.0
>
>
> As the first step for SPARK-27396 this is to put in the minimum changes 
> needed to allow a plugin to support columnar processing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28209) Shuffle Storage API: Writes

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28209:


Assignee: Apache Spark

> Shuffle Storage API: Writes
> ---
>
> Key: SPARK-28209
> URL: https://issues.apache.org/jira/browse/SPARK-28209
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Matt Cheah
>Assignee: Apache Spark
>Priority: Major
>
> Adds the write-side API for storing shuffle data in arbitrary storage 
> systems. Also refactor the existing shuffle write code so that it uses this 
> API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28209) Shuffle Storage API: Writes

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28209:


Assignee: (was: Apache Spark)

> Shuffle Storage API: Writes
> ---
>
> Key: SPARK-28209
> URL: https://issues.apache.org/jira/browse/SPARK-28209
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Matt Cheah
>Priority: Major
>
> Adds the write-side API for storing shuffle data in arbitrary storage 
> systems. Also refactor the existing shuffle write code so that it uses this 
> API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28212) Shuffle Storage API: Shuffle Cleanup

2019-06-28 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-28212:
--

 Summary: Shuffle Storage API: Shuffle Cleanup
 Key: SPARK-28212
 URL: https://issues.apache.org/jira/browse/SPARK-28212
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle
Affects Versions: 3.0.0
Reporter: Matt Cheah


In the shuffle storage API, there should be a plugin point for removing 
shuffles that are no longer used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28211) Shuffle Storage API: Driver Lifecycle

2019-06-28 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-28211:
--

 Summary: Shuffle Storage API: Driver Lifecycle
 Key: SPARK-28211
 URL: https://issues.apache.org/jira/browse/SPARK-28211
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle
Affects Versions: 3.0.0
Reporter: Matt Cheah


As part of the shuffle storage API, allow users to hook in application-wide 
startup and shutdown methods. This can do things like create tables in the 
shuffle storage database, or register / unregister against file servers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28210) Shuffle Storage API: Reads

2019-06-28 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-28210:
--

 Summary: Shuffle Storage API: Reads
 Key: SPARK-28210
 URL: https://issues.apache.org/jira/browse/SPARK-28210
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle
Affects Versions: 3.0.0
Reporter: Matt Cheah


As part of the effort to store shuffle data in arbitrary places, this issue 
tracks implementing an API for reading the shuffle data stored by the write 
API. Also ensure that the existing shuffle implementation is refactored to use 
the API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25299) Use remote storage for persisting shuffle data

2019-06-28 Thread Matt Cheah (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah updated SPARK-25299:
---
Description: 
In Spark, the shuffle primitive requires Spark executors to persist data to the 
local disk of the worker nodes. If executors crash, the external shuffle 
service can continue to serve the shuffle data that was written beyond the 
lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
external shuffle service is deployed on every worker node. The shuffle service 
shares local disk with the executors that run on its node.

There are some shortcomings with the way shuffle is fundamentally implemented 
right now. Particularly:
 * If any external shuffle service process or node becomes unavailable, all 
applications that had an executor that ran on that node must recompute the 
shuffle blocks that were lost.
 * Similarly to the above, the external shuffle service must be kept running at 
all times, which may waste resources when no applications are using that 
shuffle service node.
 * Mounting local storage can prevent users from taking advantage of desirable 
isolation benefits from using containerized environments, like Kubernetes. We 
had an external shuffle service implementation in an early prototype of the 
Kubernetes backend, but it was rejected due to its strict requirement to be 
able to mount hostPath volumes or other persistent volume setups.

In the following [architecture discussion 
document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
 (note: _not_ an SPIP), we brainstorm various high level architectures for 
improving the external shuffle service in a way that addresses the above 
problems. The purpose of this umbrella JIRA is to promote additional discussion 
on how we can approach these problems, both at the architecture level and the 
implementation level. We anticipate filing sub-issues that break down the tasks 
that must be completed to achieve this goal.

Edit June 28 2019: Our SPIP is here: 
[https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit]

  was:
In Spark, the shuffle primitive requires Spark executors to persist data to the 
local disk of the worker nodes. If executors crash, the external shuffle 
service can continue to serve the shuffle data that was written beyond the 
lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
external shuffle service is deployed on every worker node. The shuffle service 
shares local disk with the executors that run on its node.

There are some shortcomings with the way shuffle is fundamentally implemented 
right now. Particularly:
 * If any external shuffle service process or node becomes unavailable, all 
applications that had an executor that ran on that node must recompute the 
shuffle blocks that were lost.
 * Similarly to the above, the external shuffle service must be kept running at 
all times, which may waste resources when no applications are using that 
shuffle service node.
 * Mounting local storage can prevent users from taking advantage of desirable 
isolation benefits from using containerized environments, like Kubernetes. We 
had an external shuffle service implementation in an early prototype of the 
Kubernetes backend, but it was rejected due to its strict requirement to be 
able to mount hostPath volumes or other persistent volume setups.

In the following [architecture discussion 
document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
 (note: _not_ an SPIP), we brainstorm various high level architectures for 
improving the external shuffle service in a way that addresses the above 
problems. The purpose of this umbrella JIRA is to promote additional discussion 
on how we can approach these problems, both at the architecture level and the 
implementation level. We anticipate filing sub-issues that break down the tasks 
that must be completed to achieve this goal.


> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on 

[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2019-06-28 Thread Matt Cheah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875106#comment-16875106
 ] 

Matt Cheah commented on SPARK-25299:


I also noticed the SPIP document wasn't ever posted on this ticket, so sorry 
about that! Here's the link for everyone who wasn't following along on the 
mailing list: 
[https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit]

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28208) When upgrading to ORC 1.5.6, the reader needs to be closed.

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28208:


Assignee: (was: Apache Spark)

> When upgrading to ORC 1.5.6, the reader needs to be closed.
> ---
>
> Key: SPARK-28208
> URL: https://issues.apache.org/jira/browse/SPARK-28208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Owen O'Malley
>Priority: Major
>
> As part of the ORC 1.5.6 release, we optimized the common pattern of:
> {code:java}
> Reader reader = OrcFile.createReader(...);
> RecordReader rows = reader.rows(...);{code}
> which used to open one file handle in the Reader and a second one in the 
> RecordReader. Users were seeing this as a regression when moving from the old 
> Spark ORC reader via hive to the new native reader, because it opened twice 
> as many files on the NameNode.
> In ORC 1.5.6, we changed the ORC library so that it keeps the file handle in 
> the Reader until it is either closed or a RecordReader is created from it. 
> This has cut down the number of file open requests on the NameNode by half in 
> typical spark applications. (Hive's ORC code avoided this problem by putting 
> the file footer in to the input splits, but that has other problems.)
> To get the new optimization without leaking file handles, Spark needs to be 
> close the readers that aren't used to create RecordReaders.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28208) When upgrading to ORC 1.5.6, the reader needs to be closed.

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28208:


Assignee: Apache Spark

> When upgrading to ORC 1.5.6, the reader needs to be closed.
> ---
>
> Key: SPARK-28208
> URL: https://issues.apache.org/jira/browse/SPARK-28208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Owen O'Malley
>Assignee: Apache Spark
>Priority: Major
>
> As part of the ORC 1.5.6 release, we optimized the common pattern of:
> {code:java}
> Reader reader = OrcFile.createReader(...);
> RecordReader rows = reader.rows(...);{code}
> which used to open one file handle in the Reader and a second one in the 
> RecordReader. Users were seeing this as a regression when moving from the old 
> Spark ORC reader via hive to the new native reader, because it opened twice 
> as many files on the NameNode.
> In ORC 1.5.6, we changed the ORC library so that it keeps the file handle in 
> the Reader until it is either closed or a RecordReader is created from it. 
> This has cut down the number of file open requests on the NameNode by half in 
> typical spark applications. (Hive's ORC code avoided this problem by putting 
> the file footer in to the input splits, but that has other problems.)
> To get the new optimization without leaking file handles, Spark needs to be 
> close the readers that aren't used to create RecordReaders.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28209) Shuffle Storage API: Writes

2019-06-28 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-28209:
--

 Summary: Shuffle Storage API: Writes
 Key: SPARK-28209
 URL: https://issues.apache.org/jira/browse/SPARK-28209
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle
Affects Versions: 3.0.0
Reporter: Matt Cheah


Adds the write-side API for storing shuffle data in arbitrary storage systems. 
Also refactor the existing shuffle write code so that it uses this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2019-06-28 Thread Matt Cheah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875101#comment-16875101
 ] 

Matt Cheah commented on SPARK-25299:


Let's start by making sub-issues. I have a patch staged for master I intend to 
post by end of day.

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28208) When upgrading to ORC 1.5.6, the reader needs to be closed.

2019-06-28 Thread Owen O'Malley (JIRA)
Owen O'Malley created SPARK-28208:
-

 Summary: When upgrading to ORC 1.5.6, the reader needs to be 
closed.
 Key: SPARK-28208
 URL: https://issues.apache.org/jira/browse/SPARK-28208
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Owen O'Malley


As part of the ORC 1.5.6 release, we optimized the common pattern of:
{code:java}
Reader reader = OrcFile.createReader(...);
RecordReader rows = reader.rows(...);{code}

which used to open one file handle in the Reader and a second one in the 
RecordReader. Users were seeing this as a regression when moving from the old 
Spark ORC reader via hive to the new native reader, because it opened twice as 
many files on the NameNode.

In ORC 1.5.6, we changed the ORC library so that it keeps the file handle in 
the Reader until it is either closed or a RecordReader is created from it. This 
has cut down the number of file open requests on the NameNode by half in 
typical spark applications. (Hive's ORC code avoided this problem by putting 
the file footer in to the input splits, but that has other problems.)

To get the new optimization without leaking file handles, Spark needs to be 
close the readers that aren't used to create RecordReaders.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-28114) Add Jenkins job for `Hadoop-3.2` profile

2019-06-28 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp closed SPARK-28114.
---

> Add Jenkins job for `Hadoop-3.2` profile
> 
>
> Key: SPARK-28114
> URL: https://issues.apache.org/jira/browse/SPARK-28114
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: shane knapp
>Priority: Major
>
> Spark 3.0 is a major version change. We want to have the following new Jobs.
> 1. SBT with hadoop-3.2
> 2. Maven with hadoop-3.2 (on JDK8 and JDK11)
> Also, shall we have a limit for the concurrent run for the following existing 
> job? Currently, it invokes multiple jobs concurrently. We can save the 
> resource by limiting to 1 like the other jobs.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing
> We will drop four `branch-2.3` jobs at the end of August, 2019.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22207) High memory usage when converting relational data to Hierarchical data

2019-06-28 Thread kanika dhuria (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kanika dhuria reopened SPARK-22207:
---

Same issue is seen in spark 2.4

> High memory usage when converting relational data to Hierarchical data
> --
>
> Key: SPARK-22207
> URL: https://issues.apache.org/jira/browse/SPARK-22207
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: kanika dhuria
>Priority: Major
>  Labels: bulk-closed
>
> Have 4 tables 
> lineitems ~1.4Gb,
> orders ~ 330MB
> customer ~47MB
> nations ~ 2.2K
> These tables are related as follows
> There are multiple lineitems per order (pk, fk:orderkey)
> There are multiple orders per customer(pk,fk: cust_key)
> There are multiple customers per nation(pk, fk:nation key)
> Data is almost evenly distributed.
> Building hierarchy till 3 levels i.e joining lineitems, orders, customers 
> works good with executor memory 4Gb/2cores
> Adding nations require 8GB/2 cores or 4GB/1 core memory.
> ==
> {noformat}
> val sqlContext = SparkSession.builder() .enableHiveSupport() 
> .config("spark.sql.retainGroupColumns", false) 
> .config("spark.sql.crossJoin.enabled", true) .getOrCreate()
>  
>   val orders = sqlContext.sql("select * from orders")
>   val lineItem = sqlContext.sql("select * from lineitems")
>   
>   val customer = sqlContext.sql("select * from customers")
>   
>   val nation = sqlContext.sql("select * from nations")
>   
>   val lineitemOrders = 
> lineItem.groupBy(col("l_orderkey")).agg(col("l_orderkey"), 
> collect_list(struct(col("l_partkey"), 
> col("l_suppkey"),col("l_linenumber"),col("l_quantity"),col("l_extendedprice"),col("l_discount"),col("l_tax"),col("l_returnflag"),col("l_linestatus"),col("l_shipdate"),col("l_commitdate"),col("l_receiptdate"),col("l_shipinstruct"),col("l_shipmode"))).as("lineitem")).join(orders,
>  orders("O_ORDERKEY")=== lineItem("l_orderkey")).select(col("O_ORDERKEY"), 
> col("O_CUSTKEY"),  col("O_ORDERSTATUS"), col("O_TOTALPRICE"), 
> col("O_ORDERDATE"), col("O_ORDERPRIORITY"), col("O_CLERK"), 
> col("O_SHIPPRIORITY"), col("O_COMMENT"),  col("lineitem"))  
>   
>   val customerList = 
> lineitemOrders.groupBy(col("o_custkey")).agg(col("o_custkey"),collect_list(struct(col("O_ORDERKEY"),
>  col("O_CUSTKEY"),  col("O_ORDERSTATUS"), col("O_TOTALPRICE"), 
> col("O_ORDERDATE"), col("O_ORDERPRIORITY"), col("O_CLERK"), 
> col("O_SHIPPRIORITY"), 
> col("O_COMMENT"),col("lineitem"))).as("items")).join(customer,customer("c_custkey")===
>  
> lineitemOrders("o_custkey")).select(col("c_custkey"),col("c_name"),col("c_nationkey"),col("items"))
>  val nationList = 
> customerList.groupBy(col("c_nationkey")).agg(col("c_nationkey"),collect_list(struct(col("c_custkey"),col("c_name"),col("c_nationkey"),col("items"))).as("custList")).join(nation,nation("n_nationkey")===customerList("c_nationkey")).select(col("n_nationkey"),col("n_name"),col("custList"))
>  
>   nationList.write.mode("overwrite").json("filePath")
> {noformat}
> 
> If the customeList is saved in a file and then the last agg/join is run 
> separately, it does run fine in 4GB/2 core .
> I can provide the data if needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-28207) https://rtatdotblog.wordpress.com/2019/05/30/rohit-travels-tours-rohit

2019-06-28 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin deleted SPARK-28207:
---


> https://rtatdotblog.wordpress.com/2019/05/30/rohit-travels-tours-rohit
> --
>
> Key: SPARK-28207
> URL: https://issues.apache.org/jira/browse/SPARK-28207
> Project: Spark
>  Issue Type: Bug
>Reporter: Roufique Hossain
>Priority: Minor
>  Labels: http://schemas.xmlsoap.org/ws/2004/09/policy
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28207) https://rtatdotblog.wordpress.com/2019/05/30/rohit-travels-tours-rohit

2019-06-28 Thread Roufique Hossain (JIRA)
Roufique Hossain created SPARK-28207:


 Summary: 
https://rtatdotblog.wordpress.com/2019/05/30/rohit-travels-tours-rohit
 Key: SPARK-28207
 URL: https://issues.apache.org/jira/browse/SPARK-28207
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 2.4.3
Reporter: Roufique Hossain






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc

2019-06-28 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-28206:
--
Attachment: Screen Shot 2019-06-28 at 9.55.13 AM.png

> "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
> -
>
> Key: SPARK-28206
> URL: https://issues.apache.org/jira/browse/SPARK-28206
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.4.1
>Reporter: Xiangrui Meng
>Priority: Major
> Attachments: Screen Shot 2019-06-28 at 9.55.13 AM.png
>
>
> Just noticed that in [pandas_udf API doc 
> |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf],
>  "@pandas_udf" is render as ":pandas_udf".
> cc: [~hyukjin.kwon] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc

2019-06-28 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-28206:
--
Issue Type: Bug  (was: Documentation)

> "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
> -
>
> Key: SPARK-28206
> URL: https://issues.apache.org/jira/browse/SPARK-28206
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.4.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> Just noticed that in [pandas_udf API doc 
> |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf],
>  "@pandas_udf" is render as ":pandas_udf".
> cc: [~hyukjin.kwon] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc

2019-06-28 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-28206:
--
Summary: "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API 
doc  (was: "@pandas_udf" in doctest is rendered as ":pandas_udf" in html)

> "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
> -
>
> Key: SPARK-28206
> URL: https://issues.apache.org/jira/browse/SPARK-28206
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 2.4.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> Just noticed that in [pandas_udf API doc 
> |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf],
>  "@pandas_udf" is render as ":pandas_udf".
> cc: [~hyukjin.kwon] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html

2019-06-28 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-28206:
--
Summary: "@pandas_udf" in doctest is rendered as ":pandas_udf" in html  
(was: "@" is rendered as ":" in doctest)

> "@pandas_udf" in doctest is rendered as ":pandas_udf" in html
> -
>
> Key: SPARK-28206
> URL: https://issues.apache.org/jira/browse/SPARK-28206
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 2.4.1
>Reporter: Xiangrui Meng
>Priority: Major
>
> Just noticed that in [pandas_udf API doc 
> |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf],
>  "@pandas_udf" is render as ":pandas_udf".
> cc: [~hyukjin.kwon] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28206) "@" is rendered as ":" in doctest

2019-06-28 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-28206:
-

 Summary: "@" is rendered as ":" in doctest
 Key: SPARK-28206
 URL: https://issues.apache.org/jira/browse/SPARK-28206
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Affects Versions: 2.4.1
Reporter: Xiangrui Meng


Just noticed that in [pandas_udf API doc 
|https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf],
 "@pandas_udf" is render as ":pandas_udf".

cc: [~hyukjin.kwon] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28192) Data Source - State - Write side

2019-06-28 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875046#comment-16875046
 ] 

Ryan Blue commented on SPARK-28192:
---

It sounds like what you want is for a source to be able to communicate the 
required clustering and sort order for a write, is that correct?

I opened an issue for this a while ago, but it probably won't be on the roadmap 
for Spark 3.0: SPARK-23889. We can do that sooner if you're interested in it!

> Data Source - State - Write side
> 
>
> Key: SPARK-28192
> URL: https://issues.apache.org/jira/browse/SPARK-28192
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the efforts on addressing batch write on state data source.
> It could include "state repartition" if it doesn't require huge effort for 
> new DSv2, but it can be also move out to separate issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28145) Executor pods polling source can fail to replace dead executors

2019-06-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28145.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24952
[https://github.com/apache/spark/pull/24952]

> Executor pods polling source can fail to replace dead executors
> ---
>
> Key: SPARK-28145
> URL: https://issues.apache.org/jira/browse/SPARK-28145
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Onur Satici
>Assignee: Onur Satici
>Priority: Minor
> Fix For: 3.0.0
>
>
> Scheduled task responsible for reporting executor snapshots to the executor 
> allocator in kubernetes will die on any error, killing subsequent runs of the 
> same task. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28145) Executor pods polling source can fail to replace dead executors

2019-06-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28145:
-

Assignee: Onur Satici

> Executor pods polling source can fail to replace dead executors
> ---
>
> Key: SPARK-28145
> URL: https://issues.apache.org/jira/browse/SPARK-28145
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Onur Satici
>Assignee: Onur Satici
>Priority: Minor
>
> Scheduled task responsible for reporting executor snapshots to the executor 
> allocator in kubernetes will die on any error, killing subsequent runs of the 
> same task. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28205) useV1SourceList configuration should be for all data sources

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28205:


Assignee: Apache Spark

> useV1SourceList configuration should be for all data sources
> 
>
> Key: SPARK-28205
> URL: https://issues.apache.org/jira/browse/SPARK-28205
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> In the migration PR of Kafka V2: 
> https://github.com/apache/spark/pull/24738/files/ac16c9a9ef1c68db5aeda6c7001ae9abe96a358a#r298470645
> We find that the useV1SourceList 
> configuration(spark.sql.sources.read.useV1SourceList and 
> spark.sql.sources.write.useV1SourceList) should be for all data sources, 
> instead of file source V2 only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28205) useV1SourceList configuration should be for all data sources

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28205:


Assignee: (was: Apache Spark)

> useV1SourceList configuration should be for all data sources
> 
>
> Key: SPARK-28205
> URL: https://issues.apache.org/jira/browse/SPARK-28205
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> In the migration PR of Kafka V2: 
> https://github.com/apache/spark/pull/24738/files/ac16c9a9ef1c68db5aeda6c7001ae9abe96a358a#r298470645
> We find that the useV1SourceList 
> configuration(spark.sql.sources.read.useV1SourceList and 
> spark.sql.sources.write.useV1SourceList) should be for all data sources, 
> instead of file source V2 only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28205) useV1SourceList configuration should be for all data sources

2019-06-28 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-28205:
--

 Summary: useV1SourceList configuration should be for all data 
sources
 Key: SPARK-28205
 URL: https://issues.apache.org/jira/browse/SPARK-28205
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


In the migration PR of Kafka V2: 
https://github.com/apache/spark/pull/24738/files/ac16c9a9ef1c68db5aeda6c7001ae9abe96a358a#r298470645
We find that the useV1SourceList 
configuration(spark.sql.sources.read.useV1SourceList and 
spark.sql.sources.write.useV1SourceList) should be for all data sources, 
instead of file source V2 only.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28204) Make separate two test cases for column pruning in binary files

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28204:


Assignee: (was: Apache Spark)

> Make separate two test cases for column pruning in binary files
> ---
>
> Key: SPARK-28204
> URL: https://issues.apache.org/jira/browse/SPARK-28204
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> SPARK-27534 missed to address my own comments at 
> https://github.com/WeichenXu123/spark/pull/8
> It's better to push this in since the codes are already cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28204) Make separate two test cases for column pruning in binary files

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28204:


Assignee: Apache Spark

> Make separate two test cases for column pruning in binary files
> ---
>
> Key: SPARK-28204
> URL: https://issues.apache.org/jira/browse/SPARK-28204
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> SPARK-27534 missed to address my own comments at 
> https://github.com/WeichenXu123/spark/pull/8
> It's better to push this in since the codes are already cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28204) Make separate two test cases for column pruning in binary files

2019-06-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28204:
-
Description: 
SPARK-27534 missed to address my own comments at 
https://github.com/WeichenXu123/spark/pull/8

It's better to push this in since the codes are already cleaned up.

  was:
SPARK-27534 missed to address my own comments at 
https://github.com/HyukjinKwon?tab=overview=2019-04-01=2019-04-30

It's better to push this in since the codes are already cleaned up.


> Make separate two test cases for column pruning in binary files
> ---
>
> Key: SPARK-28204
> URL: https://issues.apache.org/jira/browse/SPARK-28204
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> SPARK-27534 missed to address my own comments at 
> https://github.com/WeichenXu123/spark/pull/8
> It's better to push this in since the codes are already cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28204) Make separate two test cases for column pruning in binary files

2019-06-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28204:
-
Issue Type: Test  (was: New Feature)

> Make separate two test cases for column pruning in binary files
> ---
>
> Key: SPARK-28204
> URL: https://issues.apache.org/jira/browse/SPARK-28204
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> SPARK-27534 missed to address my own comments at 
> https://github.com/HyukjinKwon?tab=overview=2019-04-01=2019-04-30
> It's better to push this in since the codes are already cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28204) Make separate two test cases for column pruning in binary files

2019-06-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28204:
-
Component/s: Tests

> Make separate two test cases for column pruning in binary files
> ---
>
> Key: SPARK-28204
> URL: https://issues.apache.org/jira/browse/SPARK-28204
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> SPARK-27534 missed to address my own comments at 
> https://github.com/HyukjinKwon?tab=overview=2019-04-01=2019-04-30
> It's better to push this in since the codes are already cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28204) Make separate two test cases for column pruning in binary files

2019-06-28 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28204:


 Summary: Make separate two test cases for column pruning in binary 
files
 Key: SPARK-28204
 URL: https://issues.apache.org/jira/browse/SPARK-28204
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


SPARK-27534 missed to address my own comments at 
https://github.com/HyukjinKwon?tab=overview=2019-04-01=2019-04-30

It's better to push this in since the codes are already cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28203) PythonRDD should respect SparkContext's conf when passing user confMap

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28203:


Assignee: (was: Apache Spark)

> PythonRDD should respect SparkContext's conf when passing user confMap
> --
>
> Key: SPARK-28203
> URL: https://issues.apache.org/jira/browse/SPARK-28203
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.3
>Reporter: Xianjin YE
>Priority: Minor
>
> PythonRDD have several API which accepts user configs from python side. The 
> parameter is called confAsMap and it's intended to merge with RDD's hadoop 
> configuration.
>  However, the confAsMap is first mapped to Configuration then merged into 
> SparkContext's hadoop configuration. The mapped Configuration will load 
> default key values in core-default.xml etc., which may be updated in 
> SparkContext's hadoop configuration. The default value will override updated 
> value in the merge process.
> I will submit a pr to fix this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28203) PythonRDD should respect SparkContext's conf when passing user confMap

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28203:


Assignee: Apache Spark

> PythonRDD should respect SparkContext's conf when passing user confMap
> --
>
> Key: SPARK-28203
> URL: https://issues.apache.org/jira/browse/SPARK-28203
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.3
>Reporter: Xianjin YE
>Assignee: Apache Spark
>Priority: Minor
>
> PythonRDD have several API which accepts user configs from python side. The 
> parameter is called confAsMap and it's intended to merge with RDD's hadoop 
> configuration.
>  However, the confAsMap is first mapped to Configuration then merged into 
> SparkContext's hadoop configuration. The mapped Configuration will load 
> default key values in core-default.xml etc., which may be updated in 
> SparkContext's hadoop configuration. The default value will override updated 
> value in the merge process.
> I will submit a pr to fix this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28107) Interval type conversion syntax support

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28107:


Assignee: (was: Apache Spark)

> Interval type conversion syntax support
> ---
>
> Key: SPARK-28107
> URL: https://issues.apache.org/jira/browse/SPARK-28107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> According to the 03 ANSI SQL standard, for the interval type conversion. 
> SparkSQL now can only support 
>  * Interval year to month
>  * Interval day to second
>  * Interval hour to second
> There are some other syntax which are both supported in PostgreSQL and 03 
> ANSI SQL.
>  * Interval day to hour
>  * Interval day to minute
>  * Interval hour to minute
>  * Interval minute to second



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28107) Interval type conversion syntax support

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28107:


Assignee: Apache Spark

> Interval type conversion syntax support
> ---
>
> Key: SPARK-28107
> URL: https://issues.apache.org/jira/browse/SPARK-28107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Assignee: Apache Spark
>Priority: Major
>
> According to the 03 ANSI SQL standard, for the interval type conversion. 
> SparkSQL now can only support 
>  * Interval year to month
>  * Interval day to second
>  * Interval hour to second
> There are some other syntax which are both supported in PostgreSQL and 03 
> ANSI SQL.
>  * Interval day to hour
>  * Interval day to minute
>  * Interval hour to minute
>  * Interval minute to second



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28107) Interval type conversion syntax support

2019-06-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874842#comment-16874842
 ] 

Apache Spark commented on SPARK-28107:
--

User 'lipzhu' has created a pull request for this issue:
https://github.com/apache/spark/pull/25000

> Interval type conversion syntax support
> ---
>
> Key: SPARK-28107
> URL: https://issues.apache.org/jira/browse/SPARK-28107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> According to the 03 ANSI SQL standard, for the interval type conversion. 
> SparkSQL now can only support 
>  * Interval year to month
>  * Interval day to second
>  * Interval hour to second
> There are some other syntax which are both supported in PostgreSQL and 03 
> ANSI SQL.
>  * Interval day to hour
>  * Interval day to minute
>  * Interval hour to minute
>  * Interval minute to second



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28203) PythonRDD should respect SparkContext's conf when passing user confMap

2019-06-28 Thread Xianjin YE (JIRA)
Xianjin YE created SPARK-28203:
--

 Summary: PythonRDD should respect SparkContext's conf when passing 
user confMap
 Key: SPARK-28203
 URL: https://issues.apache.org/jira/browse/SPARK-28203
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 2.4.3
Reporter: Xianjin YE


PythonRDD have several API which accepts user configs from python side. The 
parameter is called confAsMap and it's intended to merge with RDD's hadoop 
configuration.


 However, the confAsMap is first mapped to Configuration then merged into 
SparkContext's hadoop configuration. The mapped Configuration will load default 
key values in core-default.xml etc., which may be updated in SparkContext's 
hadoop configuration. The default value will override updated value in the 
merge process.

I will submit a pr to fix this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28077) ANSI SQL: OVERLAY function(T312)

2019-06-28 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-28077.
---
   Resolution: Fixed
 Assignee: jiaan.geng
Fix Version/s: 3.0.0

Issue resolved by pull request 24918
https://github.com/apache/spark/pull/24918

> ANSI SQL: OVERLAY function(T312)
> 
>
> Key: SPARK-28077
> URL: https://issues.apache.org/jira/browse/SPARK-28077
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.0.0
>
>
> ||Function||Return Type||Description||Example||Result||
> |{{overlay(_string_ placing _string_ from }}{{int}}{{[for 
> {{int}}])}}|{{text}}|Replace substring|{{overlay('Tas' placing 'hom' from 
> 2 for 4)}}|{{Thomas}}|
> For example:
> {code:sql}
> SELECT OVERLAY('abcdef' PLACING '45' FROM 4) AS "abc45f";
> SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5) AS "yabadaba";
> SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0) AS "yabadabadoo";
> SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4) AS "bubba";
> {code}
> https://www.postgresql.org/docs/11/functions-string.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28083) ANSI SQL: LIKE predicate: ESCAPE clause

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28083:


Assignee: Apache Spark

> ANSI SQL: LIKE predicate: ESCAPE clause
> ---
>
> Key: SPARK-28083
> URL: https://issues.apache.org/jira/browse/SPARK-28083
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Format:
> {noformat}
>  ::=
> 
>   | 
>  ::=
>
>  ::=
>   [ NOT ] LIKE  [ ESCAPE  ]
>  ::=
>   
>  ::=
>   
>  ::=
>
>  ::=
>   [ NOT ] LIKE  [ ESCAPE  ]
>  ::=
>   
>  ::=
>   
> {noformat}
>  
> [https://www.postgresql.org/docs/11/functions-matching.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28083) ANSI SQL: LIKE predicate: ESCAPE clause

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28083:


Assignee: (was: Apache Spark)

> ANSI SQL: LIKE predicate: ESCAPE clause
> ---
>
> Key: SPARK-28083
> URL: https://issues.apache.org/jira/browse/SPARK-28083
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Format:
> {noformat}
>  ::=
> 
>   | 
>  ::=
>
>  ::=
>   [ NOT ] LIKE  [ ESCAPE  ]
>  ::=
>   
>  ::=
>   
>  ::=
>
>  ::=
>   [ NOT ] LIKE  [ ESCAPE  ]
>  ::=
>   
>  ::=
>   
> {noformat}
>  
> [https://www.postgresql.org/docs/11/functions-matching.html]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28202) [Core] [Test] Avoid noises of system props in SparkConfSuite

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28202:


Assignee: Apache Spark

> [Core] [Test] Avoid noises of system props in SparkConfSuite
> 
>
> Key: SPARK-28202
> URL: https://issues.apache.org/jira/browse/SPARK-28202
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: ShuMing Li
>Assignee: Apache Spark
>Priority: Trivial
>
> When SPARK_HOME of env is set and contains a specific `spark-defaults,conf`, 
> `org.apache.spark.util.loadDefaultSparkProperties` method may noise `system 
> props`. So when runs `core/test` module, it is possible to fail to run 
> `SparkConfSuite` .
>  
> It's easy to repair by setting `loadDefaults` in `SparkConf` to be false.
> ```
> [info] - accumulators (5 seconds, 565 milliseconds)
> [info] - deprecated configs *** FAILED *** (79 milliseconds)
> [info] 7 did not equal 4 (SparkConfSuite.scala:266)
> [info] org.scalatest.exceptions.TestFailedException:
> [info] at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
> [info] at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
> [info] at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info] at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> [info] at 
> org.apache.spark.SparkConfSuite.$anonfun$new$26(SparkConfSuite.scala:266)
> [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
> [info] at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> [info] at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28202) [Core] [Test] Avoid noises of system props in SparkConfSuite

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28202:


Assignee: (was: Apache Spark)

> [Core] [Test] Avoid noises of system props in SparkConfSuite
> 
>
> Key: SPARK-28202
> URL: https://issues.apache.org/jira/browse/SPARK-28202
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: ShuMing Li
>Priority: Trivial
>
> When SPARK_HOME of env is set and contains a specific `spark-defaults,conf`, 
> `org.apache.spark.util.loadDefaultSparkProperties` method may noise `system 
> props`. So when runs `core/test` module, it is possible to fail to run 
> `SparkConfSuite` .
>  
> It's easy to repair by setting `loadDefaults` in `SparkConf` to be false.
> ```
> [info] - accumulators (5 seconds, 565 milliseconds)
> [info] - deprecated configs *** FAILED *** (79 milliseconds)
> [info] 7 did not equal 4 (SparkConfSuite.scala:266)
> [info] org.scalatest.exceptions.TestFailedException:
> [info] at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
> [info] at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
> [info] at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info] at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> [info] at 
> org.apache.spark.SparkConfSuite.$anonfun$new$26(SparkConfSuite.scala:266)
> [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
> [info] at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> [info] at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28202) [Core] [Test] Avoid noises of system props in SparkConfSuite

2019-06-28 Thread ShuMing Li (JIRA)
ShuMing Li created SPARK-28202:
--

 Summary: [Core] [Test] Avoid noises of system props in 
SparkConfSuite
 Key: SPARK-28202
 URL: https://issues.apache.org/jira/browse/SPARK-28202
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: ShuMing Li


When SPARK_HOME of env is set and contains a specific `spark-defaults,conf`, 
`org.apache.spark.util.loadDefaultSparkProperties` method may noise `system 
props`. So when runs `core/test` module, it is possible to fail to run 
`SparkConfSuite` .

 

It's easy to repair by setting `loadDefaults` in `SparkConf` to be false.

```

[info] - accumulators (5 seconds, 565 milliseconds)
[info] - deprecated configs *** FAILED *** (79 milliseconds)
[info] 7 did not equal 4 (SparkConfSuite.scala:266)
[info] org.scalatest.exceptions.TestFailedException:
[info] at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
[info] at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
[info] at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info] at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
[info] at 
org.apache.spark.SparkConfSuite.$anonfun$new$26(SparkConfSuite.scala:266)
[info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
[info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
[info] at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
[info] at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)

```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28201) Revisit MakeDecimal behavior on overflow

2019-06-28 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874765#comment-16874765
 ] 

Marco Gaido commented on SPARK-28201:
-

I'll create a PR for this ASAP.

> Revisit MakeDecimal behavior on overflow
> 
>
> Key: SPARK-28201
> URL: https://issues.apache.org/jira/browse/SPARK-28201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> As pointed out in 
> https://github.com/apache/spark/pull/20350#issuecomment-505997469, in special 
> cases of decimal aggregation we are using the `MakeDecimal` operator.
> This operator has a not well defined behavior in case of overflow, namely 
> what it does currently is:
>  - if codegen is enabled it returns null;
>  -  in interpreted mode it throws an `IllegalArgumentException`.
> So we should make his behavior uniform with other similar cases and in 
> particular we should honor the value of the conf introduced in SPARK-23179 
> and behave accordingly, ie.:
>  - returning null if the flag is true;
>  - throw an `ArithmeticException` if the flag is false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28201) Revisit MakeDecimal behavior on overflow

2019-06-28 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-28201:
---

 Summary: Revisit MakeDecimal behavior on overflow
 Key: SPARK-28201
 URL: https://issues.apache.org/jira/browse/SPARK-28201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


As pointed out in 
https://github.com/apache/spark/pull/20350#issuecomment-505997469, in special 
cases of decimal aggregation we are using the `MakeDecimal` operator.

This operator has a not well defined behavior in case of overflow, namely what 
it does currently is:

 - if codegen is enabled it returns null;
 -  in interpreted mode it throws an `IllegalArgumentException`.

So we should make his behavior uniform with other similar cases and in 
particular we should honor the value of the conf introduced in SPARK-23179 and 
behave accordingly, ie.:

 - returning null if the flag is true;
 - throw an `ArithmeticException` if the flag is false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28198) Add mapPartitionsInPandas to allow an iterator of DataFrames

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28198:


Assignee: (was: Apache Spark)

> Add mapPartitionsInPandas to allow an iterator of DataFrames
> 
>
> Key: SPARK-28198
> URL: https://issues.apache.org/jira/browse/SPARK-28198
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-26412 added a new type of Pandas UDF called Scalar Iter. It should be 
> good to use this whtout the limitation of length.
> This JIRA targets to add {{mapPartitionsInPandas}} that leverages this Pandas 
> UDF and Arrow / Pandas integration in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28198) Add mapPartitionsInPandas to allow an iterator of DataFrames

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28198:


Assignee: Apache Spark

> Add mapPartitionsInPandas to allow an iterator of DataFrames
> 
>
> Key: SPARK-28198
> URL: https://issues.apache.org/jira/browse/SPARK-28198
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-26412 added a new type of Pandas UDF called Scalar Iter. It should be 
> good to use this whtout the limitation of length.
> This JIRA targets to add {{mapPartitionsInPandas}} that leverages this Pandas 
> UDF and Arrow / Pandas integration in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28185) Trigger pandas iterator UDF closing stuff when iterator stop early

2019-06-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28185.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24986
[https://github.com/apache/spark/pull/24986]

> Trigger pandas iterator UDF closing stuff when iterator stop early
> --
>
> Key: SPARK-28185
> URL: https://issues.apache.org/jira/browse/SPARK-28185
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Fix the issue Pandas UDF closing stuff won't be triggered when iterator stop 
> early.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28185) Trigger pandas iterator UDF closing stuff when iterator stop early

2019-06-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28185:


Assignee: Weichen Xu

> Trigger pandas iterator UDF closing stuff when iterator stop early
> --
>
> Key: SPARK-28185
> URL: https://issues.apache.org/jira/browse/SPARK-28185
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 2.4.3
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> Fix the issue Pandas UDF closing stuff won't be triggered when iterator stop 
> early.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28200) Overflow handling in `ExpressionEncoder`

2019-06-28 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-28200:
---

 Summary: Overflow handling in `ExpressionEncoder`
 Key: SPARK-28200
 URL: https://issues.apache.org/jira/browse/SPARK-28200
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


As pointed out in https://github.com/apache/spark/pull/20350, we are currently 
not checking the overflow when serializing a java/scala `BigDecimal` in 
`ExpressionEncoder` / `ScalaReflection`.

We should add this check there too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28199) Remove usage of ProcessingTime in Spark codebase

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28199:


Assignee: Apache Spark

> Remove usage of ProcessingTime in Spark codebase
> 
>
> Key: SPARK-28199
> URL: https://issues.apache.org/jira/browse/SPARK-28199
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Minor
>
> Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark 
> codebase, and actually the alternative Spark proposes use deprecated methods 
> which feels like circular - never be able to remove usage.
> This issue targets to deal with removing usage of ProcessingTime in Spark 
> codebase, via adding new class to replace ProcessingTime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28199) Remove usage of ProcessingTime in Spark codebase

2019-06-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28199:


Assignee: (was: Apache Spark)

> Remove usage of ProcessingTime in Spark codebase
> 
>
> Key: SPARK-28199
> URL: https://issues.apache.org/jira/browse/SPARK-28199
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark 
> codebase, and actually the alternative Spark proposes use deprecated methods 
> which feels like circular - never be able to remove usage.
> This issue targets to deal with removing usage of ProcessingTime in Spark 
> codebase, via adding new class to replace ProcessingTime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28199) Remove usage of ProcessingTime in Spark codebase

2019-06-28 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874741#comment-16874741
 ] 

Jungtaek Lim commented on SPARK-28199:
--

Working on this. Actually I worked this as minor one but realized I had to 
introduce a new class, so filed an issue.

> Remove usage of ProcessingTime in Spark codebase
> 
>
> Key: SPARK-28199
> URL: https://issues.apache.org/jira/browse/SPARK-28199
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark 
> codebase, and actually the alternative Spark proposes use deprecated methods 
> which feels like circular - never be able to remove usage.
> This issue targets to deal with removing usage of ProcessingTime in Spark 
> codebase, via adding new class to replace ProcessingTime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28199) Remove usage of ProcessingTime in Spark codebase

2019-06-28 Thread Jungtaek Lim (JIRA)
Jungtaek Lim created SPARK-28199:


 Summary: Remove usage of ProcessingTime in Spark codebase
 Key: SPARK-28199
 URL: https://issues.apache.org/jira/browse/SPARK-28199
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark codebase, 
and actually the alternative Spark proposes use deprecated methods which feels 
like circular - never be able to remove usage.

This issue targets to deal with removing usage of ProcessingTime in Spark 
codebase, via adding new class to replace ProcessingTime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28198) Add mapPartitionsInPandas to allow an iterator of DataFrames

2019-06-28 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-28198:


 Summary: Add mapPartitionsInPandas to allow an iterator of 
DataFrames
 Key: SPARK-28198
 URL: https://issues.apache.org/jira/browse/SPARK-28198
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


SPARK-26412 added a new type of Pandas UDF called Scalar Iter. It should be 
good to use this whtout the limitation of length.

This JIRA targets to add {{mapPartitionsInPandas}} that leverages this Pandas 
UDF and Arrow / Pandas integration in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2019-06-28 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874703#comment-16874703
 ] 

Saisai Shao commented on SPARK-25299:
-

Votes were passed, so what is our plan for code submission? [~yifeih] [~mcheah]

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28197) Failed to query on external JSon Partitioned table

2019-06-28 Thread zhangbin (JIRA)
zhangbin created SPARK-28197:


 Summary: Failed to query on external JSon Partitioned table
 Key: SPARK-28197
 URL: https://issues.apache.org/jira/browse/SPARK-28197
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.3.1
Reporter: zhangbin


2019-06-28 13:37:18 WARN TaskSetManager:66 - Lost task 7.0 in stage 5.0 (TID 
12, cnbjsjqp-bdp-dn-20, executor 4): java.lang.ClassCastException: 
java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord 
at 
org.apache.hive.hcatalog.data.HCatRecordObjectInspector.getStructFieldData(HCatRecordObjectInspector.java:45)
 at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:438)
 at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:433)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at 
scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
 at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$2$$anon$1.hasNext(InMemoryRelation.scala:149)
 at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1092)
 at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1083)
 at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1018) at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1083) at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:809) 
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:286) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org