[jira] [Resolved] (SPARK-29095) add extractInstances

2019-09-24 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-29095.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25802
[https://github.com/apache/spark/pull/25802]

> add extractInstances
> 
>
> Key: SPARK-29095
> URL: https://issues.apache.org/jira/browse/SPARK-29095
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> There was method extractLabeledPoints for ml algs to transform dataset into 
> rdd of labelPoints.
> Now more and more algs support sample weighting and extractLabeledPoints is 
> less used, so we should support extract weight in the common methods.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29095) add extractInstances

2019-09-24 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-29095:
-

Assignee: zhengruifeng

> add extractInstances
> 
>
> Key: SPARK-29095
> URL: https://issues.apache.org/jira/browse/SPARK-29095
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> There was method extractLabeledPoints for ml algs to transform dataset into 
> rdd of labelPoints.
> Now more and more algs support sample weighting and extractLabeledPoints is 
> less used, so we should support extract weight in the common methods.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26848) Introduce new option to Kafka source - specify timestamp to start and end offset

2019-09-23 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26848:
-

Assignee: Jungtaek Lim

> Introduce new option to Kafka source - specify timestamp to start and end 
> offset
> 
>
> Key: SPARK-26848
> URL: https://issues.apache.org/jira/browse/SPARK-26848
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> Kafka source (for SQL/SS) provides options to set specific offset per topic 
> partition to let source starts reading from start offsets and ends reading 
> until end offsets. ("startingOffsets" and "endingOffsets" in below document.)
> http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
> I'd like to introduce new options "startingOffsetsByTimestamp" and 
> "endingOffsetsByTimestamp" to set specific timestamp per topic (since we're 
> unlikely to set the different value per partition) to let source starts 
> reading from offsets which have equal of greater timestamp, and ends reading 
> until offsets which have equal of greater timestamp.
> The option would be optional of course, and have a preference to apply the 
> options 1) timestamp option applies first, and if it doesn't exist, offset 
> option applies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26848) Introduce new option to Kafka source - specify timestamp to start and end offset

2019-09-23 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26848.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 23747
[https://github.com/apache/spark/pull/23747]

> Introduce new option to Kafka source - specify timestamp to start and end 
> offset
> 
>
> Key: SPARK-26848
> URL: https://issues.apache.org/jira/browse/SPARK-26848
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> Kafka source (for SQL/SS) provides options to set specific offset per topic 
> partition to let source starts reading from start offsets and ends reading 
> until end offsets. ("startingOffsets" and "endingOffsets" in below document.)
> http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
> I'd like to introduce new options "startingOffsetsByTimestamp" and 
> "endingOffsetsByTimestamp" to set specific timestamp per topic (since we're 
> unlikely to set the different value per partition) to let source starts 
> reading from offsets which have equal of greater timestamp, and ends reading 
> until offsets which have equal of greater timestamp.
> The option would be optional of course, and have a preference to apply the 
> options 1) timestamp option applies first, and if it doesn't exist, offset 
> option applies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29053) Sort does not work on some columns

2019-09-23 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-29053:
--
Fix Version/s: 2.4.5

> Sort does not work on some columns
> --
>
> Key: SPARK-29053
> URL: https://issues.apache.org/jira/browse/SPARK-29053
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Assignee: Aman Omer
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
> Attachments: Duration_1.png, ExecutionTime_1.png, Sort Icon.png
>
>
> Spark Thrift JDBC/ODBC Server application UI, *Sorting* is not working for 
> *Duration* and *Execution time* fields.
> *Test Steps*
>  1.Install spark
>  2.Start Spark beeline
>  3.Submit some SQL queries
>  4.Close some spark applications
>  5.Check the Spark Web UI JDBC/ODBC Server TAB.
> *Issue:*
>  *Sorting [ascending or descending]* based on *Duration* and *Execution time* 
> is not proper in *JDBC/ODBC Server TAB*. 
>  Issue there in *Session Statistics* & *SQL Statistics* session tables 
> .Please check it.
> Screenshots are attached.
> !Duration_1.png|width=826,height=410!
> !ExecutionTime_1.png|width=823,height=407!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29204) Remove `Spark Release` Jenkins tab and its four jobs

2019-09-22 Thread Sean Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935432#comment-16935432
 ] 

Sean Owen commented on SPARK-29204:
---

Is this just a matter of deleting the view? I have permissions to delete it. 
I'm OK with doing so.

> Remove `Spark Release` Jenkins tab and its four jobs
> 
>
> Key: SPARK-29204
> URL: https://issues.apache.org/jira/browse/SPARK-29204
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Spark Release Jobs.png
>
>
> Since last two years, we didn't use `Spark Release` Jenkins jobs. Although we 
> keep them until now, it already became outdated because we are using Docker 
> `spark-rm` image.
>  !Spark Release Jobs.png! 
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Release/
> We had better remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29121) Support Dot Product for Vectors

2019-09-21 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-29121.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25818
[https://github.com/apache/spark/pull/25818]

> Support Dot Product for Vectors
> ---
>
> Key: SPARK-29121
> URL: https://issues.apache.org/jira/browse/SPARK-29121
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.5, 2.4.5, 3.0.0
>Reporter: Patrick Pisciuneri
>Assignee: Patrick Pisciuneri
>Priority: Major
> Fix For: 3.0.0
>
>
> I believe *org.apache.spark.ml.linalg.Vectors* and 
> *org.apache.spark.mllib.linalg.Vectors* should support the dot product.  The 
> necessary BLAS routines are already there, just a simple wrapper is needed.
> I know there has been a lot of discussion about how much of a linear algebra 
> package should spark attempt to be, but I have found that the dot product 
> comes up quite a bit for feature engineering and scoring.  In the past we 
> have created our own *org.apache.spark.ml.linalg* package to expose the 
> private methods, but it's an annoying hack.
> See also:
> SPARK-6442
> SPARK-10989



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29121) Support Dot Product for Vectors

2019-09-21 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-29121:
-

Assignee: Patrick Pisciuneri

> Support Dot Product for Vectors
> ---
>
> Key: SPARK-29121
> URL: https://issues.apache.org/jira/browse/SPARK-29121
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.5, 2.4.5, 3.0.0
>Reporter: Patrick Pisciuneri
>Assignee: Patrick Pisciuneri
>Priority: Major
>
> I believe *org.apache.spark.ml.linalg.Vectors* and 
> *org.apache.spark.mllib.linalg.Vectors* should support the dot product.  The 
> necessary BLAS routines are already there, just a simple wrapper is needed.
> I know there has been a lot of discussion about how much of a linear algebra 
> package should spark attempt to be, but I have found that the dot product 
> comes up quite a bit for feature engineering and scoring.  In the past we 
> have created our own *org.apache.spark.ml.linalg* package to expose the 
> private methods, but it's an annoying hack.
> See also:
> SPARK-6442
> SPARK-10989



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29053) Sort does not work on some columns

2019-09-21 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-29053.
---
Fix Version/s: 3.0.0
 Assignee: Aman Omer
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/25855

> Sort does not work on some columns
> --
>
> Key: SPARK-29053
> URL: https://issues.apache.org/jira/browse/SPARK-29053
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Assignee: Aman Omer
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: Duration_1.png, ExecutionTime_1.png, Sort Icon.png
>
>
> Spark Thrift JDBC/ODBC Server application UI, *Sorting* is not working for 
> *Duration* and *Execution time* fields.
> *Test Steps*
>  1.Install spark
>  2.Start Spark beeline
>  3.Submit some SQL queries
>  4.Close some spark applications
>  5.Check the Spark Web UI JDBC/ODBC Server TAB.
> *Issue:*
>  *Sorting [ascending or descending]* based on *Duration* and *Execution time* 
> is not proper in *JDBC/ODBC Server TAB*. 
>  Issue there in *Session Statistics* & *SQL Statistics* session tables 
> .Please check it.
> Screenshots are attached.
> !Duration_1.png|width=826,height=410!
> !ExecutionTime_1.png|width=823,height=407!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19147) netty throw NPE

2019-09-21 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-19147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19147:
--
Issue Type: Improvement  (was: Bug)
  Priority: Minor  (was: Major)

> netty throw NPE
> ---
>
> Key: SPARK-19147
> URL: https://issues.apache.org/jira/browse/SPARK-19147
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>Assignee: Colin Ma
>Priority: Minor
>  Labels: bulk-closed
> Fix For: 3.0.0
>
>
> {code}
> 17/01/10 19:17:20 ERROR ShuffleBlockFetcherIterator: Failed to get block(s) 
> from bigdata-hdp-apache1828.xg01.diditaxi.com:7337
> java.lang.NullPointerException: group
>   at io.netty.bootstrap.AbstractBootstrap.group(AbstractBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:203)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:354)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:138)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 

[jira] [Resolved] (SPARK-19147) netty throw NPE

2019-09-21 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-19147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19147.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25759
[https://github.com/apache/spark/pull/25759]

> netty throw NPE
> ---
>
> Key: SPARK-19147
> URL: https://issues.apache.org/jira/browse/SPARK-19147
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>Assignee: Colin Ma
>Priority: Major
>  Labels: bulk-closed
> Fix For: 3.0.0
>
>
> {code}
> 17/01/10 19:17:20 ERROR ShuffleBlockFetcherIterator: Failed to get block(s) 
> from bigdata-hdp-apache1828.xg01.diditaxi.com:7337
> java.lang.NullPointerException: group
>   at io.netty.bootstrap.AbstractBootstrap.group(AbstractBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:203)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:354)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:138)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> 

[jira] [Assigned] (SPARK-19147) netty throw NPE

2019-09-21 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-19147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19147:
-

Assignee: Colin Ma

> netty throw NPE
> ---
>
> Key: SPARK-19147
> URL: https://issues.apache.org/jira/browse/SPARK-19147
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>Assignee: Colin Ma
>Priority: Major
>  Labels: bulk-closed
>
> {code}
> 17/01/10 19:17:20 ERROR ShuffleBlockFetcherIterator: Failed to get block(s) 
> from bigdata-hdp-apache1828.xg01.diditaxi.com:7337
> java.lang.NullPointerException: group
>   at io.netty.bootstrap.AbstractBootstrap.group(AbstractBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:203)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:354)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:138)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at 

[jira] [Assigned] (SPARK-29144) Binarizer handle sparse vectors incorrectly with negative threshold

2019-09-20 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-29144:
-

Assignee: zhengruifeng

> Binarizer handle sparse vectors incorrectly with negative threshold
> ---
>
> Key: SPARK-29144
> URL: https://issues.apache.org/jira/browse/SPARK-29144
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> the process on sparse vector is wrong if thread<0:
> {code:java}
> scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, 
> Vectors.dense(Array(0.0, 0.5, 0.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), 
> (1,[0.0,0.5,0.0]))
> scala> val df = data.toDF("id", "feature")
> df: org.apache.spark.sql.DataFrame = [id: int, feature: vector]
> scala> val binarizer: Binarizer = new 
> Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5)
> binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8
> scala> binarizer.transform(df).show()
> +---+-+-+
> | id|  feature|binarized_feature|
> +---+-+-+
> |  0|(3,[1],[0.5])|[0.0,1.0,0.0]|
> |  1|[0.0,0.5,0.0]|[1.0,1.0,1.0]|
> +---+-+-+
> {code}
> expected outputs of the above two input vectors should be the same.
>  
> To deal with sparse vectors with threshold < 0, we have two options:
> 1, return 1 for non-active items, but this will convert sparse vectors to 
> dense ones
> 2, throw an exception like what Scikit-Learn's 
> [Binarizer|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html]
>  does:
> {code:java}
> import numpy as np
> from scipy.sparse import csr_matrix
> from sklearn.preprocessing import Binarizer
> row = np.array([0, 0, 1, 2, 2, 2])
> col = np.array([0, 2, 2, 0, 1, 2])
> data = np.array([1, 2, 3, 4, 5, 6])
> a = csr_matrix((data, (row, col)), shape=(3, 3))
> binarizer = Binarizer(threshold=-1.0)
> binarizer.transform(a)
> Traceback (most recent call last):  File "", 
> line 1, in 
> binarizer.transform(a)  File 
> "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
>  line 1874, in transform
> return binarize(X, threshold=self.threshold, copy=copy)  File 
> "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
>  line 1774, in binarize
> raise ValueError('Cannot binarize a sparse matrix with threshold 
> 'ValueError: Cannot binarize a sparse matrix with threshold < 0 {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29144) Binarizer handle sparse vectors incorrectly with negative threshold

2019-09-20 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-29144.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25829
[https://github.com/apache/spark/pull/25829]

> Binarizer handle sparse vectors incorrectly with negative threshold
> ---
>
> Key: SPARK-29144
> URL: https://issues.apache.org/jira/browse/SPARK-29144
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> the process on sparse vector is wrong if thread<0:
> {code:java}
> scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, 
> Vectors.dense(Array(0.0, 0.5, 0.0
> data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), 
> (1,[0.0,0.5,0.0]))
> scala> val df = data.toDF("id", "feature")
> df: org.apache.spark.sql.DataFrame = [id: int, feature: vector]
> scala> val binarizer: Binarizer = new 
> Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5)
> binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8
> scala> binarizer.transform(df).show()
> +---+-+-+
> | id|  feature|binarized_feature|
> +---+-+-+
> |  0|(3,[1],[0.5])|[0.0,1.0,0.0]|
> |  1|[0.0,0.5,0.0]|[1.0,1.0,1.0]|
> +---+-+-+
> {code}
> expected outputs of the above two input vectors should be the same.
>  
> To deal with sparse vectors with threshold < 0, we have two options:
> 1, return 1 for non-active items, but this will convert sparse vectors to 
> dense ones
> 2, throw an exception like what Scikit-Learn's 
> [Binarizer|https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html]
>  does:
> {code:java}
> import numpy as np
> from scipy.sparse import csr_matrix
> from sklearn.preprocessing import Binarizer
> row = np.array([0, 0, 1, 2, 2, 2])
> col = np.array([0, 2, 2, 0, 1, 2])
> data = np.array([1, 2, 3, 4, 5, 6])
> a = csr_matrix((data, (row, col)), shape=(3, 3))
> binarizer = Binarizer(threshold=-1.0)
> binarizer.transform(a)
> Traceback (most recent call last):  File "", 
> line 1, in 
> binarizer.transform(a)  File 
> "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
>  line 1874, in transform
> return binarize(X, threshold=self.threshold, copy=copy)  File 
> "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py",
>  line 1774, in binarize
> raise ValueError('Cannot binarize a sparse matrix with threshold 
> 'ValueError: Cannot binarize a sparse matrix with threshold < 0 {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28772) Upgrade breeze to 1.0

2019-09-20 Thread Sean Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934806#comment-16934806
 ] 

Sean Owen commented on SPARK-28772:
---

Wait, no, I've made a mistake, misread this. We do have to do this for 2.13.

> Upgrade breeze to 1.0
> -
>
> Key: SPARK-28772
> URL: https://issues.apache.org/jira/browse/SPARK-28772
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> The latest release is 1.0, which is cross-built against Scala 2.11, 2.12, and 
> 2.13.
> [https://github.com/scalanlp/breeze/releases/tag/releases%2Fv1.0]
> [https://mvnrepository.com/artifact/org.scalanlp/breeze_2.13/1.0]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28772) Upgrade breeze to 1.0

2019-09-20 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28772:
--
Parent: SPARK-25075
Issue Type: Sub-task  (was: Task)

> Upgrade breeze to 1.0
> -
>
> Key: SPARK-28772
> URL: https://issues.apache.org/jira/browse/SPARK-28772
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> The latest release is 1.0, which is cross-built against Scala 2.11, 2.12, and 
> 2.13.
> [https://github.com/scalanlp/breeze/releases/tag/releases%2Fv1.0]
> [https://mvnrepository.com/artifact/org.scalanlp/breeze_2.13/1.0]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28772) Upgrade breeze to 1.0

2019-09-20 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28772:
--
Parent: (was: SPARK-25075)
Issue Type: Task  (was: Sub-task)

> Upgrade breeze to 1.0
> -
>
> Key: SPARK-28772
> URL: https://issues.apache.org/jira/browse/SPARK-28772
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> The latest release is 1.0, which is cross-built against Scala 2.11, 2.12, and 
> 2.13.
> [https://github.com/scalanlp/breeze/releases/tag/releases%2Fv1.0]
> [https://mvnrepository.com/artifact/org.scalanlp/breeze_2.13/1.0]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28772) Upgrade breeze to 1.0

2019-09-20 Thread Sean Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934804#comment-16934804
 ] 

Sean Owen commented on SPARK-28772:
---

We should do this, and I'll open a PR, but it is not required for Scala 2.13.

> Upgrade breeze to 1.0
> -
>
> Key: SPARK-28772
> URL: https://issues.apache.org/jira/browse/SPARK-28772
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> The latest release is 1.0, which is cross-built against Scala 2.11, 2.12, and 
> 2.13.
> [https://github.com/scalanlp/breeze/releases/tag/releases%2Fv1.0]
> [https://mvnrepository.com/artifact/org.scalanlp/breeze_2.13/1.0]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29129) Test failure: org.apache.spark.sql.hive.JavaDataFrameSuite (hadoop-2.7/JDK 11 combination)

2019-09-20 Thread Sean Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934581#comment-16934581
 ] 

Sean Owen commented on SPARK-29129:
---

Oh, heh, we have a Hadoop 2 + JDK 11 build? I haven't been paying attention. 
That shouldn't necessarily work, right? I thought Hadoop 3 was required? In any 
event, it is at least going to be strongly recommended. So I don't particularly 
care about Hadoop 2 + JDK 11, yes.

> Test failure: org.apache.spark.sql.hive.JavaDataFrameSuite (hadoop-2.7/JDK 11 
> combination)
> --
>
> Key: SPARK-29129
> URL: https://issues.apache.org/jira/browse/SPARK-29129
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Some of tests in org.apache.spark.sql.hive.JavaDataFrameSuite are failing 
> intermittently in CI builds.
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing/1564/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing/1563/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing/1562/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing/1559/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing/1558/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing/1557/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing/1541/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing/1540/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing/1539/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing/1538/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing/1537/testReport/]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29082) Spark driver cannot start with only delegation tokens

2019-09-20 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-29082:
--
Fix Version/s: (was: 3.0.0)

> Spark driver cannot start with only delegation tokens
> -
>
> Key: SPARK-29082
> URL: https://issues.apache.org/jira/browse/SPARK-29082
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
>
> If you start a Spark application with just delegation tokens, it fails. For 
> example, from an Oozie launch, you see things like this (line numbers may be 
> different):
> {noformat}
> No child hadoop job is executed.
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.oozie.action.hadoop.LauncherAM.runActionMain(LauncherAM.java:410)
> at 
> org.apache.oozie.action.hadoop.LauncherAM.access$300(LauncherAM.java:55)
> at 
> org.apache.oozie.action.hadoop.LauncherAM$2.run(LauncherAM.java:223)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.oozie.action.hadoop.LauncherAM.run(LauncherAM.java:217)
> at 
> org.apache.oozie.action.hadoop.LauncherAM$1.run(LauncherAM.java:153)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.oozie.action.hadoop.LauncherAM.main(LauncherAM.java:141)
> Caused by: org.apache.hadoop.security.KerberosAuthException: failure to 
> login: for principal: hrt_qa javax.security.auth.login.LoginException: Unable 
> to obtain password from user
> at 
> org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1847)
> at 
> org.apache.hadoop.security.UserGroupInformation.getUGIFromTicketCache(UserGroupInformation.java:616)
> at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager.doLogin(HadoopDelegationTokenManager.scala:276)
> at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager.obtainDelegationTokens(HadoopDelegationTokenManager.scala:140)
> at 
> org.apache.spark.deploy.yarn.Client.setupSecurityToken(Client.scala:305)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:1057)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:179)
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1178)
> at 
> org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1584)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:860)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-29082) Spark driver cannot start with only delegation tokens

2019-09-20 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-29082:
---

Reopened because we had to revert it for now

> Spark driver cannot start with only delegation tokens
> -
>
> Key: SPARK-29082
> URL: https://issues.apache.org/jira/browse/SPARK-29082
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 3.0.0
>
>
> If you start a Spark application with just delegation tokens, it fails. For 
> example, from an Oozie launch, you see things like this (line numbers may be 
> different):
> {noformat}
> No child hadoop job is executed.
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.oozie.action.hadoop.LauncherAM.runActionMain(LauncherAM.java:410)
> at 
> org.apache.oozie.action.hadoop.LauncherAM.access$300(LauncherAM.java:55)
> at 
> org.apache.oozie.action.hadoop.LauncherAM$2.run(LauncherAM.java:223)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.oozie.action.hadoop.LauncherAM.run(LauncherAM.java:217)
> at 
> org.apache.oozie.action.hadoop.LauncherAM$1.run(LauncherAM.java:153)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.oozie.action.hadoop.LauncherAM.main(LauncherAM.java:141)
> Caused by: org.apache.hadoop.security.KerberosAuthException: failure to 
> login: for principal: hrt_qa javax.security.auth.login.LoginException: Unable 
> to obtain password from user
> at 
> org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1847)
> at 
> org.apache.hadoop.security.UserGroupInformation.getUGIFromTicketCache(UserGroupInformation.java:616)
> at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager.doLogin(HadoopDelegationTokenManager.scala:276)
> at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager.obtainDelegationTokens(HadoopDelegationTokenManager.scala:140)
> at 
> org.apache.spark.deploy.yarn.Client.setupSecurityToken(Client.scala:305)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:1057)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:179)
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1178)
> at 
> org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1584)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:860)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26338) Use scala-xml explicitly

2019-09-20 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26338.
---
Resolution: Invalid

Reopen if there is any detail here

> Use scala-xml explicitly
> 
>
> Key: SPARK-26338
> URL: https://issues.apache.org/jira/browse/SPARK-26338
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Darcy Shen
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28900) Test Pyspark, SparkR on JDK 11 with run-tests

2019-09-19 Thread Sean Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933973#comment-16933973
 ] 

Sean Owen commented on SPARK-28900:
---

The meaning of JIRAs and umbrellas is already kind of nebulous, and I prefer 
smaller, actionable milestones. If the umbrella is about finishing the changes 
that are necessary to get a passing build on JDK 11, including Pyspark, then I 
think it's done, and I'd be fine to close it. We may have follow-on issues, but 
that is always true. And we should keep this open as we do need a Jenkins job 
to test what you are testing. But the umbrella is fine to close as far as I 
know.

> Test Pyspark, SparkR on JDK 11 with run-tests
> -
>
> Key: SPARK-28900
> URL: https://issues.apache.org/jira/browse/SPARK-28900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Priority: Major
>
> Right now, we are testing JDK 11 with a Maven-based build, as in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/
> It looks like _all_ of the Maven-based jobs 'manually' build and invoke 
> tests, and only run tests via Maven -- that is, they do not run Pyspark or 
> SparkR tests. The SBT-based builds do, because they use the {{dev/run-tests}} 
> script that is meant to be for this purpose.
> In fact, there seem to be a couple flavors of copy-pasted build configs. SBT 
> builds look like:
> {code}
> #!/bin/bash
> set -e
> # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention
> export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER"
> mkdir -p "$HOME"
> export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2"
> export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> git clean -fdx
> ./dev/run-tests
> {code}
> Maven builds looks like:
> {code}
> #!/bin/bash
> set -x
> set -e
> rm -rf ./work
> git clean -fdx
> # Generate random point for Zinc
> export ZINC_PORT
> ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)")
> # Use per-build-executor Ivy caches to avoid SBT Ivy lock contention:
> export 
> SPARK_VERSIONS_SUITE_IVY_PATH="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER/.ivy2"
> mkdir -p "$SPARK_VERSIONS_SUITE_IVY_PATH"
> # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental 
> compiler seems to
> # ignore our JAVA_HOME and use the system javac instead.
> export PATH="$JAVA_HOME/bin:$PATH"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> MVN="build/mvn -DzincPort=$ZINC_PORT"
> set +e
> if [[ $HADOOP_PROFILE == hadoop-1 ]]; then
> # Note that there is no -Pyarn flag here for Hadoop 1:
> $MVN \
> -DskipTests \
> -P"$HADOOP_PROFILE" \
> -Dhadoop.version="$HADOOP_VERSION" \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> clean package
> retcode1=$?
> $MVN \
> -P"$HADOOP_PROFILE" \
> -Dhadoop.version="$HADOOP_VERSION" \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> --fail-at-end \
> test
> retcode2=$?
> else
> $MVN \
> -DskipTests \
> -P"$HADOOP_PROFILE" \
> -Pyarn \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> clean package
> retcode1=$?
> $MVN \
> -P"$HADOOP_PROFILE" \
> -Pyarn \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> --fail-at-end \
> test
> retcode2=$?
> fi
> if [[ $retcode1 -ne 0 || $retcode2 -ne 0 ]]; then
>   if [[ $retcode1 -ne 0 ]]; then
> echo "Packaging Spark with Maven failed"
>   fi
>   if [[ $retcode2 -ne 0 ]]; then
> echo "Testing Spark with Maven failed"
>   fi
>   exit 1
> fi
> {code}
> The PR builder (one of them at least) looks like:
> {code}
> #!/bin/bash
> set -e  # fail on any non-zero exit code
> set -x
> export AMPLAB_JENKINS=1
> export PATH="$PATH:/home/anaconda/envs/py3k/bin"
> # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental 
> compiler seems to
> # ignore our JAVA_HOME and use the system javac instead.
> export PATH="$JAVA_HOME/bin:$PATH"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> 

[jira] [Resolved] (SPARK-29172) Fix some exception issue of explain commands

2019-09-19 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-29172.
---
Resolution: Not A Problem

> Fix some exception issue of explain commands
> 
>
> Key: SPARK-29172
> URL: https://issues.apache.org/jira/browse/SPARK-29172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Minor
> Attachments: cost.png, extemded.png
>
>
> The behaviors of run commands during exception handling are different depends 
> on explain command.
> See the attachments. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28900) Test Pyspark, SparkR on JDK 11 with run-tests

2019-09-19 Thread Sean Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933892#comment-16933892
 ] 

Sean Owen commented on SPARK-28900:
---

Meh, I think it's kind of important to declare JDK 11 support complete, to 
fully test Spark against JDK 11. However, right now it does seem to work when 
running Pyspark tests, and we'd re-test before a preview release of 3.0, etc. I 
am kind of neutral about whether this is essential to call this 'done', 
therefore.

> Test Pyspark, SparkR on JDK 11 with run-tests
> -
>
> Key: SPARK-28900
> URL: https://issues.apache.org/jira/browse/SPARK-28900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Priority: Major
>
> Right now, we are testing JDK 11 with a Maven-based build, as in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/
> It looks like _all_ of the Maven-based jobs 'manually' build and invoke 
> tests, and only run tests via Maven -- that is, they do not run Pyspark or 
> SparkR tests. The SBT-based builds do, because they use the {{dev/run-tests}} 
> script that is meant to be for this purpose.
> In fact, there seem to be a couple flavors of copy-pasted build configs. SBT 
> builds look like:
> {code}
> #!/bin/bash
> set -e
> # Configure per-build-executor Ivy caches to avoid SBT Ivy lock contention
> export HOME="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER"
> mkdir -p "$HOME"
> export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2"
> export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> git clean -fdx
> ./dev/run-tests
> {code}
> Maven builds looks like:
> {code}
> #!/bin/bash
> set -x
> set -e
> rm -rf ./work
> git clean -fdx
> # Generate random point for Zinc
> export ZINC_PORT
> ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)")
> # Use per-build-executor Ivy caches to avoid SBT Ivy lock contention:
> export 
> SPARK_VERSIONS_SUITE_IVY_PATH="/home/sparkivy/per-executor-caches/$EXECUTOR_NUMBER/.ivy2"
> mkdir -p "$SPARK_VERSIONS_SUITE_IVY_PATH"
> # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental 
> compiler seems to
> # ignore our JAVA_HOME and use the system javac instead.
> export PATH="$JAVA_HOME/bin:$PATH"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> MVN="build/mvn -DzincPort=$ZINC_PORT"
> set +e
> if [[ $HADOOP_PROFILE == hadoop-1 ]]; then
> # Note that there is no -Pyarn flag here for Hadoop 1:
> $MVN \
> -DskipTests \
> -P"$HADOOP_PROFILE" \
> -Dhadoop.version="$HADOOP_VERSION" \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> clean package
> retcode1=$?
> $MVN \
> -P"$HADOOP_PROFILE" \
> -Dhadoop.version="$HADOOP_VERSION" \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> --fail-at-end \
> test
> retcode2=$?
> else
> $MVN \
> -DskipTests \
> -P"$HADOOP_PROFILE" \
> -Pyarn \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> clean package
> retcode1=$?
> $MVN \
> -P"$HADOOP_PROFILE" \
> -Pyarn \
> -Phive \
> -Phive-thriftserver \
> -Pkinesis-asl \
> -Pmesos \
> --fail-at-end \
> test
> retcode2=$?
> fi
> if [[ $retcode1 -ne 0 || $retcode2 -ne 0 ]]; then
>   if [[ $retcode1 -ne 0 ]]; then
> echo "Packaging Spark with Maven failed"
>   fi
>   if [[ $retcode2 -ne 0 ]]; then
> echo "Testing Spark with Maven failed"
>   fi
>   exit 1
> fi
> {code}
> The PR builder (one of them at least) looks like:
> {code}
> #!/bin/bash
> set -e  # fail on any non-zero exit code
> set -x
> export AMPLAB_JENKINS=1
> export PATH="$PATH:/home/anaconda/envs/py3k/bin"
> # Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental 
> compiler seems to
> # ignore our JAVA_HOME and use the system javac instead.
> export PATH="$JAVA_HOME/bin:$PATH"
> # Add a pre-downloaded version of Maven to the path so that we avoid the 
> flaky download step.
> export 
> PATH="/home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.3.9/bin/:$PATH"
> echo "fixing target dir permissions"
> chmod -R +w target/* || true  # stupid hack by sknapp to ensure that the 
> chmod 

[jira] [Commented] (SPARK-29183) Upgrade JDK 11 Installation to 11.0.4

2019-09-19 Thread Sean Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933885#comment-16933885
 ] 

Sean Owen commented on SPARK-29183:
---

I fully support Shane doing all the work to get this done!

> Upgrade JDK 11 Installation to 11.0.4
> -
>
> Key: SPARK-29183
> URL: https://issues.apache.org/jira/browse/SPARK-29183
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Every JDK 11.0.x releases have many fixes including performance regression 
> fix. We had better upgrade it to the latest 11.0.4.
> - https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8221760



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24417) Build and Run Spark on JDK11

2019-09-19 Thread Sean Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933700#comment-16933700
 ] 

Sean Owen commented on SPARK-24417:
---

I don't care at all, myself, as only a minor contributor. The JIRA already 
shows lots of people contributed.

> Build and Run Spark on JDK11
> 
>
> Key: SPARK-24417
> URL: https://issues.apache.org/jira/browse/SPARK-24417
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA for Apache Spark to support JDK11
> As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per 
> community discussion, we will skip JDK9 and 10 to support JDK 11 directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28985) Pyspark ClassificationModel and RegressionModel support column setters/getters/predict

2019-09-19 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28985.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25776
[https://github.com/apache/spark/pull/25776]

> Pyspark ClassificationModel and RegressionModel support column 
> setters/getters/predict
> --
>
> Key: SPARK-28985
> URL: https://issues.apache.org/jira/browse/SPARK-28985
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> 1, add common abstract classes like JavaClassificationModel & 
> JavaProbabilisticClassificationModel
> 2, add column setters/getters, and predict method
> 3, update the test suites to verify newly added functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28985) Pyspark ClassificationModel and RegressionModel support column setters/getters/predict

2019-09-19 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28985:
-

Assignee: Huaxin Gao

> Pyspark ClassificationModel and RegressionModel support column 
> setters/getters/predict
> --
>
> Key: SPARK-28985
> URL: https://issues.apache.org/jira/browse/SPARK-28985
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
>
> 1, add common abstract classes like JavaClassificationModel & 
> JavaProbabilisticClassificationModel
> 2, add column setters/getters, and predict method
> 3, update the test suites to verify newly added functions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29118) Avoid redundant computation in GMM.transform && GLR.transform

2019-09-18 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-29118.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25815
[https://github.com/apache/spark/pull/25815]

> Avoid redundant computation in GMM.transform && GLR.transform
> -
>
> Key: SPARK-29118
> URL: https://issues.apache.org/jira/browse/SPARK-29118
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> In SPARK-27944, the computation for output columns with empty name is skipped.
> Now, I find that we can furthermore optimize
> 1, GMM.transform by directly obtaining the prediction(double) from its 
> probabilty prediction(vector), like what ProbabilisticClassificationModel and 
> ClassificationModel do.
> 2, GLR.transform by obtaining the prediction(double) from its link 
> prediction(double)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29118) Avoid redundant computation in GMM.transform && GLR.transform

2019-09-18 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-29118:
-

Assignee: zhengruifeng

> Avoid redundant computation in GMM.transform && GLR.transform
> -
>
> Key: SPARK-29118
> URL: https://issues.apache.org/jira/browse/SPARK-29118
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> In SPARK-27944, the computation for output columns with empty name is skipped.
> Now, I find that we can furthermore optimize
> 1, GMM.transform by directly obtaining the prediction(double) from its 
> probabilty prediction(vector), like what ProbabilisticClassificationModel and 
> ClassificationModel do.
> 2, GLR.transform by obtaining the prediction(double) from its link 
> prediction(double)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28927) Improve error for ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-18 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28927.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25789
[https://github.com/apache/spark/pull/25789]

> Improve error for ArrayIndexOutOfBoundsException and Not-stable AUC metrics 
> in ALS for datasets with 12 billion instances
> -
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
>  Here is the our code:
> {code:java}
> val hivedata = sc.sql(sqltext).select("id", "dpid", "score", "tag")
> .repartition(6000).persist(StorageLevel.MEMORY_AND_DISK_SER)
> val zeroValueArrItem = ArrayBuffer[(String, 

[jira] [Updated] (SPARK-28927) Improve error for ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets with 12 billion instances

2019-09-18 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28927:
--
Issue Type: Improvement  (was: Bug)
  Priority: Minor  (was: Major)
   Summary: Improve error for ArrayIndexOutOfBoundsException and Not-stable 
AUC metrics in ALS for datasets with 12 billion instances  (was: 
ArrayIndexOutOfBoundsException and Not-stable AUC metrics in ALS for datasets 
with 12 billion instances)

> Improve error for ArrayIndexOutOfBoundsException and Not-stable AUC metrics 
> in ALS for datasets with 12 billion instances
> -
>
> Key: SPARK-28927
> URL: https://issues.apache.org/jira/browse/SPARK-28927
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Qiang Wang
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Attachments: image-2019-09-02-11-55-33-596.png
>
>
> The stack trace is below:
> {quote}19/08/28 07:00:40 WARN Executor task launch worker for task 325074 
> BlockManager: Block rdd_10916_493 could not be removed as it was not found on 
> disk or in memory 19/08/28 07:00:41 ERROR Executor task launch worker for 
> task 325074 Executor: Exception in task 3.0 in stage 347.1 (TID 325074) 
> java.lang.ArrayIndexOutOfBoundsException: 6741 at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1460)
>  at 
> org.apache.spark.dpshade.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1440)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$40$$anonfun$apply$41.apply(PairRDDFunctions.scala:760)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1041)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1032)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:972) at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1032) 
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:763) 
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285) at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:141)
>  at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:137)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:137) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> This exception happened sometimes.  And we also found that the AUC metric was 
> not stable when evaluating the inner product of the user factors and the item 
> factors with the same dataset and configuration. AUC varied from 0.60 to 0.67 
> which was not stable for production environment. 
> Dataset capacity: ~12 billion ratings
>  Here is the our code:
> {code:java}
> val hivedata = 

[jira] [Resolved] (SPARK-28842) Cleanup the formatting/trailing spaces in resource-managers/kubernetes/integration-tests/README.md

2019-09-18 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28842.
---
Resolution: Not A Problem

> Cleanup the formatting/trailing spaces in 
> resource-managers/kubernetes/integration-tests/README.md
> --
>
> Key: SPARK-28842
> URL: https://issues.apache.org/jira/browse/SPARK-28842
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> The K8s integration testing guide currently has a bunch of trailing spaces on 
> lines which we could cleanup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28972) [Spark] spark.memory.offHeap.size description require to update in document

2019-09-18 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28972.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25689
[https://github.com/apache/spark/pull/25689]

> [Spark] spark.memory.offHeap.size description require to update in document
> ---
>
> Key: SPARK-28972
> URL: https://issues.apache.org/jira/browse/SPARK-28972
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: pavithra ramachandran
>Priority: Minor
> Fix For: 3.0.0
>
>
>  
> spark.memory.offHeap.size accept 1G or 1KB also. So User is able to give 
> suffix also but description say only *'absolute amount of memory in bytes'.*
> This require to update like *spark.driver.memory* where it is mentioned is 
> accepts *a size unit suffix ("k", "m", "g" or "t") (e.g. {{512m}}, {{2g}}).* 
>  
> |{{spark.memory.offHeap.size}}|0|The *absolute amount of memory in bytes* 
> which can be used for off-heap allocation. This setting has no impact on heap 
> memory usage, so if your executors' total memory consumption must fit within 
> some hard limit then be sure to shrink your JVM heap size accordingly. This 
> must be set to a positive value when {{spark.memory.offHeap.enabled=true}}.|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28972) [Spark] spark.memory.offHeap.size description require to update in document

2019-09-18 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28972:
--
Issue Type: Improvement  (was: Bug)

> [Spark] spark.memory.offHeap.size description require to update in document
> ---
>
> Key: SPARK-28972
> URL: https://issues.apache.org/jira/browse/SPARK-28972
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: pavithra ramachandran
>Priority: Minor
>
>  
> spark.memory.offHeap.size accept 1G or 1KB also. So User is able to give 
> suffix also but description say only *'absolute amount of memory in bytes'.*
> This require to update like *spark.driver.memory* where it is mentioned is 
> accepts *a size unit suffix ("k", "m", "g" or "t") (e.g. {{512m}}, {{2g}}).* 
>  
> |{{spark.memory.offHeap.size}}|0|The *absolute amount of memory in bytes* 
> which can be used for off-heap allocation. This setting has no impact on heap 
> memory usage, so if your executors' total memory consumption must fit within 
> some hard limit then be sure to shrink your JVM heap size accordingly. This 
> must be set to a positive value when {{spark.memory.offHeap.enabled=true}}.|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28972) [Spark] spark.memory.offHeap.size description require to update in document

2019-09-18 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28972:
-

Assignee: pavithra ramachandran

> [Spark] spark.memory.offHeap.size description require to update in document
> ---
>
> Key: SPARK-28972
> URL: https://issues.apache.org/jira/browse/SPARK-28972
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: pavithra ramachandran
>Priority: Minor
>
>  
> spark.memory.offHeap.size accept 1G or 1KB also. So User is able to give 
> suffix also but description say only *'absolute amount of memory in bytes'.*
> This require to update like *spark.driver.memory* where it is mentioned is 
> accepts *a size unit suffix ("k", "m", "g" or "t") (e.g. {{512m}}, {{2g}}).* 
>  
> |{{spark.memory.offHeap.size}}|0|The *absolute amount of memory in bytes* 
> which can be used for off-heap allocation. This setting has no impact on heap 
> memory usage, so if your executors' total memory consumption must fit within 
> some hard limit then be sure to shrink your JVM heap size accordingly. This 
> must be set to a positive value when {{spark.memory.offHeap.enabled=true}}.|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28799) Document TRUNCATE TABLE in SQL Reference.

2019-09-18 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28799:
-

Assignee: pavithra ramachandran

> Document TRUNCATE TABLE in SQL Reference.
> -
>
> Key: SPARK-28799
> URL: https://issues.apache.org/jira/browse/SPARK-28799
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: pavithra ramachandran
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28799) Document TRUNCATE TABLE in SQL Reference.

2019-09-18 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28799:
--
Priority: Minor  (was: Major)

> Document TRUNCATE TABLE in SQL Reference.
> -
>
> Key: SPARK-28799
> URL: https://issues.apache.org/jira/browse/SPARK-28799
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28799) Document TRUNCATE TABLE in SQL Reference.

2019-09-18 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28799.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25557
[https://github.com/apache/spark/pull/25557]

> Document TRUNCATE TABLE in SQL Reference.
> -
>
> Key: SPARK-28799
> URL: https://issues.apache.org/jira/browse/SPARK-28799
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: pavithra ramachandran
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29051) Spark Application UI search is not working for some fields

2019-09-17 Thread Sean Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931530#comment-16931530
 ] 

Sean Owen commented on SPARK-29051:
---

I don't know, I think it's because the text is formatted in javascript? I am 
not sure of the best way to fix it. The original change was to improve 
performance.

> Spark Application UI search is not working for some fields
> --
>
> Key: SPARK-29051
> URL: https://issues.apache.org/jira/browse/SPARK-29051
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.3, 2.4.4
>Reporter: jobit mathew
>Priority: Minor
> Attachments: Duration Search.png, Duration Search1.png, Search 
> Missing.png, Search Missing.png
>
>
> Spark Application UI *Search is not working* for some fields in *Spark Web UI 
> Executors TAB* and Spark job History Server page
> *Test Steps*
>  1.Install spark
>  2.Start Spark SQL/Shell/beeline
>  3.Submit some SQL queries 
>  4.Close some spark applications
>  5.Check the Spark Web UI Executors TAB and verify search
>  6.Check Spark job History Server page and verify search
> *Issue 1*
> Searching of some field contents are not working in *Spark Web UI Executors 
> TAB*(Spark SQL/Shell/JDBC server UIs ).
> • *Input column*(search working wrongly .Example if input is 34.5KB,searching 
> of 34.5 won't take ,but 345 shows the search result -it is wrong)
>  • Task time search is Ok, but *GC time* search not working
>  • *Thread Dump* -search not working [have to confirm it is required to add 
> in search, but we are able to search stdout text in that case Thread Dump 
> text also should be searchable ]
>  • *Storage memory* example 384.1 search not searching.
> !Search Missing.png!
> *Issue 2:*
> *Spark job History Server page*,completed tasks- search is not working based 
> on *Duration column values*. We are getting the proper search result, if we 
> search the content from any other columns except Duration.*For example if 
> Duration is 6.1 min* we can not search result for 6.1 min or even 6.1.
> !Duration Search.png!
>   !Duration Search1.png!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28929) Spark Logging level should be INFO instead of Debug in Executor Plugin API[SPARK-24918]

2019-09-17 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28929:
--
Priority: Trivial  (was: Minor)

> Spark Logging level should be INFO instead of Debug in Executor Plugin 
> API[SPARK-24918]
> ---
>
> Key: SPARK-28929
> URL: https://issues.apache.org/jira/browse/SPARK-28929
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.2, 2.4.3
>Reporter: jobit mathew
>Assignee: Rakesh Raushan
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Spark Logging level should be INFO instead of Debug in Executor Plugin 
> API[SPARK-24918].
> Currently logging level for Executor Plugin API[SPARK-24918] is DEBUG
> logDebug(s"Initializing the following plugins: $\{pluginNames.mkString(", 
> ")}")
> logDebug(s"Successfully loaded plugin " + 
> plugin.getClass().getCanonicalName())
> logDebug("Finished initializing plugins")
> It is better to change to  INFO instead of DEBUG.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28264) Revisiting Python / pandas UDF

2019-09-17 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28264:
--
Priority: Critical  (was: Major)

> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
>  
> See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26565) modify dev/create-release/release-build.sh to let jenkins build packages w/o publishing

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26565.
---
Resolution: Not A Problem

> modify dev/create-release/release-build.sh to let jenkins build packages w/o 
> publishing
> ---
>
> Key: SPARK-26565
> URL: https://issues.apache.org/jira/browse/SPARK-26565
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: fine.png, no-idea.jpg
>
>
> about a year+ ago, we stopped publishing releases directly from jenkins...
> this means that the spark-\{branch}-packaging builds are failing due to gpg 
> signing failures, and i would like to update these builds to *just* perform 
> packaging.
> example:
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2183/console]
> i propose to change dev/create-release/release-build.sh...
> when the script is called w/the 'package' option, add an {{if}} statement to 
> skip the following sections when run on jenkins:
> 1) gpg signing of the source tarball (lines 184-187)
> 2) gpg signing of the sparkR dist (lines 243-248)
> 3) gpg signing of the python dist (lines 256-261)
> 4) gpg signing of the regular binary dist (lines 264-271)
> 5) the svn push of the signed dists (lines 317-332)
>  
> -another, and probably much better option, is to nuke the 
> spark-\{branch}-packaging builds and create new ones that just build things 
> w/o touching this incredible fragile shell scripting nightmare.-



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22381) Add StringParam that supports valid options

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22381.
---
Resolution: Won't Fix

> Add StringParam that supports valid options
> ---
>
> Key: SPARK-22381
> URL: https://issues.apache.org/jira/browse/SPARK-22381
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> During test with https://issues.apache.org/jira/browse/SPARK-22331, I found 
> it might be a good idea to include the possible options in a StringParam.
> A StringParam extends Param[String] and allow user to specify the valid 
> options in Array[String] (case insensitive).
> So far it can help achieve three goals:
> 1. Make the StringParam aware of its possible options and support native 
> validations.
> 2. StringParam can list the supported options when user input wrong value.
> 3. allow automatic unit test coverage for case-insensitive String param
> and IMO it also decrease the code redundancy.
> The StringParam is designed to be completely compatible with existing 
> Param[String], just adding the extra logic for supporting options, which 
> means we don't need to convert all Param[String] to StringParam until we feel 
> comfortable to do that.
> 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10408) Autoencoder

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10408.
---
Resolution: Won't Fix

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>Priority: Major
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22111) OnlineLDAOptimizer should filter out empty documents beforehand

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22111.
---
Resolution: Won't Fix

> OnlineLDAOptimizer should filter out empty documents beforehand 
> 
>
> Key: SPARK-22111
> URL: https://issues.apache.org/jira/browse/SPARK-22111
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Minor
>
> OnlineLDAOptimizer should filter out empty documents beforehand in order to 
> make corpusSize, batchSize, and nonEmptyDocsN all refer to the same filtered 
> corpus with all non-empty docs. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24806) Brush up generated code so that JDK Java compilers can handle it

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24806.
---
Resolution: Won't Fix

> Brush up generated code so that JDK Java compilers can handle it
> 
>
> Key: SPARK-24806
> URL: https://issues.apache.org/jira/browse/SPARK-24806
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takeshi Yamamuro
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23694) The staging directory should under hive.exec.stagingdir if we set hive.exec.stagingdir but not under the table directory

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23694.
---
Resolution: Won't Fix

> The staging directory should under hive.exec.stagingdir if we set 
> hive.exec.stagingdir but not under the table directory 
> -
>
> Key: SPARK-23694
> URL: https://issues.apache.org/jira/browse/SPARK-23694
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifeng Dong
>Priority: Major
>
> When we set hive.exec.stagingdir but not under the table directory, for 
> example: /tmp/hive-staging, I think the staging directory should under 
> /tmp/hive-staging, not under /tmp/ like /tmp/hive-staging_xxx



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24671) DataFrame length using a dunder/magic method in PySpark

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24671.
---
Resolution: Won't Fix

> DataFrame length using a dunder/magic method in PySpark
> ---
>
> Key: SPARK-24671
> URL: https://issues.apache.org/jira/browse/SPARK-24671
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Ondrej Kokes
>Priority: Minor
>
> In Python, if a class implements a method called __len__, one can use the 
> builtin `len` function to get a length of an instance of said class, whatever 
> that means in its context. This is e.g. how you get the number of rows of a 
> pandas DataFrame.
> It should be straightforward to add this functionality to PySpark, because 
> df.count() is already implemented, so the patch I'm proposing is just two 
> lines of code (and two lines of tests). It's in this commit, I'll submit a PR 
> shortly.
> https://github.com/kokes/spark/commit/4d0afaf3cd046b11e8bae43dc00ddf4b1eb97732



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19184) Improve numerical stability for method tallSkinnyQR.

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-19184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19184.
---
Resolution: Won't Fix

> Improve numerical stability for method tallSkinnyQR.
> 
>
> Key: SPARK-19184
> URL: https://issues.apache.org/jira/browse/SPARK-19184
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Huamin Li
>Priority: Minor
>  Labels: None
>
> In method tallSkinnyQR, the final Q is calculated by A * inv(R) ([Github 
> Link|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L562]).
>  When the upper triangular matrix R is ill-conditioned, computing the inverse 
> of R can result in catastrophic cancellation. Instead, we should consider 
> using a forward solve for solving Q such that Q * R = A.
> I first create a 4 by 4 RowMatrix A = 
> (1,1,1,1;0,1E-5,0,0;0,0,1E-10,1;0,0,0,1E-14), and then I apply method 
> tallSkinnyQR to A to find RowMatrix Q and Matrix R such that A = Q*R. In this 
> case, A is ill-conditioned and so is R.
> See codes in Spark Shell:
> {code:none}
> import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors}
> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> // Create RowMatrix A.
> val mat = Seq(Vectors.dense(1,1,1,1), Vectors.dense(0, 1E-5, 1,1), 
> Vectors.dense(0,0,1E-10,1), Vectors.dense(0,0,0,1E-14))
> val denseMat = new RowMatrix(sc.parallelize(mat, 2))
> // Apply tallSkinnyQR to A.
> val result = denseMat.tallSkinnyQR(true)
> // Print the calculated Q and R.
> result.Q.rows.collect.foreach(println)
> result.R
> // Calculate Q*R. Ideally, this should be close to A.
> val reconstruct = result.Q.multiply(result.R)
> reconstruct.rows.collect.foreach(println)
> // Calculate Q'*Q. Ideally, this should be close to the identity matrix.
> result.Q.computeGramianMatrix()
> System.exit(0)
> {code}
> it will output the following results:
> {code:none}
> scala> result.Q.rows.collect.foreach(println)
> [1.0,0.0,0.0,1.5416524685312E13]
> [0.0,0.,0.0,8011776.0]
> [0.0,0.0,1.0,0.0]
> [0.0,0.0,0.0,1.0]
> scala> result.R
> 1.0  1.0 1.0  1.0
> 0.0  1.0E-5  1.0  1.0
> 0.0  0.0 1.0E-10  1.0
> 0.0  0.0 0.0  1.0E-14
> scala> reconstruct.rows.collect.foreach(println)
> [1.0,1.0,1.0,1.15416524685312]
> [0.0,9.999E-6,0.,1.0008011776]
> [0.0,0.0,1.0E-10,1.0]
> [0.0,0.0,0.0,1.0E-14]
> scala> result.Q.computeGramianMatrix()
> 1.0 0.0 0.0  1.5416524685312E13
> 0.0 0.9998  0.0  8011775.9
> 0.0 0.0 1.0  0.0
> 1.5416524685312E13  8011775.9   0.0  2.3766923337289844E26
> {code}
> With forward solve for solving Q such that Q * R = A rather than computing 
> the inverse of R, it will output the following results instead:
> {code:none}
> scala> result.Q.rows.collect.foreach(println)
> [1.0,0.0,0.0,0.0]
> [0.0,1.0,0.0,0.0]
> [0.0,0.0,1.0,0.0]
> [0.0,0.0,0.0,1.0]
> scala> result.R
> 1.0  1.0 1.0  1.0
> 0.0  1.0E-5  1.0  1.0
> 0.0  0.0 1.0E-10  1.0
> 0.0  0.0 0.0  1.0E-14
> scala> reconstruct.rows.collect.foreach(println)
> [1.0,1.0,1.0,1.0]
> [0.0,1.0E-5,1.0,1.0]
> [0.0,0.0,1.0E-10,1.0]
> [0.0,0.0,0.0,1.0E-14]
> scala> result.Q.computeGramianMatrix()
> 1.0  0.0  0.0  0.0
> 0.0  1.0  0.0  0.0
> 0.0  0.0  1.0  0.0
> 0.0  0.0  0.0  1.0
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26524) If the application directory fails to be created on the SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no capacity), the application executor w

2019-09-16 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26524.
---
Resolution: Won't Fix

> If the application directory fails to be created on the SPARK_WORKER_DIR on 
> some woker nodes (for example, bad disk or disk has no capacity), the 
> application executor will be allocated indefinitely.
> --
>
> Key: SPARK-26524
> URL: https://issues.apache.org/jira/browse/SPARK-26524
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: hantiantian
>Priority: Major
>
> When the spark worker is started, the workerdir is created successfully. When 
> the application is submitted, the disks mounted by the workerdir and 
> worker122 workerdir are damaged.
> When a worker allocates an executor, it creates a working directory and a 
> temporary directory. If the creation fails, the executor allocation fails. 
> The application directory fails to be created on the SPARK_WORKER_DIR on 
> woker121 and worker122,the application executor will be allocated 
> indefinitely.
> 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5762 because it is FAILED
> 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5765 on worker 
> worker-20190103154858-worker121-37199
> 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5764 because it is FAILED
> 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5766 on worker 
> worker-20190103154920-worker122-41273
> 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5766 because it is FAILED
> 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5767 on worker 
> worker-20190103154920-worker122-41273
> 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Removing 
> executor app-20190103154954-/5765 because it is FAILED
> 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Launching 
> executor app-20190103154954-/5768 on worker 
> worker-20190103154858-worker121-37199
> ...
> I observed the code and found that spark has some processing for the failure 
> of the executor allocation. However, it can only handle the case where the 
> current application does not have an executor that has been successfully 
> assigned.
> if (!normalExit
>  && appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
>  && MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
>  val execs = appInfo.executors.values
>  if (!execs.exists(_.state == ExecutorState.RUNNING)) {
>  logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
>  s"${appInfo.retryCount} times; removing it")
>  removeApplication(appInfo, ApplicationState.FAILED)
>  }
> }
> Master will only judge whether the worker is available according to the 
> resources of the worker. 
> // Filter out workers that don't have enough resources to launch an executor
> val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
>  .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
>  worker.coresFree >= coresPerExecutor)
>  .sortBy(_.coresFree).reverse
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23539) Add support for Kafka headers in Structured Streaming

2019-09-13 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23539.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 22282
[https://github.com/apache/spark/pull/22282]

> Add support for Kafka headers in Structured Streaming
> -
>
> Key: SPARK-23539
> URL: https://issues.apache.org/jira/browse/SPARK-23539
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Lee Dongjin
>Priority: Major
> Fix For: 3.0.0
>
>
> Kafka headers were added in 0.11. We should expose them through our kafka 
> data source in both batch and streaming queries. 
> This is currently blocked on version of Kafka in Spark from 0.10.1 to 1.0+ 
> SPARK-18057



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23539) Add support for Kafka headers in Structured Streaming

2019-09-13 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23539:
-

Assignee: Lee Dongjin

> Add support for Kafka headers in Structured Streaming
> -
>
> Key: SPARK-23539
> URL: https://issues.apache.org/jira/browse/SPARK-23539
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Lee Dongjin
>Priority: Major
>
> Kafka headers were added in 0.11. We should expose them through our kafka 
> data source in both batch and streaming queries. 
> This is currently blocked on version of Kafka in Spark from 0.10.1 to 1.0+ 
> SPARK-18057



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28969) OneVsRestModel in the py side should not set WeightCol and Classifier

2019-09-13 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28969:
-

Assignee: Huaxin Gao

> OneVsRestModel in the py side should not set WeightCol and Classifier
> -
>
> Key: SPARK-28969
> URL: https://issues.apache.org/jira/browse/SPARK-28969
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
>  Labels: release-notes
>
> 'WeightCol' and 'Classifier' can only be set in the estimator.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28969) OneVsRestModel in the py side should not set WeightCol and Classifier

2019-09-13 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28969.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25715
[https://github.com/apache/spark/pull/25715]

> OneVsRestModel in the py side should not set WeightCol and Classifier
> -
>
> Key: SPARK-28969
> URL: https://issues.apache.org/jira/browse/SPARK-28969
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> 'WeightCol' and 'Classifier' can only be set in the estimator.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer

2019-09-12 Thread Sean Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928541#comment-16928541
 ] 

Sean Owen commented on SPARK-22796:
---

Yes, but you can see that it was reverted. There are pointers to more 
discussion in the PR. You are welcome to work on it.
CC [~huaxingao] and [~podongfeng] who I know have looked at making related 
changes recently.

> Add multiple column support to PySpark QuantileDiscretizer
> --
>
> Key: SPARK-22796
> URL: https://issues.apache.org/jira/browse/SPARK-22796
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26731) remove EOLed spark jobs from jenkins

2019-09-12 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26731.
---
Target Version/s:   (was: 3.0.0)
  Resolution: Duplicate

> remove EOLed spark jobs from jenkins
> 
>
> Key: SPARK-26731
> URL: https://issues.apache.org/jira/browse/SPARK-26731
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 1.6.3, 2.0.2, 2.1.3
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> i will disable, but not remove (yet), the branch-specific builds for 1.6, 2.0 
> and 2.1 on jenkins.
> these include all test builds, as well as docs, lint, compile, and packaging.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29066) Remove old Jenkins jobs for EOL versions or obsolete combinations

2019-09-12 Thread Sean Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928539#comment-16928539
 ] 

Sean Owen commented on SPARK-29066:
---

Sounds good. This is the same as 
https://issues.apache.org/jira/browse/SPARK-26731 really, which I'll mark as a 
duplicate.
I don't know how to remove the jobs - [~shaneknapp] is that easy enough? 
anything before 2.4 can go now IMHO. I hope that makes it easier to reason 
about cleaning up what's left.

> Remove old Jenkins jobs for EOL versions or obsolete combinations
> -
>
> Key: SPARK-29066
> URL: https://issues.apache.org/jira/browse/SPARK-29066
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to remove the old Jenkins jobs for EOL versions (1.6 ~ 2.3) 
> and some obsolete combinations.
> 1. https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/
> 2. https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/ (Here, 
> `spark-master-compile-maven-hadoop-2.6` is an invalid combination.)
> 3. https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/
> 4. 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
> 5. https://amplab.cs.berkeley.edu/jenkins/view/spark%20k8s%20builds/
> For 1~3, we need additional scroll-down in laptop environments. It's 
> inconvenient.
> This cleanup will make us more room when we add `branch-3.0` later. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29050) Fix typo in some docs

2019-09-11 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-29050:
--
Issue Type: Improvement  (was: Bug)
  Priority: Trivial  (was: Major)

This can't be considered a bug, or even major. I fixed it. Please read 
https://spark.apache.org/contributing.html

> Fix typo in some docs
> -
>
> Key: SPARK-29050
> URL: https://issues.apache.org/jira/browse/SPARK-29050
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.3.3, 2.4.3, 3.0.0
>Reporter: dengziming
>Priority: Trivial
>
> 'a hdfs' change into  'an hdfs'
> 'an unique' change into 'a unique'
> 'an url' change into 'a url'
> 'a error' change into 'an error'



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28987) DiskBlockManager#createTempShuffleBlock should skip directory which is read-only

2019-09-11 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28987:
--
Priority: Minor  (was: Major)

> DiskBlockManager#createTempShuffleBlock should skip directory which is 
> read-only
> 
>
> Key: SPARK-28987
> URL: https://issues.apache.org/jira/browse/SPARK-28987
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: deshanxiao
>Priority: Minor
>
> DiskBlockManager#createTempShuffleBlock only considers the path which is not 
> exist. I think we could check whether the path is writeable or not. It's 
> resonable beacuse we invoke createTempShuffleBlock to create a new path to 
> write files in it. It should be writeable.
> stack:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1765 in stage 368592.0 failed 4 times, most recent failure: Lost task 
> 1765.3 in stage 368592.0 (TID 66021932, test-hadoop-prc-st2808.bj, executor 
> 251): java.io.FileNotFoundException: 
> /home/work/hdd6/yarn/test-hadoop/nodemanager/usercache/sql_test/appcache/application_1560996968289_16320/blockmgr-14608b48-7efd-4fd3-b050-2ac9953390d4/1e/temp_shuffle_00c7b87f-d7ed-49f3-90e7-1c8358bcfd74
>  (No such file or directory)
> at java.io.FileOutputStream.open0(Native Method)
> at java.io.FileOutputStream.open(FileOutputStream.java:270)
> at java.io.FileOutputStream.(FileOutputStream.java:213)
> at 
> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:139)
> at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:150)
> at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:268)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:159)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:100)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1515)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1503)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1502)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1502)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:816)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:816)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:816)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1740)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1695)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1684)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28987) DiskBlockManager#createTempShuffleBlock should skip directory which is read-only

2019-09-11 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28987.
---
Resolution: Won't Fix

> DiskBlockManager#createTempShuffleBlock should skip directory which is 
> read-only
> 
>
> Key: SPARK-28987
> URL: https://issues.apache.org/jira/browse/SPARK-28987
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: deshanxiao
>Priority: Minor
>
> DiskBlockManager#createTempShuffleBlock only considers the path which is not 
> exist. I think we could check whether the path is writeable or not. It's 
> resonable beacuse we invoke createTempShuffleBlock to create a new path to 
> write files in it. It should be writeable.
> stack:
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1765 in stage 368592.0 failed 4 times, most recent failure: Lost task 
> 1765.3 in stage 368592.0 (TID 66021932, test-hadoop-prc-st2808.bj, executor 
> 251): java.io.FileNotFoundException: 
> /home/work/hdd6/yarn/test-hadoop/nodemanager/usercache/sql_test/appcache/application_1560996968289_16320/blockmgr-14608b48-7efd-4fd3-b050-2ac9953390d4/1e/temp_shuffle_00c7b87f-d7ed-49f3-90e7-1c8358bcfd74
>  (No such file or directory)
> at java.io.FileOutputStream.open0(Native Method)
> at java.io.FileOutputStream.open(FileOutputStream.java:270)
> at java.io.FileOutputStream.(FileOutputStream.java:213)
> at 
> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:139)
> at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:150)
> at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:268)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:159)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:100)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1515)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1503)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1502)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1502)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:816)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:816)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:816)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1740)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1695)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1684)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28906) `bin/spark-submit --version` shows incorrect info

2019-09-11 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28906.
---
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 25655
[https://github.com/apache/spark/pull/25655]

> `bin/spark-submit --version` shows incorrect info
> -
>
> Key: SPARK-28906
> URL: https://issues.apache.org/jira/browse/SPARK-28906
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 
> 2.4.4, 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
> Attachments: image-2019-08-29-05-50-13-526.png
>
>
> Since Spark 2.3.1, `spark-submit` shows a wrong information.
> {code}
> $ bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3
>   /_/
> Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222
> Branch
> Compiled by user  on 2019-02-04T13:00:46Z
> Revision
> Url
> Type --help for more information.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28906) `bin/spark-submit --version` shows incorrect info

2019-09-11 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28906:
-

Assignee: Kazuaki Ishizaki

> `bin/spark-submit --version` shows incorrect info
> -
>
> Key: SPARK-28906
> URL: https://issues.apache.org/jira/browse/SPARK-28906
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 
> 2.4.4, 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Attachments: image-2019-08-29-05-50-13-526.png
>
>
> Since Spark 2.3.1, `spark-submit` shows a wrong information.
> {code}
> $ bin/spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.3
>   /_/
> Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_222
> Branch
> Compiled by user  on 2019-02-04T13:00:46Z
> Revision
> Url
> Type --help for more information.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29028) Add links to IBM Cloud Object Storage connector in cloud-integration.md

2019-09-10 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-29028.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25737
[https://github.com/apache/spark/pull/25737]

> Add links to IBM Cloud Object Storage connector in cloud-integration.md
> ---
>
> Key: SPARK-29028
> URL: https://issues.apache.org/jira/browse/SPARK-29028
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29028) Add links to IBM Cloud Object Storage connector in cloud-integration.md

2019-09-10 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-29028:
-

Assignee: Dilip Biswal

> Add links to IBM Cloud Object Storage connector in cloud-integration.md
> ---
>
> Key: SPARK-29028
> URL: https://issues.apache.org/jira/browse/SPARK-29028
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28657) Fix currentContext Instance failed sometimes

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28657:
--
Fix Version/s: (was: 2.4.5)

> Fix currentContext Instance failed sometimes
> 
>
> Key: SPARK-28657
> URL: https://issues.apache.org/jira/browse/SPARK-28657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment:  
>  
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: warn.jpg
>
>
> When run spark on yarn, I got 
> {code:java}
> // java.lang.ClassCastException: org.apache.hadoop.ipc.CallerContext$Builder 
> cannot be cast to scala.runtime.Nothing$ 
> {code}
>   !warn.jpg!
> {{Utils.classForName return Class[Nothing], I think it should be defind as 
> Class[_] to resolve this issue}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28657) Fix currentContext Instance failed sometimes

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28657.
---
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 25389
[https://github.com/apache/spark/pull/25389]

> Fix currentContext Instance failed sometimes
> 
>
> Key: SPARK-28657
> URL: https://issues.apache.org/jira/browse/SPARK-28657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment:  
>  
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
> Attachments: warn.jpg
>
>
> When run spark on yarn, I got 
> {code:java}
> // java.lang.ClassCastException: org.apache.hadoop.ipc.CallerContext$Builder 
> cannot be cast to scala.runtime.Nothing$ 
> {code}
>   !warn.jpg!
> {{Utils.classForName return Class[Nothing], I think it should be defind as 
> Class[_] to resolve this issue}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28657) Fix currentContext Instance failed sometimes

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28657:
-

Assignee: hong dongdong

> Fix currentContext Instance failed sometimes
> 
>
> Key: SPARK-28657
> URL: https://issues.apache.org/jira/browse/SPARK-28657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment:  
>  
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Minor
> Attachments: warn.jpg
>
>
> When run spark on yarn, I got 
> {code:java}
> // java.lang.ClassCastException: org.apache.hadoop.ipc.CallerContext$Builder 
> cannot be cast to scala.runtime.Nothing$ 
> {code}
>   !warn.jpg!
> {{Utils.classForName return Class[Nothing], I think it should be defind as 
> Class[_] to resolve this issue}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29011) Upgrade netty-all to 4.1.39-Final

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-29011:
--
Fix Version/s: 2.4.5

> Upgrade netty-all to 4.1.39-Final
> -
>
> Key: SPARK-29011
> URL: https://issues.apache.org/jira/browse/SPARK-29011
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Nicholas Marion
>Assignee: Nicholas Marion
>Priority: Trivial
> Fix For: 2.4.5, 3.0.0
>
>
> We should use the newest version of netty-all.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14098) Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14098:
--
Labels: release-notes  (was: releasenotes)

> Generate Java code to build CachedColumnarBatch and get values from 
> CachedColumnarBatch when DataFrame.cache() is called
> 
>
> Key: SPARK-14098
> URL: https://issues.apache.org/jira/browse/SPARK-14098
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Priority: Major
>  Labels: release-notes
>
> [Here|https://docs.google.com/document/d/1-2BnW5ibuHIeQzmHEGIGkEcuMUCTk87pmPis2DKRg-Q/edit?usp=sharing]
>  is a design document for this change (***TODO: Update the document***).
> This JIRA implements a new in-memory cache feature used by DataFrame.cache 
> and Dataset.cache. The followings are basic design based on discussions with 
> Sameer, Weichen, Xiao, Herman, and Nong.
> * Use ColumnarBatch with ColumnVector that are common data representations 
> for columnar storage
> * Use multiple compression scheme (such as RLE, intdelta, and so on) for each 
> ColumnVector in ColumnarBatch depends on its data typpe
> * Generate code that is simple and specialized for each in-memory cache to 
> build an in-memory cache
> * Generate code that directly reads data from ColumnVector for the in-memory 
> cache by whole-stage codegen.
> * Enhance ColumnVector to keep UnsafeArrayData
> * Use primitive-type array for primitive uncompressed data type in 
> ColumnVector
> * Use byte[] for UnsafeArrayData and compressed data
> Based on this design, this JIRA generates two kinds of Java code for 
> DataFrame.cache()/Dataset.cache()
> * Generate Java code to build CachedColumnarBatch, which keeps data in 
> ColumnarBatch
> * Generate Java code to get a value of each column from ColumnarBatch
> ** a Get a value directly from from ColumnarBatch in code generated by whole 
> stage code gen (primary path)
> ** b Get a value thru an iterator if whole stage code gen is disabled (e.g. # 
> of columns is more than 100, as backup path)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14643) Remove overloaded methods which become ambiguous in Scala 2.12

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-14643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14643.
---
Target Version/s:   (was: 3.0.0)
  Resolution: Won't Fix

Reopen if anyone has different ideas.

> Remove overloaded methods which become ambiguous in Scala 2.12
> --
>
> Key: SPARK-14643
> URL: https://issues.apache.org/jira/browse/SPARK-14643
> Project: Spark
>  Issue Type: Task
>  Components: Build, Project Infra
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
>
> Spark 1.x's Dataset API runs into subtle source incompatibility problems for 
> Java 8 and Scala 2.12 users when Spark is built against Scala 2.12. In a 
> nutshell, the current API has overloaded methods whose signatures are 
> ambiguous when resolving calls that use the Java 8 lambda syntax (only if 
> Spark is build against Scala 2.12).
> This issue is somewhat subtle, so there's a full writeup at 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit?usp=sharing
>  which describes the exact circumstances under which the current APIs are 
> problematic. The writeup also proposes a solution which involves the removal 
> of certain overloads only in Scala 2.12 builds of Spark and the introduction 
> of implicit conversions for retaining source compatibility.
> We don't need to implement any of these changes until we add Scala 2.12 
> support since the changes must only be applied when building against Scala 
> 2.12 and will be done via traits + shims which are mixed in via 
> per-Scala-version source directories (like how we handle the 
> Scala-version-specific parts of the REPL). For now, this JIRA acts as a 
> placeholder so that the parent JIRA reflects the complete set of tasks which 
> need to be finished for 2.12 support.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14098) Generate Java code to build CachedColumnarBatch and get values from CachedColumnarBatch when DataFrame.cache() is called

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14098.
---
Target Version/s:   (was: 3.0.0)
  Resolution: Done

> Generate Java code to build CachedColumnarBatch and get values from 
> CachedColumnarBatch when DataFrame.cache() is called
> 
>
> Key: SPARK-14098
> URL: https://issues.apache.org/jira/browse/SPARK-14098
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Priority: Major
>  Labels: releasenotes
>
> [Here|https://docs.google.com/document/d/1-2BnW5ibuHIeQzmHEGIGkEcuMUCTk87pmPis2DKRg-Q/edit?usp=sharing]
>  is a design document for this change (***TODO: Update the document***).
> This JIRA implements a new in-memory cache feature used by DataFrame.cache 
> and Dataset.cache. The followings are basic design based on discussions with 
> Sameer, Weichen, Xiao, Herman, and Nong.
> * Use ColumnarBatch with ColumnVector that are common data representations 
> for columnar storage
> * Use multiple compression scheme (such as RLE, intdelta, and so on) for each 
> ColumnVector in ColumnarBatch depends on its data typpe
> * Generate code that is simple and specialized for each in-memory cache to 
> build an in-memory cache
> * Generate code that directly reads data from ColumnVector for the in-memory 
> cache by whole-stage codegen.
> * Enhance ColumnVector to keep UnsafeArrayData
> * Use primitive-type array for primitive uncompressed data type in 
> ColumnVector
> * Use byte[] for UnsafeArrayData and compressed data
> Based on this design, this JIRA generates two kinds of Java code for 
> DataFrame.cache()/Dataset.cache()
> * Generate Java code to build CachedColumnarBatch, which keeps data in 
> ColumnarBatch
> * Generate Java code to get a value of each column from ColumnarBatch
> ** a Get a value directly from from ColumnarBatch in code generated by whole 
> stage code gen (primary path)
> ** b Get a value thru an iterator if whole stage code gen is disabled (e.g. # 
> of columns is more than 100, as backup path)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16275) Implement all the Hive fallback functions

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-16275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16275.
---
Target Version/s:   (was: 3.0.0)
  Resolution: Done

> Implement all the Hive fallback functions
> -
>
> Key: SPARK-16275
> URL: https://issues.apache.org/jira/browse/SPARK-16275
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>
> As of Spark 2.0, Spark falls back to Hive for only the following built-in 
> functions:
> {code}
> "elt", "hash", "java_method", "histogram_numeric",
> "map_keys", "map_values",
> "parse_url", "percentile", "percentile_approx", "reflect", "sentences", 
> "stack", "str_to_map",
> "xpath", "xpath_boolean", "xpath_double", "xpath_float", "xpath_int", 
> "xpath_long",
> "xpath_number", "xpath_short", "xpath_string",
> // table generating function
> "inline", "posexplode"
> {code}
> The goal of the ticket is to implement all of these in Spark so we don't need 
> to fall back into Hive's UDFs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16390) Dataset API improvements

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-16390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16390.
---
Target Version/s:   (was: 3.0.0)
  Resolution: Done

> Dataset API improvements
> 
>
> Key: SPARK-16390
> URL: https://issues.apache.org/jira/browse/SPARK-16390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>
> This is an umbrella ticket for improving the user experience of Dataset API.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18455) General support for correlated subquery processing

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-18455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18455.
---
Target Version/s:   (was: 3.0.0)
  Resolution: Done

Looks like the subtasks are done?

> General support for correlated subquery processing
> --
>
> Key: SPARK-18455
> URL: https://issues.apache.org/jira/browse/SPARK-18455
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Nattavut Sutyanyong
>Priority: Major
> Attachments: SPARK-18455-scoping-doc.pdf
>
>
> Subquery support has been introduced in Spark 2.0. The initial implementation 
> covers the most common subquery use case: the ones used in TPC queries for 
> instance.
> Spark currently supports the following subqueries:
> * Uncorrelated Scalar Subqueries. All cases are supported.
> * Correlated Scalar Subqueries. We only allow subqueries that are aggregated 
> and use equality predicates.
> * Predicate Subqueries. IN or Exists type of queries. We allow most 
> predicates, except when they are pulled from under an Aggregate or Window 
> operator. In that case we only support equality predicates.
> However this does not cover the full range of possible subqueries. This, in 
> part, has to do with the fact that we currently rewrite all correlated 
> subqueries into a (LEFT/LEFT SEMI/LEFT ANTI) join.
> We currently lack supports for the following use cases:
> * The use of predicate subqueries in a projection.
> * The use of non-equality predicates below Aggregates and or Window operators.
> * The use of non-Aggregate subqueries for correlated scalar subqueries.
> This JIRA aims to lift these current limitations in subquery processing.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22236) CSV I/O: does not respect RFC 4180

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22236:
--
Target Version/s:   (was: 3.0.0)

> CSV I/O: does not respect RFC 4180
> --
>
> Key: SPARK-22236
> URL: https://issues.apache.org/jira/browse/SPARK-22236
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Ondrej Kokes
>Priority: Minor
>
> When reading or writing CSV files with Spark, double quotes are escaped with 
> a backslash by default. However, the appropriate behaviour as set out by RFC 
> 4180 (and adhered to by many software packages) is to escape using a second 
> double quote.
> This piece of Python code demonstrates the issue
> {code}
> import csv
> with open('testfile.csv', 'w') as f:
> cw = csv.writer(f)
> cw.writerow(['a 2.5" drive', 'another column'])
> cw.writerow(['a "quoted" string', '"quoted"'])
> cw.writerow([1,2])
> with open('testfile.csv') as f:
> print(f.read())
> # "a 2.5"" drive",another column
> # "a ""quoted"" string","""quoted"""
> # 1,2
> spark.read.csv('testfile.csv').collect()
> # [Row(_c0='"a 2.5"" drive"', _c1='another column'),
> #  Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'),
> #  Row(_c0='1', _c1='2')]
> # explicitly stating the escape character fixed the issue
> spark.read.option('escape', '"').csv('testfile.csv').collect()
> # [Row(_c0='a 2.5" drive', _c1='another column'),
> #  Row(_c0='a "quoted" string', _c1='"quoted"'),
> #  Row(_c0='1', _c1='2')]
> {code}
> The same applies to writes, where reading the file written by Spark may 
> result in garbage.
> {code}
> df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file 
> correctly
> df.write.format("csv").save('testout.csv')
> with open('testout.csv/part-csv') as f:
> cr = csv.reader(f)
> print(next(cr))
> print(next(cr))
> # ['a 2.5\\ drive"', 'another column']
> # ['a \\quoted\\" string"', '\\quoted\\""']
> {code}
> The culprit is in 
> [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91],
>  where the default escape character is overridden.
> While it's possible to work with CSV files in a "compatible" manner, it would 
> be useful if Spark had sensible defaults that conform to the above-mentioned 
> RFC (as well as W3C recommendations). I realise this would be a breaking 
> change and thus if accepted, it would probably need to result in a warning 
> first, before moving to a new default.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25261) Standardize the default units of spark.driver|executor.memory

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25261.
---
   Fix Version/s: 2.4.0
Target Version/s:   (was: 3.0.0)
Assignee: huangtengfei
  Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/22252

> Standardize the default units of spark.driver|executor.memory
> -
>
> Key: SPARK-25261
> URL: https://issues.apache.org/jira/browse/SPARK-25261
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: huangtengfei
>Assignee: huangtengfei
>Priority: Minor
> Fix For: 2.4.0
>
>
> From  
> [SparkContext|https://github.com/ivoson/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L464]
>  and 
> [SparkSubmitCommandBuilder|https://github.com/ivoson/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L265],we
>  can see that spark.driver.memory and spark.executor.memory are parsed as 
> bytes if no units specified. But in the doc, they are described as mb in 
> default, which may lead to some misunderstanding.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25157) Streaming of image files from directory

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25157:
--
Target Version/s:   (was: 3.0.0)

> Streaming of image files from directory
> ---
>
> Key: SPARK-25157
> URL: https://issues.apache.org/jira/browse/SPARK-25157
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Amit Baghel
>Priority: Major
>
> We are doing video analytics for video streams using Spark. At present there 
> is no direct way to stream video frames or image files to Spark and process 
> them using Structured Streaming and Dataset. We are using Kafka to stream 
> images and then doing processing at spark. We need a method in Spark to 
> stream images from directory. Currently *{{DataStreamReader}}* doesn't 
> support Image files. With the introduction of 
> *org.apache.spark.ml.image.ImageSchema* class, we think streaming 
> capabilities can be added for image files. It is fine if it won't support 
> some of the structured streaming features as it is a binary file. This method 
> could be similar to *mmlspark* *streamImages* method. 
> [https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25783) Spark shell fails because of jline incompatibility

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25783.
---
Target Version/s:   (was: 3.0.0)
  Resolution: Duplicate

> Spark shell fails because of jline incompatibility
> --
>
> Key: SPARK-25783
> URL: https://issues.apache.org/jira/browse/SPARK-25783
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.4.0
> Environment: spark 2.4.0-rc3 on hadoop 2.6.0 (cdh 5.15.1) with 
> -Phadoop-provided
>Reporter: koert kuipers
>Priority: Minor
>
> error i get when launching spark-shell is:
> {code:bash}
> Spark context Web UI available at http://client:4040
> Spark context available as 'sc' (master = yarn, app id = application_xxx).
> Spark session available as 'spark'.
> Exception in thread "main" java.lang.NoSuchMethodError: 
> jline.console.completer.CandidateListCompletionHandler.setPrintSpaceAfterFullCompletion(Z)V
>   at 
> scala.tools.nsc.interpreter.jline.JLineConsoleReader.initCompletion(JLineReader.scala:139)
>   at 
> scala.tools.nsc.interpreter.jline.InteractiveReader.postInit(JLineReader.scala:54)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$1.apply(SparkILoop.scala:190)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$1.apply(SparkILoop.scala:188)
>   at 
> scala.tools.nsc.interpreter.SplashReader.postInit(InteractiveReader.scala:130)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$org$apache$spark$repl$SparkILoop$$anonfun$$loopPostInit$1$1.apply$mcV$sp(SparkILoop.scala:214)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$org$apache$spark$repl$SparkILoop$$anonfun$$loopPostInit$1$1.apply(SparkILoop.scala:199)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$org$apache$spark$repl$SparkILoop$$anonfun$$loopPostInit$1$1.apply(SparkILoop.scala:199)
>   at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$mumly$1.apply(ILoop.scala:189)
>   at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:221)
>   at scala.tools.nsc.interpreter.ILoop.mumly(ILoop.scala:186)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.org$apache$spark$repl$SparkILoop$$anonfun$$loopPostInit$1(SparkILoop.scala:199)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$startup$1$1.apply(SparkILoop.scala:267)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$startup$1$1.apply(SparkILoop.scala:247)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.withSuppressedSettings$1(SparkILoop.scala:235)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.startup$1(SparkILoop.scala:247)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:282)
>   at org.apache.spark.repl.SparkILoop.runClosure(SparkILoop.scala:159)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:182)
>   at org.apache.spark.repl.Main$.doMain(Main.scala:78)
>   at org.apache.spark.repl.Main$.main(Main.scala:58)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> spark 2.4.0-rc3 which i build with:
> {code:bash}
> dev/make-distribution.sh --name provided --tgz -Phadoop-2.6 
> -Dhadoop.version=2.6.0 -Pyarn -Phadoop-provided
> {code}
> and deployed with in spark-env.sh:
> {code:bash}
> export SPARK_DIST_CLASSPATH=$(hadoop classpath)
> {code}
> hadoop version is 2.6.0 (CDH 5.15.1)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26247:
--
Target Version/s:   (was: 3.0.0)

> SPIP - ML Model Extension for no-Spark MLLib Online Serving
> ---
>
> Key: SPARK-26247
> URL: https://issues.apache.org/jira/browse/SPARK-26247
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Anne Holler
>Priority: Major
>  Labels: SPIP
> Attachments: SPIPMlModelExtensionForOnlineServing.pdf, diff.out, 
> diff.reduceLoadLatency, diff.scoreInstance
>
>
> This ticket tracks an SPIP to improve model load time and model serving 
> interfaces for online serving of Spark MLlib models.  The SPIP is here
> [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]
>  
> The improvement opportunity exists in all versions of spark.  We developed 
> our set of changes wrt version 2.1.0 and can port them forward to other 
> versions (e.g., we have ported them forward to 2.3.2).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28227) Spark can’t support TRANSFORM with aggregation

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28227:
--
Target Version/s:   (was: 3.0.0)

> Spark can’t  support TRANSFORM with aggregation
> ---
>
> Key: SPARK-28227
> URL: https://issues.apache.org/jira/browse/SPARK-28227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Spark can;t support using TRANSFORM with aggregation such as :
> {code:java}
> SELECT TRANSFORM(T.A, SUM(T.B))
> USING 'func' AS (X STRING Y STRING)
> FROM DEFAULT.TEST T
> WHERE T.C > 0
> GROUP BY T.A{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29020) Unifying behaviour between array_sort and sort_array

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-29020:
--
Target Version/s:   (was: 2.4.5, 3.0.0)
  Labels:   (was: functions sql)

> Unifying behaviour between array_sort and sort_array
> 
>
> Key: SPARK-29020
> URL: https://issues.apache.org/jira/browse/SPARK-29020
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.4
>Reporter: German Schiavon Matteo
>Priority: Major
>
> I've noticed that there are two functions to sort arrays *sort_array* and 
> *array_sort*.
> *sort_array* is from 1.5.0 and it has the possibility of ordering both 
> ascending and descending 
> *array_sort* is from 2.4.0 and it only has the possibility of ordering in 
> ascending.
> Basically I just added the possibility of ordering either ascending or 
> descending using *array_sort*. 
> I think it would be good to have unified behaviours. 
>  
> This is the link to the [PR|[https://github.com/apache/spark/pull/25728]]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28980) Remove most remaining deprecated items since <= 2.2.0 for 3.0

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28980.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25684
[https://github.com/apache/spark/pull/25684]

> Remove most remaining deprecated items since <= 2.2.0 for 3.0
> -
>
> Key: SPARK-28980
> URL: https://issues.apache.org/jira/browse/SPARK-28980
> Project: Spark
>  Issue Type: Task
>  Components: MLlib, PySpark, Spark Core, SQL, Structured Streaming, 
> YARN
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Following on https://issues.apache.org/jira/browse/SPARK-25908 I'd like to 
> propose removing the rest of the items that have been deprecated since <= 
> Spark 2.2.0, before Spark 3.0.
> This appears to be:
> - Remove SQLContext.createExternalTable and Catalog.createExternalTable, 
> deprecated in favor of createTable since 2.2.0, plus tests of deprecated 
> methods
> - Remove HiveContext, deprecated in 2.0.0, in favor of 
> SparkSession.builder.enableHiveSupport
> - Remove deprecated KinesisUtils.createStream methods, plus tests of 
> deprecated methods, deprecate in 2.2.0
> - Remove deprecated MLlib (not Spark ML) linear method support, mostly 
> utility constructors and 'train' methods, and associated docs. This includes 
> methods in LinearRegression, LogisticRegression, Lasso, RidgeRegression. 
> These have been deprecated since 2.0.0
> - Remove deprecated Pyspark MLlib linear method support, including 
> LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD
> - Remove 'runs' argument in KMeans.train() method, which has been a no-op 
> since 2.0.0
> - Remove deprecated ChiSqSelector isSorted protected method
> - Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor 
> of 'yarn' and deploy mode 'cluster', etc
> But while preparing the change, I found:
> - I was not able to remove deprecated DataFrameReader.json(RDD) in favor of 
> DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is 
> still needed to support Pyspark's .json() method, which can't use a Dataset.
> - Looks like SQLContext.createExternalTable was not actually deprecated in 
> Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable 
> was.
> - I afterwards noted that the toDegrees, toRadians functions were almost 
> removed fully in SPARK-25908, but Felix suggested keeping just the R version 
> as they hadn't been technically deprecated. I'd like to revisit that. Do we 
> really want the inconsistency? I'm not against reverting it again, but then 
> that implies leaving SQLContext.createExternalTable just in Pyspark too, 
> which seems weird.
> - I *kept* LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, 
> RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove 
> them (still used by StreamingLogisticRegressionWithSGD?) and they are not 
> fully removed in Scala. Maybe should not have been deprecated.
> I will open a PR accordingly for more detailed review.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28980) Remove most remaining deprecated items since <= 2.2.0 for 3.0

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28980:
--
Description: 
Following on https://issues.apache.org/jira/browse/SPARK-25908 I'd like to 
propose removing the rest of the items that have been deprecated since <= Spark 
2.2.0, before Spark 3.0.

This appears to be:

- Remove SQLContext.createExternalTable and Catalog.createExternalTable, 
deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods
- Remove HiveContext, deprecated in 2.0.0, in favor of 
SparkSession.builder.enableHiveSupport
- Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated 
methods, deprecate in 2.2.0
- Remove deprecated MLlib (not Spark ML) linear method support, mostly utility 
constructors and 'train' methods, and associated docs. This includes methods in 
LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been 
deprecated since 2.0.0
- Remove deprecated Pyspark MLlib linear method support, including 
LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD
- Remove 'runs' argument in KMeans.train() method, which has been a no-op since 
2.0.0
- Remove deprecated ChiSqSelector isSorted protected method
- Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor 
of 'yarn' and deploy mode 'cluster', etc

But while preparing the change, I found:

- I was not able to remove deprecated DataFrameReader.json(RDD) in favor of 
DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is 
still needed to support Pyspark's .json() method, which can't use a Dataset.
- Looks like SQLContext.createExternalTable was not actually deprecated in 
Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was.
- I afterwards noted that the toDegrees, toRadians functions were almost 
removed fully in SPARK-25908, but Felix suggested keeping just the R version as 
they hadn't been technically deprecated. I'd like to revisit that. Do we really 
want the inconsistency? I'm not against reverting it again, but then that 
implies leaving SQLContext.createExternalTable just in Pyspark too, which seems 
weird.
- I *kept* LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, 
RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove 
them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully 
removed in Scala. Maybe should not have been deprecated.

I will open a PR accordingly for more detailed review.

  was:
Following on https://issues.apache.org/jira/browse/SPARK-25908 I'd like to 
propose removing the rest of the items that have been deprecated since <= Spark 
2.2.0, before Spark 3.0.

This appears to be:

- Remove SQLContext.createExternalTable and Catalog.createExternalTable, 
deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods
- Remove HiveContext, deprecated in 2.0.0, in favor of 
SparkSession.builder.enableHiveSupport
- Remove deprecated toDegrees, toRadians SQL functions (see below)
- Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated 
methods, deprecate in 2.2.0
- Remove deprecated MLlib (not Spark ML) linear method support, mostly utility 
constructors and 'train' methods, and associated docs. This includes methods in 
LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been 
deprecated since 2.0.0
- Remove deprecated Pyspark MLlib linear method support, including 
LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD
- Remove 'runs' argument in KMeans.train() method, which has been a no-op since 
2.0.0
- Remove deprecated ChiSqSelector isSorted protected method
- Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor 
of 'yarn' and deploy mode 'cluster', etc

But while preparing the change, I found:

- I was not able to remove deprecated DataFrameReader.json(RDD) in favor of 
DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is 
still needed to support Pyspark's .json() method, which can't use a Dataset.
- Looks like SQLContext.createExternalTable was not actually deprecated in 
Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was.
- I afterwards noted that the toDegrees, toRadians functions were almost 
removed fully in SPARK-25908, but Felix suggested keeping just the R version as 
they hadn't been technically deprecated. I'd like to revisit that. Do we really 
want the inconsistency? I'm not against reverting it again, but then that 
implies leaving SQLContext.createExternalTable just in Pyspark too, which seems 
weird.
- I *kept* LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, 
RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove 
them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully 
removed 

[jira] [Updated] (SPARK-28969) OneVsRestModel in the py side should not set WeightCol and Classifier

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28969:
--
Docs Text: The setClassifier method in Pyspark's OneVsRestModel has been 
removed in 3.0 for parity with the Scala implementation. Callers should not 
need to set the classifier in the model after creation.
   Labels: release-notes  (was: )

> OneVsRestModel in the py side should not set WeightCol and Classifier
> -
>
> Key: SPARK-28969
> URL: https://issues.apache.org/jira/browse/SPARK-28969
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>  Labels: release-notes
>
> 'WeightCol' and 'Classifier' can only be set in the estimator.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28340) Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file: java.nio.channels.ClosedByInterruptException"

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28340.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25674
[https://github.com/apache/spark/pull/25674]

> Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught 
> exception while reverting partial writes to file: 
> java.nio.channels.ClosedByInterruptException"
> 
>
> Key: SPARK-28340
> URL: https://issues.apache.org/jira/browse/SPARK-28340
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: Colin Ma
>Priority: Minor
> Fix For: 3.0.0
>
>
> If a Spark task is killed while writing blocks to disk (due to intentional 
> job kills, automated killing of redundant speculative tasks, etc) then Spark 
> may log exceptions like
> {code:java}
> 19/07/10 21:31:08 ERROR storage.DiskBlockObjectWriter: Uncaught exception 
> while reverting partial writes to file /
> java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at sun.nio.ch.FileChannelImpl.truncate(FileChannelImpl.java:372)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:218)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1369)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:214)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:105)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748){code}
> If {{BypassMergeSortShuffleWriter}} is being used then a single cancelled 
> task can result in hundreds of these stacktraces being logged.
> Here are some StackOverflow questions asking about this:
>  * [https://stackoverflow.com/questions/40027870/spark-jobserver-job-crash]
>  * 
> [https://stackoverflow.com/questions/50646953/why-is-java-nio-channels-closedbyinterruptexceptio-called-when-caling-multiple]
>  * 
> [https://stackoverflow.com/questions/41867053/java-nio-channels-closedbyinterruptexception-in-spark]
>  * 
> [https://stackoverflow.com/questions/56845041/are-closedbyinterruptexception-exceptions-expected-when-spark-speculation-kills]
>  
> Can we prevent this exception from occurring? If not, can we treat this 
> "expected exception" in a special manner to avoid log spam? My concern is 
> that the presence of large numbers of spurious exceptions is confusing to 
> users when they are inspecting Spark logs to diagnose other issues.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28340) Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file: java.nio.channels.ClosedByInterruptException"

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28340:
-

Assignee: Colin Ma

> Noisy exceptions when tasks are killed: "DiskBlockObjectWriter: Uncaught 
> exception while reverting partial writes to file: 
> java.nio.channels.ClosedByInterruptException"
> 
>
> Key: SPARK-28340
> URL: https://issues.apache.org/jira/browse/SPARK-28340
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: Colin Ma
>Priority: Minor
>
> If a Spark task is killed while writing blocks to disk (due to intentional 
> job kills, automated killing of redundant speculative tasks, etc) then Spark 
> may log exceptions like
> {code:java}
> 19/07/10 21:31:08 ERROR storage.DiskBlockObjectWriter: Uncaught exception 
> while reverting partial writes to file /
> java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at sun.nio.ch.FileChannelImpl.truncate(FileChannelImpl.java:372)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter$$anonfun$revertPartialWritesAndClose$2.apply$mcV$sp(DiskBlockObjectWriter.scala:218)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1369)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:214)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:237)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:105)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748){code}
> If {{BypassMergeSortShuffleWriter}} is being used then a single cancelled 
> task can result in hundreds of these stacktraces being logged.
> Here are some StackOverflow questions asking about this:
>  * [https://stackoverflow.com/questions/40027870/spark-jobserver-job-crash]
>  * 
> [https://stackoverflow.com/questions/50646953/why-is-java-nio-channels-closedbyinterruptexceptio-called-when-caling-multiple]
>  * 
> [https://stackoverflow.com/questions/41867053/java-nio-channels-closedbyinterruptexception-in-spark]
>  * 
> [https://stackoverflow.com/questions/56845041/are-closedbyinterruptexception-exceptions-expected-when-spark-speculation-kills]
>  
> Can we prevent this exception from occurring? If not, can we treat this 
> "expected exception" in a special manner to avoid log spam? My concern is 
> that the presence of large numbers of spurious exceptions is confusing to 
> users when they are inspecting Spark logs to diagnose other issues.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28884) [Spark]Spark driver cores is showing 0 instead of 1 in UI for cluster mode deployment

2019-09-09 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28884.
---
Resolution: Not A Problem

> [Spark]Spark driver cores is showing 0 instead of 1 in UI for cluster mode 
> deployment
> -
>
> Key: SPARK-28884
> URL: https://issues.apache.org/jira/browse/SPARK-28884
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
> Attachments: Core0.png, Core1.png
>
>
> Launch spark sql in local and yarn mode
> bin/spark-shell --master local
> bin/spark-shell --master yarn( client and cluster both)
> vm1:/opt/HA/C10/install/spark/sparkJdbc # bin/spark-submit --master yarn 
> --deploy-mode cluster --class org.apache.spark.examples.SparkPi 
> /opt/HA/C10/install/spark/spark/jars/original-spark-examples_2.11-2.3.2.jar 10
> vm1:/opt/HA/C10/install/spark/sparkJdbc # bin/spark-submit --master yarn 
> --deploy-mode client --class org.apache.spark.examples.SparkPi 
> /opt/HA/C10/install/spark/spark/jars/original-spark-examples_2.11-2.3.2.jar 10
> Open UI and check the driver core it display 0 but in local is display 1.
> Expectation: It should display 1 by default 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28886) Kubernetes DepsTestsSuite fails on OSX with minikube 1.3.1 due to formatting

2019-09-08 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28886.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25599
[https://github.com/apache/spark/pull/25599]

> Kubernetes DepsTestsSuite fails on OSX with minikube 1.3.1 due to formatting
> 
>
> Key: SPARK-28886
> URL: https://issues.apache.org/jira/browse/SPARK-28886
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 3.0.0
>
>
> With minikube 1.3.1 on OSX the service discovery command returns an extra "* 
> " which doesn't parse into a URL causing the DepsTestsSuite to fail.
>  
> I've got a fix just need to double check some stuff.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28953) Integration tests fail due to malformed URL

2019-09-08 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28953.
---
Resolution: Duplicate

> Integration tests fail due to malformed URL
> ---
>
> Key: SPARK-28953
> URL: https://issues.apache.org/jira/browse/SPARK-28953
> Project: Spark
>  Issue Type: Bug
>  Components: jenkins, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Tests failed on Ubuntu, verified on two different machines:
> KubernetesSuite:
> - Launcher client dependencies *** FAILED ***
>  java.net.MalformedURLException: no protocol: * http://172.31.46.91:30706
>  at java.net.URL.(URL.java:600)
>  at java.net.URL.(URL.java:497)
>  at java.net.URL.(URL.java:446)
>  at 
> org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$$init$$1(DepsTestsSuite.scala:160)
>  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>  at org.scalatest.Transformer.apply(Transformer.scala:22)
>  at org.scalatest.Transformer.apply(Transformer.scala:20)
>  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>  
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT
>  /_/
>  
>  Using Scala version 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
>  Type in expressions to have them evaluated.
>  Type :help for more information.
>  
> scala> val pb = new ProcessBuilder().command("bash", "-c", "minikube service 
> ceph-nano-s3 -n spark --url")
>  pb: ProcessBuilder = java.lang.ProcessBuilder@46092840
> scala> pb.redirectErrorStream(true)
>  res0: ProcessBuilder = java.lang.ProcessBuilder@46092840
> scala> val proc = pb.start()
>  proc: Process = java.lang.UNIXProcess@5e9650d3
> scala> val r = org.apache.commons.io.IOUtils.toString(proc.getInputStream())
>  r: String =
>  "* http://172.31.46.91:30706
>  "
> Although (no asterisk):
> $ minikube service ceph-nano-s3 -n spark --url
> [http://172.31.46.91:30706|http://172.31.46.91:30706/]
>  
> This is weird because it fails at the java level, where does the asterisk 
> come from?
> $ minikube version
> minikube version: v1.3.1
> commit: ca60a424ce69a4d79f502650199ca2b52f29e631
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27420) KinesisInputDStream should expose a way to configure CloudWatch metrics

2019-09-08 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27420.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 24651
[https://github.com/apache/spark/pull/24651]

> KinesisInputDStream should expose a way to configure CloudWatch metrics
> ---
>
> Key: SPARK-27420
> URL: https://issues.apache.org/jira/browse/SPARK-27420
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Input/Output
>Affects Versions: 2.3.3
>Reporter: Jerome Gagnon
>Assignee: Kengo Seki
>Priority: Major
> Fix For: 3.0.0
>
>
> KinesisInputDStream currently does not provide a way to disable CloudWatch 
> metrics push. Kinesis client library (KCL) which is used under the hood 
> provide the ability through `withMetrics` methods.
> To make things worse the default level is "DETAILED" which pushes 10s of 
> metrics every 10 seconds. When dealing with multiple streaming jobs this add 
> up pretty quickly, leading to thousands of dollar in cost. 
> Exposing a way to disable/set the proper level of monitoring is critical to 
> us. We had to send invalid credentials and suppress log as a less-than-ideal 
> workaround : see 
> [https://stackoverflow.com/questions/41811039/disable-cloudwatch-for-aws-kinesis-at-spark-streaming/55599002#55599002]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27420) KinesisInputDStream should expose a way to configure CloudWatch metrics

2019-09-08 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27420:
-

Assignee: Kengo Seki

> KinesisInputDStream should expose a way to configure CloudWatch metrics
> ---
>
> Key: SPARK-27420
> URL: https://issues.apache.org/jira/browse/SPARK-27420
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Input/Output
>Affects Versions: 2.3.3
>Reporter: Jerome Gagnon
>Assignee: Kengo Seki
>Priority: Major
>
> KinesisInputDStream currently does not provide a way to disable CloudWatch 
> metrics push. Kinesis client library (KCL) which is used under the hood 
> provide the ability through `withMetrics` methods.
> To make things worse the default level is "DETAILED" which pushes 10s of 
> metrics every 10 seconds. When dealing with multiple streaming jobs this add 
> up pretty quickly, leading to thousands of dollar in cost. 
> Exposing a way to disable/set the proper level of monitoring is critical to 
> us. We had to send invalid credentials and suppress log as a less-than-ideal 
> workaround : see 
> [https://stackoverflow.com/questions/41811039/disable-cloudwatch-for-aws-kinesis-at-spark-streaming/55599002#55599002]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28942) [Spark][WEB UI]Spark in local mode hostname display localhost in the Host Column of Task Summary Page

2019-09-08 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28942:
--
Issue Type: Improvement  (was: Bug)

> [Spark][WEB UI]Spark in local mode hostname display localhost in the Host 
> Column of Task Summary Page
> -
>
> Key: SPARK-28942
> URL: https://issues.apache.org/jira/browse/SPARK-28942
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> In the stage page under Task Summary Page Host Column shows 'localhost' 
> instead of showing host IP or host name mentioned against the Driver Host Name
> Steps:
> spark-shell --master local
> create table emp(id int);
> insert into emp values(100);
> select * from emp;
> Go to  Stage UI page and check the Task Summary Page.
> Host column will display 'localhost' instead the driver host.
>  
> Note in case of spark-shell --master yarn mode UI display correct host name 
> under the column.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28942) [Spark][WEB UI]Spark in local mode hostname display localhost in the Host Column of Task Summary Page

2019-09-08 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28942.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25645
[https://github.com/apache/spark/pull/25645]

> [Spark][WEB UI]Spark in local mode hostname display localhost in the Host 
> Column of Task Summary Page
> -
>
> Key: SPARK-28942
> URL: https://issues.apache.org/jira/browse/SPARK-28942
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Shivu Sondur
>Priority: Minor
> Fix For: 3.0.0
>
>
> In the stage page under Task Summary Page Host Column shows 'localhost' 
> instead of showing host IP or host name mentioned against the Driver Host Name
> Steps:
> spark-shell --master local
> create table emp(id int);
> insert into emp values(100);
> select * from emp;
> Go to  Stage UI page and check the Task Summary Page.
> Host column will display 'localhost' instead the driver host.
>  
> Note in case of spark-shell --master yarn mode UI display correct host name 
> under the column.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28942) [Spark][WEB UI]Spark in local mode hostname display localhost in the Host Column of Task Summary Page

2019-09-08 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28942:
-

Assignee: Shivu Sondur

> [Spark][WEB UI]Spark in local mode hostname display localhost in the Host 
> Column of Task Summary Page
> -
>
> Key: SPARK-28942
> URL: https://issues.apache.org/jira/browse/SPARK-28942
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Shivu Sondur
>Priority: Minor
>
> In the stage page under Task Summary Page Host Column shows 'localhost' 
> instead of showing host IP or host name mentioned against the Driver Host Name
> Steps:
> spark-shell --master local
> create table emp(id int);
> insert into emp values(100);
> select * from emp;
> Go to  Stage UI page and check the Task Summary Page.
> Host column will display 'localhost' instead the driver host.
>  
> Note in case of spark-shell --master yarn mode UI display correct host name 
> under the column.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28967) ConcurrentModificationException is thrown from EventLoggingListener

2019-09-06 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28967.
---
Fix Version/s: 3.0.0
 Assignee: Jungtaek Lim
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/25672

> ConcurrentModificationException is thrown from EventLoggingListener
> ---
>
> Key: SPARK-28967
> URL: https://issues.apache.org/jira/browse/SPARK-28967
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.0.0
>
>
> While testing SPARK-28869 manually, I've found simple Structured Streaming 
> query is continuously throwing ConcurrentModificationException from 
> EventLoggingListener.
> Stack trace follows:
> {code:java}
> 19/09/04 09:48:49 ERROR AsyncEventQueue: Listener EventLoggingListener threw 
> an exception19/09/04 09:48:49 ERROR AsyncEventQueue: Listener 
> EventLoggingListener threw an 
> exceptionjava.util.ConcurrentModificationException at 
> java.util.Hashtable$Enumerator.next(Hashtable.java:1387) at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:424)
>  at 
> scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$6.next(Wrappers.scala:420)
>  at scala.collection.Iterator.foreach(Iterator.scala:941) at 
> scala.collection.Iterator.foreach$(Iterator.scala:941) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at 
> scala.collection.IterableLike.foreach(IterableLike.scala:74) at 
> scala.collection.IterableLike.foreach$(IterableLike.scala:73) at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:56) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:237) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:230) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
> org.apache.spark.util.JsonProtocol$.mapToJson(JsonProtocol.scala:514) at 
> org.apache.spark.util.JsonProtocol$.$anonfun$propertiesToJson$1(JsonProtocol.scala:520)
>  at scala.Option.map(Option.scala:163) at 
> org.apache.spark.util.JsonProtocol$.propertiesToJson(JsonProtocol.scala:519) 
> at org.apache.spark.util.JsonProtocol$.jobStartToJson(JsonProtocol.scala:155) 
> at 
> org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:79) 
> at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:149)
>  at 
> org.apache.spark.scheduler.EventLoggingListener.onJobStart(EventLoggingListener.scala:217)
>  at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:37)
>  at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>  at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>  at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>  at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:99) at 
> org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:84) at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:102)
>  at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:102)
>  at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at 
> scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:97)
>  at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:93)
>  at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:93)
>  {code}
>  
> It also occurs with current master branch.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28980) Remove most remaining deprecated items since <= 2.2.0 for 3.0

2019-09-05 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28980:
--
Description: 
Following on https://issues.apache.org/jira/browse/SPARK-25908 I'd like to 
propose removing the rest of the items that have been deprecated since <= Spark 
2.2.0, before Spark 3.0.

This appears to be:

- Remove SQLContext.createExternalTable and Catalog.createExternalTable, 
deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods
- Remove HiveContext, deprecated in 2.0.0, in favor of 
SparkSession.builder.enableHiveSupport
- Remove deprecated toDegrees, toRadians SQL functions (see below)
- Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated 
methods, deprecate in 2.2.0
- Remove deprecated MLlib (not Spark ML) linear method support, mostly utility 
constructors and 'train' methods, and associated docs. This includes methods in 
LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been 
deprecated since 2.0.0
- Remove deprecated Pyspark MLlib linear method support, including 
LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD
- Remove 'runs' argument in KMeans.train() method, which has been a no-op since 
2.0.0
- Remove deprecated ChiSqSelector isSorted protected method
- Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor 
of 'yarn' and deploy mode 'cluster', etc

But while preparing the change, I found:

- I was not able to remove deprecated DataFrameReader.json(RDD) in favor of 
DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is 
still needed to support Pyspark's .json() method, which can't use a Dataset.
- Looks like SQLContext.createExternalTable was not actually deprecated in 
Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was.
- I afterwards noted that the toDegrees, toRadians functions were almost 
removed fully in SPARK-25908, but Felix suggested keeping just the R version as 
they hadn't been technically deprecated. I'd like to revisit that. Do we really 
want the inconsistency? I'm not against reverting it again, but then that 
implies leaving SQLContext.createExternalTable just in Pyspark too, which seems 
weird.
- I *kept* LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, 
RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove 
them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully 
removed in Scala. Maybe should not have been deprecated.

I will open a PR accordingly for more detailed review.

  was:
Following on https://issues.apache.org/jira/browse/SPARK-25908 I'd like to 
propose removing the rest of the items that have been deprecated since <= Spark 
2.2.0, before Spark 3.0.

This appears to be:

- Remove SQLContext.createExternalTable and Catalog.createExternalTable, 
deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods
- Remove deprecated toDegrees, toRadians SQL functions (see below)
- Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated 
methods, deprecate in 2.2.0
- Remove deprecated MLlib (not Spark ML) linear method support, mostly utility 
constructors and 'train' methods, and associated docs. This includes methods in 
LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been 
deprecated since 2.0.0
- Remove deprecated Pyspark MLlib linear method support, including 
LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD
- Remove 'runs' argument in KMeans.train() method, which has been a no-op since 
2.0.0
- Remove deprecated ChiSqSelector isSorted protected method
- Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor 
of 'yarn' and deploy mode 'cluster', etc

But while preparing the change, I found:

- I was not able to remove deprecated DataFrameReader.json(RDD) in favor of 
DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is 
still needed to support Pyspark's .json() method, which can't use a Dataset.
- Looks like SQLContext.createExternalTable was not actually deprecated in 
Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was.
- I afterwards noted that the toDegrees, toRadians functions were almost 
removed fully in SPARK-25908, but Felix suggested keeping just the R version as 
they hadn't been technically deprecated. I'd like to revisit that. Do we really 
want the inconsistency? I'm not against reverting it again, but then that 
implies leaving SQLContext.createExternalTable just in Pyspark too, which seems 
weird.
- I *kept* LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, 
RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove 
them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully 
removed in Scala. Maybe should not 

[jira] [Commented] (SPARK-28981) Missing library for reading/writing Snappy-compressed files

2019-09-05 Thread Sean Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923682#comment-16923682
 ] 

Sean Owen commented on SPARK-28981:
---

(Really we could say it's a duplicate of 
https://issues.apache.org/jira/browse/SPARK-26995 )

> Missing library for reading/writing Snappy-compressed files
> ---
>
> Key: SPARK-28981
> URL: https://issues.apache.org/jira/browse/SPARK-28981
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4
>Reporter: Paul Schweigert
>Priority: Minor
>
> The current Dockerfile for Spark on Kubernetes is missing the 
> "ld-linux-x86-64.so.2" library needed to read / write Snappy-compressed 
> files. 
>  
> Sample error message when trying to read a parquet file compressed with 
> snappy:
>  
> {code:java}
> 19/09/02 05:33:19 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 
> 172.30.189.77, executor 2): org.apache.spark.SparkException: Task failed 
> while writing rows.
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
> 
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
> 
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
> 
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> 
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.UnsatisfiedLinkError: 
> /tmp/snappy-1.1.7-04145e2f-cc82-4217-99b8-641cdd755a87-libsnappyjava.so: 
> Error loading shared library ld-linux-x86-64.so.2: No such file or directory 
> (needed by 
> /tmp/snappy-1.1.7-04145e2f-cc82-4217-99b8-641cdd755a87-libsnappyjava.so)
> at java.lang.ClassLoader$NativeLibrary.load(Native Method)
> at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
> at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
> at java.lang.Runtime.load0(Runtime.java:809)
> at java.lang.System.load(System.java:1086)
> at 
> org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:179)
> at org.xerial.snappy.SnappyLoader.loadSnappyApi(SnappyLoader.java:154)
> at org.xerial.snappy.Snappy.(Snappy.java:47)
> at 
> org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
> 
> at 
> org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
> 
> at 
> org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
> 
> at 
> org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.compress(CodecFactory.java:165)
> 
> at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:95)
> 
> at 
> org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:147)
> 
> at 
> org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:235)  
>   
> at 
> org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:122)
> 
> at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:172)
> 
> at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:114)
> 
> at 
> org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
> 
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
> 
> at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:57)
> 
> at 
> org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:74)
> 
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
> 
> at 
> 

[jira] [Updated] (SPARK-28977) JDBC Dataframe Reader Doc Doesn't Match JDBC Data Source Page

2019-09-05 Thread Sean Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28977:
--
Fix Version/s: 2.4.5

> JDBC Dataframe Reader Doc Doesn't Match JDBC Data Source Page
> -
>
> Key: SPARK-28977
> URL: https://issues.apache.org/jira/browse/SPARK-28977
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: Christopher Hoshino-Fish
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> [https://spark.apache.org/docs/2.4.3/sql-data-sources-jdbc.html]
> Specifically in the partitionColumn section, this page says:
> "{{partitionColumn}} must be a numeric, date, or timestamp column from the 
> table in question."
>  
> But then in this doc: 
> [https://spark.apache.org/docs/2.4.3/api/scala/index.html#org.apache.spark.sql.DataFrameReader]
> in def jdbc(url: String, table: String, columnName: String, lowerBound: Long, 
> upperBound: Long, numPartitions: Int, connectionProperties: Properties): 
> DataFrame
> we have:
> columnName
> the name of a column of integral type that will be used for partitioning.
>  
> This appears to go back pretty far, to 1.6.3, but I'm not sure when this was 
> accurate.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >