[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy

2018-08-15 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16581056#comment-16581056
 ] 

Saisai Shao commented on LIVY-489:
--

Hi [~mgaido], let's combine task 2, 3, 4 together, and 5, 6 separately. Also 
please submit a PR accordingly. I've already reviewed partly, can you please 
submit a PR against Livy master?

> Expose a JDBC endpoint for Livy
> ---
>
> Key: LIVY-489
> URL: https://issues.apache.org/jira/browse/LIVY-489
> Project: Livy
>  Issue Type: New Feature
>  Components: API, Server
>Affects Versions: 0.6.0
>Reporter: Marco Gaido
>Priority: Major
>
> Many users and BI tools use JDBC connections in order to retrieve data. As 
> Livy exposes only a REST API, this is a limitation in its adoption. Hence, 
> adding a JDBC endpoint may be a very useful feature, which could also make 
> Livy a more attractive solution for end user to adopt.
> Moreover, currently, Spark exposes a JDBC interface, but this has many 
> limitations, including that all the queries are submitted to the same 
> application, therefore there is no isolation/security, which can be offered 
> by Livy, making a Livy JDBC API a better solution for companies/users who 
> want to use Spark in order to run they queries through JDBC.
> In order to make the transition from existing solutions to the new JDBC 
> server seamless, the proposal is to use the Hive thrift-server and extend it 
> as it was done by the STS.
> [Here, you can find the design 
> doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy

2018-08-13 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578207#comment-16578207
 ] 

Saisai Shao commented on LIVY-489:
--

BTW, [~mgaido] do you have a branch or a diff so that I can take a review.

> Expose a JDBC endpoint for Livy
> ---
>
> Key: LIVY-489
> URL: https://issues.apache.org/jira/browse/LIVY-489
> Project: Livy
>  Issue Type: New Feature
>  Components: API, Server
>Affects Versions: 0.6.0
>Reporter: Marco Gaido
>Priority: Major
>
> Many users and BI tools use JDBC connections in order to retrieve data. As 
> Livy exposes only a REST API, this is a limitation in its adoption. Hence, 
> adding a JDBC endpoint may be a very useful feature, which could also make 
> Livy a more attractive solution for end user to adopt.
> Moreover, currently, Spark exposes a JDBC interface, but this has many 
> limitations, including that all the queries are submitted to the same 
> application, therefore there is no isolation/security, which can be offered 
> by Livy, making a Livy JDBC API a better solution for companies/users who 
> want to use Spark in order to run they queries through JDBC.
> In order to make the transition from existing solutions to the new JDBC 
> server seamless, the proposal is to use the Hive thrift-server and extend it 
> as it was done by the STS.
> [Here, you can find the design 
> doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy

2018-08-13 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578133#comment-16578133
 ] 

Saisai Shao commented on LIVY-489:
--

I see. But it is hard to review for such a big thing, it would be better to 
split to small pieces. You don't have to make them isolated, a series of 
dependent tasks is OK.

> Expose a JDBC endpoint for Livy
> ---
>
> Key: LIVY-489
> URL: https://issues.apache.org/jira/browse/LIVY-489
> Project: Livy
>  Issue Type: New Feature
>  Components: API, Server
>Affects Versions: 0.6.0
>Reporter: Marco Gaido
>Priority: Major
>
> Many users and BI tools use JDBC connections in order to retrieve data. As 
> Livy exposes only a REST API, this is a limitation in its adoption. Hence, 
> adding a JDBC endpoint may be a very useful feature, which could also make 
> Livy a more attractive solution for end user to adopt.
> Moreover, currently, Spark exposes a JDBC interface, but this has many 
> limitations, including that all the queries are submitted to the same 
> application, therefore there is no isolation/security, which can be offered 
> by Livy, making a Livy JDBC API a better solution for companies/users who 
> want to use Spark in order to run they queries through JDBC.
> In order to make the transition from existing solutions to the new JDBC 
> server seamless, the proposal is to use the Hive thrift-server and extend it 
> as it was done by the STS.
> [Here, you can find the design 
> doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-25020) Unable to Perform Graceful Shutdown in Spark Streaming with Hadoop 2.8

2018-08-12 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16577846#comment-16577846
 ] 

Saisai Shao commented on SPARK-25020:
-

Yes, we also met the same issue. This is more like a Hadoop problem rather than 
a Spark problem. Would you please also report this issue to Hadoop community?

CC [~ste...@apache.org], I think I talked to you the same issue before, I think 
this is something should be fixed in the Hadoop side.

> Unable to Perform Graceful Shutdown in Spark Streaming with Hadoop 2.8
> --
>
> Key: SPARK-25020
> URL: https://issues.apache.org/jira/browse/SPARK-25020
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: Spark Streaming
> -- Tested on 2.2 & 2.3 (more than likely affects all versions with graceful 
> shutdown) 
> Hadoop 2.8
>Reporter: Ricky Saltzer
>Priority: Major
>
> Opening this up to give you guys some insight in an issue that will occur 
> when using Spark Streaming with Hadoop 2.8. 
> Hadoop 2.8 added HADOOP-12950 which adds a upper limit of a 10 second timeout 
> for its shutdown hook. From our tests, if the Spark job takes longer than 10 
> seconds to shutdown gracefully, the Hadoop shutdown thread seems to trample 
> over the graceful shutdown and throw an exception like
> {code:java}
> 18/08/03 17:21:04 ERROR Utils: Uncaught exception in thread pool-1-thread-1
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
> at 
> org.apache.spark.streaming.scheduler.ReceiverTracker.stop(ReceiverTracker.scala:177)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:114)
> at 
> org.apache.spark.streaming.StreamingContext$$anonfun$stop$1.apply$mcV$sp(StreamingContext.scala:682)
> at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1317)
> at 
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:681)
> at 
> org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:715)
> at 
> org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:599)
> at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
> at scala.util.Try$.apply(Try.scala:192)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){code}
> The reason I hit this issue is because we recently upgraded to EMR 5.15, 
> which has both Spark 2.3 & Hadoop 2.8. The following workaround has proven 
> successful to us (in limited testing)
> Instead of just running
> {code:java}
> ...
> ssc.start()
> ssc.awaitTermination(){code}
> We needed to do the following
> {code:java}
> ...
> ssc.start()
> sys.ShutdownHookThread {
>   ssc.stop(true, true)
> }
> ssc.awaitTermination(){code}
> As far as I can tell, there is no way to override the default {{10 second}} 
> timeout in HADOOP-12950, which is why we had to go with the workaround. 
> Note: I also verified this bug exists even with EMR 5.12.1 which runs Spark 
> 2.2.x & Hadoop 2.8. 
> Ricky
>  Epic Games



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

--

[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy

2018-08-12 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1650#comment-1650
 ] 

Saisai Shao commented on LIVY-489:
--

Hi [~mgaido] can you please also list out the tasks you're planning to do and 
create subtasks accordingly.

> Expose a JDBC endpoint for Livy
> ---
>
> Key: LIVY-489
> URL: https://issues.apache.org/jira/browse/LIVY-489
> Project: Livy
>  Issue Type: New Feature
>  Components: API, Server
>Affects Versions: 0.6.0
>Reporter: Marco Gaido
>Priority: Major
>
> Many users and BI tools use JDBC connections in order to retrieve data. As 
> Livy exposes only a REST API, this is a limitation in its adoption. Hence, 
> adding a JDBC endpoint may be a very useful feature, which could also make 
> Livy a more attractive solution for end user to adopt.
> Moreover, currently, Spark exposes a JDBC interface, but this has many 
> limitations, including that all the queries are submitted to the same 
> application, therefore there is no isolation/security, which can be offered 
> by Livy, making a Livy JDBC API a better solution for companies/users who 
> want to use Spark in order to run they queries through JDBC.
> In order to make the transition from existing solutions to the new JDBC 
> server seamless, the proposal is to use the Hive thrift-server and extend it 
> as it was done by the STS.
> [Here, you can find the design 
> doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16575778#comment-16575778
 ] 

Saisai Shao commented on SPARK-25084:
-

I see. Unfortunately I've cut the RC4, if it worth to include in 2.3.2, I will 
cut a new RC.

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
> ss_net_profit) limit 1000;{code}
> Wrong Codegen:
> {code:java}
> /* 146 */ private int computeHashForStruct_0(InternalRow 
> mutableStateArray[0], int value1) {
> /* 147 */
> /* 148 */
> /* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
> /* 150 */
> /* 151 */ final int element = mutableStateArray[0].getInt(0);
> /* 152 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1);
> /* 153 */
> /* 154 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16575774#comment-16575774
 ] 

Saisai Shao commented on SPARK-25084:
-

Is this a regression or just a bug existed in old version?

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
> ss_net_profit) limit 1000;{code}
> Wrong Codegen:
> {code:java}
> /* 146 */ private int computeHashForStruct_0(InternalRow 
> mutableStateArray[0], int value1) {
> /* 147 */
> /* 148 */
> /* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
> /* 150 */
> /* 151 */ final int element = mutableStateArray[0].getInt(0);
> /* 152 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1);
> /* 153 */
> /* 154 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25084) "distribute by" on multiple columns may lead to codegen issue

2018-08-09 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16575772#comment-16575772
 ] 

Saisai Shao commented on SPARK-25084:
-

I'm already preparing new RC4. If this is not a severe issue, I would not block 
the RC4 release.

> "distribute by" on multiple columns may lead to codegen issue
> -
>
> Key: SPARK-25084
> URL: https://issues.apache.org/jira/browse/SPARK-25084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
>
> Test Query:
> {code:java}
> select * from store_sales distribute by (ss_sold_time_sk, ss_item_sk, 
> ss_customer_sk, ss_cdemo_sk, ss_addr_sk, ss_promo_sk, ss_ext_list_price, 
> ss_net_profit) limit 1000;{code}
> Wrong Codegen:
> {code:java}
> /* 146 */ private int computeHashForStruct_0(InternalRow 
> mutableStateArray[0], int value1) {
> /* 147 */
> /* 148 */
> /* 149 */ if (!mutableStateArray[0].isNullAt(0)) {
> /* 150 */
> /* 151 */ final int element = mutableStateArray[0].getInt(0);
> /* 152 */ value1 = 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element, value1);
> /* 153 */
> /* 154 */ }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-07 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24948:

Fix Version/s: 2.2.3

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Blocker
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-07 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24948:
---

Assignee: Marco Gaido

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Blocker
> Fix For: 2.3.2, 2.4.0
>
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-07 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24948:

Fix Version/s: 2.3.2

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Blocker
> Fix For: 2.3.2, 2.4.0
>
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22634) Update Bouncy castle dependency

2018-08-07 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572542#comment-16572542
 ] 

Saisai Shao commented on SPARK-22634:
-

[~srowen] I'm wondering if it is possible to upgrade to version 1.6.0, as this 
version fixed to CVEs (https://www.bouncycastle.org/latest_releases.html).

> Update Bouncy castle dependency
> ---
>
> Key: SPARK-22634
> URL: https://issues.apache.org/jira/browse/SPARK-22634
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Lior Regev
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.3.0
>
>
> Spark's usage of jets3t library as well as Spark's own Flume and Kafka 
> streaming uses bouncy castle version 1.51
> This is an outdated version as the latest one is 1.58
> This, in turn renders packages such as 
> [spark-hadoopcryptoledger-ds|https://github.com/ZuInnoTe/spark-hadoopcryptoledger-ds]
>  unusable since these require 1.58 and spark's distributions come along with 
> 1.51
> My own attempt was to run on EMR, and since I automatically get all of 
> spark's dependecies (bouncy castle 1.51 being one of them) into the 
> classpath, using the library to parse blockchain data failed due to missing 
> functionality.
> I have also opened an 
> [issue|https://bitbucket.org/jmurty/jets3t/issues/242/bouncycastle-dependency]
>  with jets3t to update their dependecy as well, but along with that Spark 
> would have to update it's own or at least be packaged with a newer version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25030) SparkSubmit.doSubmit will not return result if the mainClass submitted creates a Timer()

2018-08-06 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570995#comment-16570995
 ] 

Saisai Shao commented on SPARK-25030:
-

I would like to see more about this issue.

> SparkSubmit.doSubmit will not return result if the mainClass submitted 
> creates a Timer()
> 
>
> Key: SPARK-25030
> URL: https://issues.apache.org/jira/browse/SPARK-25030
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Jiang Xingbo
>Priority: Major
>
> Create a Timer() in the mainClass submitted to SparkSubmit makes it unable to 
> fetch result, it is very easy to reproduce the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-02 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24948:

Priority: Blocker  (was: Major)

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Blocker
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24948) SHS filters wrongly some applications due to permission check

2018-08-02 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24948:

Target Version/s: 2.2.3, 2.3.2, 2.4.0

> SHS filters wrongly some applications due to permission check
> -
>
> Key: SPARK-24948
> URL: https://issues.apache.org/jira/browse/SPARK-24948
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Marco Gaido
>Priority: Major
>
> SHS filters the event logs it doesn't have permissions to read. 
> Unfortunately, this check is quite naive, as it takes into account only the 
> base permissions (ie. user, group, other permissions). For instance, if ACL 
> are enabled, they are ignored in this check; moreover, each filesystem may 
> have different policies (eg. they can consider spark as a superuser who can 
> access everything).
> This results in some applications not being displayed in the SHS, despite the 
> Spark user (or whatever user the SHS is started with) can actually read their 
> ent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564562#comment-16564562
 ] 

Saisai Shao commented on SPARK-24615:
-

Leveraging dynamic allocation to tear down and bring up desired executor is a 
non-goal here, we will address it in feature, currently we're still focusing on 
using static allocation like spark.executor.gpus.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564562#comment-16564562
 ] 

Saisai Shao edited comment on SPARK-24615 at 8/1/18 12:35 AM:
--

Leveraging dynamic allocation to tear down and bring up desired executor is a 
non-goal here, we will address it in future, currently we're still focusing on 
using static allocation like spark.executor.gpus.


was (Author: jerryshao):
Leveraging dynamic allocation to tear down and bring up desired executor is a 
non-goal here, we will address it in feature, currently we're still focusing on 
using static allocation like spark.executor.gpus.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16563730#comment-16563730
 ] 

Saisai Shao edited comment on SPARK-24615 at 7/31/18 2:16 PM:
--

Hi [~tgraves], I think eval() might unnecessarily break the lineage which could 
execute in one stage, for example: data transforming -> training -> 
transforming, this could possibly run in one stage, using eval will break into 
several stages, I'm not sure if it is the good choice. Also if we use eval to 
break the lineage, how do we store the intermediate data, like shuffle, or in 
memory/ on disk?

Yes, how to break the boundaries is hard for user to know, but currently I 
cannot figure out a good solution, unless we use eval() to explicitly separate 
them. To solve the conflicts, failing might be one choice. In the SQL or DF 
area, I don't think we have to expose such low level RDD APIs to user, maybe 
some hints should be enough (though I haven't thought about it).

Currently in my design, withResources only applies to the stage in which RDD 
will be executed, the following stages will still be ordinary stages without 
additional resources.


was (Author: jerryshao):
Hi [~tgraves], I think eval() might unnecessarily break the lineage which could 
execute in one stage, for example: data transforming -> training -> 
transforming, this could possibly run in one stage, using eval will break into 
several stages, I'm not sure if it is the good choice. Also if we use eval to 
break the lineage, how do we store the intermediate data, like shuffle, or in 
memory/ on disk?

Yes, how to break the boundaries is hard for user to know, but currently I 
cannot figure out a good solution, unless we use eval() to explicitly separate 
them. To solve the conflicts, failing might be one choice.

Currently in my design, withResources only applies to the stage in which RDD 
will be executed, the following stages will still be ordinary stages without 
additional resouces.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-31 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16563730#comment-16563730
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves], I think eval() might unnecessarily break the lineage which could 
execute in one stage, for example: data transforming -> training -> 
transforming, this could possibly run in one stage, using eval will break into 
several stages, I'm not sure if it is the good choice. Also if we use eval to 
break the lineage, how do we store the intermediate data, like shuffle, or in 
memory/ on disk?

Yes, how to break the boundaries is hard for user to know, but currently I 
cannot figure out a good solution, unless we use eval() to explicitly separate 
them. To solve the conflicts, failing might be one choice.

Currently in my design, withResources only applies to the stage in which RDD 
will be executed, the following stages will still be ordinary stages without 
additional resouces.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24975) Spark history server REST API /api/v1/version returns error 404

2018-07-31 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24975.
-
Resolution: Duplicate

> Spark history server REST API /api/v1/version returns error 404
> ---
>
> Key: SPARK-24975
> URL: https://issues.apache.org/jira/browse/SPARK-24975
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: shanyu zhao
>Priority: Major
>
> Spark history server REST API provides /api/v1/version, according to doc:
> [https://spark.apache.org/docs/latest/monitoring.html]
> However, for Spark 2.3, we see:
> {code:java}
> curl http://localhost:18080/api/v1/version
> 
> 
> 
> Error 404 Not Found
> 
> HTTP ERROR 404
> Problem accessing /api/v1/version. Reason:
>  Not Foundhttp://eclipse.org/jetty";>Powered by 
> Jetty:// 9.3.z-SNAPSHOT
> 
> {code}
> On a Spark 2.2 cluster, we see:
> {code:java}
> curl http://localhost:18080/api/v1/version
> {
> "spark" : "2.2.0"
> }{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-29 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24957:

Target Version/s: 2.4.0

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_2.collect()
> res1: Array[org.apache.spark.sql.Row] = 
> Array([a,11948571.4285714285714285714286])
> scala> val df_total_sum = 
> df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]
> scala> df_total_sum.collect()
> res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
> {noformat}
> The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
> result of {{df_grouped_2}} is clearly incorrect (it is the value of the 
> correct result times {{10^14}}).
> When codegen is disabled all results are correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24957) Decimal arithmetic can lead to wrong values using codegen

2018-07-29 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561334#comment-16561334
 ] 

Saisai Shao commented on SPARK-24957:
-

Not necessary to mark as blocker, we still have plenty of time for 2.4 release, 
I will mark the target version as 2.4.0.

> Decimal arithmetic can lead to wrong values using codegen
> -
>
> Key: SPARK-24957
> URL: https://issues.apache.org/jira/browse/SPARK-24957
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: David Vogelbacher
>Priority: Major
>
> I noticed a bug when doing arithmetic on a dataframe containing decimal 
> values with codegen enabled.
> I tried to narrow it down on a small repro and got this (executed in 
> spark-shell):
> {noformat}
> scala> val df = Seq(
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("12.0")),
>  | ("a", BigDecimal("11.88")),
>  | ("a", BigDecimal("11.88"))
>  | ).toDF("text", "number")
> df: org.apache.spark.sql.DataFrame = [text: string, number: decimal(38,18)]
> scala> val df_grouped_1 = 
> df.groupBy(df.col("text")).agg(functions.avg(df.col("number")).as("number"))
> df_grouped_1: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_1.collect()
> res0: Array[org.apache.spark.sql.Row] = Array([a,11.94857142857143])
> scala> val df_grouped_2 = 
> df_grouped_1.groupBy(df_grouped_1.col("text")).agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_grouped_2: org.apache.spark.sql.DataFrame = [text: string, number: 
> decimal(38,22)]
> scala> df_grouped_2.collect()
> res1: Array[org.apache.spark.sql.Row] = 
> Array([a,11948571.4285714285714285714286])
> scala> val df_total_sum = 
> df_grouped_1.agg(functions.sum(df_grouped_1.col("number")).as("number"))
> df_total_sum: org.apache.spark.sql.DataFrame = [number: decimal(38,22)]
> scala> df_total_sum.collect()
> res2: Array[org.apache.spark.sql.Row] = Array([11.94857142857143])
> {noformat}
> The results of {{df_grouped_1}} and {{df_total_sum}} are correct, whereas the 
> result of {{df_grouped_2}} is clearly incorrect (it is the value of the 
> correct result times {{10^14}}).
> When codegen is disabled all results are correct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24932) Allow update mode for streaming queries with join

2018-07-29 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24932:

Target Version/s:   (was: 2.3.2)

> Allow update mode for streaming queries with join
> -
>
> Key: SPARK-24932
> URL: https://issues.apache.org/jira/browse/SPARK-24932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Eric Fu
>Priority: Major
>
> In issue SPARK-19140 we supported update output mode for non-aggregation 
> streaming queries. This should also be applied to streaming join to keep 
> semantic consistent.
> PS. Streaming join feature is added after SPARK-19140. 
> When using update _output_ mode the join will works exactly as _append_ mode. 
> However, for example, this will allow user to run an aggregation-after-join 
> query in update mode in order to get a more real-time result output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24932) Allow update mode for streaming queries with join

2018-07-29 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561331#comment-16561331
 ] 

Saisai Shao commented on SPARK-24932:
-

I'm going to remove the target version for this JIRA. Committers will set the 
fix version properly when merged.

> Allow update mode for streaming queries with join
> -
>
> Key: SPARK-24932
> URL: https://issues.apache.org/jira/browse/SPARK-24932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Eric Fu
>Priority: Major
>
> In issue SPARK-19140 we supported update output mode for non-aggregation 
> streaming queries. This should also be applied to streaming join to keep 
> semantic consistent.
> PS. Streaming join feature is added after SPARK-19140. 
> When using update _output_ mode the join will works exactly as _append_ mode. 
> However, for example, this will allow user to run an aggregation-after-join 
> query in update mode in order to get a more real-time result output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24964) Please add OWASP Dependency Check to all comonent builds(pom.xml)

2018-07-29 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24964:

Target Version/s:   (was: 2.3.2, 2.4.0, 3.0.0, 2.3.3)

>  Please add OWASP Dependency Check to all comonent builds(pom.xml)
> --
>
> Key: SPARK-24964
> URL: https://issues.apache.org/jira/browse/SPARK-24964
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, MLlib, Spark Core, SparkR
>Affects Versions: 2.3.1
> Environment: All development, build, test, environments.
> ~/workspace/spark-2.3.1/pom.xml
> ~/workspace/spark-2.3.1/assembly/pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/pom.xml
> ~/workspace/spark-2.3.1/common/network-common/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-common/pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/pom.xml
> ~/workspace/spark-2.3.1/common/network-yarn/pom.xml
> ~/workspace/spark-2.3.1/common/sketch/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/sketch/pom.xml
> ~/workspace/spark-2.3.1/common/tags/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/tags/pom.xml
> ~/workspace/spark-2.3.1/common/unsafe/pom.xml
> ~/workspace/spark-2.3.1/core/pom.xml
> ~/workspace/spark-2.3.1/examples/pom.xml
> ~/workspace/spark-2.3.1/external/docker-integration-tests/pom.xml
> ~/workspace/spark-2.3.1/external/flume/pom.xml
> ~/workspace/spark-2.3.1/external/flume-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/flume-sink/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-sql/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/spark-ganglia-lgpl/pom.xml
> ~/workspace/spark-2.3.1/graphx/pom.xml
> ~/workspace/spark-2.3.1/hadoop-cloud/pom.xml
> ~/workspace/spark-2.3.1/launcher/pom.xml
> ~/workspace/spark-2.3.1/mllib/pom.xml
> ~/workspace/spark-2.3.1/mllib-local/pom.xml
> ~/workspace/spark-2.3.1/repl/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/kubernetes/core/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/mesos/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/yarn/pom.xml
> ~/workspace/spark-2.3.1/sql/catalyst/pom.xml
> ~/workspace/spark-2.3.1/sql/core/pom.xml
> ~/workspace/spark-2.3.1/sql/hive/pom.xml
> ~/workspace/spark-2.3.1/sql/hive-thriftserver/pom.xml
> ~/workspace/spark-2.3.1/streaming/pom.xml
> ~/workspace/spark-2.3.1/tools/pom.xml
>Reporter: Albert Baker
>Priority: Major
>  Labels: build, easy-fix, security
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> OWASP DC makes an outbound REST call to MITRE Common Vulnerabilities & 
> Exposures (CVE) to perform a lookup for each dependant .jar to list any/all 
> known vulnerabilities for each jar. This step is needed because a manual 
> MITRE CVE lookup/check on the main component does not include checking for 
> vulnerabilities in dependant libraries.
> OWASP Dependency check : 
> https://www.owasp.org/index.php/OWASP_Dependency_Check has plug-ins for most 
> Java build/make types (ant, maven, ivy, gradle). Also, add the appropriate 
> command to the nightly build to generate a report of all known 
> vulnerabilities in any/all third party libraries/dependencies that get pulled 
> in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false clean aggregate
> Generating this report nightly/weekly will help inform the project's 
> development team if any dependant libraries have a reported known 
> vulneraility. Project teams that keep up with removing vulnerabilities on a 
> weekly basis will help protect businesses that rely on these open source 
> componets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24964) Please add OWASP Dependency Check to all comonent builds(pom.xml)

2018-07-29 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16561327#comment-16561327
 ] 

Saisai Shao commented on SPARK-24964:
-

I'm going to remove the set target version, usually we don't set this for a 
feature, committers will set a fix version when merged.

>  Please add OWASP Dependency Check to all comonent builds(pom.xml)
> --
>
> Key: SPARK-24964
> URL: https://issues.apache.org/jira/browse/SPARK-24964
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, MLlib, Spark Core, SparkR
>Affects Versions: 2.3.1
> Environment: All development, build, test, environments.
> ~/workspace/spark-2.3.1/pom.xml
> ~/workspace/spark-2.3.1/assembly/pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/pom.xml
> ~/workspace/spark-2.3.1/common/network-common/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-common/pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/pom.xml
> ~/workspace/spark-2.3.1/common/network-yarn/pom.xml
> ~/workspace/spark-2.3.1/common/sketch/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/sketch/pom.xml
> ~/workspace/spark-2.3.1/common/tags/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/tags/pom.xml
> ~/workspace/spark-2.3.1/common/unsafe/pom.xml
> ~/workspace/spark-2.3.1/core/pom.xml
> ~/workspace/spark-2.3.1/examples/pom.xml
> ~/workspace/spark-2.3.1/external/docker-integration-tests/pom.xml
> ~/workspace/spark-2.3.1/external/flume/pom.xml
> ~/workspace/spark-2.3.1/external/flume-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/flume-sink/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-sql/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/spark-ganglia-lgpl/pom.xml
> ~/workspace/spark-2.3.1/graphx/pom.xml
> ~/workspace/spark-2.3.1/hadoop-cloud/pom.xml
> ~/workspace/spark-2.3.1/launcher/pom.xml
> ~/workspace/spark-2.3.1/mllib/pom.xml
> ~/workspace/spark-2.3.1/mllib-local/pom.xml
> ~/workspace/spark-2.3.1/repl/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/kubernetes/core/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/mesos/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/yarn/pom.xml
> ~/workspace/spark-2.3.1/sql/catalyst/pom.xml
> ~/workspace/spark-2.3.1/sql/core/pom.xml
> ~/workspace/spark-2.3.1/sql/hive/pom.xml
> ~/workspace/spark-2.3.1/sql/hive-thriftserver/pom.xml
> ~/workspace/spark-2.3.1/streaming/pom.xml
> ~/workspace/spark-2.3.1/tools/pom.xml
>Reporter: Albert Baker
>Priority: Major
>  Labels: build, easy-fix, security
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> OWASP DC makes an outbound REST call to MITRE Common Vulnerabilities & 
> Exposures (CVE) to perform a lookup for each dependant .jar to list any/all 
> known vulnerabilities for each jar. This step is needed because a manual 
> MITRE CVE lookup/check on the main component does not include checking for 
> vulnerabilities in dependant libraries.
> OWASP Dependency check : 
> https://www.owasp.org/index.php/OWASP_Dependency_Check has plug-ins for most 
> Java build/make types (ant, maven, ivy, gradle). Also, add the appropriate 
> command to the nightly build to generate a report of all known 
> vulnerabilities in any/all third party libraries/dependencies that get pulled 
> in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false clean aggregate
> Generating this report nightly/weekly will help inform the project's 
> development team if any dependant libraries have a reported known 
> vulneraility. Project teams that keep up with removing vulnerabilities on a 
> weekly basis will help protect businesses that rely on these open source 
> componets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24867) Add AnalysisBarrier to DataFrameWriter

2018-07-25 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16556760#comment-16556760
 ] 

Saisai Shao commented on SPARK-24867:
-

I see, thanks! Please let me know when the JIRA is opened.

> Add AnalysisBarrier to DataFrameWriter 
> ---
>
> Key: SPARK-24867
> URL: https://issues.apache.org/jira/browse/SPARK-24867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.3.2
>
>
> {code}
>   val udf1 = udf({(x: Int, y: Int) => x + y})
>   val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
>   df.cache()
>   df.write.saveAsTable("t")
>   df.write.saveAsTable("t1")
> {code}
> Cache is not being used because the plans do not match with the cached plan. 
> This is a regression caused by the changes we made in AnalysisBarrier, since 
> not all the Analyzer rules are idempotent. We need to fix it to Spark 2.3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-24 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555080#comment-16555080
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves] [~irashid] thanks a lot for your comments.

Currently in my design I don't insert a specific stage boundary with different 
resources, the stage boundary is still the same (by shuffle or by result). so 
{{withResouces}} is not an eval() action which trigger a stage. Instead, it 
just adds a resource hint to the RDD.

So which means RDDs with different resources requirements in one stage may have 
conflicts. For example: {{rdd1.withResources.mapPartitions \{ xxx 
\}.withResources.mapPartitions \{ xxx \}.collect}},  resources in rdd1 may be 
different from map rdd, so currently what I can think is that:

1. always pick the latter with warning log to say that multiple different 
resources in one stage is illegal.
2. fail the stage with warning log to say that multiple different resources in 
one stage is illegal.
3. merge conflicts with maximum resources needs. For example rdd1 requires 3 
gpus per task, rdd2 requires 4 gpus per task, then the merged requirement would 
be 4 gpus per task. (This is the high level description, details will be per 
partition based merging) [chosen].

Take join for example, where rdd1 and rdd2 may have different resource 
requirements, and joined RDD will potentially have other resource requirements.

For example:

{code}
val rddA = rdd.mapPartitions().withResources
val rddB = rdd.mapPartitions().withResources
val rddC = rddA.join(rddB).withResources
rddC.collect()
{code}

Here this 3 {{withResources}} may have different requirements. Since {{rddC}} 
is running in a different stage, so there's no need to merge the resource 
conflicts. But {{rddA}} and {{rddB}} are running in the same stage with 
different tasks (partitions). So the merging strategy I'm thinking is based on 
the partition, tasks running with partitions from {{rddA}} will use the 
resource specified by {{rddA}}, so does {{rddB}}.





> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24867) Add AnalysisBarrier to DataFrameWriter

2018-07-24 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554986#comment-16554986
 ] 

Saisai Shao commented on SPARK-24867:
-

[~smilegator] what's the ETA of this issue?

> Add AnalysisBarrier to DataFrameWriter 
> ---
>
> Key: SPARK-24867
> URL: https://issues.apache.org/jira/browse/SPARK-24867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> {code}
>   val udf1 = udf({(x: Int, y: Int) => x + y})
>   val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
>   df.cache()
>   df.write.saveAsTable("t")
>   df.write.saveAsTable("t1")
> {code}
> Cache is not being used because the plans do not match with the cached plan. 
> This is a regression caused by the changes we made in AnalysisBarrier, since 
> not all the Analyzer rules are idempotent. We need to fix it to Spark 2.3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24297) Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB

2018-07-24 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24297:
---

Assignee: Imran Rashid

> Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
> ---
>
> Key: SPARK-24297
> URL: https://issues.apache.org/jira/browse/SPARK-24297
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Shuffle, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Any network request which does not use stream-to-disk that is sending over 
> 2GB is doomed to fail, so we might as well at least set the default value of 
> spark.maxRemoteBlockSizeFetchToMem to something < 2GB.
> It probably makes sense to set it to something even lower still, but that 
> might require more careful testing; this is a totally safe first step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24297) Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB

2018-07-24 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24297.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21474
[https://github.com/apache/spark/pull/21474]

> Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
> ---
>
> Key: SPARK-24297
> URL: https://issues.apache.org/jira/browse/SPARK-24297
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Shuffle, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Any network request which does not use stream-to-disk that is sending over 
> 2GB is doomed to fail, so we might as well at least set the default value of 
> spark.maxRemoteBlockSizeFetchToMem to something < 2GB.
> It probably makes sense to set it to something even lower still, but that 
> might require more careful testing; this is a totally safe first step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-23 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553679#comment-16553679
 ] 

Saisai Shao commented on SPARK-24615:
-

[~Tagar] yes!

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24594) Introduce metrics for YARN executor allocation problems

2018-07-23 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24594.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21635
[https://github.com/apache/spark/pull/21635]

> Introduce metrics for YARN executor allocation problems 
> 
>
> Key: SPARK-24594
> URL: https://issues.apache.org/jira/browse/SPARK-24594
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.0
>
>
> Within SPARK-16630 it come up to introduce metrics for  YARN executor 
> allocation problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24594) Introduce metrics for YARN executor allocation problems

2018-07-23 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24594:
---

Assignee: Attila Zsolt Piros

> Introduce metrics for YARN executor allocation problems 
> 
>
> Key: SPARK-24594
> URL: https://issues.apache.org/jira/browse/SPARK-24594
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.0
>
>
> Within SPARK-16630 it come up to introduce metrics for  YARN executor 
> allocation problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-22 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552233#comment-16552233
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves] what you mentioned above is also what we think about and try to 
figure out a way to solve it. (this problem also existed in barrier execution).

>From user point, specifying resource through RDD is the only feasible way 
>currently what I can think, though resource is bound to stage/task not 
>particular RDD. This means user could specify resources for different RDDs in 
>a single stage, Spark can only use one resource within this stage. This will 
>bring out several problems as you mentioned:

*Specify resources to which RDD*

For example {{rddA.withResource.mapPartition \{ xxx \}.collect()}} is not 
different from {{rddA.mapPartition \{ xxx \}.withResource.collect}}. Since all 
the rdds are executed in the same stage. So in the current design, not matter 
the resource is specified with {{rddA}} or mapped RDD, the result is the same.

*one to one dependency RDDs with different resources*

For example {{rddA.withResource.mapPartition \{ xxx \}.withResource.collec()}}, 
here assuming the resource request for {{rddA}} and mapped RDD is different, 
since they're running in a single stage, so we should fix such conflict.

*multiple dependencies RDDs with different resources*

For example:

{code}
val rddA = rdd.withResources.mapPartitions()

val rddB = rdd.withResources.mapPartitions()

val rddC = rddA.join(rddB)
{code}

If the resources in {{rddA}} is different from {{rddB}}, then we should also 
fix such conflicts.

Previously I proposed to use largest resource requirement to satisfy all the 
needs. But it may also cause the resource wasting, [~mengxr] mentioned to 
set/merge resources per partition to avoid waste. In the meanwhile, it there's 
a API exposed to set resources in the stage level, then this problem will not 
be existed, but Spark doesn't expose such APIs to user, the only thing user can 
specify is from RDD level, I'm still thinking of a good way to fix it.
 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16550266#comment-16550266
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves] I'm still not sure how to handle memory per stage. Unlike MR, 
Spark shares the task runtime in a single JVM, I'm not sure how to control the 
memory usage within the JVM. Are you suggesting the similar approach like using 
GPU, when memory requirement cannot be satisfied, release the current executors 
and requesting new executors by dynamic resource allocation?

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16550182#comment-16550182
 ] 

Saisai Shao commented on SPARK-24723:
-

Hi [~mengxr], I don't think YARN has such feature to configure password-less 
SSH on all containers. YARN itself doesn't rely on SSH, and in our deployment 
(Ambari), we don't have use password-less ssh.
{quote}And does container by default run sshd? If not, which process is 
responsible for starting/terminating the daemon?
{quote}
If the container is is not dockerized, so it will share with system's sshd, it 
is system's responsibility to start/terminate this daemon.

If the container is dockerized, I think the docker container should be 
responsible for starting sshd (IIUC).

Maybe we should check if sshd is started before starting MPI job, if sshd is 
not started, simply we cannot run MPI job no matter who is responsible for sshd 
daemon.

[~leftnoteasy] might have some thoughts, since he is the originator of 
mpich2-yarn.

 

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Saisai Shao
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.
>  
> Requirements:
>  * understand how to set up YARN to run MPI job as a YARN application
>  * figure out how to do it with Spark w/ Barrier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24195) sc.addFile for local:/ path is broken

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24195:
---

Assignee: Li Yuanjian

> sc.addFile for local:/ path is broken
> -
>
> Key: SPARK-24195
> URL: https://issues.apache.org/jira/browse/SPARK-24195
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Assignee: Li Yuanjian
>Priority: Minor
>  Labels: starter
> Fix For: 2.4.0
>
>
> In changing SPARK-6300
> https://github.com/apache/spark/commit/00e730b94cba1202a73af1e2476ff5a44af4b6b2
> essentially the change to
> new File(path).getCanonicalFile.toURI.toString
> breaks when path is local:, as java.io.File doesn't handle it.
> eg.
> new 
> File("local:///home/user/demo/logger.config").getCanonicalFile.toURI.toString
> res1: String = file:/user/anotheruser/local:/home/user/demo/logger.config



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24195) sc.addFile for local:/ path is broken

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24195.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21533
[https://github.com/apache/spark/pull/21533]

> sc.addFile for local:/ path is broken
> -
>
> Key: SPARK-24195
> URL: https://issues.apache.org/jira/browse/SPARK-24195
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>  Labels: starter
> Fix For: 2.4.0
>
>
> In changing SPARK-6300
> https://github.com/apache/spark/commit/00e730b94cba1202a73af1e2476ff5a44af4b6b2
> essentially the change to
> new File(path).getCanonicalFile.toURI.toString
> breaks when path is local:, as java.io.File doesn't handle it.
> eg.
> new 
> File("local:///home/user/demo/logger.config").getCanonicalFile.toURI.toString
> res1: String = file:/user/anotheruser/local:/home/user/demo/logger.config



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24307) Support sending messages over 2GB from memory

2018-07-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16550167#comment-16550167
 ] 

Saisai Shao commented on SPARK-24307:
-

Issue resolved by pull request 21440
[https://github.com/apache/spark/pull/21440]

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support bloc

[jira] [Resolved] (SPARK-24307) Support sending messages over 2GB from memory

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24307.
-
Resolution: Fixed
  Assignee: Imran Rashid

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB 
> in memory). 
>  A drawback to this approach is that blocks that are cached 

[jira] [Updated] (SPARK-24307) Support sending messages over 2GB from memory

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24307:

Fix Version/s: 2.4.0

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB 
> in memory). 
>  A drawback to this approach is that blocks that are cached in memory as 
> deserialized values would need to have the *enti

[jira] [Updated] (SPARK-24037) stateful operators

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24037:

Fix Version/s: (was: 2.4.0)

> stateful operators
> --
>
> Key: SPARK-24037
> URL: https://issues.apache.org/jira/browse/SPARK-24037
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> pointer to https://issues.apache.org/jira/browse/SPARK-24036



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24037) stateful operators

2018-07-19 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24037:

Fix Version/s: 2.4.0

> stateful operators
> --
>
> Key: SPARK-24037
> URL: https://issues.apache.org/jira/browse/SPARK-24037
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> pointer to https://issues.apache.org/jira/browse/SPARK-24036



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549276#comment-16549276
 ] 

Saisai Shao commented on SPARK-24615:
-

Thanks [~tgraves] for the suggestion. 
{quote}Once I get to the point I want to do the ML I want to ask for the gpu's 
as well as ask for more memory during that stage because I didn't need more 
before this stage for all the etl work.  I realize you already have executors, 
but ideally spark with the cluster manager could potentially release the 
existing ones and ask for new ones with those requirements.
{quote}
Yes, I already discussed with my colleague offline, this is a valid scenario, 
but I think to achieve this we should change the current dynamic resource 
allocation mechanism Currently I marked this as a Non-Goal in this proposal, 
only focus on statically resource requesting (--executor-cores, 
--executor-gpus). I think we should support it later.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-18 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548662#comment-16548662
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves] I'm rewriting the design doc based on the comments mentioned 
above, so temporarily make it inaccessible, sorry about it, I will reopen it.

I think it is hard to control the memory usage per stage/task, because task is 
running in the executor which shared within a JVM. For CPU, yes I think we can 
do it, but I'm not sure the usage scenario of it.

For the requirement of using different types of machine, what I can think of is 
leveraging dynamic resource allocation. For example, if user wants run some MPI 
jobs with barrier enabled, then Spark could allocate some new executors with 
accelerator resource via cluster manager (for example using node label if it is 
running on YARN). But I will not target this as a goal in this design, since a) 
it is a non-goal for barrier scheduler currently; b) it makes the design too 
complex, would be better to separate to another work.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-17 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547263#comment-16547263
 ] 

Saisai Shao commented on SPARK-24615:
-

Sure, I will also add it as Xiangrui also suggested the same concern.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-17 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546590#comment-16546590
 ] 

Saisai Shao commented on SPARK-24615:
-

Yes, currently the user is responsible for asking yarn to get resources like 
GPU, for example like {{--num-gpus}}. 

Yes, I agree with you, the concerns you mentioned above is valid. But currently 
the design of this Jira only targets to the task level scheduling with 
accelerator resources already available. It assumes that accelerator resources 
is already got by executor and reported back to driver. Driver will schedule 
the tasks based on the available resources.

For Spark to communicate with cluster manager to request other resources like 
GPU, currently it is not covered in this design doc.

Xiangrui mentioned that Spark to communicate with cluster manager should also 
be covered in this SPIP, so I'm still under drafting.

 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-16 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545962#comment-16545962
 ] 

Saisai Shao commented on SPARK-24615:
-

Sorry [~tgraves] for the late response. Yes,  when requesting executors, user 
should know accelerators are required or not. If there's no satisfied 
accelerators, the job will be pending or not launched. 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24804) There are duplicate words in the title in the DatasetSuite

2018-07-14 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16544386#comment-16544386
 ] 

Saisai Shao edited comment on SPARK-24804 at 7/15/18 1:29 AM:
--

Please don't set target version or fix version. Committers will help to set 
when this issue is resolved.

I'm removing the target version of this Jira to avoid blocking the release of 
2.3.2


was (Author: jerryshao):
Please don't set target version or fix version. Committers will help to set 
when this issue is resolved.

> There are duplicate words in the title in the DatasetSuite
> --
>
> Key: SPARK-24804
> URL: https://issues.apache.org/jira/browse/SPARK-24804
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: hantiantian
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24804) There are duplicate words in the title in the DatasetSuite

2018-07-14 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16544386#comment-16544386
 ] 

Saisai Shao commented on SPARK-24804:
-

Please don't set target version or fix version. Committers will help to set 
when this issue is resolved.

> There are duplicate words in the title in the DatasetSuite
> --
>
> Key: SPARK-24804
> URL: https://issues.apache.org/jira/browse/SPARK-24804
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: hantiantian
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24804) There are duplicate words in the title in the DatasetSuite

2018-07-14 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24804:

Priority: Trivial  (was: Minor)

> There are duplicate words in the title in the DatasetSuite
> --
>
> Key: SPARK-24804
> URL: https://issues.apache.org/jira/browse/SPARK-24804
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: hantiantian
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24804) There are duplicate words in the title in the DatasetSuite

2018-07-14 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24804:

Target Version/s:   (was: 2.3.2)

> There are duplicate words in the title in the DatasetSuite
> --
>
> Key: SPARK-24804
> URL: https://issues.apache.org/jira/browse/SPARK-24804
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: hantiantian
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24804) There are duplicate words in the title in the DatasetSuite

2018-07-14 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24804:

Fix Version/s: (was: 2.3.2)

> There are duplicate words in the title in the DatasetSuite
> --
>
> Key: SPARK-24804
> URL: https://issues.apache.org/jira/browse/SPARK-24804
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: hantiantian
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24781) Using a reference from Dataset in Filter/Sort might not work.

2018-07-10 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539635#comment-16539635
 ] 

Saisai Shao commented on SPARK-24781:
-

I see. I will wait for this before cutting a new 2.3.2 RC release

> Using a reference from Dataset in Filter/Sort might not work.
> -
>
> Key: SPARK-24781
> URL: https://issues.apache.org/jira/browse/SPARK-24781
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takuya Ueshin
>Priority: Blocker
>
> When we use a reference from {{Dataset}} in {{filter}} or {{sort}}, which was 
> not used in the prior {{select}}, an {{AnalysisException}} occurs, e.g.,
> {code:scala}
> val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
> df.select(df("name")).filter(df("id") === 0).show()
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing 
> from name#5 in operator !Filter (id#6 = 0).;;
> !Filter (id#6 = 0)
>+- AnalysisBarrier
>   +- Project [name#5]
>  +- Project [_1#2 AS name#5, _2#3 AS id#6]
> +- LocalRelation [_1#2, _2#3]
> {noformat}
> If we use {{col}} instead, it works:
> {code:scala}
> val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
> df.select(col("name")).filter(col("id") === 0).show()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24781) Using a reference from Dataset in Filter/Sort might not work.

2018-07-10 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539630#comment-16539630
 ] 

Saisai Shao commented on SPARK-24781:
-

Thanks Felix. Does this have to be in 2.3.2? [~ueshin]

> Using a reference from Dataset in Filter/Sort might not work.
> -
>
> Key: SPARK-24781
> URL: https://issues.apache.org/jira/browse/SPARK-24781
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takuya Ueshin
>Priority: Blocker
>
> When we use a reference from {{Dataset}} in {{filter}} or {{sort}}, which was 
> not used in the prior {{select}}, an {{AnalysisException}} occurs, e.g.,
> {code:scala}
> val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
> df.select(df("name")).filter(df("id") === 0).show()
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing 
> from name#5 in operator !Filter (id#6 = 0).;;
> !Filter (id#6 = 0)
>+- AnalysisBarrier
>   +- Project [name#5]
>  +- Project [_1#2 AS name#5, _2#3 AS id#6]
> +- LocalRelation [_1#2, _2#3]
> {noformat}
> If we use {{col}} instead, it works:
> {code:scala}
> val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
> df.select(col("name")).filter(col("id") === 0).show()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming

2018-07-10 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24678:
---

Assignee: sharkd tu

> We should use 'PROCESS_LOCAL' first for Spark-Streaming
> ---
>
> Key: SPARK-24678
> URL: https://issues.apache.org/jira/browse/SPARK-24678
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.3.1
>Reporter: sharkd tu
>Assignee: sharkd tu
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, `BlockRDD.getPreferredLocations`  only get hosts info of blocks, 
> which results in subsequent schedule level is not better than 'NODE_LOCAL'. 
> We can just make a small changes, the schedule level can be improved to 
> 'PROCESS_LOCAL'
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming

2018-07-10 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24678.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21658
[https://github.com/apache/spark/pull/21658]

> We should use 'PROCESS_LOCAL' first for Spark-Streaming
> ---
>
> Key: SPARK-24678
> URL: https://issues.apache.org/jira/browse/SPARK-24678
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.3.1
>Reporter: sharkd tu
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, `BlockRDD.getPreferredLocations`  only get hosts info of blocks, 
> which results in subsequent schedule level is not better than 'NODE_LOCAL'. 
> We can just make a small changes, the schedule level can be improved to 
> 'PROCESS_LOCAL'
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24646) Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes

2018-07-08 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24646.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21633
[https://github.com/apache/spark/pull/21633]

> Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes
> 
>
> Key: SPARK-24646
> URL: https://issues.apache.org/jira/browse/SPARK-24646
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.4.0
>
>
> In the case of getting tokens via customized {{ServiceCredentialProvider}}, 
> it is required that {{ServiceCredentialProvider}} be available in local 
> spark-submit process classpath. In this case, all the configured remote 
> sources should be forced to download to local.
> For the ease of using this configuration, here propose to add wildcard '*' 
> support to {{spark.yarn.dist.forceDownloadSchemes}}, also clarify the usage 
> of this configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24646) Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes

2018-07-08 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24646:
---

Assignee: Saisai Shao

> Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes
> 
>
> Key: SPARK-24646
> URL: https://issues.apache.org/jira/browse/SPARK-24646
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.4.0
>
>
> In the case of getting tokens via customized {{ServiceCredentialProvider}}, 
> it is required that {{ServiceCredentialProvider}} be available in local 
> spark-submit process classpath. In this case, all the configured remote 
> sources should be forced to download to local.
> For the ease of using this configuration, here propose to add wildcard '*' 
> support to {{spark.yarn.dist.forceDownloadSchemes}}, also clarify the usage 
> of this configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-06 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534565#comment-16534565
 ] 

Saisai Shao edited comment on SPARK-24723 at 7/6/18 8:25 AM:
-

[~mengxr] [~jiangxb1987]

There's one solution to handle password-less SSH problem for all cluster 
manager in a programming way. This is referred from MPI on YARN framework 
[https://github.com/alibaba/mpich2-yarn]

In this MPI on YARN framework, before launching MPI job, application master 
(master) will generate ssh private key and public key and then propagate the 
public key to all the containers (worker), during container start, it will 
write public key to local authorized_keys file, so after that, MPI job started 
from master node can ssh with all the containers in password-less manner. After 
MPI job is finished, all the containers would delete this public key from 
authorized_keys file to revert the environment.

In our case, we could do this in a similar way, before launching MPI job, 0-th 
task could also generate ssh private key and public key, and then propagate the 
public keys to all the barrier task (maybe through BarrierTaskContext). For 
other tasks, they could receive public key from 0-th task and write public key 
to authorized_keys file (maybe by BarrierTaskContext). After this, 
password-less ssh is set up, mpirun from 0-th task could be started without 
password. After MPI job is finished, all the barrier tasks could delete this 
public key from authorized_keys file to revert the environment.

The example code is like below:
{code:java}
 rdd.barrier().mapPartitions { (iter, context) => 
  // Write iter to disk. ??? 
  // Wait until all tasks finished writing. 
  context.barrier() 
  // The 0-th task launches an MPI job. 
  if (context.partitionId() == 0) {
    // generate and propagate ssh keys.
    // Wait for keys to set up in other tasks.
    val hosts = context.getTaskInfos().map(_.host)
    // Set up MPI machine file using host infos. ???
    // Launch the MPI job by calling mpirun. ???
  } else {
    // get and setup public key
    // notify 0-th task that pubic key is setup.
  }

  // Wait until the MPI job finished. 
  context.barrier()

  // Delete SSH key and revert the environment. 
  // Collect output and return. ??? 
 }
{code}
 

What is your opinion about this solution?


was (Author: jerryshao):
[~mengxr] [~jiangxb1987]

There's one solution to handle password-less SSH problem for all cluster 
manager in a programming way. This is referred from MPI on YARN framework 
[https://github.com/alibaba/mpich2-yarn]

In this MPI on YARN framework, before launching MPI job, application master 
(master) will generate ssh private key and public key and then propagate the 
public key to all the containers (worker), during container start, it will 
write public key to local authorized_keys file, so after that, MPI job started 
from master node can ssh with all the containers in password-less manner. After 
MPI job is finished, all the containers would delete this public key from 
authorized_keys file to revert the environment.

In our case, we could do this in a similar way, before launching MPI job, 0-th 
task could also generate ssh private key and public key, and then propagate the 
public keys to all the barrier task (maybe through BarrierTaskContext). For 
other tasks, they could receive public key from 0-th task and write public key 
to authorized_keys file (maybe by BarrierTaskContext). After this, 
password-less ssh is set up, mpirun from 0-th task could be started without 
password. After MPI job is finished, all the barrier tasks could delete this 
public key from authorized_keys file to revert the environment.

The example code is like below:

 
rdd.barrier().mapPartitions { (iter, context) =>
  // Write iter to disk.???
  // Wait until all tasks finished writing.
  context.barrier()
  // The 0-th task launches an MPI job.
  if (context.partitionId() == 0) {
// generate and propagate ssh keys.
// Wait for keys to set up in other tasks.

val hosts = context.getTaskInfos().map(_.host)  
// Set up MPI machine file using host infos.  ???
// Launch the MPI job by calling mpirun.  ???
  } else {
// get and setup public key
// notify 0-th task that pubic key is setup.
  }  
  // Wait until the MPI job finished.
  context.barrier()

  // Delete SSH key and revert the environment.  
  // Collect output and return.???  
}
 

What is your opinion about this solution?

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affec

[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-06 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534565#comment-16534565
 ] 

Saisai Shao commented on SPARK-24723:
-

[~mengxr] [~jiangxb1987]

There's one solution to handle password-less SSH problem for all cluster 
manager in a programming way. This is referred from MPI on YARN framework 
[https://github.com/alibaba/mpich2-yarn]

In this MPI on YARN framework, before launching MPI job, application master 
(master) will generate ssh private key and public key and then propagate the 
public key to all the containers (worker), during container start, it will 
write public key to local authorized_keys file, so after that, MPI job started 
from master node can ssh with all the containers in password-less manner. After 
MPI job is finished, all the containers would delete this public key from 
authorized_keys file to revert the environment.

In our case, we could do this in a similar way, before launching MPI job, 0-th 
task could also generate ssh private key and public key, and then propagate the 
public keys to all the barrier task (maybe through BarrierTaskContext). For 
other tasks, they could receive public key from 0-th task and write public key 
to authorized_keys file (maybe by BarrierTaskContext). After this, 
password-less ssh is set up, mpirun from 0-th task could be started without 
password. After MPI job is finished, all the barrier tasks could delete this 
public key from authorized_keys file to revert the environment.

The example code is like below:

 
rdd.barrier().mapPartitions { (iter, context) =>
  // Write iter to disk.???
  // Wait until all tasks finished writing.
  context.barrier()
  // The 0-th task launches an MPI job.
  if (context.partitionId() == 0) {
// generate and propagate ssh keys.
// Wait for keys to set up in other tasks.

val hosts = context.getTaskInfos().map(_.host)  
// Set up MPI machine file using host infos.  ???
// Launch the MPI job by calling mpirun.  ???
  } else {
// get and setup public key
// notify 0-th task that pubic key is setup.
  }  
  // Wait until the MPI job finished.
  context.barrier()

  // Delete SSH key and revert the environment.  
  // Collect output and return.???  
}
 

What is your opinion about this solution?

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-06 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16534529#comment-16534529
 ] 

Saisai Shao commented on SPARK-24723:
-

After discussed with Xiangrui offline, resource reservation is not the key 
focus here. Here the main problem is how to provide necessary information for 
barrier tasks to start MPI job in a password-less manner.

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (LIVY-480) Apache livy and Ajax

2018-07-04 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/LIVY-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533241#comment-16533241
 ] 

Saisai Shao commented on LIVY-480:
--

Please ask question in the mail list, to see if someone can answer your 
questions.

> Apache livy and Ajax
> 
>
> Key: LIVY-480
> URL: https://issues.apache.org/jira/browse/LIVY-480
> Project: Livy
>  Issue Type: Request
>  Components: API, Interpreter, REPL, Server
>Affects Versions: 0.5.0
>Reporter: Melchicédec NDUWAYO
>Priority: Major
>
> Good morning every one,
> can someone help me with how to send a post request by ajax to the apache 
> livy server? I tried my best but I don't see how to do that. I have a 
> textarea in which user can put his spark code. I want to send that code to 
> the apache livy by ajax and get the result. I don't if my question is clear. 
> Thank you.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (LIVY-480) Apache livy and Ajax

2018-07-04 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/LIVY-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved LIVY-480.
--
Resolution: Invalid

> Apache livy and Ajax
> 
>
> Key: LIVY-480
> URL: https://issues.apache.org/jira/browse/LIVY-480
> Project: Livy
>  Issue Type: Request
>  Components: API, Interpreter, REPL, Server
>Affects Versions: 0.5.0
>Reporter: Melchicédec NDUWAYO
>Priority: Major
>
> Good morning every one,
> can someone help me with how to send a post request by ajax to the apache 
> livy server? I tried my best but I don't see how to do that. I have a 
> textarea in which user can put his spark code. I want to send that code to 
> the apache livy by ajax and get the result. I don't if my question is clear. 
> Thank you.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-04 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533205#comment-16533205
 ] 

Saisai Shao commented on SPARK-24723:
-

Hi [~mengxr], I would like to know the goal of this ticket? The goal of barrier 
scheduler is to offer gang semantics in the task scheduling level, whereas the 
gang semantics in the YARN level is more regarding to resource level.

I discussed with [~leftnoteasy] about the feasibility of supporting gang 
semantics on YARN. YARN has Reservation System which support gang like 
semantics (reserve requested resources), but it is not designed for gang. Here 
is some thoughts about supporting it on YARN 
[https://docs.google.com/document/d/1OA-iVwuHB8wlzwwlrEHOK6Q2SlKy3-5QEB5AmwMVEUU/edit?usp=sharing],
 I'm not sure if it aligns your goal of this ticket.

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24739) PySpark does not work with Python 3.7.0

2018-07-04 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533189#comment-16533189
 ] 

Saisai Shao commented on SPARK-24739:
-

Do we have to fix it in 2.3.2? I don't think it is even critical. For example 
Java 9 is out for a long time, but we still don't support Java 9. So this seems 
not a big problem. 

> PySpark does not work with Python 3.7.0
> ---
>
> Key: SPARK-24739
> URL: https://issues.apache.org/jira/browse/SPARK-24739
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.3, 2.2.2, 2.3.1
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> Python 3.7 is released in few days ago and our PySpark does not work. For 
> example
> {code}
> sc.parallelize([1, 2]).take(1)
> {code}
> {code}
> File "/.../spark/python/pyspark/rdd.py", line 1343, in __main__.RDD.take
> Failed example:
> sc.parallelize(range(100), 100).filter(lambda x: x > 90).take(3)
> Exception raised:
> Traceback (most recent call last):
>   File "/.../3.7/lib/python3.7/doctest.py", line 1329, in __run
> compileflags, 1), test.globs)
>   File "", line 1, in 
> sc.parallelize(range(100), 100).filter(lambda x: x > 90).take(3)
>   File "/.../spark/python/pyspark/rdd.py", line 1377, in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "/.../spark/python/pyspark/context.py", line 1013, in runJob
> sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), 
> mappedRDD._jrdd, partitions)
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
> line 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 0 in stage 143.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 143.0 (TID 688, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/.../spark/python/pyspark/rdd.py", line 1373, in takeUpToNumLeft
> yield next(iterator)
> StopIteration
> The above exception was the direct cause of the following exception:
> Traceback (most recent call last):
>   File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 320, 
> in main
> process()
>   File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 315, 
> in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 378, in dump_stream
> vs = list(itertools.islice(iterator, batch))
> RuntimeError: generator raised StopIteration
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:309)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:449)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:432)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:263)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:149)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:149)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2071)
>   at 
> org.apache.spark.SparkContext$$anonf

[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR on Windows

2018-07-03 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532145#comment-16532145
 ] 

Saisai Shao commented on SPARK-24535:
-

Hi [~felixcheung] what's the current status of this JIRA, do you have an ETA 
about it?

> Fix java version parsing in SparkR on Windows
> -
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
>Priority: Blocker
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-07-03 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532143#comment-16532143
 ] 

Saisai Shao commented on SPARK-24530:
-

Hi [~hyukjin.kwon] what is the current status of this JIRA, do you have an ETA 
about it?

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-07-01 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529352#comment-16529352
 ] 

Saisai Shao commented on SPARK-24530:
-

[~hyukjin.kwon], Spark 2.1.3 and 2.2.2 are on vote, can you please fix the 
issue and leave the comments in the related threads.

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-07-01 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24530:

Target Version/s: 2.1.3, 2.2.2, 2.3.2, 2.4.0  (was: 2.4.0)

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (LIVY-479) Enable livy.rsc.launcher.address configuration

2018-06-27 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/LIVY-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16525775#comment-16525775
 ] 

Saisai Shao commented on LIVY-479:
--

Yes, I've left the comments on the PR.

> Enable livy.rsc.launcher.address configuration
> --
>
> Key: LIVY-479
> URL: https://issues.apache.org/jira/browse/LIVY-479
> Project: Livy
>  Issue Type: Improvement
>  Components: RSC
>Affects Versions: 0.5.0
>Reporter: Tao Li
>Priority: Major
>
> The current behavior is that we are setting livy.rsc.launcher.address to RPC 
> server and ignoring whatever is specified in configs. However in some 
> scenarios there is a need for the user to be able to explicitly configure it. 
> For example, the IP address for an active Livy server might change over the 
> time, so rather than using the fixed IP address, we need to specify this 
> setting to a more generic host name. For this reason, we want to enable the 
> capability of user side configurations. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SPARK-24418) Upgrade to Scala 2.11.12

2018-06-25 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24418.
-
Resolution: Fixed

Issue resolved by pull request 21495
[https://github.com/apache/spark/pull/21495]

> Upgrade to Scala 2.11.12
> 
>
> Key: SPARK-24418
> URL: https://issues.apache.org/jira/browse/SPARK-24418
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Scala 2.11.12+ will support JDK9+. However, this is not going to be a simple 
> version bump. 
> *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
> initialize the Spark before REPL sees any files.
> Issue filed in Scala community.
> https://github.com/scala/bug/issues/10913



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24646) Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes

2018-06-25 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523022#comment-16523022
 ] 

Saisai Shao commented on SPARK-24646:
-

Hi [~vanzin], here is a specific example:

We have a customized {{ServiceCredentialProvider}} named HS2 
{{ServiceCredentialProvider}} which is located in our own jar {{foo}}. So when 
SparkSubmit process launches YARN client, this {{foo}} jar should be existed in 
SparkSubmit process classpath. 

If this {{foo}} jar is a local resource added by {{--jars}}, then it will be 
existed in the classpath, but if it is a remote jar (for example on HDFS), then 
only yarn-client mode will download and add this {{foo}} jar to classpath, in 
yarn-cluster mode, it will not be downloaded, so this specific HS2 
{{ServiceCredentialProvider}} will not be loaded.

When using spark-submit script, user could decide to add the remote jar or 
local jar. But in the Livy scenario, Livy only supports remote jars (jars on 
the hdfs), and we only configure to support yarn cluster mode. So in this 
scenario, we cannot load this customized {{ServiceCredentialProvider}} in yarn 
cluster mode.

So the fix is to force to download the jars to local SparkSubmit process with 
configuration "spark.yarn.dist.forceDownloadSchemes", to use it easily, I 
propose to add wildcard '*' support, which will download all the remote 
resource without checking the scheme.

> Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes
> 
>
> Key: SPARK-24646
> URL: https://issues.apache.org/jira/browse/SPARK-24646
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Minor
>
> In the case of getting tokens via customized {{ServiceCredentialProvider}}, 
> it is required that {{ServiceCredentialProvider}} be available in local 
> spark-submit process classpath. In this case, all the configured remote 
> sources should be forced to download to local.
> For the ease of using this configuration, here propose to add wildcard '*' 
> support to {{spark.yarn.dist.forceDownloadSchemes}}, also clarify the usage 
> of this configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24646) Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes

2018-06-24 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-24646:
---

 Summary: Support wildcard '*' for to 
spark.yarn.dist.forceDownloadSchemes
 Key: SPARK-24646
 URL: https://issues.apache.org/jira/browse/SPARK-24646
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Saisai Shao


In the case of getting tokens via customized {{ServiceCredentialProvider}}, it 
is required that {{ServiceCredentialProvider}} be available in local 
spark-submit process classpath. In this case, all the configured remote sources 
should be forced to download to local.

For the ease of using this configuration, here propose to add wildcard '*' 
support to {{spark.yarn.dist.forceDownloadSchemes}}, also clarify the usage of 
this configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark

2018-06-21 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519934#comment-16519934
 ] 

Saisai Shao commented on SPARK-24374:
-

Hi [~mengxr] [~jiangxb1987] SPARK-24615 would be a part of project hydrogen, I 
wrote a high level design doc about the implementation, would you please help 
to review and comment, to see the solution is feasible or not. Thanks a lot.

> SPIP: Support Barrier Scheduling in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-21 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16519931#comment-16519931
 ] 

Saisai Shao commented on HIVE-16391:


Gently ping [~hagleitn], would you please help to review the current proposed 
patch and suggest the next step. Thanks a lot.

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Affects Versions: 1.2.2
>Reporter: Reynold Xin
>Assignee: Saisai Shao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.3
>
> Attachments: HIVE-16391.1.patch, HIVE-16391.2.patch, HIVE-16391.patch
>
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SPARK-24615) Accelerator aware task scheduling for Spark

2018-06-21 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24615:

Labels: SPIP  (was: )

> Accelerator aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Priority: Major
>  Labels: SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24615) Accelerator aware task scheduling for Spark

2018-06-21 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24615:

Description: 
In the machine learning area, accelerator card (GPU, FPGA, TPU) is predominant 
compared to CPUs. To make the current Spark architecture to work with 
accelerator cards, Spark itself should understand the existence of accelerators 
and know how to schedule task onto the executors where accelerators are 
equipped.

Current Spark’s scheduler schedules tasks based on the locality of the data 
plus the available of CPUs. This will introduce some problems when scheduling 
tasks with accelerators required.
 # CPU cores are usually more than accelerators on one node, using CPU cores to 
schedule accelerator required tasks will introduce the mismatch.
 # In one cluster, we always assume that CPU is equipped in each node, but this 
is not true of accelerator cards.
 # The existence of heterogeneous tasks (accelerator required or not) requires 
scheduler to schedule tasks with a smart way.

So here propose to improve the current scheduler to support heterogeneous tasks 
(accelerator requires or not). This can be part of the work of Project hydrogen.

Details is attached in google doc. It doesn't cover all the implementation 
details, just highlight the parts should be changed.

 

CC [~yanboliang] [~merlintang]

  was:
In the machine learning area, accelerator card (GPU, FPGA, TPU) is predominant 
compared to CPUs. To make the current Spark architecture to work with 
accelerator cards, Spark itself should understand the existence of accelerators 
and know how to schedule task onto the executors where accelerators are 
equipped.

Current Spark’s scheduler schedules tasks based on the locality of the data 
plus the available of CPUs. This will introduce some problems when scheduling 
tasks with accelerators required.
 # CPU cores are usually more than accelerators on one node, using CPU cores to 
schedule accelerator required tasks will introduce the mismatch.
 # In one cluster, we always assume that CPU is equipped in each node, but this 
is not true of accelerator cards.
 # The existence of heterogeneous tasks (accelerator required or not) requires 
scheduler to schedule tasks with a smart way.

So here propose to improve the current scheduler to support heterogeneous tasks 
(accelerator requires or not). This can be part of the work of Project hydrogen.

Details is attached in google doc.

 

CC [~yanboliang] [~merlintang]


> Accelerator aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Priority: Major
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24615) Accelerator aware task scheduling for Spark

2018-06-21 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24615:

Description: 
In the machine learning area, accelerator card (GPU, FPGA, TPU) is predominant 
compared to CPUs. To make the current Spark architecture to work with 
accelerator cards, Spark itself should understand the existence of accelerators 
and know how to schedule task onto the executors where accelerators are 
equipped.

Current Spark’s scheduler schedules tasks based on the locality of the data 
plus the available of CPUs. This will introduce some problems when scheduling 
tasks with accelerators required.
 # CPU cores are usually more than accelerators on one node, using CPU cores to 
schedule accelerator required tasks will introduce the mismatch.
 # In one cluster, we always assume that CPU is equipped in each node, but this 
is not true of accelerator cards.
 # The existence of heterogeneous tasks (accelerator required or not) requires 
scheduler to schedule tasks with a smart way.

So here propose to improve the current scheduler to support heterogeneous tasks 
(accelerator requires or not). This can be part of the work of Project hydrogen.

Details is attached in google doc.

 

CC [~yanboliang] [~merlintang]

  was:
In the machine learning area, accelerator card (GPU, FPGA, TPU) is predominant 
compared to CPUs. To make the current Spark architecture to work with 
accelerator cards, Spark itself should understand the existence of accelerators 
and know how to schedule task onto the executors where accelerators are 
equipped.

Current Spark’s scheduler schedules tasks based on the locality of the data 
plus the available of CPUs. This will introduce some problems when scheduling 
tasks with accelerators required.
 # CPU cores are usually more than accelerators on one node, using CPU cores to 
schedule accelerator required tasks will introduce the mismatch.
 # In one cluster, we always assume that CPU is equipped in each node, but this 
is not true of accelerator cards.
 # The existence of heterogeneous tasks (accelerator required or not) requires 
scheduler to schedule tasks with a smart way.

So here propose to improve the current scheduler to support heterogeneous tasks 
(accelerator requires or not). Details is attached in google doc.


> Accelerator aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Priority: Major
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24615) Accelerator aware task scheduling for Spark

2018-06-21 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-24615:
---

 Summary: Accelerator aware task scheduling for Spark
 Key: SPARK-24615
 URL: https://issues.apache.org/jira/browse/SPARK-24615
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Saisai Shao


In the machine learning area, accelerator card (GPU, FPGA, TPU) is predominant 
compared to CPUs. To make the current Spark architecture to work with 
accelerator cards, Spark itself should understand the existence of accelerators 
and know how to schedule task onto the executors where accelerators are 
equipped.

Current Spark’s scheduler schedules tasks based on the locality of the data 
plus the available of CPUs. This will introduce some problems when scheduling 
tasks with accelerators required.
 # CPU cores are usually more than accelerators on one node, using CPU cores to 
schedule accelerator required tasks will introduce the mismatch.
 # In one cluster, we always assume that CPU is equipped in each node, but this 
is not true of accelerator cards.
 # The existence of heterogeneous tasks (accelerator required or not) requires 
scheduler to schedule tasks with a smart way.

So here propose to improve the current scheduler to support heterogeneous tasks 
(accelerator requires or not). Details is attached in google doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3

2018-06-19 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16517720#comment-16517720
 ] 

Saisai Shao commented on SPARK-24493:
-

This BUG is fixed in HDFS-12670, Spark should bump its supported Hadoop3 
version to 3.1.1 when released.

> Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
> --
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, 
> renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, 
> sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 
> 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
> at org.apache.hadoop.ipc.Client.call(Client.java:1445)
> at org.apache.hadoop.ipc.Client.call(Client.java:1355)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at 

[jira] [Updated] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-13 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated HIVE-16391:
---
Attachment: HIVE-16391.2.patch

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Affects Versions: 1.2.2
>Reporter: Reynold Xin
>Assignee: Saisai Shao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.3
>
> Attachments: HIVE-16391.1.patch, HIVE-16391.2.patch, HIVE-16391.patch
>
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-13 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511851#comment-16511851
 ] 

Saisai Shao commented on HIVE-16391:


I see. I can keep the "core" classifier and use another name. Will update the 
patch.

[~owen.omalley] would you please help to review this patch, since you created a 
Spark JIRA, or can you please point someone in Hive community to help to 
review? Thanks a lot.

 

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Affects Versions: 1.2.2
>Reporter: Reynold Xin
>Assignee: Saisai Shao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.3
>
> Attachments: HIVE-16391.1.patch, HIVE-16391.patch
>
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (LIVY-477) Upgrade Livy Scala version to 2.11.12

2018-06-12 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/LIVY-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved LIVY-477.
--
   Resolution: Fixed
Fix Version/s: 0.6.0

> Upgrade Livy Scala version to 2.11.12
> -
>
> Key: LIVY-477
> URL: https://issues.apache.org/jira/browse/LIVY-477
> Project: Livy
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 0.5.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 0.6.0
>
>
> Scala version below 2.11.12 has CVE 
> ([https://scala-lang.org/news/security-update-nov17.html),] and Spark will 
> also upgrade its supported version to 2.11.12. 
> So here upgrading Livy's Scala version also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (LIVY-477) Upgrade Livy Scala version to 2.11.12

2018-06-12 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/LIVY-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509600#comment-16509600
 ] 

Saisai Shao commented on LIVY-477:
--

Issue resolved by pull request 100
[https://github.com/apache/incubator-livy/pull/100|https://github.com/apache/incubator-livy/pull/100]

> Upgrade Livy Scala version to 2.11.12
> -
>
> Key: LIVY-477
> URL: https://issues.apache.org/jira/browse/LIVY-477
> Project: Livy
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 0.5.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
>
> Scala version below 2.11.12 has CVE 
> ([https://scala-lang.org/news/security-update-nov17.html),] and Spark will 
> also upgrade its supported version to 2.11.12. 
> So here upgrading Livy's Scala version also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SPARK-24518) Using Hadoop credential provider API to store password

2018-06-11 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24518:

Description: 
Current Spark configs password in a plaintext way, like putting in the 
configuration file or adding as a launch arguments, sometimes such 
configurations like SSL password is configured by cluster admin, which should 
not be seen by user, but now this passwords are world readable to all the users.

Hadoop credential provider API support storing password in a secure way, in 
which Spark could read it in a secure way, so here propose to add support of 
using credential provider API to get password.

> Using Hadoop credential provider API to store password
> --
>
> Key: SPARK-24518
> URL: https://issues.apache.org/jira/browse/SPARK-24518
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Current Spark configs password in a plaintext way, like putting in the 
> configuration file or adding as a launch arguments, sometimes such 
> configurations like SSL password is configured by cluster admin, which should 
> not be seen by user, but now this passwords are world readable to all the 
> users.
> Hadoop credential provider API support storing password in a secure way, in 
> which Spark could read it in a secure way, so here propose to add support of 
> using credential provider API to get password.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24518) Using Hadoop credential provider API to store password

2018-06-11 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24518:

Environment: (was: Current Spark configs password in a plaintext way, 
like putting in the configuration file or adding as a launch arguments, 
sometimes such configurations like SSL password is configured by cluster admin, 
which should not be seen by user, but now this passwords are world readable to 
all the users.

Hadoop credential provider API support storing password in a secure way, in 
which Spark could read it in a secure way, so here propose to add support of 
using credential provider API to get password.)

> Using Hadoop credential provider API to store password
> --
>
> Key: SPARK-24518
> URL: https://issues.apache.org/jira/browse/SPARK-24518
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24518) Using Hadoop credential provider API to store password

2018-06-11 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-24518:
---

 Summary: Using Hadoop credential provider API to store password
 Key: SPARK-24518
 URL: https://issues.apache.org/jira/browse/SPARK-24518
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
 Environment: Current Spark configs password in a plaintext way, like 
putting in the configuration file or adding as a launch arguments, sometimes 
such configurations like SSL password is configured by cluster admin, which 
should not be seen by user, but now this passwords are world readable to all 
the users.

Hadoop credential provider API support storing password in a secure way, in 
which Spark could read it in a secure way, so here propose to add support of 
using credential provider API to get password.
Reporter: Saisai Shao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (LIVY-477) Upgrade Livy Scala version to 2.11.12

2018-06-10 Thread Saisai Shao (JIRA)
Saisai Shao created LIVY-477:


 Summary: Upgrade Livy Scala version to 2.11.12
 Key: LIVY-477
 URL: https://issues.apache.org/jira/browse/LIVY-477
 Project: Livy
  Issue Type: Improvement
  Components: Build
Affects Versions: 0.5.0
Reporter: Saisai Shao
Assignee: Saisai Shao


Scala version below 2.11.12 has CVE 
([https://scala-lang.org/news/security-update-nov17.html),] and Spark will also 
upgrade its supported version to 2.11.12. 

So here upgrading Livy's Scala version also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-06-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16506023#comment-16506023
 ] 

Saisai Shao commented on SPARK-23534:
-

Yes, it is a Hadoop 2.8+ issue, but we don't a 2.8 profile for Spark, instead 
we have a 3.1 profile, so it will affect our hadoop 3 build, that's why also 
add a link here.

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24487) Add support for RabbitMQ.

2018-06-08 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24487.
-
Resolution: Won't Fix

> Add support for RabbitMQ.
> -
>
> Key: SPARK-24487
> URL: https://issues.apache.org/jira/browse/SPARK-24487
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michał Jurkiewicz
>Priority: Major
>
> Add support for RabbitMQ.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24487) Add support for RabbitMQ.

2018-06-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16505786#comment-16505786
 ] 

Saisai Shao commented on SPARK-24487:
-

I'm sure Spark community will not merge this into Spark code base. You can 
either maintain a package yourself, or contribute to Apache Bahir.

> Add support for RabbitMQ.
> -
>
> Key: SPARK-24487
> URL: https://issues.apache.org/jira/browse/SPARK-24487
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michał Jurkiewicz
>Priority: Major
>
> Add support for RabbitMQ.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0

2018-06-08 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-23151:

Issue Type: Sub-task  (was: Dependency upgrade)
Parent: SPARK-23534

> Provide a distribution of Spark with Hadoop 3.0
> ---
>
> Key: SPARK-23151
> URL: https://issues.apache.org/jira/browse/SPARK-23151
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Louis Burke
>Priority: Major
>
> Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark 
> package
> only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication 
> is
> that using up to date Kinesis libraries alongside s3 causes a clash w.r.t
> aws-java-sdk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0

2018-06-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16505783#comment-16505783
 ] 

Saisai Shao commented on SPARK-23151:
-

I will convert this as a subtask of SPARK-23534.

> Provide a distribution of Spark with Hadoop 3.0
> ---
>
> Key: SPARK-23151
> URL: https://issues.apache.org/jira/browse/SPARK-23151
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Louis Burke
>Priority: Major
>
> Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark 
> package
> only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication 
> is
> that using up to date Kinesis libraries alongside s3 causes a clash w.r.t
> aws-java-sdk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-06-07 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16505771#comment-16505771
 ] 

Saisai Shao commented on SPARK-23534:
-

Spark with Hadoop 3 will be failed in token renew for long running case (with 
keytab and principal).

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3

2018-06-07 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16505768#comment-16505768
 ] 

Saisai Shao edited comment on SPARK-24493 at 6/8/18 6:56 AM:
-

Adding more background, the issue is happened when building Spark with Hadoop 
2.8+.

In Spark, we use {{TokenIdentifer}} return by {{decodeIdentifier}} to get renew 
interval for HDFS token, but due to missing service loader file, Hadoop failed 
to create {{TokenIdentifier}} and returns *null*, which leads to an unexpected 
renew interval (Long.MaxValue).

The related code in Hadoop is:

{code}
 private static Class
 getClassForIdentifier(Text kind) { Class cls = 
null; synchronized (Token.class) { if (tokenKindMap == null)

{ tokenKindMap = Maps.newHashMap(); for (TokenIdentifier id : 
ServiceLoader.load(TokenIdentifier.class)) \\{ tokenKindMap.put(id.getKind(), 
id.getClass()); }

}
 cls = tokenKindMap.get(kind);
 } if (cls == null)

{ LOG.debug("Cannot find class for token kind " + kind); return null; }

return cls;
 }
{code}

The problem is:

The HDFS "DelegationTokenIdentifier" class is in hadoop-hdfs-client jar, but 
the service loader description file 
"META-INF/services/org.apache.hadoop.security.token.TokenIdentifier" is in 
hadoop-hdfs jar. Spark local submit process/driver process (depends on client 
or cluster mode) only relies on hadoop-hdfs-client jar, but not hadoop-hdfs 
jar. So the ServiceLoader will be failed to find HDFS 
"DelegationTokenIdentifier" class and return null.

The issue is due to the change in HADOOP-6200.  Previously we only have 
building profile for Hadoop 2.6 and 2.7, so there's no issue here. But 
currently we has a building profile for Hadoop 3.1, so this will fail the token 
renew in Hadoop 3.1.

The is a Hadoop issue, creating a Spark Jira to track this issue and bump the 
version when Hadoop side is fixed.


was (Author: jerryshao):
Adding more background, the issue is happened when building Spark with Hadoop 
2.8+.

In Spark, we use {{TokenIdentifer}} return by {{decodeIdentifier}} to get renew 
interval for HDFS token, but due to missing service loader file, Hadoop failed 
to create {{TokenIdentifier}} and returns *null*, which leads to an unexpected 
renew interval (Long.MaxValue).

The related code in Hadoop is:
  private static Class
  getClassForIdentifier(Text kind) {Class 
cls = null;synchronized (Token.class) {  if (tokenKindMap == null) {
tokenKindMap = Maps.newHashMap();for (TokenIdentifier id : 
ServiceLoader.load(TokenIdentifier.class)) \{
  tokenKindMap.put(id.getKind(), id.getClass());
}
  }
  cls = tokenKindMap.get(kind);
}if (cls == null) {
  LOG.debug("Cannot find class for token kind " + kind);  return null;
}return cls;
  }


The problem is:

The HDFS "DelegationTokenIdentifier" class is in hadoop-hdfs-client jar, but 
the service loader description file 
"META-INF/services/org.apache.hadoop.security.token.TokenIdentifier" is in 
hadoop-hdfs jar. Spark local submit process/driver process (depends on client 
or cluster mode) only relies on hadoop-hdfs-client jar, but not hadoop-hdfs 
jar. So the ServiceLoader will be failed to find HDFS 
"DelegationTokenIdentifier" class and return null.

The issue is due to the change in HADOOP-6200.  Previously we only have 
building profile for Hadoop 2.6 and 2.7, so there's no issue here. But 
currently we has a building profile for Hadoop 3.1, so this will fail the token 
renew in Hadoop 3.1.

The is a Hadoop issue, creating a Spark Jira to track this issue and bump the 
version when Hadoop side is fixed.

> Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
> --
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TI

[jira] [Commented] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3

2018-06-07 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16505768#comment-16505768
 ] 

Saisai Shao commented on SPARK-24493:
-

Adding more background, the issue is happened when building Spark with Hadoop 
2.8+.

In Spark, we use {{TokenIdentifer}} return by {{decodeIdentifier}} to get renew 
interval for HDFS token, but due to missing service loader file, Hadoop failed 
to create {{TokenIdentifier}} and returns *null*, which leads to an unexpected 
renew interval (Long.MaxValue).

The related code in Hadoop is:
  private static Class
  getClassForIdentifier(Text kind) {Class 
cls = null;synchronized (Token.class) {  if (tokenKindMap == null) {
tokenKindMap = Maps.newHashMap();for (TokenIdentifier id : 
ServiceLoader.load(TokenIdentifier.class)) \{
  tokenKindMap.put(id.getKind(), id.getClass());
}
  }
  cls = tokenKindMap.get(kind);
}if (cls == null) {
  LOG.debug("Cannot find class for token kind " + kind);  return null;
}return cls;
  }


The problem is:

The HDFS "DelegationTokenIdentifier" class is in hadoop-hdfs-client jar, but 
the service loader description file 
"META-INF/services/org.apache.hadoop.security.token.TokenIdentifier" is in 
hadoop-hdfs jar. Spark local submit process/driver process (depends on client 
or cluster mode) only relies on hadoop-hdfs-client jar, but not hadoop-hdfs 
jar. So the ServiceLoader will be failed to find HDFS 
"DelegationTokenIdentifier" class and return null.

The issue is due to the change in HADOOP-6200.  Previously we only have 
building profile for Hadoop 2.6 and 2.7, so there's no issue here. But 
currently we has a building profile for Hadoop 3.1, so this will fail the token 
renew in Hadoop 3.1.

The is a Hadoop issue, creating a Spark Jira to track this issue and bump the 
version when Hadoop side is fixed.

> Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
> --
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, 
> renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, 
> sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 
> 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
> at org.apache.hadoop.ipc.Client.call(Client.java:1445)
> at org.apache.hadoop.ipc.Client.call(Client.java:1355)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInv

[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3

2018-06-07 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24493:

Summary: Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3  
(was: Kerberos Ticket Renewal is failing in long running Spark job)

> Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
> --
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, 
> renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, 
> sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 
> 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
> at org.apache.hadoop.ipc.Client.call(Client.java:1445)
> at org.apache.hadoop.ipc.Client.call(Client.java:1355)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.r

[jira] [Commented] (SPARK-24487) Add support for RabbitMQ.

2018-06-07 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16505760#comment-16505760
 ] 

Saisai Shao commented on SPARK-24487:
-

You can add it in your own package with DataSource API. There's no meaning to 
add to Spark code base.

 

> Add support for RabbitMQ.
> -
>
> Key: SPARK-24487
> URL: https://issues.apache.org/jira/browse/SPARK-24487
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michał Jurkiewicz
>Priority: Major
>
> Add support for RabbitMQ.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >