from:"Saisai Shao \(Jira\)"

[jira] [Updated] (SPARK-24964) Please add OWASP Dependency Check to all comonent builds(pom.xml)

2018-07-29 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24964:

Target Version/s:   (was: 2.3.2, 2.4.0, 3.0.0, 2.3.3)

>  Please add OWASP Dependency Check to all comonent builds(pom.xml)
> --
>
> Key: SPARK-24964
> URL: https://issues.apache.org/jira/browse/SPARK-24964
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, MLlib, Spark Core, SparkR
>Affects Versions: 2.3.1
> Environment: All development, build, test, environments.
> ~/workspace/spark-2.3.1/pom.xml
> ~/workspace/spark-2.3.1/assembly/pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/pom.xml
> ~/workspace/spark-2.3.1/common/network-common/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-common/pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/pom.xml
> ~/workspace/spark-2.3.1/common/network-yarn/pom.xml
> ~/workspace/spark-2.3.1/common/sketch/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/sketch/pom.xml
> ~/workspace/spark-2.3.1/common/tags/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/tags/pom.xml
> ~/workspace/spark-2.3.1/common/unsafe/pom.xml
> ~/workspace/spark-2.3.1/core/pom.xml
> ~/workspace/spark-2.3.1/examples/pom.xml
> ~/workspace/spark-2.3.1/external/docker-integration-tests/pom.xml
> ~/workspace/spark-2.3.1/external/flume/pom.xml
> ~/workspace/spark-2.3.1/external/flume-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/flume-sink/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-sql/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/spark-ganglia-lgpl/pom.xml
> ~/workspace/spark-2.3.1/graphx/pom.xml
> ~/workspace/spark-2.3.1/hadoop-cloud/pom.xml
> ~/workspace/spark-2.3.1/launcher/pom.xml
> ~/workspace/spark-2.3.1/mllib/pom.xml
> ~/workspace/spark-2.3.1/mllib-local/pom.xml
> ~/workspace/spark-2.3.1/repl/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/kubernetes/core/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/mesos/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/yarn/pom.xml
> ~/workspace/spark-2.3.1/sql/catalyst/pom.xml
> ~/workspace/spark-2.3.1/sql/core/pom.xml
> ~/workspace/spark-2.3.1/sql/hive/pom.xml
> ~/workspace/spark-2.3.1/sql/hive-thriftserver/pom.xml
> ~/workspace/spark-2.3.1/streaming/pom.xml
> ~/workspace/spark-2.3.1/tools/pom.xml
>Reporter: Albert Baker
>Priority: Major
>  Labels: build, easy-fix, security
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> OWASP DC makes an outbound REST call to MITRE Common Vulnerabilities & 
> Exposures (CVE) to perform a lookup for each dependant .jar to list any/all 
> known vulnerabilities for each jar. This step is needed because a manual 
> MITRE CVE lookup/check on the main component does not include checking for 
> vulnerabilities in dependant libraries.
> OWASP Dependency check : 
> https://www.owasp.org/index.php/OWASP_Dependency_Check has plug-ins for most 
> Java build/make types (ant, maven, ivy, gradle). Also, add the appropriate 
> command to the nightly build to generate a report of all known 
> vulnerabilities in any/all third party libraries/dependencies that get pulled 
> in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false clean aggregate
> Generating this report nightly/weekly will help inform the project's 
> development team if any dependant libraries have a reported known 
> vulneraility. Project teams that keep up with removing vulnerabilities on a 
> weekly basis will help protect businesses that rely on these open source 
> componets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24964) Please add OWASP Dependency Check to all comonent builds(pom.xml)

2018-07-29 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561327#comment-16561327
 ] 

Saisai Shao commented on SPARK-24964:
-

I'm going to remove the set target version, usually we don't set this for a 
feature, committers will set a fix version when merged.

>  Please add OWASP Dependency Check to all comonent builds(pom.xml)
> --
>
> Key: SPARK-24964
> URL: https://issues.apache.org/jira/browse/SPARK-24964
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, MLlib, Spark Core, SparkR
>Affects Versions: 2.3.1
> Environment: All development, build, test, environments.
> ~/workspace/spark-2.3.1/pom.xml
> ~/workspace/spark-2.3.1/assembly/pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/kvstore/pom.xml
> ~/workspace/spark-2.3.1/common/network-common/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-common/pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/network-shuffle/pom.xml
> ~/workspace/spark-2.3.1/common/network-yarn/pom.xml
> ~/workspace/spark-2.3.1/common/sketch/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/sketch/pom.xml
> ~/workspace/spark-2.3.1/common/tags/dependency-reduced-pom.xml
> ~/workspace/spark-2.3.1/common/tags/pom.xml
> ~/workspace/spark-2.3.1/common/unsafe/pom.xml
> ~/workspace/spark-2.3.1/core/pom.xml
> ~/workspace/spark-2.3.1/examples/pom.xml
> ~/workspace/spark-2.3.1/external/docker-integration-tests/pom.xml
> ~/workspace/spark-2.3.1/external/flume/pom.xml
> ~/workspace/spark-2.3.1/external/flume-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/flume-sink/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-10-sql/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8/pom.xml
> ~/workspace/spark-2.3.1/external/kafka-0-8-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl/pom.xml
> ~/workspace/spark-2.3.1/external/kinesis-asl-assembly/pom.xml
> ~/workspace/spark-2.3.1/external/spark-ganglia-lgpl/pom.xml
> ~/workspace/spark-2.3.1/graphx/pom.xml
> ~/workspace/spark-2.3.1/hadoop-cloud/pom.xml
> ~/workspace/spark-2.3.1/launcher/pom.xml
> ~/workspace/spark-2.3.1/mllib/pom.xml
> ~/workspace/spark-2.3.1/mllib-local/pom.xml
> ~/workspace/spark-2.3.1/repl/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/kubernetes/core/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/mesos/pom.xml
> ~/workspace/spark-2.3.1/resource-managers/yarn/pom.xml
> ~/workspace/spark-2.3.1/sql/catalyst/pom.xml
> ~/workspace/spark-2.3.1/sql/core/pom.xml
> ~/workspace/spark-2.3.1/sql/hive/pom.xml
> ~/workspace/spark-2.3.1/sql/hive-thriftserver/pom.xml
> ~/workspace/spark-2.3.1/streaming/pom.xml
> ~/workspace/spark-2.3.1/tools/pom.xml
>Reporter: Albert Baker
>Priority: Major
>  Labels: build, easy-fix, security
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> OWASP DC makes an outbound REST call to MITRE Common Vulnerabilities & 
> Exposures (CVE) to perform a lookup for each dependant .jar to list any/all 
> known vulnerabilities for each jar. This step is needed because a manual 
> MITRE CVE lookup/check on the main component does not include checking for 
> vulnerabilities in dependant libraries.
> OWASP Dependency check : 
> https://www.owasp.org/index.php/OWASP_Dependency_Check has plug-ins for most 
> Java build/make types (ant, maven, ivy, gradle). Also, add the appropriate 
> command to the nightly build to generate a report of all known 
> vulnerabilities in any/all third party libraries/dependencies that get pulled 
> in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false clean aggregate
> Generating this report nightly/weekly will help inform the project's 
> development team if any dependant libraries have a reported known 
> vulneraility. Project teams that keep up with removing vulnerabilities on a 
> weekly basis will help protect businesses that rely on these open source 
> componets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24867) Add AnalysisBarrier to DataFrameWriter

2018-07-25 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556760#comment-16556760
 ] 

Saisai Shao commented on SPARK-24867:
-

I see, thanks! Please let me know when the JIRA is opened.

> Add AnalysisBarrier to DataFrameWriter 
> ---
>
> Key: SPARK-24867
> URL: https://issues.apache.org/jira/browse/SPARK-24867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.3.2
>
>
> {code}
>   val udf1 = udf({(x: Int, y: Int) => x + y})
>   val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
>   df.cache()
>   df.write.saveAsTable("t")
>   df.write.saveAsTable("t1")
> {code}
> Cache is not being used because the plans do not match with the cached plan. 
> This is a regression caused by the changes we made in AnalysisBarrier, since 
> not all the Analyzer rules are idempotent. We need to fix it to Spark 2.3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-24 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555080#comment-16555080
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves] [~irashid] thanks a lot for your comments.

Currently in my design I don't insert a specific stage boundary with different 
resources, the stage boundary is still the same (by shuffle or by result). so 
{{withResouces}} is not an eval() action which trigger a stage. Instead, it 
just adds a resource hint to the RDD.

So which means RDDs with different resources requirements in one stage may have 
conflicts. For example: {{rdd1.withResources.mapPartitions \{ xxx 
\}.withResources.mapPartitions \{ xxx \}.collect}},  resources in rdd1 may be 
different from map rdd, so currently what I can think is that:

1. always pick the latter with warning log to say that multiple different 
resources in one stage is illegal.
2. fail the stage with warning log to say that multiple different resources in 
one stage is illegal.
3. merge conflicts with maximum resources needs. For example rdd1 requires 3 
gpus per task, rdd2 requires 4 gpus per task, then the merged requirement would 
be 4 gpus per task. (This is the high level description, details will be per 
partition based merging) [chosen].

Take join for example, where rdd1 and rdd2 may have different resource 
requirements, and joined RDD will potentially have other resource requirements.

For example:

{code}
val rddA = rdd.mapPartitions().withResources
val rddB = rdd.mapPartitions().withResources
val rddC = rddA.join(rddB).withResources
rddC.collect()
{code}

Here this 3 {{withResources}} may have different requirements. Since {{rddC}} 
is running in a different stage, so there's no need to merge the resource 
conflicts. But {{rddA}} and {{rddB}} are running in the same stage with 
different tasks (partitions). So the merging strategy I'm thinking is based on 
the partition, tasks running with partitions from {{rddA}} will use the 
resource specified by {{rddA}}, so does {{rddB}}.





> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24867) Add AnalysisBarrier to DataFrameWriter

2018-07-24 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554986#comment-16554986
 ] 

Saisai Shao commented on SPARK-24867:
-

[~smilegator] what's the ETA of this issue?

> Add AnalysisBarrier to DataFrameWriter 
> ---
>
> Key: SPARK-24867
> URL: https://issues.apache.org/jira/browse/SPARK-24867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> {code}
>   val udf1 = udf({(x: Int, y: Int) => x + y})
>   val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
>   df.cache()
>   df.write.saveAsTable("t")
>   df.write.saveAsTable("t1")
> {code}
> Cache is not being used because the plans do not match with the cached plan. 
> This is a regression caused by the changes we made in AnalysisBarrier, since 
> not all the Analyzer rules are idempotent. We need to fix it to Spark 2.3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24297) Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB

2018-07-24 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24297:
---

Assignee: Imran Rashid

> Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
> ---
>
> Key: SPARK-24297
> URL: https://issues.apache.org/jira/browse/SPARK-24297
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Shuffle, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Any network request which does not use stream-to-disk that is sending over 
> 2GB is doomed to fail, so we might as well at least set the default value of 
> spark.maxRemoteBlockSizeFetchToMem to something < 2GB.
> It probably makes sense to set it to something even lower still, but that 
> might require more careful testing; this is a totally safe first step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24297) Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB

2018-07-24 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24297.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21474
[https://github.com/apache/spark/pull/21474]

> Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
> ---
>
> Key: SPARK-24297
> URL: https://issues.apache.org/jira/browse/SPARK-24297
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Shuffle, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Any network request which does not use stream-to-disk that is sending over 
> 2GB is doomed to fail, so we might as well at least set the default value of 
> spark.maxRemoteBlockSizeFetchToMem to something < 2GB.
> It probably makes sense to set it to something even lower still, but that 
> might require more careful testing; this is a totally safe first step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-23 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553679#comment-16553679
 ] 

Saisai Shao commented on SPARK-24615:
-

[~Tagar] yes!

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24594) Introduce metrics for YARN executor allocation problems

2018-07-23 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24594.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21635
[https://github.com/apache/spark/pull/21635]

> Introduce metrics for YARN executor allocation problems 
> 
>
> Key: SPARK-24594
> URL: https://issues.apache.org/jira/browse/SPARK-24594
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.0
>
>
> Within SPARK-16630 it come up to introduce metrics for  YARN executor 
> allocation problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24594) Introduce metrics for YARN executor allocation problems

2018-07-23 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24594:
---

Assignee: Attila Zsolt Piros

> Introduce metrics for YARN executor allocation problems 
> 
>
> Key: SPARK-24594
> URL: https://issues.apache.org/jira/browse/SPARK-24594
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.0
>
>
> Within SPARK-16630 it come up to introduce metrics for  YARN executor 
> allocation problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-22 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552233#comment-16552233
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves] what you mentioned above is also what we think about and try to 
figure out a way to solve it. (this problem also existed in barrier execution).

>From user point, specifying resource through RDD is the only feasible way 
>currently what I can think, though resource is bound to stage/task not 
>particular RDD. This means user could specify resources for different RDDs in 
>a single stage, Spark can only use one resource within this stage. This will 
>bring out several problems as you mentioned:

*Specify resources to which RDD*

For example {{rddA.withResource.mapPartition \{ xxx \}.collect()}} is not 
different from {{rddA.mapPartition \{ xxx \}.withResource.collect}}. Since all 
the rdds are executed in the same stage. So in the current design, not matter 
the resource is specified with {{rddA}} or mapped RDD, the result is the same.

*one to one dependency RDDs with different resources*

For example {{rddA.withResource.mapPartition \{ xxx \}.withResource.collec()}}, 
here assuming the resource request for {{rddA}} and mapped RDD is different, 
since they're running in a single stage, so we should fix such conflict.

*multiple dependencies RDDs with different resources*

For example:

{code}
val rddA = rdd.withResources.mapPartitions()

val rddB = rdd.withResources.mapPartitions()

val rddC = rddA.join(rddB)
{code}

If the resources in {{rddA}} is different from {{rddB}}, then we should also 
fix such conflicts.

Previously I proposed to use largest resource requirement to satisfy all the 
needs. But it may also cause the resource wasting, [~mengxr] mentioned to 
set/merge resources per partition to avoid waste. In the meanwhile, it there's 
a API exposed to set resources in the stage level, then this problem will not 
be existed, but Spark doesn't expose such APIs to user, the only thing user can 
specify is from RDD level, I'm still thinking of a good way to fix it.
 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-19 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550266#comment-16550266
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves] I'm still not sure how to handle memory per stage. Unlike MR, 
Spark shares the task runtime in a single JVM, I'm not sure how to control the 
memory usage within the JVM. Are you suggesting the similar approach like using 
GPU, when memory requirement cannot be satisfied, release the current executors 
and requesting new executors by dynamic resource allocation?

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-19 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550182#comment-16550182
 ] 

Saisai Shao commented on SPARK-24723:
-

Hi [~mengxr], I don't think YARN has such feature to configure password-less 
SSH on all containers. YARN itself doesn't rely on SSH, and in our deployment 
(Ambari), we don't have use password-less ssh.
{quote}And does container by default run sshd? If not, which process is 
responsible for starting/terminating the daemon?
{quote}
If the container is is not dockerized, so it will share with system's sshd, it 
is system's responsibility to start/terminate this daemon.

If the container is dockerized, I think the docker container should be 
responsible for starting sshd (IIUC).

Maybe we should check if sshd is started before starting MPI job, if sshd is 
not started, simply we cannot run MPI job no matter who is responsible for sshd 
daemon.

[~leftnoteasy] might have some thoughts, since he is the originator of 
mpich2-yarn.

 

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Saisai Shao
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.
>  
> Requirements:
>  * understand how to set up YARN to run MPI job as a YARN application
>  * figure out how to do it with Spark w/ Barrier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24195) sc.addFile for local:/ path is broken

2018-07-19 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24195:
---

Assignee: Li Yuanjian

> sc.addFile for local:/ path is broken
> -
>
> Key: SPARK-24195
> URL: https://issues.apache.org/jira/browse/SPARK-24195
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Assignee: Li Yuanjian
>Priority: Minor
>  Labels: starter
> Fix For: 2.4.0
>
>
> In changing SPARK-6300
> https://github.com/apache/spark/commit/00e730b94cba1202a73af1e2476ff5a44af4b6b2
> essentially the change to
> new File(path).getCanonicalFile.toURI.toString
> breaks when path is local:, as java.io.File doesn't handle it.
> eg.
> new 
> File("local:///home/user/demo/logger.config").getCanonicalFile.toURI.toString
> res1: String = file:/user/anotheruser/local:/home/user/demo/logger.config



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24195) sc.addFile for local:/ path is broken

2018-07-19 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24195.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21533
[https://github.com/apache/spark/pull/21533]

> sc.addFile for local:/ path is broken
> -
>
> Key: SPARK-24195
> URL: https://issues.apache.org/jira/browse/SPARK-24195
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>  Labels: starter
> Fix For: 2.4.0
>
>
> In changing SPARK-6300
> https://github.com/apache/spark/commit/00e730b94cba1202a73af1e2476ff5a44af4b6b2
> essentially the change to
> new File(path).getCanonicalFile.toURI.toString
> breaks when path is local:, as java.io.File doesn't handle it.
> eg.
> new 
> File("local:///home/user/demo/logger.config").getCanonicalFile.toURI.toString
> res1: String = file:/user/anotheruser/local:/home/user/demo/logger.config



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24307) Support sending messages over 2GB from memory

2018-07-19 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550167#comment-16550167
 ] 

Saisai Shao commented on SPARK-24307:
-

Issue resolved by pull request 21440
[https://github.com/apache/spark/pull/21440]

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB 
> in

[jira] [Resolved] (SPARK-24307) Support sending messages over 2GB from memory

2018-07-19 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24307.
-
Resolution: Fixed
  Assignee: Imran Rashid

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB 
> in memory). 
>  A drawback to this approach is that blocks that are cached

[jira] [Updated] (SPARK-24307) Support sending messages over 2GB from memory

2018-07-19 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24307:

Fix Version/s: 2.4.0

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB 
> in memory). 
>  A drawback to this approach is that blocks that are cached in memory as 
> deserialized values would need to have the

[jira] [Updated] (SPARK-24037) stateful operators

2018-07-19 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24037:

Fix Version/s: (was: 2.4.0)

> stateful operators
> --
>
> Key: SPARK-24037
> URL: https://issues.apache.org/jira/browse/SPARK-24037
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> pointer to https://issues.apache.org/jira/browse/SPARK-24036



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24037) stateful operators

2018-07-19 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24037:

Fix Version/s: 2.4.0

> stateful operators
> --
>
> Key: SPARK-24037
> URL: https://issues.apache.org/jira/browse/SPARK-24037
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> pointer to https://issues.apache.org/jira/browse/SPARK-24036



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-19 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549276#comment-16549276
 ] 

Saisai Shao commented on SPARK-24615:
-

Thanks [~tgraves] for the suggestion. 
{quote}Once I get to the point I want to do the ML I want to ask for the gpu's 
as well as ask for more memory during that stage because I didn't need more 
before this stage for all the etl work.  I realize you already have executors, 
but ideally spark with the cluster manager could potentially release the 
existing ones and ask for new ones with those requirements.
{quote}
Yes, I already discussed with my colleague offline, this is a valid scenario, 
but I think to achieve this we should change the current dynamic resource 
allocation mechanism Currently I marked this as a Non-Goal in this proposal, 
only focus on statically resource requesting (--executor-cores, 
--executor-gpus). I think we should support it later.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-18 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548662#comment-16548662
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves] I'm rewriting the design doc based on the comments mentioned 
above, so temporarily make it inaccessible, sorry about it, I will reopen it.

I think it is hard to control the memory usage per stage/task, because task is 
running in the executor which shared within a JVM. For CPU, yes I think we can 
do it, but I'm not sure the usage scenario of it.

For the requirement of using different types of machine, what I can think of is 
leveraging dynamic resource allocation. For example, if user wants run some MPI 
jobs with barrier enabled, then Spark could allocate some new executors with 
accelerator resource via cluster manager (for example using node label if it is 
running on YARN). But I will not target this as a goal in this design, since a) 
it is a non-goal for barrier scheduler currently; b) it makes the design too 
complex, would be better to separate to another work.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-17 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547263#comment-16547263
 ] 

Saisai Shao commented on SPARK-24615:
-

Sure, I will also add it as Xiangrui also suggested the same concern.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-17 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546590#comment-16546590
 ] 

Saisai Shao commented on SPARK-24615:
-

Yes, currently the user is responsible for asking yarn to get resources like 
GPU, for example like {{--num-gpus}}. 

Yes, I agree with you, the concerns you mentioned above is valid. But currently 
the design of this Jira only targets to the task level scheduling with 
accelerator resources already available. It assumes that accelerator resources 
is already got by executor and reported back to driver. Driver will schedule 
the tasks based on the available resources.

For Spark to communicate with cluster manager to request other resources like 
GPU, currently it is not covered in this design doc.

Xiangrui mentioned that Spark to communicate with cluster manager should also 
be covered in this SPIP, so I'm still under drafting.

 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-16 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545962#comment-16545962
 ] 

Saisai Shao commented on SPARK-24615:
-

Sorry [~tgraves] for the late response. Yes,  when requesting executors, user 
should know accelerators are required or not. If there's no satisfied 
accelerators, the job will be pending or not launched. 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24804) There are duplicate words in the title in the DatasetSuite

2018-07-14 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544386#comment-16544386
 ] 

Saisai Shao edited comment on SPARK-24804 at 7/15/18 1:29 AM:
--

Please don't set target version or fix version. Committers will help to set 
when this issue is resolved.

I'm removing the target version of this Jira to avoid blocking the release of 
2.3.2


was (Author: jerryshao):
Please don't set target version or fix version. Committers will help to set 
when this issue is resolved.

> There are duplicate words in the title in the DatasetSuite
> --
>
> Key: SPARK-24804
> URL: https://issues.apache.org/jira/browse/SPARK-24804
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: hantiantian
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24804) There are duplicate words in the title in the DatasetSuite

2018-07-14 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544386#comment-16544386
 ] 

Saisai Shao commented on SPARK-24804:
-

Please don't set target version or fix version. Committers will help to set 
when this issue is resolved.

> There are duplicate words in the title in the DatasetSuite
> --
>
> Key: SPARK-24804
> URL: https://issues.apache.org/jira/browse/SPARK-24804
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: hantiantian
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24804) There are duplicate words in the title in the DatasetSuite

2018-07-14 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24804:

Priority: Trivial  (was: Minor)

> There are duplicate words in the title in the DatasetSuite
> --
>
> Key: SPARK-24804
> URL: https://issues.apache.org/jira/browse/SPARK-24804
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: hantiantian
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24804) There are duplicate words in the title in the DatasetSuite

2018-07-14 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24804:

Target Version/s:   (was: 2.3.2)

> There are duplicate words in the title in the DatasetSuite
> --
>
> Key: SPARK-24804
> URL: https://issues.apache.org/jira/browse/SPARK-24804
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: hantiantian
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24804) There are duplicate words in the title in the DatasetSuite

2018-07-14 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24804:

Fix Version/s: (was: 2.3.2)

> There are duplicate words in the title in the DatasetSuite
> --
>
> Key: SPARK-24804
> URL: https://issues.apache.org/jira/browse/SPARK-24804
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: hantiantian
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24781) Using a reference from Dataset in Filter/Sort might not work.

2018-07-11 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539635#comment-16539635
 ] 

Saisai Shao commented on SPARK-24781:
-

I see. I will wait for this before cutting a new 2.3.2 RC release

> Using a reference from Dataset in Filter/Sort might not work.
> -
>
> Key: SPARK-24781
> URL: https://issues.apache.org/jira/browse/SPARK-24781
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takuya Ueshin
>Priority: Blocker
>
> When we use a reference from {{Dataset}} in {{filter}} or {{sort}}, which was 
> not used in the prior {{select}}, an {{AnalysisException}} occurs, e.g.,
> {code:scala}
> val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
> df.select(df("name")).filter(df("id") === 0).show()
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing 
> from name#5 in operator !Filter (id#6 = 0).;;
> !Filter (id#6 = 0)
>+- AnalysisBarrier
>   +- Project [name#5]
>  +- Project [_1#2 AS name#5, _2#3 AS id#6]
> +- LocalRelation [_1#2, _2#3]
> {noformat}
> If we use {{col}} instead, it works:
> {code:scala}
> val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
> df.select(col("name")).filter(col("id") === 0).show()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24781) Using a reference from Dataset in Filter/Sort might not work.

2018-07-11 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539630#comment-16539630
 ] 

Saisai Shao commented on SPARK-24781:
-

Thanks Felix. Does this have to be in 2.3.2? [~ueshin]

> Using a reference from Dataset in Filter/Sort might not work.
> -
>
> Key: SPARK-24781
> URL: https://issues.apache.org/jira/browse/SPARK-24781
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Takuya Ueshin
>Priority: Blocker
>
> When we use a reference from {{Dataset}} in {{filter}} or {{sort}}, which was 
> not used in the prior {{select}}, an {{AnalysisException}} occurs, e.g.,
> {code:scala}
> val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
> df.select(df("name")).filter(df("id") === 0).show()
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing 
> from name#5 in operator !Filter (id#6 = 0).;;
> !Filter (id#6 = 0)
>+- AnalysisBarrier
>   +- Project [name#5]
>  +- Project [_1#2 AS name#5, _2#3 AS id#6]
> +- LocalRelation [_1#2, _2#3]
> {noformat}
> If we use {{col}} instead, it works:
> {code:scala}
> val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id")
> df.select(col("name")).filter(col("id") === 0).show()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming

2018-07-10 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24678:
---

Assignee: sharkd tu

> We should use 'PROCESS_LOCAL' first for Spark-Streaming
> ---
>
> Key: SPARK-24678
> URL: https://issues.apache.org/jira/browse/SPARK-24678
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.3.1
>Reporter: sharkd tu
>Assignee: sharkd tu
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, `BlockRDD.getPreferredLocations`  only get hosts info of blocks, 
> which results in subsequent schedule level is not better than 'NODE_LOCAL'. 
> We can just make a small changes, the schedule level can be improved to 
> 'PROCESS_LOCAL'
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming

2018-07-10 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24678.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21658
[https://github.com/apache/spark/pull/21658]

> We should use 'PROCESS_LOCAL' first for Spark-Streaming
> ---
>
> Key: SPARK-24678
> URL: https://issues.apache.org/jira/browse/SPARK-24678
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.3.1
>Reporter: sharkd tu
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, `BlockRDD.getPreferredLocations`  only get hosts info of blocks, 
> which results in subsequent schedule level is not better than 'NODE_LOCAL'. 
> We can just make a small changes, the schedule level can be improved to 
> 'PROCESS_LOCAL'
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24646) Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes

2018-07-08 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24646.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21633
[https://github.com/apache/spark/pull/21633]

> Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes
> 
>
> Key: SPARK-24646
> URL: https://issues.apache.org/jira/browse/SPARK-24646
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.4.0
>
>
> In the case of getting tokens via customized {{ServiceCredentialProvider}}, 
> it is required that {{ServiceCredentialProvider}} be available in local 
> spark-submit process classpath. In this case, all the configured remote 
> sources should be forced to download to local.
> For the ease of using this configuration, here propose to add wildcard '*' 
> support to {{spark.yarn.dist.forceDownloadSchemes}}, also clarify the usage 
> of this configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24646) Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes

2018-07-08 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24646:
---

Assignee: Saisai Shao

> Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes
> 
>
> Key: SPARK-24646
> URL: https://issues.apache.org/jira/browse/SPARK-24646
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.4.0
>
>
> In the case of getting tokens via customized {{ServiceCredentialProvider}}, 
> it is required that {{ServiceCredentialProvider}} be available in local 
> spark-submit process classpath. In this case, all the configured remote 
> sources should be forced to download to local.
> For the ease of using this configuration, here propose to add wildcard '*' 
> support to {{spark.yarn.dist.forceDownloadSchemes}}, also clarify the usage 
> of this configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-06 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534565#comment-16534565
 ] 

Saisai Shao edited comment on SPARK-24723 at 7/6/18 8:25 AM:
-

[~mengxr] [~jiangxb1987]

There's one solution to handle password-less SSH problem for all cluster 
manager in a programming way. This is referred from MPI on YARN framework 
[https://github.com/alibaba/mpich2-yarn]

In this MPI on YARN framework, before launching MPI job, application master 
(master) will generate ssh private key and public key and then propagate the 
public key to all the containers (worker), during container start, it will 
write public key to local authorized_keys file, so after that, MPI job started 
from master node can ssh with all the containers in password-less manner. After 
MPI job is finished, all the containers would delete this public key from 
authorized_keys file to revert the environment.

In our case, we could do this in a similar way, before launching MPI job, 0-th 
task could also generate ssh private key and public key, and then propagate the 
public keys to all the barrier task (maybe through BarrierTaskContext). For 
other tasks, they could receive public key from 0-th task and write public key 
to authorized_keys file (maybe by BarrierTaskContext). After this, 
password-less ssh is set up, mpirun from 0-th task could be started without 
password. After MPI job is finished, all the barrier tasks could delete this 
public key from authorized_keys file to revert the environment.

The example code is like below:
{code:java}
 rdd.barrier().mapPartitions { (iter, context) => 
  // Write iter to disk. ??? 
  // Wait until all tasks finished writing. 
  context.barrier() 
  // The 0-th task launches an MPI job. 
  if (context.partitionId() == 0) {
    // generate and propagate ssh keys.
    // Wait for keys to set up in other tasks.
    val hosts = context.getTaskInfos().map(_.host)
    // Set up MPI machine file using host infos. ???
    // Launch the MPI job by calling mpirun. ???
  } else {
    // get and setup public key
    // notify 0-th task that pubic key is setup.
  }

  // Wait until the MPI job finished. 
  context.barrier()

  // Delete SSH key and revert the environment. 
  // Collect output and return. ??? 
 }
{code}
 

What is your opinion about this solution?


was (Author: jerryshao):
[~mengxr] [~jiangxb1987]

There's one solution to handle password-less SSH problem for all cluster 
manager in a programming way. This is referred from MPI on YARN framework 
[https://github.com/alibaba/mpich2-yarn]

In this MPI on YARN framework, before launching MPI job, application master 
(master) will generate ssh private key and public key and then propagate the 
public key to all the containers (worker), during container start, it will 
write public key to local authorized_keys file, so after that, MPI job started 
from master node can ssh with all the containers in password-less manner. After 
MPI job is finished, all the containers would delete this public key from 
authorized_keys file to revert the environment.

In our case, we could do this in a similar way, before launching MPI job, 0-th 
task could also generate ssh private key and public key, and then propagate the 
public keys to all the barrier task (maybe through BarrierTaskContext). For 
other tasks, they could receive public key from 0-th task and write public key 
to authorized_keys file (maybe by BarrierTaskContext). After this, 
password-less ssh is set up, mpirun from 0-th task could be started without 
password. After MPI job is finished, all the barrier tasks could delete this 
public key from authorized_keys file to revert the environment.

The example code is like below:

 
rdd.barrier().mapPartitions { (iter, context) =>
  // Write iter to disk.???
  // Wait until all tasks finished writing.
  context.barrier()
  // The 0-th task launches an MPI job.
  if (context.partitionId() == 0) {
// generate and propagate ssh keys.
// Wait for keys to set up in other tasks.

val hosts = context.getTaskInfos().map(_.host)  
// Set up MPI machine file using host infos.  ???
// Launch the MPI job by calling mpirun.  ???
  } else {
// get and setup public key
// notify 0-th task that pubic key is setup.
  }  
  // Wait until the MPI job finished.
  context.barrier()

  // Delete SSH key and revert the environment.  
  // Collect output and return.???  
}
 

What is your opinion about this solution?

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions:

[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-06 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534565#comment-16534565
 ] 

Saisai Shao commented on SPARK-24723:
-

[~mengxr] [~jiangxb1987]

There's one solution to handle password-less SSH problem for all cluster 
manager in a programming way. This is referred from MPI on YARN framework 
[https://github.com/alibaba/mpich2-yarn]

In this MPI on YARN framework, before launching MPI job, application master 
(master) will generate ssh private key and public key and then propagate the 
public key to all the containers (worker), during container start, it will 
write public key to local authorized_keys file, so after that, MPI job started 
from master node can ssh with all the containers in password-less manner. After 
MPI job is finished, all the containers would delete this public key from 
authorized_keys file to revert the environment.

In our case, we could do this in a similar way, before launching MPI job, 0-th 
task could also generate ssh private key and public key, and then propagate the 
public keys to all the barrier task (maybe through BarrierTaskContext). For 
other tasks, they could receive public key from 0-th task and write public key 
to authorized_keys file (maybe by BarrierTaskContext). After this, 
password-less ssh is set up, mpirun from 0-th task could be started without 
password. After MPI job is finished, all the barrier tasks could delete this 
public key from authorized_keys file to revert the environment.

The example code is like below:

 
rdd.barrier().mapPartitions { (iter, context) =>
  // Write iter to disk.???
  // Wait until all tasks finished writing.
  context.barrier()
  // The 0-th task launches an MPI job.
  if (context.partitionId() == 0) {
// generate and propagate ssh keys.
// Wait for keys to set up in other tasks.

val hosts = context.getTaskInfos().map(_.host)  
// Set up MPI machine file using host infos.  ???
// Launch the MPI job by calling mpirun.  ???
  } else {
// get and setup public key
// notify 0-th task that pubic key is setup.
  }  
  // Wait until the MPI job finished.
  context.barrier()

  // Delete SSH key and revert the environment.  
  // Collect output and return.???  
}
 

What is your opinion about this solution?

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-06 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534529#comment-16534529
 ] 

Saisai Shao commented on SPARK-24723:
-

After discussed with Xiangrui offline, resource reservation is not the key 
focus here. Here the main problem is how to provide necessary information for 
barrier tasks to start MPI job in a password-less manner.

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (LIVY-480) Apache livy and Ajax

2018-07-04 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/LIVY-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved LIVY-480.
--
Resolution: Invalid

> Apache livy and Ajax
> 
>
> Key: LIVY-480
> URL: https://issues.apache.org/jira/browse/LIVY-480
> Project: Livy
>  Issue Type: Request
>  Components: API, Interpreter, REPL, Server
>Affects Versions: 0.5.0
>Reporter: Melchicédec NDUWAYO
>Priority: Major
>
> Good morning every one,
> can someone help me with how to send a post request by ajax to the apache 
> livy server? I tried my best but I don't see how to do that. I have a 
> textarea in which user can put his spark code. I want to send that code to 
> the apache livy by ajax and get the result. I don't if my question is clear. 
> Thank you.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN

2018-07-04 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533205#comment-16533205
 ] 

Saisai Shao commented on SPARK-24723:
-

Hi [~mengxr], I would like to know the goal of this ticket? The goal of barrier 
scheduler is to offer gang semantics in the task scheduling level, whereas the 
gang semantics in the YARN level is more regarding to resource level.

I discussed with [~leftnoteasy] about the feasibility of supporting gang 
semantics on YARN. YARN has Reservation System which support gang like 
semantics (reserve requested resources), but it is not designed for gang. Here 
is some thoughts about supporting it on YARN 
[https://docs.google.com/document/d/1OA-iVwuHB8wlzwwlrEHOK6Q2SlKy3-5QEB5AmwMVEUU/edit?usp=sharing],
 I'm not sure if it aligns your goal of this ticket.

> Discuss necessary info and access in barrier mode + YARN
> 
>
> Key: SPARK-24723
> URL: https://issues.apache.org/jira/browse/SPARK-24723
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In barrier mode, to run hybrid distributed DL training jobs, we need to 
> provide users sufficient info and access so they can set up a hybrid 
> distributed training job, e.g., using MPI.
> This ticket limits the scope of discussion to Spark + YARN. There were some 
> past attempts from the Hadoop community. So we should find someone with good 
> knowledge to lead the discussion here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24739) PySpark does not work with Python 3.7.0

2018-07-04 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533189#comment-16533189
 ] 

Saisai Shao commented on SPARK-24739:
-

Do we have to fix it in 2.3.2? I don't think it is even critical. For example 
Java 9 is out for a long time, but we still don't support Java 9. So this seems 
not a big problem. 

> PySpark does not work with Python 3.7.0
> ---
>
> Key: SPARK-24739
> URL: https://issues.apache.org/jira/browse/SPARK-24739
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.3, 2.2.2, 2.3.1
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> Python 3.7 is released in few days ago and our PySpark does not work. For 
> example
> {code}
> sc.parallelize([1, 2]).take(1)
> {code}
> {code}
> File "/.../spark/python/pyspark/rdd.py", line 1343, in __main__.RDD.take
> Failed example:
> sc.parallelize(range(100), 100).filter(lambda x: x > 90).take(3)
> Exception raised:
> Traceback (most recent call last):
>   File "/.../3.7/lib/python3.7/doctest.py", line 1329, in __run
> compileflags, 1), test.globs)
>   File "", line 1, in 
> sc.parallelize(range(100), 100).filter(lambda x: x > 90).take(3)
>   File "/.../spark/python/pyspark/rdd.py", line 1377, in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "/.../spark/python/pyspark/context.py", line 1013, in runJob
> sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), 
> mappedRDD._jrdd, partitions)
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
> line 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 0 in stage 143.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 143.0 (TID 688, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/.../spark/python/pyspark/rdd.py", line 1373, in takeUpToNumLeft
> yield next(iterator)
> StopIteration
> The above exception was the direct cause of the following exception:
> Traceback (most recent call last):
>   File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 320, 
> in main
> process()
>   File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 315, 
> in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 378, in dump_stream
> vs = list(itertools.islice(iterator, batch))
> RuntimeError: generator raised StopIteration
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:309)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:449)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:432)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:263)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:149)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:149)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2071)
>   at 
>

[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR on Windows

2018-07-03 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532145#comment-16532145
 ] 

Saisai Shao commented on SPARK-24535:
-

Hi [~felixcheung] what's the current status of this JIRA, do you have an ETA 
about it?

> Fix java version parsing in SparkR on Windows
> -
>
> Key: SPARK-24535
> URL: https://issues.apache.org/jira/browse/SPARK-24535
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
>Priority: Blocker
>
> We see errors on CRAN of the form 
> {code:java}
>   java version "1.8.0_144"
>   Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
>   Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
>   Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
>   -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
> --
>   subscript out of bounds
>   1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
> sparkConfig = sparkRTestConfig) at 
> D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
>   2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
> sparkExecutorEnvMap, 
>  sparkJars, sparkPackages)
>   3: checkJavaVersion()
>   4: strsplit(javaVersionFilter[[1]], "[\"]")
> {code}
> The complete log file is at 
> http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-07-03 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532143#comment-16532143
 ] 

Saisai Shao commented on SPARK-24530:
-

Hi [~hyukjin.kwon] what is the current status of this JIRA, do you have an ETA 
about it?

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-07-01 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529352#comment-16529352
 ] 

Saisai Shao commented on SPARK-24530:
-

[~hyukjin.kwon], Spark 2.1.3 and 2.2.2 are on vote, can you please fix the 
issue and leave the comments in the related threads.

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken

2018-07-01 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24530:

Target Version/s: 2.1.3, 2.2.2, 2.3.2, 2.4.0  (was: 2.4.0)

> Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) 
> and pyspark.ml docs are broken
> ---
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?
>  
> The followings are our released doc status. Some recent docs seems to be 
> broken.
> *2.1.x*
> (O) 
> [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (O) 
> [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.2.x*
> (O) 
> [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> *2.3.x*
> (O) 
> [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]
> (X) 
> [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (LIVY-479) Enable livy.rsc.launcher.address configuration

2018-06-27 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/LIVY-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525775#comment-16525775
 ] 

Saisai Shao commented on LIVY-479:
--

Yes, I've left the comments on the PR.

> Enable livy.rsc.launcher.address configuration
> --
>
> Key: LIVY-479
> URL: https://issues.apache.org/jira/browse/LIVY-479
> Project: Livy
>  Issue Type: Improvement
>  Components: RSC
>Affects Versions: 0.5.0
>Reporter: Tao Li
>Priority: Major
>
> The current behavior is that we are setting livy.rsc.launcher.address to RPC 
> server and ignoring whatever is specified in configs. However in some 
> scenarios there is a need for the user to be able to explicitly configure it. 
> For example, the IP address for an active Livy server might change over the 
> time, so rather than using the fixed IP address, we need to specify this 
> setting to a more generic host name. For this reason, we want to enable the 
> capability of user side configurations. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (SPARK-24418) Upgrade to Scala 2.11.12

2018-06-25 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24418.
-
Resolution: Fixed

Issue resolved by pull request 21495
[https://github.com/apache/spark/pull/21495]

> Upgrade to Scala 2.11.12
> 
>
> Key: SPARK-24418
> URL: https://issues.apache.org/jira/browse/SPARK-24418
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Scala 2.11.12+ will support JDK9+. However, this is not going to be a simple 
> version bump. 
> *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
> initialize the Spark before REPL sees any files.
> Issue filed in Scala community.
> https://github.com/scala/bug/issues/10913



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24646) Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes

2018-06-25 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16523022#comment-16523022
 ] 

Saisai Shao commented on SPARK-24646:
-

Hi [~vanzin], here is a specific example:

We have a customized {{ServiceCredentialProvider}} named HS2 
{{ServiceCredentialProvider}} which is located in our own jar {{foo}}. So when 
SparkSubmit process launches YARN client, this {{foo}} jar should be existed in 
SparkSubmit process classpath. 

If this {{foo}} jar is a local resource added by {{--jars}}, then it will be 
existed in the classpath, but if it is a remote jar (for example on HDFS), then 
only yarn-client mode will download and add this {{foo}} jar to classpath, in 
yarn-cluster mode, it will not be downloaded, so this specific HS2 
{{ServiceCredentialProvider}} will not be loaded.

When using spark-submit script, user could decide to add the remote jar or 
local jar. But in the Livy scenario, Livy only supports remote jars (jars on 
the hdfs), and we only configure to support yarn cluster mode. So in this 
scenario, we cannot load this customized {{ServiceCredentialProvider}} in yarn 
cluster mode.

So the fix is to force to download the jars to local SparkSubmit process with 
configuration "spark.yarn.dist.forceDownloadSchemes", to use it easily, I 
propose to add wildcard '*' support, which will download all the remote 
resource without checking the scheme.

> Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes
> 
>
> Key: SPARK-24646
> URL: https://issues.apache.org/jira/browse/SPARK-24646
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Minor
>
> In the case of getting tokens via customized {{ServiceCredentialProvider}}, 
> it is required that {{ServiceCredentialProvider}} be available in local 
> spark-submit process classpath. In this case, all the configured remote 
> sources should be forced to download to local.
> For the ease of using this configuration, here propose to add wildcard '*' 
> support to {{spark.yarn.dist.forceDownloadSchemes}}, also clarify the usage 
> of this configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24646) Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes

2018-06-25 Thread Saisai Shao (JIRA)

Saisai Shao created SPARK-24646:
---

 Summary: Support wildcard '*' for to 
spark.yarn.dist.forceDownloadSchemes
 Key: SPARK-24646
 URL: https://issues.apache.org/jira/browse/SPARK-24646
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Saisai Shao


In the case of getting tokens via customized {{ServiceCredentialProvider}}, it 
is required that {{ServiceCredentialProvider}} be available in local 
spark-submit process classpath. In this case, all the configured remote sources 
should be forced to download to local.

For the ease of using this configuration, here propose to add wildcard '*' 
support to {{spark.yarn.dist.forceDownloadSchemes}}, also clarify the usage of 
this configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark

2018-06-21 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519934#comment-16519934
 ] 

Saisai Shao commented on SPARK-24374:
-

Hi [~mengxr] [~jiangxb1987] SPARK-24615 would be a part of project hydrogen, I 
wrote a high level design doc about the implementation, would you please help 
to review and comment, to see the solution is feasible or not. Thanks a lot.

> SPIP: Support Barrier Scheduling in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-21 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519931#comment-16519931
 ] 

Saisai Shao commented on HIVE-16391:


Gently ping [~hagleitn], would you please help to review the current proposed 
patch and suggest the next step. Thanks a lot.

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Affects Versions: 1.2.2
>Reporter: Reynold Xin
>Assignee: Saisai Shao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.3
>
> Attachments: HIVE-16391.1.patch, HIVE-16391.2.patch, HIVE-16391.patch
>
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SPARK-24615) Accelerator aware task scheduling for Spark

2018-06-21 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24615:

Labels: SPIP  (was: )

> Accelerator aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Priority: Major
>  Labels: SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24615) Accelerator aware task scheduling for Spark

2018-06-21 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24615:

Description: 
In the machine learning area, accelerator card (GPU, FPGA, TPU) is predominant 
compared to CPUs. To make the current Spark architecture to work with 
accelerator cards, Spark itself should understand the existence of accelerators 
and know how to schedule task onto the executors where accelerators are 
equipped.

Current Spark’s scheduler schedules tasks based on the locality of the data 
plus the available of CPUs. This will introduce some problems when scheduling 
tasks with accelerators required.
 # CPU cores are usually more than accelerators on one node, using CPU cores to 
schedule accelerator required tasks will introduce the mismatch.
 # In one cluster, we always assume that CPU is equipped in each node, but this 
is not true of accelerator cards.
 # The existence of heterogeneous tasks (accelerator required or not) requires 
scheduler to schedule tasks with a smart way.

So here propose to improve the current scheduler to support heterogeneous tasks 
(accelerator requires or not). This can be part of the work of Project hydrogen.

Details is attached in google doc. It doesn't cover all the implementation 
details, just highlight the parts should be changed.

 

CC [~yanboliang] [~merlintang]

  was:
In the machine learning area, accelerator card (GPU, FPGA, TPU) is predominant 
compared to CPUs. To make the current Spark architecture to work with 
accelerator cards, Spark itself should understand the existence of accelerators 
and know how to schedule task onto the executors where accelerators are 
equipped.

Current Spark’s scheduler schedules tasks based on the locality of the data 
plus the available of CPUs. This will introduce some problems when scheduling 
tasks with accelerators required.
 # CPU cores are usually more than accelerators on one node, using CPU cores to 
schedule accelerator required tasks will introduce the mismatch.
 # In one cluster, we always assume that CPU is equipped in each node, but this 
is not true of accelerator cards.
 # The existence of heterogeneous tasks (accelerator required or not) requires 
scheduler to schedule tasks with a smart way.

So here propose to improve the current scheduler to support heterogeneous tasks 
(accelerator requires or not). This can be part of the work of Project hydrogen.

Details is attached in google doc.

 

CC [~yanboliang] [~merlintang]


> Accelerator aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Priority: Major
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24615) Accelerator aware task scheduling for Spark

2018-06-21 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24615:

Description: 
In the machine learning area, accelerator card (GPU, FPGA, TPU) is predominant 
compared to CPUs. To make the current Spark architecture to work with 
accelerator cards, Spark itself should understand the existence of accelerators 
and know how to schedule task onto the executors where accelerators are 
equipped.

Current Spark’s scheduler schedules tasks based on the locality of the data 
plus the available of CPUs. This will introduce some problems when scheduling 
tasks with accelerators required.
 # CPU cores are usually more than accelerators on one node, using CPU cores to 
schedule accelerator required tasks will introduce the mismatch.
 # In one cluster, we always assume that CPU is equipped in each node, but this 
is not true of accelerator cards.
 # The existence of heterogeneous tasks (accelerator required or not) requires 
scheduler to schedule tasks with a smart way.

So here propose to improve the current scheduler to support heterogeneous tasks 
(accelerator requires or not). This can be part of the work of Project hydrogen.

Details is attached in google doc.

 

CC [~yanboliang] [~merlintang]

  was:
In the machine learning area, accelerator card (GPU, FPGA, TPU) is predominant 
compared to CPUs. To make the current Spark architecture to work with 
accelerator cards, Spark itself should understand the existence of accelerators 
and know how to schedule task onto the executors where accelerators are 
equipped.

Current Spark’s scheduler schedules tasks based on the locality of the data 
plus the available of CPUs. This will introduce some problems when scheduling 
tasks with accelerators required.
 # CPU cores are usually more than accelerators on one node, using CPU cores to 
schedule accelerator required tasks will introduce the mismatch.
 # In one cluster, we always assume that CPU is equipped in each node, but this 
is not true of accelerator cards.
 # The existence of heterogeneous tasks (accelerator required or not) requires 
scheduler to schedule tasks with a smart way.

So here propose to improve the current scheduler to support heterogeneous tasks 
(accelerator requires or not). Details is attached in google doc.


> Accelerator aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Priority: Major
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24615) Accelerator aware task scheduling for Spark

2018-06-21 Thread Saisai Shao (JIRA)

Saisai Shao created SPARK-24615:
---

 Summary: Accelerator aware task scheduling for Spark
 Key: SPARK-24615
 URL: https://issues.apache.org/jira/browse/SPARK-24615
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Saisai Shao


In the machine learning area, accelerator card (GPU, FPGA, TPU) is predominant 
compared to CPUs. To make the current Spark architecture to work with 
accelerator cards, Spark itself should understand the existence of accelerators 
and know how to schedule task onto the executors where accelerators are 
equipped.

Current Spark’s scheduler schedules tasks based on the locality of the data 
plus the available of CPUs. This will introduce some problems when scheduling 
tasks with accelerators required.
 # CPU cores are usually more than accelerators on one node, using CPU cores to 
schedule accelerator required tasks will introduce the mismatch.
 # In one cluster, we always assume that CPU is equipped in each node, but this 
is not true of accelerator cards.
 # The existence of heterogeneous tasks (accelerator required or not) requires 
scheduler to schedule tasks with a smart way.

So here propose to improve the current scheduler to support heterogeneous tasks 
(accelerator requires or not). Details is attached in google doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3

2018-06-19 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517720#comment-16517720
 ] 

Saisai Shao commented on SPARK-24493:
-

This BUG is fixed in HDFS-12670, Spark should bump its supported Hadoop3 
version to 3.1.1 when released.

> Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
> --
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, 
> renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, 
> sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 
> 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
> at org.apache.hadoop.ipc.Client.call(Client.java:1445)
> at org.apache.hadoop.ipc.Client.call(Client.java:1355)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at

[jira] [Updated] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-13 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated HIVE-16391:
---
Attachment: HIVE-16391.2.patch

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Affects Versions: 1.2.2
>Reporter: Reynold Xin
>Assignee: Saisai Shao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.3
>
> Attachments: HIVE-16391.1.patch, HIVE-16391.2.patch, HIVE-16391.patch
>
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-13 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511851#comment-16511851
 ] 

Saisai Shao commented on HIVE-16391:


I see. I can keep the "core" classifier and use another name. Will update the 
patch.

[~owen.omalley] would you please help to review this patch, since you created a 
Spark JIRA, or can you please point someone in Hive community to help to 
review? Thanks a lot.

 

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Affects Versions: 1.2.2
>Reporter: Reynold Xin
>Assignee: Saisai Shao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.3
>
> Attachments: HIVE-16391.1.patch, HIVE-16391.patch
>
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (LIVY-477) Upgrade Livy Scala version to 2.11.12

2018-06-12 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/LIVY-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved LIVY-477.
--
   Resolution: Fixed
Fix Version/s: 0.6.0

> Upgrade Livy Scala version to 2.11.12
> -
>
> Key: LIVY-477
> URL: https://issues.apache.org/jira/browse/LIVY-477
> Project: Livy
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 0.5.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 0.6.0
>
>
> Scala version below 2.11.12 has CVE 
> ([https://scala-lang.org/news/security-update-nov17.html),] and Spark will 
> also upgrade its supported version to 2.11.12. 
> So here upgrading Livy's Scala version also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (SPARK-24518) Using Hadoop credential provider API to store password

2018-06-11 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24518:

Description: 
Current Spark configs password in a plaintext way, like putting in the 
configuration file or adding as a launch arguments, sometimes such 
configurations like SSL password is configured by cluster admin, which should 
not be seen by user, but now this passwords are world readable to all the users.

Hadoop credential provider API support storing password in a secure way, in 
which Spark could read it in a secure way, so here propose to add support of 
using credential provider API to get password.

> Using Hadoop credential provider API to store password
> --
>
> Key: SPARK-24518
> URL: https://issues.apache.org/jira/browse/SPARK-24518
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Current Spark configs password in a plaintext way, like putting in the 
> configuration file or adding as a launch arguments, sometimes such 
> configurations like SSL password is configured by cluster admin, which should 
> not be seen by user, but now this passwords are world readable to all the 
> users.
> Hadoop credential provider API support storing password in a secure way, in 
> which Spark could read it in a secure way, so here propose to add support of 
> using credential provider API to get password.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24518) Using Hadoop credential provider API to store password

2018-06-11 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24518:

Environment: (was: Current Spark configs password in a plaintext way, 
like putting in the configuration file or adding as a launch arguments, 
sometimes such configurations like SSL password is configured by cluster admin, 
which should not be seen by user, but now this passwords are world readable to 
all the users.

Hadoop credential provider API support storing password in a secure way, in 
which Spark could read it in a secure way, so here propose to add support of 
using credential provider API to get password.)

> Using Hadoop credential provider API to store password
> --
>
> Key: SPARK-24518
> URL: https://issues.apache.org/jira/browse/SPARK-24518
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24518) Using Hadoop credential provider API to store password

2018-06-11 Thread Saisai Shao (JIRA)

Saisai Shao created SPARK-24518:
---

 Summary: Using Hadoop credential provider API to store password
 Key: SPARK-24518
 URL: https://issues.apache.org/jira/browse/SPARK-24518
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
 Environment: Current Spark configs password in a plaintext way, like 
putting in the configuration file or adding as a launch arguments, sometimes 
such configurations like SSL password is configured by cluster admin, which 
should not be seen by user, but now this passwords are world readable to all 
the users.

Hadoop credential provider API support storing password in a secure way, in 
which Spark could read it in a secure way, so here propose to add support of 
using credential provider API to get password.
Reporter: Saisai Shao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-06-08 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506023#comment-16506023
 ] 

Saisai Shao commented on SPARK-23534:
-

Yes, it is a Hadoop 2.8+ issue, but we don't a 2.8 profile for Spark, instead 
we have a 3.1 profile, so it will affect our hadoop 3 build, that's why also 
add a link here.

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24487) Add support for RabbitMQ.

2018-06-08 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24487.
-
Resolution: Won't Fix

> Add support for RabbitMQ.
> -
>
> Key: SPARK-24487
> URL: https://issues.apache.org/jira/browse/SPARK-24487
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michał Jurkiewicz
>Priority: Major
>
> Add support for RabbitMQ.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24487) Add support for RabbitMQ.

2018-06-08 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505786#comment-16505786
 ] 

Saisai Shao commented on SPARK-24487:
-

I'm sure Spark community will not merge this into Spark code base. You can 
either maintain a package yourself, or contribute to Apache Bahir.

> Add support for RabbitMQ.
> -
>
> Key: SPARK-24487
> URL: https://issues.apache.org/jira/browse/SPARK-24487
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michał Jurkiewicz
>Priority: Major
>
> Add support for RabbitMQ.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0

2018-06-08 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-23151:

Issue Type: Sub-task  (was: Dependency upgrade)
Parent: SPARK-23534

> Provide a distribution of Spark with Hadoop 3.0
> ---
>
> Key: SPARK-23151
> URL: https://issues.apache.org/jira/browse/SPARK-23151
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Louis Burke
>Priority: Major
>
> Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark 
> package
> only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication 
> is
> that using up to date Kinesis libraries alongside s3 causes a clash w.r.t
> aws-java-sdk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0

2018-06-08 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505783#comment-16505783
 ] 

Saisai Shao commented on SPARK-23151:
-

I will convert this as a subtask of SPARK-23534.

> Provide a distribution of Spark with Hadoop 3.0
> ---
>
> Key: SPARK-23151
> URL: https://issues.apache.org/jira/browse/SPARK-23151
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Louis Burke
>Priority: Major
>
> Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark 
> package
> only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication 
> is
> that using up to date Kinesis libraries alongside s3 causes a clash w.r.t
> aws-java-sdk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-06-08 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505771#comment-16505771
 ] 

Saisai Shao commented on SPARK-23534:
-

Spark with Hadoop 3 will be failed in token renew for long running case (with 
keytab and principal).

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3

2018-06-08 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505768#comment-16505768
 ] 

Saisai Shao edited comment on SPARK-24493 at 6/8/18 6:56 AM:
-

Adding more background, the issue is happened when building Spark with Hadoop 
2.8+.

In Spark, we use {{TokenIdentifer}} return by {{decodeIdentifier}} to get renew 
interval for HDFS token, but due to missing service loader file, Hadoop failed 
to create {{TokenIdentifier}} and returns *null*, which leads to an unexpected 
renew interval (Long.MaxValue).

The related code in Hadoop is:

{code}
 private static Class
 getClassForIdentifier(Text kind) { Class cls = 
null; synchronized (Token.class) { if (tokenKindMap == null)

{ tokenKindMap = Maps.newHashMap(); for (TokenIdentifier id : 
ServiceLoader.load(TokenIdentifier.class)) \\{ tokenKindMap.put(id.getKind(), 
id.getClass()); }

}
 cls = tokenKindMap.get(kind);
 } if (cls == null)

{ LOG.debug("Cannot find class for token kind " + kind); return null; }

return cls;
 }
{code}

The problem is:

The HDFS "DelegationTokenIdentifier" class is in hadoop-hdfs-client jar, but 
the service loader description file 
"META-INF/services/org.apache.hadoop.security.token.TokenIdentifier" is in 
hadoop-hdfs jar. Spark local submit process/driver process (depends on client 
or cluster mode) only relies on hadoop-hdfs-client jar, but not hadoop-hdfs 
jar. So the ServiceLoader will be failed to find HDFS 
"DelegationTokenIdentifier" class and return null.

The issue is due to the change in HADOOP-6200.  Previously we only have 
building profile for Hadoop 2.6 and 2.7, so there's no issue here. But 
currently we has a building profile for Hadoop 3.1, so this will fail the token 
renew in Hadoop 3.1.

The is a Hadoop issue, creating a Spark Jira to track this issue and bump the 
version when Hadoop side is fixed.


was (Author: jerryshao):
Adding more background, the issue is happened when building Spark with Hadoop 
2.8+.

In Spark, we use {{TokenIdentifer}} return by {{decodeIdentifier}} to get renew 
interval for HDFS token, but due to missing service loader file, Hadoop failed 
to create {{TokenIdentifier}} and returns *null*, which leads to an unexpected 
renew interval (Long.MaxValue).

The related code in Hadoop is:
  private static Class
  getClassForIdentifier(Text kind) {Class 
cls = null;synchronized (Token.class) {  if (tokenKindMap == null) {
tokenKindMap = Maps.newHashMap();for (TokenIdentifier id : 
ServiceLoader.load(TokenIdentifier.class)) \{
  tokenKindMap.put(id.getKind(), id.getClass());
}
  }
  cls = tokenKindMap.get(kind);
}if (cls == null) {
  LOG.debug("Cannot find class for token kind " + kind);  return null;
}return cls;
  }


The problem is:

The HDFS "DelegationTokenIdentifier" class is in hadoop-hdfs-client jar, but 
the service loader description file 
"META-INF/services/org.apache.hadoop.security.token.TokenIdentifier" is in 
hadoop-hdfs jar. Spark local submit process/driver process (depends on client 
or cluster mode) only relies on hadoop-hdfs-client jar, but not hadoop-hdfs 
jar. So the ServiceLoader will be failed to find HDFS 
"DelegationTokenIdentifier" class and return null.

The issue is due to the change in HADOOP-6200.  Previously we only have 
building profile for Hadoop 2.6 and 2.7, so there's no issue here. But 
currently we has a building profile for Hadoop 3.1, so this will fail the token 
renew in Hadoop 3.1.

The is a Hadoop issue, creating a Spark Jira to track this issue and bump the 
version when Hadoop side is fixed.

> Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
> --
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, ,

[jira] [Commented] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3

2018-06-08 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505768#comment-16505768
 ] 

Saisai Shao commented on SPARK-24493:
-

Adding more background, the issue is happened when building Spark with Hadoop 
2.8+.

In Spark, we use {{TokenIdentifer}} return by {{decodeIdentifier}} to get renew 
interval for HDFS token, but due to missing service loader file, Hadoop failed 
to create {{TokenIdentifier}} and returns *null*, which leads to an unexpected 
renew interval (Long.MaxValue).

The related code in Hadoop is:
  private static Class
  getClassForIdentifier(Text kind) {Class 
cls = null;synchronized (Token.class) {  if (tokenKindMap == null) {
tokenKindMap = Maps.newHashMap();for (TokenIdentifier id : 
ServiceLoader.load(TokenIdentifier.class)) \{
  tokenKindMap.put(id.getKind(), id.getClass());
}
  }
  cls = tokenKindMap.get(kind);
}if (cls == null) {
  LOG.debug("Cannot find class for token kind " + kind);  return null;
}return cls;
  }


The problem is:

The HDFS "DelegationTokenIdentifier" class is in hadoop-hdfs-client jar, but 
the service loader description file 
"META-INF/services/org.apache.hadoop.security.token.TokenIdentifier" is in 
hadoop-hdfs jar. Spark local submit process/driver process (depends on client 
or cluster mode) only relies on hadoop-hdfs-client jar, but not hadoop-hdfs 
jar. So the ServiceLoader will be failed to find HDFS 
"DelegationTokenIdentifier" class and return null.

The issue is due to the change in HADOOP-6200.  Previously we only have 
building profile for Hadoop 2.6 and 2.7, so there's no issue here. But 
currently we has a building profile for Hadoop 3.1, so this will fail the token 
renew in Hadoop 3.1.

The is a Hadoop issue, creating a Spark Jira to track this issue and bump the 
version when Hadoop side is fixed.

> Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
> --
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, 
> renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, 
> sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 
> 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
> at org.apache.hadoop.ipc.Client.call(Client.java:1445)
> at org.apache.hadoop.ipc.Client.call(Client.java:1355)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
>

[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3

2018-06-08 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24493:

Summary: Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3  
(was: Kerberos Ticket Renewal is failing in long running Spark job)

> Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
> --
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, 
> renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, 
> sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 
> 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
> at org.apache.hadoop.ipc.Client.call(Client.java:1445)
> at org.apache.hadoop.ipc.Client.call(Client.java:1355)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at

[jira] [Commented] (SPARK-24487) Add support for RabbitMQ.

2018-06-08 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505760#comment-16505760
 ] 

Saisai Shao commented on SPARK-24487:
-

You can add it in your own package with DataSource API. There's no meaning to 
add to Spark code base.

 

> Add support for RabbitMQ.
> -
>
> Key: SPARK-24487
> URL: https://issues.apache.org/jira/browse/SPARK-24487
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michał Jurkiewicz
>Priority: Major
>
> Add support for RabbitMQ.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in long running Spark job

2018-06-08 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24493:

Component/s: Spark Core

> Kerberos Ticket Renewal is failing in long running Spark job
> 
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, 
> renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, 
> sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 
> 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
> at org.apache.hadoop.ipc.Client.call(Client.java:1445)
> at org.apache.hadoop.ipc.Client.call(Client.java:1355)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at

[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in long running Spark job

2018-06-08 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24493:

Component/s: (was: Spark Core)
 YARN

> Kerberos Ticket Renewal is failing in long running Spark job
> 
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, 
> renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, 
> sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 
> 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
> at org.apache.hadoop.ipc.Client.call(Client.java:1445)
> at org.apache.hadoop.ipc.Client.call(Client.java:1355)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in long running Spark job

2018-06-08 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24493:

Priority: Major  (was: Blocker)

> Kerberos Ticket Renewal is failing in long running Spark job
> 
>
> Key: SPARK-24493
> URL: https://issues.apache.org/jira/browse/SPARK-24493
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Asif M
>Priority: Major
>
> Kerberos Ticket Renewal is failing on long running spark job. I have added 
> below 2 kerberos properties in the HDFS configuration and ran a spark 
> streaming job 
> ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py])
> {noformat}
> dfs.namenode.delegation.token.max-lifetime=180 (30min)
> dfs.namenode.delegation.token.renew-interval=90 (15min)
> {noformat}
>  
> Spark Job failed at 15min with below error:
> {noformat}
> 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at 
> /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381)
>  failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage 
> 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 
> (TID 7290, , executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, 
> renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, 
> sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 
> 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499)
> at org.apache.hadoop.ipc.Client.call(Client.java:1445)
> at org.apache.hadoop.ipc.Client.call(Client.java:1355)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
> at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845)
> at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
> at 
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86)
> at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189)
> at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141)
> at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at

[jira] [Commented] (SPARK-24487) Add support for RabbitMQ.

2018-06-07 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505608#comment-16505608
 ] 

Saisai Shao commented on SPARK-24487:
-

What's the usage and purpose to integrate RabbitMQ to Spark?

> Add support for RabbitMQ.
> -
>
> Key: SPARK-24487
> URL: https://issues.apache.org/jira/browse/SPARK-24487
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michał Jurkiewicz
>Priority: Major
>
> Add support for RabbitMQ.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-07 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505607#comment-16505607
 ] 

Saisai Shao commented on HIVE-16391:


Any comment [~vanzin] [~ste...@apache.org]?

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Affects Versions: 1.2.2
>Reporter: Reynold Xin
>Assignee: Saisai Shao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.3
>
> Attachments: HIVE-16391.1.patch, HIVE-16391.patch
>
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-06 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503188#comment-16503188
 ] 

Saisai Shao commented on HIVE-16391:


Uploaded a new patch [^HIVE-16391.1.patch]to use the solution mentioned by 
Marcelo.

Simply by adding two new maven modules and rename the original "hive-exec" 
module. One added module is new "hive-exec" which is compliant to existing 
Hive, another added module "hive-exec-spark" is specifically for Spark.

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Affects Versions: 1.2.2
>Reporter: Reynold Xin
>Assignee: Saisai Shao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.3
>
> Attachments: HIVE-16391.1.patch, HIVE-16391.patch
>
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-06 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated HIVE-16391:
---
Attachment: HIVE-16391.1.patch

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Affects Versions: 1.2.2
>Reporter: Reynold Xin
>Assignee: Saisai Shao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.3
>
> Attachments: HIVE-16391.1.patch, HIVE-16391.patch
>
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-06 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502976#comment-16502976
 ] 

Saisai Shao commented on HIVE-16391:


[~vanzin] one problem about your proposed solution: hive-exec test jar is not 
valid anymore, because we changed the artifact name for the current "hive-exec" 
pom. This might affect the user who relies on this test jar.

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Affects Versions: 1.2.2
>Reporter: Reynold Xin
>Assignee: Saisai Shao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.3
>
> Attachments: HIVE-16391.patch
>
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-05 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502756#comment-16502756
 ] 

Saisai Shao commented on HIVE-16391:


{quote}The problem with that is that it changes the meaning of Hive's 
artifacts, so anybody currently importing hive-exec would see a breakage, and 
that's probably not desired.
{quote}
 
 This might not be acceptable from Hive community, because it will break the 
current user as you mentioned.

As [~joshrosen] mentioned, Spark wants the hive-exec jar which shades kryo and 
prototuf-java, not a pure non-shaded jar.
{quote}Another option is to change the artifact name of the current "hive-exec" 
pom. Then you'd publish the normal jar under the new artifact name, then have a 
separate module that imports that jar, shades dependencies, and publishes the 
result as "hive-exec". That would maintain compatibility with existing 
artifacts.
{quote}
I can try this approach, but it seems not a small change for Hive, I'm not sure 
if Hive community will accept such approach (at least for branch 1.2).

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Affects Versions: 1.2.2
>Reporter: Reynold Xin
>Assignee: Saisai Shao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.3
>
> Attachments: HIVE-16391.patch
>
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-05 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated HIVE-16391:
---
Fix Version/s: 1.2.3
Affects Version/s: 1.2.2
   Attachment: HIVE-16391.patch
   Status: Patch Available  (was: Open)

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Affects Versions: 1.2.2
>Reporter: Reynold Xin
>Assignee: Saisai Shao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.2.3
>
> Attachments: HIVE-16391.patch
>
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-05 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501667#comment-16501667
 ] 

Saisai Shao commented on HIVE-16391:


I see, thanks. Will upload the patch.

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Reporter: Reynold Xin
>Assignee: Saisai Shao
>Priority: Major
>  Labels: pull-request-available
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-05 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501561#comment-16501561
 ] 

Saisai Shao commented on HIVE-16391:


Seems there's no permission for me to upload a file.

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Reporter: Reynold Xin
>Priority: Major
>  Labels: pull-request-available
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-05 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501561#comment-16501561
 ] 

Saisai Shao edited comment on HIVE-16391 at 6/5/18 10:15 AM:
-

Seems there's no permission for me to upload a file. There's no such button.


was (Author: jerryshao):
Seems there's no permission for me to upload a file.

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Reporter: Reynold Xin
>Priority: Major
>  Labels: pull-request-available
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-05 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501415#comment-16501415
 ] 

Saisai Shao commented on HIVE-16391:


I'm not sure if submitting a PR is a right way to review in Hive Community, 
waiting for the feedback.

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Reporter: Reynold Xin
>Priority: Major
>  Labels: pull-request-available
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)

2018-06-04 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501285#comment-16501285
 ] 

Saisai Shao commented on HIVE-16391:


Hi [~joshrosen] I'm trying to make the hive changes as you mentioned above 
using the new classifier {{core-spark}}. I found one problem about release two 
shaded jars (one is hive-exec, another is hive-exec-core-spark). The published 
pom file is still reduced pom file, which is related to hive-exec, so when 
Spark using hive-exec-core-spark jar, it should explicitly declare all the 
transitive dependencies of hive-exec.

I'm not sure if there's a way to publish two pom files mapping to two different 
shaded jars, or it is acceptable for Spark to explicitly declare all the 
transitive dependencies, like {{core}} classifier you used before?

> Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
> -
>
> Key: HIVE-16391
> URL: https://issues.apache.org/jira/browse/HIVE-16391
> Project: Hive
>  Issue Type: Task
>  Components: Build Infrastructure
>Reporter: Reynold Xin
>Priority: Major
>
> Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the 
> only change in the fork is to work around the issue that Hive publishes only 
> two sets of jars: one set with no dependency declared, and another with all 
> the dependencies included in the published uber jar. That is to say, Hive 
> doesn't publish a set of jars with the proper dependencies declared.
> There is general consensus on both sides that we should remove the forked 
> Hive.
> The change in the forked version is recorded here 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
> Note that the fork in the past included other fixes but those have all become 
> unnecessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2018-06-04 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501225#comment-16501225
 ] 

Saisai Shao commented on SPARK-20202:
-

OK, for the 1st, I've already started working on it locally. Looks like it is 
not a big change, only some POM changes are enough, I will submit a patch to 
Hive community.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Major
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2018-06-03 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499707#comment-16499707
 ] 

Saisai Shao commented on SPARK-20202:
-

What is our plan to to fix this issue, are we going to use new Hive version, or 
we are still stick to 1.2?

If we're still stick to 1.2, [~ste...@apache.org] and I will take this issue 
and make the ball rolling in Hive community.

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Major
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-06-01 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497931#comment-16497931
 ] 

Saisai Shao commented on SPARK-18673:
-

Thanks Steve, looking forward to your inputs.

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24355) Improve Spark shuffle server responsiveness to non-ChunkFetch requests

2018-06-01 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497626#comment-16497626
 ] 

Saisai Shao commented on SPARK-24355:
-

Do we have a test result before and after?

> Improve Spark shuffle server responsiveness to non-ChunkFetch requests
> --
>
> Key: SPARK-24355
> URL: https://issues.apache.org/jira/browse/SPARK-24355
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
> Environment: Hadoop-2.7.4
> Spark-2.3.0
>Reporter: Min Shen
>Priority: Major
>
> We run Spark on YARN, and deploy Spark external shuffle service as part of 
> YARN NM aux service.
> One issue we saw with Spark external shuffle service is the various timeout 
> experienced by the clients on either registering executor with local shuffle 
> server or establish connection to remote shuffle server.
> Example of a timeout for establishing connection with remote shuffle server:
> {code:java}
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:288)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:248)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:106)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:182)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.org$apache$spark$storage$ShuffleBlockFetcherIterator$$send$1(ShuffleBlockFetcherIterator.scala:396)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:391)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:345)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:57)
> {code}
> Example of a timeout for registering executor with local shuffle server:
> {code:java}
> ava.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
> {code}
> While patches such as SPARK-20640 and config parameters such as 
> spark.shuffle.registration.timeout and spark.shuffle.sasl.timeout (when 
> spark.authenticate is set to true) could help to alleviate this type of 
> problems, it does not solve the fundamental issue.
> We have observed that, when the shuffle workload gets very busy in peak 
> hours, the client requests could timeout even after configuring these 
> parameters to very high values. Further investigating this issue revealed the 
> following issue:
> Right now, the default server side netty handler threads is 2 * # cores, and 
> can be further configured with parameter spark.shuffle.io.serverThreads.
> In order to process a client request, it would require one available server 
> netty handler thread.
> However, when the server netty handler threads start to process 
> ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk 
> contentions from the random read operations initiated by all the 
> ChunkFetchRequests received from clients.
> As a result, when the shuffle server is serving many concurrent 
>

[jira] [Commented] (SPARK-24448) File not found on the address SparkFiles.get returns on standalone cluster

2018-05-31 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497548#comment-16497548
 ] 

Saisai Shao commented on SPARK-24448:
-

Does it only happen in standalone cluster mode, have you tried client mode?

> File not found on the address SparkFiles.get returns on standalone cluster
> --
>
> Key: SPARK-24448
> URL: https://issues.apache.org/jira/browse/SPARK-24448
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Pritpal Singh
>Priority: Major
>
> I want to upload a file on all worker nodes in a standalone cluster and 
> retrieve the location of file. Here is my code
>  
> val tempKeyStoreLoc = System.getProperty("java.io.tmpdir") + "/keystore.jks"
> val file = new File(tempKeyStoreLoc)
> sparkContext.addFile(file.getAbsolutePath)
> val keyLoc = SparkFiles.get("keystore.jks")
>  
> SparkFiles.get returns a random location where keystore.jks does not exist. I 
> submit the job in cluster mode. In fact the location Spark.Files returns does 
> not exist on any of the worker nodes (including the driver node). 
> I observed that Spark does load keystore.jks files on worker nodes at 
> /work///keystore.jks. The partition_id 
> changes from one worker node to another.
> My requirement is to upload a file on all nodes of a cluster and retrieve its 
> location. I'm expecting the location to be common across all worker nodes.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-05-31 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496187#comment-16496187
 ] 

Saisai Shao commented on SPARK-18673:
-

I created a PR for this issue [https://github.com/JoshRosen/hive/pull/2]

Actually one line fix is enough, most of other changes in Hive is related to 
HBase, which is not required for us.

I'm not sure what is our plan for Hive support, are we still planning to use 
1.2.1spark2 as a built-in version for Spark in future, or we plan to upgrade to 
latest Hive version. If we plan to upgrade to the latest one, then this PR is 
not necessary.

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23991) data loss when allocateBlocksToBatch

2018-05-29 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-23991.
-
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21430
[https://github.com/apache/spark/pull/21430]

> data loss when allocateBlocksToBatch
> 
>
> Key: SPARK-23991
> URL: https://issues.apache.org/jira/browse/SPARK-23991
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Input/Output
>Affects Versions: 2.2.0
> Environment: spark 2.11
>Reporter: kevin fu
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> with checkpoint and WAL enabled, driver will write the allocation of blocks 
> to batch into hdfs. however, if it fails as following, the blocks of this 
> batch cannot be computed by the DAG. Because the blocks have been dequeued 
> from the receivedBlockQueue and get lost.
> {panel:title=error log}
> 18/04/15 11:11:25 WARN ReceivedBlockTracker: Exception thrown while writing 
> record: BatchAllocationEvent(152376548 ms,AllocatedBlocks(Map(0 -> 
> ArrayBuffer( to the WriteAheadLog. org.apache.spark.SparkException: 
> Exception thrown in awaitResult: at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) at 
> org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:83)
>  at 
> org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:234)
>  at 
> org.apache.spark.streaming.scheduler.ReceivedBlockTracker.allocateBlocksToBatch(ReceivedBlockTracker.scala:118)
>  at 
> org.apache.spark.streaming.scheduler.ReceiverTracker.allocateBlocksToBatch(ReceiverTracker.scala:213)
>  at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
>  at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
>  at scala.util.Try$.apply(Try.scala:192) at 
> org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
>  at 
> org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
>  at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
>  at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Caused 
> by: java.util.concurrent.TimeoutException: Futures timed out after [5000 
> milliseconds] at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at 
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>  at scala.concurrent.Await$.result(package.scala:190) at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190) ... 12 
> more 18/04/15 11:11:25 INFO ReceivedBlockTracker: Possibly processed batch 
> 152376548 ms needs to be processed again in WAL recovery{panel}
> the concerning codes are showed below:
> {code}
>   /**
>* Allocate all unallocated blocks to the given batch.
>* This event will get written to the write ahead log (if enabled).
>*/
>   def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {
> if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) 
> {
>   val streamIdToBlocks = streamIds.map { streamId =>
>   (streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true))
>   }.toMap
>   val allocatedBlocks = AllocatedBlocks(streamIdToBlocks)
>   if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) {
> timeToAllocatedBlocks.put(batchTime, allocatedBlocks)
> lastAllocatedBatchTime = batchTime
>   } else {
> logInfo(s"Possibly processed batch $batchTime needs to be processed 
> again in WAL recovery")
>   }
> } else {
>   // This situation occurs when:
>   // 1. WAL is ended with BatchAllocationEvent, but without 
> BatchCleanupEvent,
>   // possibly processed batch job or half-processed batch job need to be 
> processed again,
>   // so the batchTime will be equal to lastAllocatedBatchTime.
>   // 2. Slow checkpointing makes recovered batch time older than WAL 
> recovered
>   // lastAllocatedBatchTime.
>   // This situation will only occurs in recovery time.
>   logInfo(s"Possibly processed batch $batchTime needs to be processed 
> again in WAL recovery")
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To

[jira] [Assigned] (SPARK-23991) data loss when allocateBlocksToBatch

2018-05-29 Thread Saisai Shao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-23991:
---

Assignee: Gabor Somogyi

> data loss when allocateBlocksToBatch
> 
>
> Key: SPARK-23991
> URL: https://issues.apache.org/jira/browse/SPARK-23991
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Input/Output
>Affects Versions: 2.2.0
> Environment: spark 2.11
>Reporter: kevin fu
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> with checkpoint and WAL enabled, driver will write the allocation of blocks 
> to batch into hdfs. however, if it fails as following, the blocks of this 
> batch cannot be computed by the DAG. Because the blocks have been dequeued 
> from the receivedBlockQueue and get lost.
> {panel:title=error log}
> 18/04/15 11:11:25 WARN ReceivedBlockTracker: Exception thrown while writing 
> record: BatchAllocationEvent(152376548 ms,AllocatedBlocks(Map(0 -> 
> ArrayBuffer( to the WriteAheadLog. org.apache.spark.SparkException: 
> Exception thrown in awaitResult: at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) at 
> org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:83)
>  at 
> org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:234)
>  at 
> org.apache.spark.streaming.scheduler.ReceivedBlockTracker.allocateBlocksToBatch(ReceivedBlockTracker.scala:118)
>  at 
> org.apache.spark.streaming.scheduler.ReceiverTracker.allocateBlocksToBatch(ReceiverTracker.scala:213)
>  at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
>  at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
>  at scala.util.Try$.apply(Try.scala:192) at 
> org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
>  at 
> org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
>  at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
>  at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Caused 
> by: java.util.concurrent.TimeoutException: Futures timed out after [5000 
> milliseconds] at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at 
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>  at scala.concurrent.Await$.result(package.scala:190) at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190) ... 12 
> more 18/04/15 11:11:25 INFO ReceivedBlockTracker: Possibly processed batch 
> 152376548 ms needs to be processed again in WAL recovery{panel}
> the concerning codes are showed below:
> {code}
>   /**
>* Allocate all unallocated blocks to the given batch.
>* This event will get written to the write ahead log (if enabled).
>*/
>   def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {
> if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) 
> {
>   val streamIdToBlocks = streamIds.map { streamId =>
>   (streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true))
>   }.toMap
>   val allocatedBlocks = AllocatedBlocks(streamIdToBlocks)
>   if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) {
> timeToAllocatedBlocks.put(batchTime, allocatedBlocks)
> lastAllocatedBatchTime = batchTime
>   } else {
> logInfo(s"Possibly processed batch $batchTime needs to be processed 
> again in WAL recovery")
>   }
> } else {
>   // This situation occurs when:
>   // 1. WAL is ended with BatchAllocationEvent, but without 
> BatchCleanupEvent,
>   // possibly processed batch job or half-processed batch job need to be 
> processed again,
>   // so the batchTime will be equal to lastAllocatedBatchTime.
>   // 2. Slow checkpointing makes recovered batch time older than WAL 
> recovered
>   // lastAllocatedBatchTime.
>   // This situation will only occurs in recovery time.
>   logInfo(s"Possibly processed batch $batchTime needs to be processed 
> again in WAL recovery")
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Resolved] (LIVY-472) Improve the logs for fail-to-create session

2018-05-24 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/LIVY-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved LIVY-472.
--
   Resolution: Fixed
Fix Version/s: 0.6.0
   0.5.1

Issue resolved by pull request 96
[https://github.com/apache/incubator-livy/pull/96]

> Improve the logs for fail-to-create session
> ---
>
> Key: LIVY-472
> URL: https://issues.apache.org/jira/browse/LIVY-472
> Project: Livy
>  Issue Type: Improvement
>  Components: Server
>Affects Versions: 0.5.0
>Reporter: Saisai Shao
>Priority: Minor
> Fix For: 0.5.1, 0.6.0
>
>
> Livy currently doesn't give a very clear log about the fail-to-create 
> session, it only says that session related app tag cannot be found in RM, but 
> doesn't tell user how to search and get the true root cause. So here change 
> the logs to make it more clear.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (SPARK-24377) Make --py-files work in non pyspark application

2018-05-24 Thread Saisai Shao (JIRA)

Saisai Shao created SPARK-24377:
---

 Summary: Make --py-files work in non pyspark application
 Key: SPARK-24377
 URL: https://issues.apache.org/jira/browse/SPARK-24377
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.3.0
Reporter: Saisai Shao


For some Spark applications, though they're a java program, they require not 
only jar dependencies, but also python dependencies. One example is Livy remote 
SparkContext application, this application is actually a embedded REPL for 
Scala/Python/R, so it will not only load in jar dependencies, but also python 
and R deps.

Currently for a Spark application, --py-files can only be worked for a pyspark 
application, so it will not be worked in the above case. So here propose to 
remove such restriction.

Also we tested that "spark.submit.pyFiles" only supports quite limited scenario 
(client mode with local deps), so here also expand the usage of 
"spark.submit.pyFiles" to be alternative of --py-files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (LIVY-471) New session creation API set to support resource uploading

2018-05-18 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/LIVY-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481375#comment-16481375
 ] 

Saisai Shao commented on LIVY-471:
--

I have a local PR for solution 2, which was simpler and straightforward. It 
only adds new APIs while keep the consistency of old API. Also it handles both 
batch and interactive session. I can submit a WIP pr, in which you can see the 
implementation more clearly.

> New session creation API set to support resource uploading
> --
>
> Key: LIVY-471
> URL: https://issues.apache.org/jira/browse/LIVY-471
> Project: Livy
>  Issue Type: Improvement
>  Components: Server
>Affects Versions: 0.5.0
>Reporter: Saisai Shao
>Priority: Major
>
> Already post in mail list.
> In our current API design to create interactive / batch session, we assume 
> end user should upload jars, pyFiles and related dependencies to HDFS before 
> creating the session, and we use one POST request to create session. But 
> usually end user may not have the permission to access the HDFS in their 
> submission machine, so it makes them hard to create new sessions. So the 
> requirement here is that if Livy could offer APIs to upload resources during 
> session creation. One implementation is proposed in
> [https://github.com/apache/incubator-livy/pull/91|https://github.com/apache/incubator-livy/pull/91.]
> This add a field in session creation request to delay the session creation, 
> then adding a bunch of APIs to support resource upload, finally adding an API 
> to start creating the session. This seems a feasible solution, but also a 
> little hack to support such scenario. So I was thinking if we could a set of 
> new APIs to support such scenarios, rather than hack the existing APIs.
> To borrow the concept from yarn application submission, we can have 3 APIs to 
> create session.
>  * requesting a new session id from Livy Server.
>  * uploading resources associate with this session id.
>  * submitting request to create session.
> This is similar to YARN's process to submit application, and we can bump the 
> supported API version for newly added APIs.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (SPARK-24241) Do not fail fast when dynamic resource allocation enabled with 0 executor

2018-05-15 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475480#comment-16475480
 ] 

Saisai Shao commented on SPARK-24241:
-

Issue resolved by pull request 21290
https://github.com/apache/spark/pull/21290

> Do not fail fast when dynamic resource allocation enabled with 0 executor
> -
>
> Key: SPARK-24241
> URL: https://issues.apache.org/jira/browse/SPARK-24241
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 2.4.0
>
>
> {code:java}
> ~/spark-2.3.0-bin-hadoop2.7$ bin/spark-sql --num-executors 0 --conf 
> spark.dynamicAllocation.enabled=true
> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=1024m; 
> support was removed in 8.0
> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; 
> support was removed in 8.0
> Error: Number of executors must be a positive number
> Run with --help for usage help or --verbose for debug output
> {code}
> Actually, we could start up with min executor number with 0 before 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

201 - 300 of 1493 matches

Mail list logo