[jira] [Updated] (SPARK-24964) Please add OWASP Dependency Check to all comonent builds(pom.xml)
[ https://issues.apache.org/jira/browse/SPARK-24964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24964: Target Version/s: (was: 2.3.2, 2.4.0, 3.0.0, 2.3.3) > Please add OWASP Dependency Check to all comonent builds(pom.xml) > -- > > Key: SPARK-24964 > URL: https://issues.apache.org/jira/browse/SPARK-24964 > Project: Spark > Issue Type: New Feature > Components: Build, MLlib, Spark Core, SparkR >Affects Versions: 2.3.1 > Environment: All development, build, test, environments. > ~/workspace/spark-2.3.1/pom.xml > ~/workspace/spark-2.3.1/assembly/pom.xml > ~/workspace/spark-2.3.1/common/kvstore/dependency-reduced-pom.xml > ~/workspace/spark-2.3.1/common/kvstore/pom.xml > ~/workspace/spark-2.3.1/common/network-common/dependency-reduced-pom.xml > ~/workspace/spark-2.3.1/common/network-common/pom.xml > ~/workspace/spark-2.3.1/common/network-shuffle/dependency-reduced-pom.xml > ~/workspace/spark-2.3.1/common/network-shuffle/pom.xml > ~/workspace/spark-2.3.1/common/network-yarn/pom.xml > ~/workspace/spark-2.3.1/common/sketch/dependency-reduced-pom.xml > ~/workspace/spark-2.3.1/common/sketch/pom.xml > ~/workspace/spark-2.3.1/common/tags/dependency-reduced-pom.xml > ~/workspace/spark-2.3.1/common/tags/pom.xml > ~/workspace/spark-2.3.1/common/unsafe/pom.xml > ~/workspace/spark-2.3.1/core/pom.xml > ~/workspace/spark-2.3.1/examples/pom.xml > ~/workspace/spark-2.3.1/external/docker-integration-tests/pom.xml > ~/workspace/spark-2.3.1/external/flume/pom.xml > ~/workspace/spark-2.3.1/external/flume-assembly/pom.xml > ~/workspace/spark-2.3.1/external/flume-sink/pom.xml > ~/workspace/spark-2.3.1/external/kafka-0-10/pom.xml > ~/workspace/spark-2.3.1/external/kafka-0-10-assembly/pom.xml > ~/workspace/spark-2.3.1/external/kafka-0-10-sql/pom.xml > ~/workspace/spark-2.3.1/external/kafka-0-8/pom.xml > ~/workspace/spark-2.3.1/external/kafka-0-8-assembly/pom.xml > ~/workspace/spark-2.3.1/external/kinesis-asl/pom.xml > ~/workspace/spark-2.3.1/external/kinesis-asl-assembly/pom.xml > ~/workspace/spark-2.3.1/external/spark-ganglia-lgpl/pom.xml > ~/workspace/spark-2.3.1/graphx/pom.xml > ~/workspace/spark-2.3.1/hadoop-cloud/pom.xml > ~/workspace/spark-2.3.1/launcher/pom.xml > ~/workspace/spark-2.3.1/mllib/pom.xml > ~/workspace/spark-2.3.1/mllib-local/pom.xml > ~/workspace/spark-2.3.1/repl/pom.xml > ~/workspace/spark-2.3.1/resource-managers/kubernetes/core/pom.xml > ~/workspace/spark-2.3.1/resource-managers/mesos/pom.xml > ~/workspace/spark-2.3.1/resource-managers/yarn/pom.xml > ~/workspace/spark-2.3.1/sql/catalyst/pom.xml > ~/workspace/spark-2.3.1/sql/core/pom.xml > ~/workspace/spark-2.3.1/sql/hive/pom.xml > ~/workspace/spark-2.3.1/sql/hive-thriftserver/pom.xml > ~/workspace/spark-2.3.1/streaming/pom.xml > ~/workspace/spark-2.3.1/tools/pom.xml >Reporter: Albert Baker >Priority: Major > Labels: build, easy-fix, security > Original Estimate: 3h > Remaining Estimate: 3h > > OWASP DC makes an outbound REST call to MITRE Common Vulnerabilities & > Exposures (CVE) to perform a lookup for each dependant .jar to list any/all > known vulnerabilities for each jar. This step is needed because a manual > MITRE CVE lookup/check on the main component does not include checking for > vulnerabilities in dependant libraries. > OWASP Dependency check : > https://www.owasp.org/index.php/OWASP_Dependency_Check has plug-ins for most > Java build/make types (ant, maven, ivy, gradle). Also, add the appropriate > command to the nightly build to generate a report of all known > vulnerabilities in any/all third party libraries/dependencies that get pulled > in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false clean aggregate > Generating this report nightly/weekly will help inform the project's > development team if any dependant libraries have a reported known > vulneraility. Project teams that keep up with removing vulnerabilities on a > weekly basis will help protect businesses that rely on these open source > componets. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24964) Please add OWASP Dependency Check to all comonent builds(pom.xml)
[ https://issues.apache.org/jira/browse/SPARK-24964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561327#comment-16561327 ] Saisai Shao commented on SPARK-24964: - I'm going to remove the set target version, usually we don't set this for a feature, committers will set a fix version when merged. > Please add OWASP Dependency Check to all comonent builds(pom.xml) > -- > > Key: SPARK-24964 > URL: https://issues.apache.org/jira/browse/SPARK-24964 > Project: Spark > Issue Type: New Feature > Components: Build, MLlib, Spark Core, SparkR >Affects Versions: 2.3.1 > Environment: All development, build, test, environments. > ~/workspace/spark-2.3.1/pom.xml > ~/workspace/spark-2.3.1/assembly/pom.xml > ~/workspace/spark-2.3.1/common/kvstore/dependency-reduced-pom.xml > ~/workspace/spark-2.3.1/common/kvstore/pom.xml > ~/workspace/spark-2.3.1/common/network-common/dependency-reduced-pom.xml > ~/workspace/spark-2.3.1/common/network-common/pom.xml > ~/workspace/spark-2.3.1/common/network-shuffle/dependency-reduced-pom.xml > ~/workspace/spark-2.3.1/common/network-shuffle/pom.xml > ~/workspace/spark-2.3.1/common/network-yarn/pom.xml > ~/workspace/spark-2.3.1/common/sketch/dependency-reduced-pom.xml > ~/workspace/spark-2.3.1/common/sketch/pom.xml > ~/workspace/spark-2.3.1/common/tags/dependency-reduced-pom.xml > ~/workspace/spark-2.3.1/common/tags/pom.xml > ~/workspace/spark-2.3.1/common/unsafe/pom.xml > ~/workspace/spark-2.3.1/core/pom.xml > ~/workspace/spark-2.3.1/examples/pom.xml > ~/workspace/spark-2.3.1/external/docker-integration-tests/pom.xml > ~/workspace/spark-2.3.1/external/flume/pom.xml > ~/workspace/spark-2.3.1/external/flume-assembly/pom.xml > ~/workspace/spark-2.3.1/external/flume-sink/pom.xml > ~/workspace/spark-2.3.1/external/kafka-0-10/pom.xml > ~/workspace/spark-2.3.1/external/kafka-0-10-assembly/pom.xml > ~/workspace/spark-2.3.1/external/kafka-0-10-sql/pom.xml > ~/workspace/spark-2.3.1/external/kafka-0-8/pom.xml > ~/workspace/spark-2.3.1/external/kafka-0-8-assembly/pom.xml > ~/workspace/spark-2.3.1/external/kinesis-asl/pom.xml > ~/workspace/spark-2.3.1/external/kinesis-asl-assembly/pom.xml > ~/workspace/spark-2.3.1/external/spark-ganglia-lgpl/pom.xml > ~/workspace/spark-2.3.1/graphx/pom.xml > ~/workspace/spark-2.3.1/hadoop-cloud/pom.xml > ~/workspace/spark-2.3.1/launcher/pom.xml > ~/workspace/spark-2.3.1/mllib/pom.xml > ~/workspace/spark-2.3.1/mllib-local/pom.xml > ~/workspace/spark-2.3.1/repl/pom.xml > ~/workspace/spark-2.3.1/resource-managers/kubernetes/core/pom.xml > ~/workspace/spark-2.3.1/resource-managers/mesos/pom.xml > ~/workspace/spark-2.3.1/resource-managers/yarn/pom.xml > ~/workspace/spark-2.3.1/sql/catalyst/pom.xml > ~/workspace/spark-2.3.1/sql/core/pom.xml > ~/workspace/spark-2.3.1/sql/hive/pom.xml > ~/workspace/spark-2.3.1/sql/hive-thriftserver/pom.xml > ~/workspace/spark-2.3.1/streaming/pom.xml > ~/workspace/spark-2.3.1/tools/pom.xml >Reporter: Albert Baker >Priority: Major > Labels: build, easy-fix, security > Original Estimate: 3h > Remaining Estimate: 3h > > OWASP DC makes an outbound REST call to MITRE Common Vulnerabilities & > Exposures (CVE) to perform a lookup for each dependant .jar to list any/all > known vulnerabilities for each jar. This step is needed because a manual > MITRE CVE lookup/check on the main component does not include checking for > vulnerabilities in dependant libraries. > OWASP Dependency check : > https://www.owasp.org/index.php/OWASP_Dependency_Check has plug-ins for most > Java build/make types (ant, maven, ivy, gradle). Also, add the appropriate > command to the nightly build to generate a report of all known > vulnerabilities in any/all third party libraries/dependencies that get pulled > in. example : mvn -Powasp -Dtest=false -DfailIfNoTests=false clean aggregate > Generating this report nightly/weekly will help inform the project's > development team if any dependant libraries have a reported known > vulneraility. Project teams that keep up with removing vulnerabilities on a > weekly basis will help protect businesses that rely on these open source > componets. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24867) Add AnalysisBarrier to DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-24867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16556760#comment-16556760 ] Saisai Shao commented on SPARK-24867: - I see, thanks! Please let me know when the JIRA is opened. > Add AnalysisBarrier to DataFrameWriter > --- > > Key: SPARK-24867 > URL: https://issues.apache.org/jira/browse/SPARK-24867 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > Fix For: 2.3.2 > > > {code} > val udf1 = udf({(x: Int, y: Int) => x + y}) > val df = spark.range(0, 3).toDF("a") > .withColumn("b", udf1($"a", udf1($"a", lit(10 > df.cache() > df.write.saveAsTable("t") > df.write.saveAsTable("t1") > {code} > Cache is not being used because the plans do not match with the cached plan. > This is a regression caused by the changes we made in AnalysisBarrier, since > not all the Analyzer rules are idempotent. We need to fix it to Spark 2.3.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555080#comment-16555080 ] Saisai Shao commented on SPARK-24615: - Hi [~tgraves] [~irashid] thanks a lot for your comments. Currently in my design I don't insert a specific stage boundary with different resources, the stage boundary is still the same (by shuffle or by result). so {{withResouces}} is not an eval() action which trigger a stage. Instead, it just adds a resource hint to the RDD. So which means RDDs with different resources requirements in one stage may have conflicts. For example: {{rdd1.withResources.mapPartitions \{ xxx \}.withResources.mapPartitions \{ xxx \}.collect}}, resources in rdd1 may be different from map rdd, so currently what I can think is that: 1. always pick the latter with warning log to say that multiple different resources in one stage is illegal. 2. fail the stage with warning log to say that multiple different resources in one stage is illegal. 3. merge conflicts with maximum resources needs. For example rdd1 requires 3 gpus per task, rdd2 requires 4 gpus per task, then the merged requirement would be 4 gpus per task. (This is the high level description, details will be per partition based merging) [chosen]. Take join for example, where rdd1 and rdd2 may have different resource requirements, and joined RDD will potentially have other resource requirements. For example: {code} val rddA = rdd.mapPartitions().withResources val rddB = rdd.mapPartitions().withResources val rddC = rddA.join(rddB).withResources rddC.collect() {code} Here this 3 {{withResources}} may have different requirements. Since {{rddC}} is running in a different stage, so there's no need to merge the resource conflicts. But {{rddA}} and {{rddB}} are running in the same stage with different tasks (partitions). So the merging strategy I'm thinking is based on the partition, tasks running with partitions from {{rddA}} will use the resource specified by {{rddA}}, so does {{rddB}}. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24867) Add AnalysisBarrier to DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-24867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554986#comment-16554986 ] Saisai Shao commented on SPARK-24867: - [~smilegator] what's the ETA of this issue? > Add AnalysisBarrier to DataFrameWriter > --- > > Key: SPARK-24867 > URL: https://issues.apache.org/jira/browse/SPARK-24867 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > > {code} > val udf1 = udf({(x: Int, y: Int) => x + y}) > val df = spark.range(0, 3).toDF("a") > .withColumn("b", udf1($"a", udf1($"a", lit(10 > df.cache() > df.write.saveAsTable("t") > df.write.saveAsTable("t1") > {code} > Cache is not being used because the plans do not match with the cached plan. > This is a regression caused by the changes we made in AnalysisBarrier, since > not all the Analyzer rules are idempotent. We need to fix it to Spark 2.3.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24297) Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
[ https://issues.apache.org/jira/browse/SPARK-24297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-24297: --- Assignee: Imran Rashid > Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB > --- > > Key: SPARK-24297 > URL: https://issues.apache.org/jira/browse/SPARK-24297 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Shuffle, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.0 > > > Any network request which does not use stream-to-disk that is sending over > 2GB is doomed to fail, so we might as well at least set the default value of > spark.maxRemoteBlockSizeFetchToMem to something < 2GB. > It probably makes sense to set it to something even lower still, but that > might require more careful testing; this is a totally safe first step. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24297) Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
[ https://issues.apache.org/jira/browse/SPARK-24297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-24297. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21474 [https://github.com/apache/spark/pull/21474] > Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB > --- > > Key: SPARK-24297 > URL: https://issues.apache.org/jira/browse/SPARK-24297 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Shuffle, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.0 > > > Any network request which does not use stream-to-disk that is sending over > 2GB is doomed to fail, so we might as well at least set the default value of > spark.maxRemoteBlockSizeFetchToMem to something < 2GB. > It probably makes sense to set it to something even lower still, but that > might require more careful testing; this is a totally safe first step. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553679#comment-16553679 ] Saisai Shao commented on SPARK-24615: - [~Tagar] yes! > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24594) Introduce metrics for YARN executor allocation problems
[ https://issues.apache.org/jira/browse/SPARK-24594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-24594. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21635 [https://github.com/apache/spark/pull/21635] > Introduce metrics for YARN executor allocation problems > > > Key: SPARK-24594 > URL: https://issues.apache.org/jira/browse/SPARK-24594 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.4.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 2.4.0 > > > Within SPARK-16630 it come up to introduce metrics for YARN executor > allocation problems. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24594) Introduce metrics for YARN executor allocation problems
[ https://issues.apache.org/jira/browse/SPARK-24594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-24594: --- Assignee: Attila Zsolt Piros > Introduce metrics for YARN executor allocation problems > > > Key: SPARK-24594 > URL: https://issues.apache.org/jira/browse/SPARK-24594 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.4.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 2.4.0 > > > Within SPARK-16630 it come up to introduce metrics for YARN executor > allocation problems. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552233#comment-16552233 ] Saisai Shao commented on SPARK-24615: - Hi [~tgraves] what you mentioned above is also what we think about and try to figure out a way to solve it. (this problem also existed in barrier execution). >From user point, specifying resource through RDD is the only feasible way >currently what I can think, though resource is bound to stage/task not >particular RDD. This means user could specify resources for different RDDs in >a single stage, Spark can only use one resource within this stage. This will >bring out several problems as you mentioned: *Specify resources to which RDD* For example {{rddA.withResource.mapPartition \{ xxx \}.collect()}} is not different from {{rddA.mapPartition \{ xxx \}.withResource.collect}}. Since all the rdds are executed in the same stage. So in the current design, not matter the resource is specified with {{rddA}} or mapped RDD, the result is the same. *one to one dependency RDDs with different resources* For example {{rddA.withResource.mapPartition \{ xxx \}.withResource.collec()}}, here assuming the resource request for {{rddA}} and mapped RDD is different, since they're running in a single stage, so we should fix such conflict. *multiple dependencies RDDs with different resources* For example: {code} val rddA = rdd.withResources.mapPartitions() val rddB = rdd.withResources.mapPartitions() val rddC = rddA.join(rddB) {code} If the resources in {{rddA}} is different from {{rddB}}, then we should also fix such conflicts. Previously I proposed to use largest resource requirement to satisfy all the needs. But it may also cause the resource wasting, [~mengxr] mentioned to set/merge resources per partition to avoid waste. In the meanwhile, it there's a API exposed to set resources in the stage level, then this problem will not be existed, but Spark doesn't expose such APIs to user, the only thing user can specify is from RDD level, I'm still thinking of a good way to fix it. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550266#comment-16550266 ] Saisai Shao commented on SPARK-24615: - Hi [~tgraves] I'm still not sure how to handle memory per stage. Unlike MR, Spark shares the task runtime in a single JVM, I'm not sure how to control the memory usage within the JVM. Are you suggesting the similar approach like using GPU, when memory requirement cannot be satisfied, release the current executors and requesting new executors by dynamic resource allocation? > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN
[ https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550182#comment-16550182 ] Saisai Shao commented on SPARK-24723: - Hi [~mengxr], I don't think YARN has such feature to configure password-less SSH on all containers. YARN itself doesn't rely on SSH, and in our deployment (Ambari), we don't have use password-less ssh. {quote}And does container by default run sshd? If not, which process is responsible for starting/terminating the daemon? {quote} If the container is is not dockerized, so it will share with system's sshd, it is system's responsibility to start/terminate this daemon. If the container is dockerized, I think the docker container should be responsible for starting sshd (IIUC). Maybe we should check if sshd is started before starting MPI job, if sshd is not started, simply we cannot run MPI job no matter who is responsible for sshd daemon. [~leftnoteasy] might have some thoughts, since he is the originator of mpich2-yarn. > Discuss necessary info and access in barrier mode + YARN > > > Key: SPARK-24723 > URL: https://issues.apache.org/jira/browse/SPARK-24723 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Saisai Shao >Priority: Major > > In barrier mode, to run hybrid distributed DL training jobs, we need to > provide users sufficient info and access so they can set up a hybrid > distributed training job, e.g., using MPI. > This ticket limits the scope of discussion to Spark + YARN. There were some > past attempts from the Hadoop community. So we should find someone with good > knowledge to lead the discussion here. > > Requirements: > * understand how to set up YARN to run MPI job as a YARN application > * figure out how to do it with Spark w/ Barrier -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24195) sc.addFile for local:/ path is broken
[ https://issues.apache.org/jira/browse/SPARK-24195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-24195: --- Assignee: Li Yuanjian > sc.addFile for local:/ path is broken > - > > Key: SPARK-24195 > URL: https://issues.apache.org/jira/browse/SPARK-24195 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Felix Cheung >Assignee: Li Yuanjian >Priority: Minor > Labels: starter > Fix For: 2.4.0 > > > In changing SPARK-6300 > https://github.com/apache/spark/commit/00e730b94cba1202a73af1e2476ff5a44af4b6b2 > essentially the change to > new File(path).getCanonicalFile.toURI.toString > breaks when path is local:, as java.io.File doesn't handle it. > eg. > new > File("local:///home/user/demo/logger.config").getCanonicalFile.toURI.toString > res1: String = file:/user/anotheruser/local:/home/user/demo/logger.config -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24195) sc.addFile for local:/ path is broken
[ https://issues.apache.org/jira/browse/SPARK-24195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-24195. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21533 [https://github.com/apache/spark/pull/21533] > sc.addFile for local:/ path is broken > - > > Key: SPARK-24195 > URL: https://issues.apache.org/jira/browse/SPARK-24195 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Felix Cheung >Priority: Minor > Labels: starter > Fix For: 2.4.0 > > > In changing SPARK-6300 > https://github.com/apache/spark/commit/00e730b94cba1202a73af1e2476ff5a44af4b6b2 > essentially the change to > new File(path).getCanonicalFile.toURI.toString > breaks when path is local:, as java.io.File doesn't handle it. > eg. > new > File("local:///home/user/demo/logger.config").getCanonicalFile.toURI.toString > res1: String = file:/user/anotheruser/local:/home/user/demo/logger.config -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24307) Support sending messages over 2GB from memory
[ https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16550167#comment-16550167 ] Saisai Shao commented on SPARK-24307: - Issue resolved by pull request 21440 [https://github.com/apache/spark/pull/21440] > Support sending messages over 2GB from memory > - > > Key: SPARK-24307 > URL: https://issues.apache.org/jira/browse/SPARK-24307 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.0 > > > Spark's networking layer supports sending messages backed by a {{FileRegion}} > or a {{ByteBuf}}. Sending large FileRegion's works, as netty supports large > FileRegions. However, {{ByteBuf}} is limited to 2GB. This is particularly > a problem for sending large datasets that are already in memory, eg. cached > RDD blocks. > eg. if you try to replicate a block stored in memory that is over 2 GB, you > will see an exception like: > {noformat} > 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC > 7420542363232096629 to xyz.com/172.31.113.213:44358: > io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: > readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= > writerIndex <= capacity(-1294617291)) > io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: > readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= > writerIndex <= capacity(-1294617291)) > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816) > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723) > at > io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081) > at > io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: > -1294617291 (expected: 0 <= readerIndex <= writerIndex <= > capacity(-1294617291)) > at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129) > at > io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688) > at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110) > at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359) > at > org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87) > at > org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95) > at > org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33) > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88) > ... 17 more > {noformat} > A simple solution to this is to create a "FileRegion" which is backed by a > {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB > in
[jira] [Resolved] (SPARK-24307) Support sending messages over 2GB from memory
[ https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-24307. - Resolution: Fixed Assignee: Imran Rashid > Support sending messages over 2GB from memory > - > > Key: SPARK-24307 > URL: https://issues.apache.org/jira/browse/SPARK-24307 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.0 > > > Spark's networking layer supports sending messages backed by a {{FileRegion}} > or a {{ByteBuf}}. Sending large FileRegion's works, as netty supports large > FileRegions. However, {{ByteBuf}} is limited to 2GB. This is particularly > a problem for sending large datasets that are already in memory, eg. cached > RDD blocks. > eg. if you try to replicate a block stored in memory that is over 2 GB, you > will see an exception like: > {noformat} > 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC > 7420542363232096629 to xyz.com/172.31.113.213:44358: > io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: > readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= > writerIndex <= capacity(-1294617291)) > io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: > readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= > writerIndex <= capacity(-1294617291)) > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816) > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723) > at > io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081) > at > io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: > -1294617291 (expected: 0 <= readerIndex <= writerIndex <= > capacity(-1294617291)) > at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129) > at > io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688) > at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110) > at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359) > at > org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87) > at > org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95) > at > org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33) > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88) > ... 17 more > {noformat} > A simple solution to this is to create a "FileRegion" which is backed by a > {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB > in memory). > A drawback to this approach is that blocks that are cached
[jira] [Updated] (SPARK-24307) Support sending messages over 2GB from memory
[ https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24307: Fix Version/s: 2.4.0 > Support sending messages over 2GB from memory > - > > Key: SPARK-24307 > URL: https://issues.apache.org/jira/browse/SPARK-24307 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Major > Fix For: 2.4.0 > > > Spark's networking layer supports sending messages backed by a {{FileRegion}} > or a {{ByteBuf}}. Sending large FileRegion's works, as netty supports large > FileRegions. However, {{ByteBuf}} is limited to 2GB. This is particularly > a problem for sending large datasets that are already in memory, eg. cached > RDD blocks. > eg. if you try to replicate a block stored in memory that is over 2 GB, you > will see an exception like: > {noformat} > 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC > 7420542363232096629 to xyz.com/172.31.113.213:44358: > io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: > readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= > writerIndex <= capacity(-1294617291)) > io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: > readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= > writerIndex <= capacity(-1294617291)) > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816) > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723) > at > io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081) > at > io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: > -1294617291 (expected: 0 <= readerIndex <= writerIndex <= > capacity(-1294617291)) > at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129) > at > io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688) > at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110) > at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359) > at > org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87) > at > org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95) > at > org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33) > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88) > ... 17 more > {noformat} > A simple solution to this is to create a "FileRegion" which is backed by a > {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB > in memory). > A drawback to this approach is that blocks that are cached in memory as > deserialized values would need to have the
[jira] [Updated] (SPARK-24037) stateful operators
[ https://issues.apache.org/jira/browse/SPARK-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24037: Fix Version/s: (was: 2.4.0) > stateful operators > -- > > Key: SPARK-24037 > URL: https://issues.apache.org/jira/browse/SPARK-24037 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > pointer to https://issues.apache.org/jira/browse/SPARK-24036 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24037) stateful operators
[ https://issues.apache.org/jira/browse/SPARK-24037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24037: Fix Version/s: 2.4.0 > stateful operators > -- > > Key: SPARK-24037 > URL: https://issues.apache.org/jira/browse/SPARK-24037 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > > pointer to https://issues.apache.org/jira/browse/SPARK-24036 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549276#comment-16549276 ] Saisai Shao commented on SPARK-24615: - Thanks [~tgraves] for the suggestion. {quote}Once I get to the point I want to do the ML I want to ask for the gpu's as well as ask for more memory during that stage because I didn't need more before this stage for all the etl work. I realize you already have executors, but ideally spark with the cluster manager could potentially release the existing ones and ask for new ones with those requirements. {quote} Yes, I already discussed with my colleague offline, this is a valid scenario, but I think to achieve this we should change the current dynamic resource allocation mechanism Currently I marked this as a Non-Goal in this proposal, only focus on statically resource requesting (--executor-cores, --executor-gpus). I think we should support it later. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548662#comment-16548662 ] Saisai Shao commented on SPARK-24615: - Hi [~tgraves] I'm rewriting the design doc based on the comments mentioned above, so temporarily make it inaccessible, sorry about it, I will reopen it. I think it is hard to control the memory usage per stage/task, because task is running in the executor which shared within a JVM. For CPU, yes I think we can do it, but I'm not sure the usage scenario of it. For the requirement of using different types of machine, what I can think of is leveraging dynamic resource allocation. For example, if user wants run some MPI jobs with barrier enabled, then Spark could allocate some new executors with accelerator resource via cluster manager (for example using node label if it is running on YARN). But I will not target this as a goal in this design, since a) it is a non-goal for barrier scheduler currently; b) it makes the design too complex, would be better to separate to another work. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547263#comment-16547263 ] Saisai Shao commented on SPARK-24615: - Sure, I will also add it as Xiangrui also suggested the same concern. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546590#comment-16546590 ] Saisai Shao commented on SPARK-24615: - Yes, currently the user is responsible for asking yarn to get resources like GPU, for example like {{--num-gpus}}. Yes, I agree with you, the concerns you mentioned above is valid. But currently the design of this Jira only targets to the task level scheduling with accelerator resources already available. It assumes that accelerator resources is already got by executor and reported back to driver. Driver will schedule the tasks based on the available resources. For Spark to communicate with cluster manager to request other resources like GPU, currently it is not covered in this design doc. Xiangrui mentioned that Spark to communicate with cluster manager should also be covered in this SPIP, so I'm still under drafting. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545962#comment-16545962 ] Saisai Shao commented on SPARK-24615: - Sorry [~tgraves] for the late response. Yes, when requesting executors, user should know accelerators are required or not. If there's no satisfied accelerators, the job will be pending or not launched. > Accelerator-aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Labels: Hydrogen, SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24804) There are duplicate words in the title in the DatasetSuite
[ https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544386#comment-16544386 ] Saisai Shao edited comment on SPARK-24804 at 7/15/18 1:29 AM: -- Please don't set target version or fix version. Committers will help to set when this issue is resolved. I'm removing the target version of this Jira to avoid blocking the release of 2.3.2 was (Author: jerryshao): Please don't set target version or fix version. Committers will help to set when this issue is resolved. > There are duplicate words in the title in the DatasetSuite > -- > > Key: SPARK-24804 > URL: https://issues.apache.org/jira/browse/SPARK-24804 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: hantiantian >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24804) There are duplicate words in the title in the DatasetSuite
[ https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544386#comment-16544386 ] Saisai Shao commented on SPARK-24804: - Please don't set target version or fix version. Committers will help to set when this issue is resolved. > There are duplicate words in the title in the DatasetSuite > -- > > Key: SPARK-24804 > URL: https://issues.apache.org/jira/browse/SPARK-24804 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: hantiantian >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24804) There are duplicate words in the title in the DatasetSuite
[ https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24804: Priority: Trivial (was: Minor) > There are duplicate words in the title in the DatasetSuite > -- > > Key: SPARK-24804 > URL: https://issues.apache.org/jira/browse/SPARK-24804 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: hantiantian >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24804) There are duplicate words in the title in the DatasetSuite
[ https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24804: Target Version/s: (was: 2.3.2) > There are duplicate words in the title in the DatasetSuite > -- > > Key: SPARK-24804 > URL: https://issues.apache.org/jira/browse/SPARK-24804 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: hantiantian >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24804) There are duplicate words in the title in the DatasetSuite
[ https://issues.apache.org/jira/browse/SPARK-24804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24804: Fix Version/s: (was: 2.3.2) > There are duplicate words in the title in the DatasetSuite > -- > > Key: SPARK-24804 > URL: https://issues.apache.org/jira/browse/SPARK-24804 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: hantiantian >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24781) Using a reference from Dataset in Filter/Sort might not work.
[ https://issues.apache.org/jira/browse/SPARK-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539635#comment-16539635 ] Saisai Shao commented on SPARK-24781: - I see. I will wait for this before cutting a new 2.3.2 RC release > Using a reference from Dataset in Filter/Sort might not work. > - > > Key: SPARK-24781 > URL: https://issues.apache.org/jira/browse/SPARK-24781 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takuya Ueshin >Priority: Blocker > > When we use a reference from {{Dataset}} in {{filter}} or {{sort}}, which was > not used in the prior {{select}}, an {{AnalysisException}} occurs, e.g., > {code:scala} > val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id") > df.select(df("name")).filter(df("id") === 0).show() > {code} > {noformat} > org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing > from name#5 in operator !Filter (id#6 = 0).;; > !Filter (id#6 = 0) >+- AnalysisBarrier > +- Project [name#5] > +- Project [_1#2 AS name#5, _2#3 AS id#6] > +- LocalRelation [_1#2, _2#3] > {noformat} > If we use {{col}} instead, it works: > {code:scala} > val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id") > df.select(col("name")).filter(col("id") === 0).show() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24781) Using a reference from Dataset in Filter/Sort might not work.
[ https://issues.apache.org/jira/browse/SPARK-24781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539630#comment-16539630 ] Saisai Shao commented on SPARK-24781: - Thanks Felix. Does this have to be in 2.3.2? [~ueshin] > Using a reference from Dataset in Filter/Sort might not work. > - > > Key: SPARK-24781 > URL: https://issues.apache.org/jira/browse/SPARK-24781 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takuya Ueshin >Priority: Blocker > > When we use a reference from {{Dataset}} in {{filter}} or {{sort}}, which was > not used in the prior {{select}}, an {{AnalysisException}} occurs, e.g., > {code:scala} > val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id") > df.select(df("name")).filter(df("id") === 0).show() > {code} > {noformat} > org.apache.spark.sql.AnalysisException: Resolved attribute(s) id#6 missing > from name#5 in operator !Filter (id#6 = 0).;; > !Filter (id#6 = 0) >+- AnalysisBarrier > +- Project [name#5] > +- Project [_1#2 AS name#5, _2#3 AS id#6] > +- LocalRelation [_1#2, _2#3] > {noformat} > If we use {{col}} instead, it works: > {code:scala} > val df = Seq(("test1", 0), ("test2", 1)).toDF("name", "id") > df.select(col("name")).filter(col("id") === 0).show() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming
[ https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-24678: --- Assignee: sharkd tu > We should use 'PROCESS_LOCAL' first for Spark-Streaming > --- > > Key: SPARK-24678 > URL: https://issues.apache.org/jira/browse/SPARK-24678 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.3.1 >Reporter: sharkd tu >Assignee: sharkd tu >Priority: Major > Fix For: 2.4.0 > > > Currently, `BlockRDD.getPreferredLocations` only get hosts info of blocks, > which results in subsequent schedule level is not better than 'NODE_LOCAL'. > We can just make a small changes, the schedule level can be improved to > 'PROCESS_LOCAL' > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24678) We should use 'PROCESS_LOCAL' first for Spark-Streaming
[ https://issues.apache.org/jira/browse/SPARK-24678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-24678. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21658 [https://github.com/apache/spark/pull/21658] > We should use 'PROCESS_LOCAL' first for Spark-Streaming > --- > > Key: SPARK-24678 > URL: https://issues.apache.org/jira/browse/SPARK-24678 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.3.1 >Reporter: sharkd tu >Priority: Major > Fix For: 2.4.0 > > > Currently, `BlockRDD.getPreferredLocations` only get hosts info of blocks, > which results in subsequent schedule level is not better than 'NODE_LOCAL'. > We can just make a small changes, the schedule level can be improved to > 'PROCESS_LOCAL' > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24646) Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes
[ https://issues.apache.org/jira/browse/SPARK-24646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-24646. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21633 [https://github.com/apache/spark/pull/21633] > Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes > > > Key: SPARK-24646 > URL: https://issues.apache.org/jira/browse/SPARK-24646 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.4.0 > > > In the case of getting tokens via customized {{ServiceCredentialProvider}}, > it is required that {{ServiceCredentialProvider}} be available in local > spark-submit process classpath. In this case, all the configured remote > sources should be forced to download to local. > For the ease of using this configuration, here propose to add wildcard '*' > support to {{spark.yarn.dist.forceDownloadSchemes}}, also clarify the usage > of this configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24646) Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes
[ https://issues.apache.org/jira/browse/SPARK-24646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-24646: --- Assignee: Saisai Shao > Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes > > > Key: SPARK-24646 > URL: https://issues.apache.org/jira/browse/SPARK-24646 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.4.0 > > > In the case of getting tokens via customized {{ServiceCredentialProvider}}, > it is required that {{ServiceCredentialProvider}} be available in local > spark-submit process classpath. In this case, all the configured remote > sources should be forced to download to local. > For the ease of using this configuration, here propose to add wildcard '*' > support to {{spark.yarn.dist.forceDownloadSchemes}}, also clarify the usage > of this configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN
[ https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534565#comment-16534565 ] Saisai Shao edited comment on SPARK-24723 at 7/6/18 8:25 AM: - [~mengxr] [~jiangxb1987] There's one solution to handle password-less SSH problem for all cluster manager in a programming way. This is referred from MPI on YARN framework [https://github.com/alibaba/mpich2-yarn] In this MPI on YARN framework, before launching MPI job, application master (master) will generate ssh private key and public key and then propagate the public key to all the containers (worker), during container start, it will write public key to local authorized_keys file, so after that, MPI job started from master node can ssh with all the containers in password-less manner. After MPI job is finished, all the containers would delete this public key from authorized_keys file to revert the environment. In our case, we could do this in a similar way, before launching MPI job, 0-th task could also generate ssh private key and public key, and then propagate the public keys to all the barrier task (maybe through BarrierTaskContext). For other tasks, they could receive public key from 0-th task and write public key to authorized_keys file (maybe by BarrierTaskContext). After this, password-less ssh is set up, mpirun from 0-th task could be started without password. After MPI job is finished, all the barrier tasks could delete this public key from authorized_keys file to revert the environment. The example code is like below: {code:java} rdd.barrier().mapPartitions { (iter, context) => // Write iter to disk. ??? // Wait until all tasks finished writing. context.barrier() // The 0-th task launches an MPI job. if (context.partitionId() == 0) { // generate and propagate ssh keys. // Wait for keys to set up in other tasks. val hosts = context.getTaskInfos().map(_.host) // Set up MPI machine file using host infos. ??? // Launch the MPI job by calling mpirun. ??? } else { // get and setup public key // notify 0-th task that pubic key is setup. } // Wait until the MPI job finished. context.barrier() // Delete SSH key and revert the environment. // Collect output and return. ??? } {code} What is your opinion about this solution? was (Author: jerryshao): [~mengxr] [~jiangxb1987] There's one solution to handle password-less SSH problem for all cluster manager in a programming way. This is referred from MPI on YARN framework [https://github.com/alibaba/mpich2-yarn] In this MPI on YARN framework, before launching MPI job, application master (master) will generate ssh private key and public key and then propagate the public key to all the containers (worker), during container start, it will write public key to local authorized_keys file, so after that, MPI job started from master node can ssh with all the containers in password-less manner. After MPI job is finished, all the containers would delete this public key from authorized_keys file to revert the environment. In our case, we could do this in a similar way, before launching MPI job, 0-th task could also generate ssh private key and public key, and then propagate the public keys to all the barrier task (maybe through BarrierTaskContext). For other tasks, they could receive public key from 0-th task and write public key to authorized_keys file (maybe by BarrierTaskContext). After this, password-less ssh is set up, mpirun from 0-th task could be started without password. After MPI job is finished, all the barrier tasks could delete this public key from authorized_keys file to revert the environment. The example code is like below: rdd.barrier().mapPartitions { (iter, context) => // Write iter to disk.??? // Wait until all tasks finished writing. context.barrier() // The 0-th task launches an MPI job. if (context.partitionId() == 0) { // generate and propagate ssh keys. // Wait for keys to set up in other tasks. val hosts = context.getTaskInfos().map(_.host) // Set up MPI machine file using host infos. ??? // Launch the MPI job by calling mpirun. ??? } else { // get and setup public key // notify 0-th task that pubic key is setup. } // Wait until the MPI job finished. context.barrier() // Delete SSH key and revert the environment. // Collect output and return.??? } What is your opinion about this solution? > Discuss necessary info and access in barrier mode + YARN > > > Key: SPARK-24723 > URL: https://issues.apache.org/jira/browse/SPARK-24723 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions:
[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN
[ https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534565#comment-16534565 ] Saisai Shao commented on SPARK-24723: - [~mengxr] [~jiangxb1987] There's one solution to handle password-less SSH problem for all cluster manager in a programming way. This is referred from MPI on YARN framework [https://github.com/alibaba/mpich2-yarn] In this MPI on YARN framework, before launching MPI job, application master (master) will generate ssh private key and public key and then propagate the public key to all the containers (worker), during container start, it will write public key to local authorized_keys file, so after that, MPI job started from master node can ssh with all the containers in password-less manner. After MPI job is finished, all the containers would delete this public key from authorized_keys file to revert the environment. In our case, we could do this in a similar way, before launching MPI job, 0-th task could also generate ssh private key and public key, and then propagate the public keys to all the barrier task (maybe through BarrierTaskContext). For other tasks, they could receive public key from 0-th task and write public key to authorized_keys file (maybe by BarrierTaskContext). After this, password-less ssh is set up, mpirun from 0-th task could be started without password. After MPI job is finished, all the barrier tasks could delete this public key from authorized_keys file to revert the environment. The example code is like below: rdd.barrier().mapPartitions { (iter, context) => // Write iter to disk.??? // Wait until all tasks finished writing. context.barrier() // The 0-th task launches an MPI job. if (context.partitionId() == 0) { // generate and propagate ssh keys. // Wait for keys to set up in other tasks. val hosts = context.getTaskInfos().map(_.host) // Set up MPI machine file using host infos. ??? // Launch the MPI job by calling mpirun. ??? } else { // get and setup public key // notify 0-th task that pubic key is setup. } // Wait until the MPI job finished. context.barrier() // Delete SSH key and revert the environment. // Collect output and return.??? } What is your opinion about this solution? > Discuss necessary info and access in barrier mode + YARN > > > Key: SPARK-24723 > URL: https://issues.apache.org/jira/browse/SPARK-24723 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > In barrier mode, to run hybrid distributed DL training jobs, we need to > provide users sufficient info and access so they can set up a hybrid > distributed training job, e.g., using MPI. > This ticket limits the scope of discussion to Spark + YARN. There were some > past attempts from the Hadoop community. So we should find someone with good > knowledge to lead the discussion here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN
[ https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534529#comment-16534529 ] Saisai Shao commented on SPARK-24723: - After discussed with Xiangrui offline, resource reservation is not the key focus here. Here the main problem is how to provide necessary information for barrier tasks to start MPI job in a password-less manner. > Discuss necessary info and access in barrier mode + YARN > > > Key: SPARK-24723 > URL: https://issues.apache.org/jira/browse/SPARK-24723 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > In barrier mode, to run hybrid distributed DL training jobs, we need to > provide users sufficient info and access so they can set up a hybrid > distributed training job, e.g., using MPI. > This ticket limits the scope of discussion to Spark + YARN. There were some > past attempts from the Hadoop community. So we should find someone with good > knowledge to lead the discussion here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (LIVY-480) Apache livy and Ajax
[ https://issues.apache.org/jira/browse/LIVY-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved LIVY-480. -- Resolution: Invalid > Apache livy and Ajax > > > Key: LIVY-480 > URL: https://issues.apache.org/jira/browse/LIVY-480 > Project: Livy > Issue Type: Request > Components: API, Interpreter, REPL, Server >Affects Versions: 0.5.0 >Reporter: Melchicédec NDUWAYO >Priority: Major > > Good morning every one, > can someone help me with how to send a post request by ajax to the apache > livy server? I tried my best but I don't see how to do that. I have a > textarea in which user can put his spark code. I want to send that code to > the apache livy by ajax and get the result. I don't if my question is clear. > Thank you. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-24723) Discuss necessary info and access in barrier mode + YARN
[ https://issues.apache.org/jira/browse/SPARK-24723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533205#comment-16533205 ] Saisai Shao commented on SPARK-24723: - Hi [~mengxr], I would like to know the goal of this ticket? The goal of barrier scheduler is to offer gang semantics in the task scheduling level, whereas the gang semantics in the YARN level is more regarding to resource level. I discussed with [~leftnoteasy] about the feasibility of supporting gang semantics on YARN. YARN has Reservation System which support gang like semantics (reserve requested resources), but it is not designed for gang. Here is some thoughts about supporting it on YARN [https://docs.google.com/document/d/1OA-iVwuHB8wlzwwlrEHOK6Q2SlKy3-5QEB5AmwMVEUU/edit?usp=sharing], I'm not sure if it aligns your goal of this ticket. > Discuss necessary info and access in barrier mode + YARN > > > Key: SPARK-24723 > URL: https://issues.apache.org/jira/browse/SPARK-24723 > Project: Spark > Issue Type: Story > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > In barrier mode, to run hybrid distributed DL training jobs, we need to > provide users sufficient info and access so they can set up a hybrid > distributed training job, e.g., using MPI. > This ticket limits the scope of discussion to Spark + YARN. There were some > past attempts from the Hadoop community. So we should find someone with good > knowledge to lead the discussion here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24739) PySpark does not work with Python 3.7.0
[ https://issues.apache.org/jira/browse/SPARK-24739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533189#comment-16533189 ] Saisai Shao commented on SPARK-24739: - Do we have to fix it in 2.3.2? I don't think it is even critical. For example Java 9 is out for a long time, but we still don't support Java 9. So this seems not a big problem. > PySpark does not work with Python 3.7.0 > --- > > Key: SPARK-24739 > URL: https://issues.apache.org/jira/browse/SPARK-24739 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.3, 2.2.2, 2.3.1 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > > Python 3.7 is released in few days ago and our PySpark does not work. For > example > {code} > sc.parallelize([1, 2]).take(1) > {code} > {code} > File "/.../spark/python/pyspark/rdd.py", line 1343, in __main__.RDD.take > Failed example: > sc.parallelize(range(100), 100).filter(lambda x: x > 90).take(3) > Exception raised: > Traceback (most recent call last): > File "/.../3.7/lib/python3.7/doctest.py", line 1329, in __run > compileflags, 1), test.globs) > File "", line 1, in > sc.parallelize(range(100), 100).filter(lambda x: x > 90).take(3) > File "/.../spark/python/pyspark/rdd.py", line 1377, in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "/.../spark/python/pyspark/context.py", line 1013, in runJob > sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), > mappedRDD._jrdd, partitions) > File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task > 0 in stage 143.0 failed 1 times, most recent failure: Lost task 0.0 in stage > 143.0 (TID 688, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/.../spark/python/pyspark/rdd.py", line 1373, in takeUpToNumLeft > yield next(iterator) > StopIteration > The above exception was the direct cause of the following exception: > Traceback (most recent call last): > File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 320, > in main > process() > File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 315, > in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line > 378, in dump_stream > vs = list(itertools.islice(iterator, batch)) > RuntimeError: generator raised StopIteration > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:309) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:449) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:432) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:263) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at > org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at > org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at > org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28) > at > org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:149) > at > org.apache.spark.api.python.PythonRDD$$anonfun$3.apply(PythonRDD.scala:149) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2071) > at >
[jira] [Commented] (SPARK-24535) Fix java version parsing in SparkR on Windows
[ https://issues.apache.org/jira/browse/SPARK-24535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532145#comment-16532145 ] Saisai Shao commented on SPARK-24535: - Hi [~felixcheung] what's the current status of this JIRA, do you have an ETA about it? > Fix java version parsing in SparkR on Windows > - > > Key: SPARK-24535 > URL: https://issues.apache.org/jira/browse/SPARK-24535 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1, 2.4.0 >Reporter: Shivaram Venkataraman >Assignee: Felix Cheung >Priority: Blocker > > We see errors on CRAN of the form > {code:java} > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > Picked up _JAVA_OPTIONS: -XX:-UsePerfData > -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21) > -- > subscript out of bounds > 1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, > sparkConfig = sparkRTestConfig) at > D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21 > 2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, > sparkExecutorEnvMap, > sparkJars, sparkPackages) > 3: checkJavaVersion() > 4: strsplit(javaVersionFilter[[1]], "[\"]") > {code} > The complete log file is at > http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532143#comment-16532143 ] Saisai Shao commented on SPARK-24530: - Hi [~hyukjin.kwon] what is the current status of this JIRA, do you have an ETA about it? > Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) > and pyspark.ml docs are broken > --- > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Assignee: Hyukjin Kwon >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529352#comment-16529352 ] Saisai Shao commented on SPARK-24530: - [~hyukjin.kwon], Spark 2.1.3 and 2.2.2 are on vote, can you please fix the issue and leave the comments in the related threads. > Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) > and pyspark.ml docs are broken > --- > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Assignee: Hyukjin Kwon >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24530) Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken
[ https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24530: Target Version/s: 2.1.3, 2.2.2, 2.3.2, 2.4.0 (was: 2.4.0) > Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) > and pyspark.ml docs are broken > --- > > Key: SPARK-24530 > URL: https://issues.apache.org/jira/browse/SPARK-24530 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Assignee: Hyukjin Kwon >Priority: Blocker > Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot > 2018-06-12 at 8.23.29 AM.png, image-2018-06-13-15-15-51-025.png, > pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png > > > I generated python docs from master locally using `make html`. However, the > generated html doc doesn't render class docs correctly. I attached the > screenshot from Spark 2.3 docs and master docs generated on my local. Not > sure if this is because my local setup. > cc: [~dongjoon] Could you help verify? > > The followings are our released doc status. Some recent docs seems to be > broken. > *2.1.x* > (O) > [https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (O) > [https://spark.apache.org/docs/2.1.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.1.2/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.2.x* > (O) > [https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.2.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > *2.3.x* > (O) > [https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] > (X) > [https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (LIVY-479) Enable livy.rsc.launcher.address configuration
[ https://issues.apache.org/jira/browse/LIVY-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525775#comment-16525775 ] Saisai Shao commented on LIVY-479: -- Yes, I've left the comments on the PR. > Enable livy.rsc.launcher.address configuration > -- > > Key: LIVY-479 > URL: https://issues.apache.org/jira/browse/LIVY-479 > Project: Livy > Issue Type: Improvement > Components: RSC >Affects Versions: 0.5.0 >Reporter: Tao Li >Priority: Major > > The current behavior is that we are setting livy.rsc.launcher.address to RPC > server and ignoring whatever is specified in configs. However in some > scenarios there is a need for the user to be able to explicitly configure it. > For example, the IP address for an active Livy server might change over the > time, so rather than using the fixed IP address, we need to specify this > setting to a more generic host name. For this reason, we want to enable the > capability of user side configurations. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SPARK-24418) Upgrade to Scala 2.11.12
[ https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-24418. - Resolution: Fixed Issue resolved by pull request 21495 [https://github.com/apache/spark/pull/21495] > Upgrade to Scala 2.11.12 > > > Key: SPARK-24418 > URL: https://issues.apache.org/jira/browse/SPARK-24418 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.0 > > > Scala 2.11.12+ will support JDK9+. However, this is not going to be a simple > version bump. > *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to > initialize the Spark before REPL sees any files. > Issue filed in Scala community. > https://github.com/scala/bug/issues/10913 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24646) Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes
[ https://issues.apache.org/jira/browse/SPARK-24646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16523022#comment-16523022 ] Saisai Shao commented on SPARK-24646: - Hi [~vanzin], here is a specific example: We have a customized {{ServiceCredentialProvider}} named HS2 {{ServiceCredentialProvider}} which is located in our own jar {{foo}}. So when SparkSubmit process launches YARN client, this {{foo}} jar should be existed in SparkSubmit process classpath. If this {{foo}} jar is a local resource added by {{--jars}}, then it will be existed in the classpath, but if it is a remote jar (for example on HDFS), then only yarn-client mode will download and add this {{foo}} jar to classpath, in yarn-cluster mode, it will not be downloaded, so this specific HS2 {{ServiceCredentialProvider}} will not be loaded. When using spark-submit script, user could decide to add the remote jar or local jar. But in the Livy scenario, Livy only supports remote jars (jars on the hdfs), and we only configure to support yarn cluster mode. So in this scenario, we cannot load this customized {{ServiceCredentialProvider}} in yarn cluster mode. So the fix is to force to download the jars to local SparkSubmit process with configuration "spark.yarn.dist.forceDownloadSchemes", to use it easily, I propose to add wildcard '*' support, which will download all the remote resource without checking the scheme. > Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes > > > Key: SPARK-24646 > URL: https://issues.apache.org/jira/browse/SPARK-24646 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Minor > > In the case of getting tokens via customized {{ServiceCredentialProvider}}, > it is required that {{ServiceCredentialProvider}} be available in local > spark-submit process classpath. In this case, all the configured remote > sources should be forced to download to local. > For the ease of using this configuration, here propose to add wildcard '*' > support to {{spark.yarn.dist.forceDownloadSchemes}}, also clarify the usage > of this configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24646) Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes
Saisai Shao created SPARK-24646: --- Summary: Support wildcard '*' for to spark.yarn.dist.forceDownloadSchemes Key: SPARK-24646 URL: https://issues.apache.org/jira/browse/SPARK-24646 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: Saisai Shao In the case of getting tokens via customized {{ServiceCredentialProvider}}, it is required that {{ServiceCredentialProvider}} be available in local spark-submit process classpath. In this case, all the configured remote sources should be forced to download to local. For the ease of using this configuration, here propose to add wildcard '*' support to {{spark.yarn.dist.forceDownloadSchemes}}, also clarify the usage of this configuration. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Scheduling in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519934#comment-16519934 ] Saisai Shao commented on SPARK-24374: - Hi [~mengxr] [~jiangxb1987] SPARK-24615 would be a part of project hydrogen, I wrote a high level design doc about the implementation, would you please help to review and comment, to see the solution is feasible or not. Thanks a lot. > SPIP: Support Barrier Scheduling in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Epic > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen, SPIP > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked/attached SPIP doc.) > {quote} > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
[ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519931#comment-16519931 ] Saisai Shao commented on HIVE-16391: Gently ping [~hagleitn], would you please help to review the current proposed patch and suggest the next step. Thanks a lot. > Publish proper Hive 1.2 jars (without including all dependencies in uber jar) > - > > Key: HIVE-16391 > URL: https://issues.apache.org/jira/browse/HIVE-16391 > Project: Hive > Issue Type: Task > Components: Build Infrastructure >Affects Versions: 1.2.2 >Reporter: Reynold Xin >Assignee: Saisai Shao >Priority: Major > Labels: pull-request-available > Fix For: 1.2.3 > > Attachments: HIVE-16391.1.patch, HIVE-16391.2.patch, HIVE-16391.patch > > > Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the > only change in the fork is to work around the issue that Hive publishes only > two sets of jars: one set with no dependency declared, and another with all > the dependencies included in the published uber jar. That is to say, Hive > doesn't publish a set of jars with the proper dependencies declared. > There is general consensus on both sides that we should remove the forked > Hive. > The change in the forked version is recorded here > https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 > Note that the fork in the past included other fixes but those have all become > unnecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SPARK-24615) Accelerator aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24615: Labels: SPIP (was: ) > Accelerator aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Priority: Major > Labels: SPIP > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24615) Accelerator aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24615: Description: In the machine learning area, accelerator card (GPU, FPGA, TPU) is predominant compared to CPUs. To make the current Spark architecture to work with accelerator cards, Spark itself should understand the existence of accelerators and know how to schedule task onto the executors where accelerators are equipped. Current Spark’s scheduler schedules tasks based on the locality of the data plus the available of CPUs. This will introduce some problems when scheduling tasks with accelerators required. # CPU cores are usually more than accelerators on one node, using CPU cores to schedule accelerator required tasks will introduce the mismatch. # In one cluster, we always assume that CPU is equipped in each node, but this is not true of accelerator cards. # The existence of heterogeneous tasks (accelerator required or not) requires scheduler to schedule tasks with a smart way. So here propose to improve the current scheduler to support heterogeneous tasks (accelerator requires or not). This can be part of the work of Project hydrogen. Details is attached in google doc. It doesn't cover all the implementation details, just highlight the parts should be changed. CC [~yanboliang] [~merlintang] was: In the machine learning area, accelerator card (GPU, FPGA, TPU) is predominant compared to CPUs. To make the current Spark architecture to work with accelerator cards, Spark itself should understand the existence of accelerators and know how to schedule task onto the executors where accelerators are equipped. Current Spark’s scheduler schedules tasks based on the locality of the data plus the available of CPUs. This will introduce some problems when scheduling tasks with accelerators required. # CPU cores are usually more than accelerators on one node, using CPU cores to schedule accelerator required tasks will introduce the mismatch. # In one cluster, we always assume that CPU is equipped in each node, but this is not true of accelerator cards. # The existence of heterogeneous tasks (accelerator required or not) requires scheduler to schedule tasks with a smart way. So here propose to improve the current scheduler to support heterogeneous tasks (accelerator requires or not). This can be part of the work of Project hydrogen. Details is attached in google doc. CC [~yanboliang] [~merlintang] > Accelerator aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Priority: Major > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. It doesn't cover all the implementation > details, just highlight the parts should be changed. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24615) Accelerator aware task scheduling for Spark
[ https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24615: Description: In the machine learning area, accelerator card (GPU, FPGA, TPU) is predominant compared to CPUs. To make the current Spark architecture to work with accelerator cards, Spark itself should understand the existence of accelerators and know how to schedule task onto the executors where accelerators are equipped. Current Spark’s scheduler schedules tasks based on the locality of the data plus the available of CPUs. This will introduce some problems when scheduling tasks with accelerators required. # CPU cores are usually more than accelerators on one node, using CPU cores to schedule accelerator required tasks will introduce the mismatch. # In one cluster, we always assume that CPU is equipped in each node, but this is not true of accelerator cards. # The existence of heterogeneous tasks (accelerator required or not) requires scheduler to schedule tasks with a smart way. So here propose to improve the current scheduler to support heterogeneous tasks (accelerator requires or not). This can be part of the work of Project hydrogen. Details is attached in google doc. CC [~yanboliang] [~merlintang] was: In the machine learning area, accelerator card (GPU, FPGA, TPU) is predominant compared to CPUs. To make the current Spark architecture to work with accelerator cards, Spark itself should understand the existence of accelerators and know how to schedule task onto the executors where accelerators are equipped. Current Spark’s scheduler schedules tasks based on the locality of the data plus the available of CPUs. This will introduce some problems when scheduling tasks with accelerators required. # CPU cores are usually more than accelerators on one node, using CPU cores to schedule accelerator required tasks will introduce the mismatch. # In one cluster, we always assume that CPU is equipped in each node, but this is not true of accelerator cards. # The existence of heterogeneous tasks (accelerator required or not) requires scheduler to schedule tasks with a smart way. So here propose to improve the current scheduler to support heterogeneous tasks (accelerator requires or not). Details is attached in google doc. > Accelerator aware task scheduling for Spark > --- > > Key: SPARK-24615 > URL: https://issues.apache.org/jira/browse/SPARK-24615 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Priority: Major > > In the machine learning area, accelerator card (GPU, FPGA, TPU) is > predominant compared to CPUs. To make the current Spark architecture to work > with accelerator cards, Spark itself should understand the existence of > accelerators and know how to schedule task onto the executors where > accelerators are equipped. > Current Spark’s scheduler schedules tasks based on the locality of the data > plus the available of CPUs. This will introduce some problems when scheduling > tasks with accelerators required. > # CPU cores are usually more than accelerators on one node, using CPU cores > to schedule accelerator required tasks will introduce the mismatch. > # In one cluster, we always assume that CPU is equipped in each node, but > this is not true of accelerator cards. > # The existence of heterogeneous tasks (accelerator required or not) > requires scheduler to schedule tasks with a smart way. > So here propose to improve the current scheduler to support heterogeneous > tasks (accelerator requires or not). This can be part of the work of Project > hydrogen. > Details is attached in google doc. > > CC [~yanboliang] [~merlintang] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24615) Accelerator aware task scheduling for Spark
Saisai Shao created SPARK-24615: --- Summary: Accelerator aware task scheduling for Spark Key: SPARK-24615 URL: https://issues.apache.org/jira/browse/SPARK-24615 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: Saisai Shao In the machine learning area, accelerator card (GPU, FPGA, TPU) is predominant compared to CPUs. To make the current Spark architecture to work with accelerator cards, Spark itself should understand the existence of accelerators and know how to schedule task onto the executors where accelerators are equipped. Current Spark’s scheduler schedules tasks based on the locality of the data plus the available of CPUs. This will introduce some problems when scheduling tasks with accelerators required. # CPU cores are usually more than accelerators on one node, using CPU cores to schedule accelerator required tasks will introduce the mismatch. # In one cluster, we always assume that CPU is equipped in each node, but this is not true of accelerator cards. # The existence of heterogeneous tasks (accelerator required or not) requires scheduler to schedule tasks with a smart way. So here propose to improve the current scheduler to support heterogeneous tasks (accelerator requires or not). Details is attached in google doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
[ https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517720#comment-16517720 ] Saisai Shao commented on SPARK-24493: - This BUG is fixed in HDFS-12670, Spark should bump its supported Hadoop3 version to 3.1.1 when released. > Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3 > -- > > Key: SPARK-24493 > URL: https://issues.apache.org/jira/browse/SPARK-24493 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Asif M >Priority: Major > > Kerberos Ticket Renewal is failing on long running spark job. I have added > below 2 kerberos properties in the HDFS configuration and ran a spark > streaming job > ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py]) > {noformat} > dfs.namenode.delegation.token.max-lifetime=180 (30min) > dfs.namenode.delegation.token.renew-interval=90 (15min) > {noformat} > > Spark Job failed at 15min with below error: > {noformat} > 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at > /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381) > failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage > 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 > (TID 7290, , executor 1): > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, > renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, > sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 > 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+ > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499) > at org.apache.hadoop.ipc.Client.call(Client.java:1445) > at org.apache.hadoop.ipc.Client.call(Client.java:1355) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at
[jira] [Updated] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
[ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated HIVE-16391: --- Attachment: HIVE-16391.2.patch > Publish proper Hive 1.2 jars (without including all dependencies in uber jar) > - > > Key: HIVE-16391 > URL: https://issues.apache.org/jira/browse/HIVE-16391 > Project: Hive > Issue Type: Task > Components: Build Infrastructure >Affects Versions: 1.2.2 >Reporter: Reynold Xin >Assignee: Saisai Shao >Priority: Major > Labels: pull-request-available > Fix For: 1.2.3 > > Attachments: HIVE-16391.1.patch, HIVE-16391.2.patch, HIVE-16391.patch > > > Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the > only change in the fork is to work around the issue that Hive publishes only > two sets of jars: one set with no dependency declared, and another with all > the dependencies included in the published uber jar. That is to say, Hive > doesn't publish a set of jars with the proper dependencies declared. > There is general consensus on both sides that we should remove the forked > Hive. > The change in the forked version is recorded here > https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 > Note that the fork in the past included other fixes but those have all become > unnecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
[ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511851#comment-16511851 ] Saisai Shao commented on HIVE-16391: I see. I can keep the "core" classifier and use another name. Will update the patch. [~owen.omalley] would you please help to review this patch, since you created a Spark JIRA, or can you please point someone in Hive community to help to review? Thanks a lot. > Publish proper Hive 1.2 jars (without including all dependencies in uber jar) > - > > Key: HIVE-16391 > URL: https://issues.apache.org/jira/browse/HIVE-16391 > Project: Hive > Issue Type: Task > Components: Build Infrastructure >Affects Versions: 1.2.2 >Reporter: Reynold Xin >Assignee: Saisai Shao >Priority: Major > Labels: pull-request-available > Fix For: 1.2.3 > > Attachments: HIVE-16391.1.patch, HIVE-16391.patch > > > Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the > only change in the fork is to work around the issue that Hive publishes only > two sets of jars: one set with no dependency declared, and another with all > the dependencies included in the published uber jar. That is to say, Hive > doesn't publish a set of jars with the proper dependencies declared. > There is general consensus on both sides that we should remove the forked > Hive. > The change in the forked version is recorded here > https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 > Note that the fork in the past included other fixes but those have all become > unnecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (LIVY-477) Upgrade Livy Scala version to 2.11.12
[ https://issues.apache.org/jira/browse/LIVY-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved LIVY-477. -- Resolution: Fixed Fix Version/s: 0.6.0 > Upgrade Livy Scala version to 2.11.12 > - > > Key: LIVY-477 > URL: https://issues.apache.org/jira/browse/LIVY-477 > Project: Livy > Issue Type: Improvement > Components: Build >Affects Versions: 0.5.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 0.6.0 > > > Scala version below 2.11.12 has CVE > ([https://scala-lang.org/news/security-update-nov17.html),] and Spark will > also upgrade its supported version to 2.11.12. > So here upgrading Livy's Scala version also. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SPARK-24518) Using Hadoop credential provider API to store password
[ https://issues.apache.org/jira/browse/SPARK-24518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24518: Description: Current Spark configs password in a plaintext way, like putting in the configuration file or adding as a launch arguments, sometimes such configurations like SSL password is configured by cluster admin, which should not be seen by user, but now this passwords are world readable to all the users. Hadoop credential provider API support storing password in a secure way, in which Spark could read it in a secure way, so here propose to add support of using credential provider API to get password. > Using Hadoop credential provider API to store password > -- > > Key: SPARK-24518 > URL: https://issues.apache.org/jira/browse/SPARK-24518 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Minor > > Current Spark configs password in a plaintext way, like putting in the > configuration file or adding as a launch arguments, sometimes such > configurations like SSL password is configured by cluster admin, which should > not be seen by user, but now this passwords are world readable to all the > users. > Hadoop credential provider API support storing password in a secure way, in > which Spark could read it in a secure way, so here propose to add support of > using credential provider API to get password. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24518) Using Hadoop credential provider API to store password
[ https://issues.apache.org/jira/browse/SPARK-24518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24518: Environment: (was: Current Spark configs password in a plaintext way, like putting in the configuration file or adding as a launch arguments, sometimes such configurations like SSL password is configured by cluster admin, which should not be seen by user, but now this passwords are world readable to all the users. Hadoop credential provider API support storing password in a secure way, in which Spark could read it in a secure way, so here propose to add support of using credential provider API to get password.) > Using Hadoop credential provider API to store password > -- > > Key: SPARK-24518 > URL: https://issues.apache.org/jira/browse/SPARK-24518 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24518) Using Hadoop credential provider API to store password
Saisai Shao created SPARK-24518: --- Summary: Using Hadoop credential provider API to store password Key: SPARK-24518 URL: https://issues.apache.org/jira/browse/SPARK-24518 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Environment: Current Spark configs password in a plaintext way, like putting in the configuration file or adding as a launch arguments, sometimes such configurations like SSL password is configured by cluster admin, which should not be seen by user, but now this passwords are world readable to all the users. Hadoop credential provider API support storing password in a secure way, in which Spark could read it in a secure way, so here propose to add support of using credential provider API to get password. Reporter: Saisai Shao -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506023#comment-16506023 ] Saisai Shao commented on SPARK-23534: - Yes, it is a Hadoop 2.8+ issue, but we don't a 2.8 profile for Spark, instead we have a 3.1 profile, so it will affect our hadoop 3 build, that's why also add a link here. > Spark run on Hadoop 3.0.0 > - > > Key: SPARK-23534 > URL: https://issues.apache.org/jira/browse/SPARK-23534 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > > Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make > sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark > run on Hadoop 3.0. > The work includes: > # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0. > # Test to see if there's dependency issues with Hadoop 3.0. > # Investigating the feasibility to use shaded client jars (HADOOP-11804). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24487) Add support for RabbitMQ.
[ https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-24487. - Resolution: Won't Fix > Add support for RabbitMQ. > - > > Key: SPARK-24487 > URL: https://issues.apache.org/jira/browse/SPARK-24487 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Michał Jurkiewicz >Priority: Major > > Add support for RabbitMQ. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24487) Add support for RabbitMQ.
[ https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505786#comment-16505786 ] Saisai Shao commented on SPARK-24487: - I'm sure Spark community will not merge this into Spark code base. You can either maintain a package yourself, or contribute to Apache Bahir. > Add support for RabbitMQ. > - > > Key: SPARK-24487 > URL: https://issues.apache.org/jira/browse/SPARK-24487 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Michał Jurkiewicz >Priority: Major > > Add support for RabbitMQ. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0
[ https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-23151: Issue Type: Sub-task (was: Dependency upgrade) Parent: SPARK-23534 > Provide a distribution of Spark with Hadoop 3.0 > --- > > Key: SPARK-23151 > URL: https://issues.apache.org/jira/browse/SPARK-23151 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 >Reporter: Louis Burke >Priority: Major > > Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark > package > only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication > is > that using up to date Kinesis libraries alongside s3 causes a clash w.r.t > aws-java-sdk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0
[ https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505783#comment-16505783 ] Saisai Shao commented on SPARK-23151: - I will convert this as a subtask of SPARK-23534. > Provide a distribution of Spark with Hadoop 3.0 > --- > > Key: SPARK-23151 > URL: https://issues.apache.org/jira/browse/SPARK-23151 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 >Reporter: Louis Burke >Priority: Major > > Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark > package > only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication > is > that using up to date Kinesis libraries alongside s3 causes a clash w.r.t > aws-java-sdk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505771#comment-16505771 ] Saisai Shao commented on SPARK-23534: - Spark with Hadoop 3 will be failed in token renew for long running case (with keytab and principal). > Spark run on Hadoop 3.0.0 > - > > Key: SPARK-23534 > URL: https://issues.apache.org/jira/browse/SPARK-23534 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > > Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make > sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark > run on Hadoop 3.0. > The work includes: > # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0. > # Test to see if there's dependency issues with Hadoop 3.0. > # Investigating the feasibility to use shaded client jars (HADOOP-11804). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
[ https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505768#comment-16505768 ] Saisai Shao edited comment on SPARK-24493 at 6/8/18 6:56 AM: - Adding more background, the issue is happened when building Spark with Hadoop 2.8+. In Spark, we use {{TokenIdentifer}} return by {{decodeIdentifier}} to get renew interval for HDFS token, but due to missing service loader file, Hadoop failed to create {{TokenIdentifier}} and returns *null*, which leads to an unexpected renew interval (Long.MaxValue). The related code in Hadoop is: {code} private static Class getClassForIdentifier(Text kind) { Class cls = null; synchronized (Token.class) { if (tokenKindMap == null) { tokenKindMap = Maps.newHashMap(); for (TokenIdentifier id : ServiceLoader.load(TokenIdentifier.class)) \\{ tokenKindMap.put(id.getKind(), id.getClass()); } } cls = tokenKindMap.get(kind); } if (cls == null) { LOG.debug("Cannot find class for token kind " + kind); return null; } return cls; } {code} The problem is: The HDFS "DelegationTokenIdentifier" class is in hadoop-hdfs-client jar, but the service loader description file "META-INF/services/org.apache.hadoop.security.token.TokenIdentifier" is in hadoop-hdfs jar. Spark local submit process/driver process (depends on client or cluster mode) only relies on hadoop-hdfs-client jar, but not hadoop-hdfs jar. So the ServiceLoader will be failed to find HDFS "DelegationTokenIdentifier" class and return null. The issue is due to the change in HADOOP-6200. Previously we only have building profile for Hadoop 2.6 and 2.7, so there's no issue here. But currently we has a building profile for Hadoop 3.1, so this will fail the token renew in Hadoop 3.1. The is a Hadoop issue, creating a Spark Jira to track this issue and bump the version when Hadoop side is fixed. was (Author: jerryshao): Adding more background, the issue is happened when building Spark with Hadoop 2.8+. In Spark, we use {{TokenIdentifer}} return by {{decodeIdentifier}} to get renew interval for HDFS token, but due to missing service loader file, Hadoop failed to create {{TokenIdentifier}} and returns *null*, which leads to an unexpected renew interval (Long.MaxValue). The related code in Hadoop is: private static Class getClassForIdentifier(Text kind) {Class cls = null;synchronized (Token.class) { if (tokenKindMap == null) { tokenKindMap = Maps.newHashMap();for (TokenIdentifier id : ServiceLoader.load(TokenIdentifier.class)) \{ tokenKindMap.put(id.getKind(), id.getClass()); } } cls = tokenKindMap.get(kind); }if (cls == null) { LOG.debug("Cannot find class for token kind " + kind); return null; }return cls; } The problem is: The HDFS "DelegationTokenIdentifier" class is in hadoop-hdfs-client jar, but the service loader description file "META-INF/services/org.apache.hadoop.security.token.TokenIdentifier" is in hadoop-hdfs jar. Spark local submit process/driver process (depends on client or cluster mode) only relies on hadoop-hdfs-client jar, but not hadoop-hdfs jar. So the ServiceLoader will be failed to find HDFS "DelegationTokenIdentifier" class and return null. The issue is due to the change in HADOOP-6200. Previously we only have building profile for Hadoop 2.6 and 2.7, so there's no issue here. But currently we has a building profile for Hadoop 3.1, so this will fail the token renew in Hadoop 3.1. The is a Hadoop issue, creating a Spark Jira to track this issue and bump the version when Hadoop side is fixed. > Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3 > -- > > Key: SPARK-24493 > URL: https://issues.apache.org/jira/browse/SPARK-24493 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Asif M >Priority: Major > > Kerberos Ticket Renewal is failing on long running spark job. I have added > below 2 kerberos properties in the HDFS configuration and ran a spark > streaming job > ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py]) > {noformat} > dfs.namenode.delegation.token.max-lifetime=180 (30min) > dfs.namenode.delegation.token.renew-interval=90 (15min) > {noformat} > > Spark Job failed at 15min with below error: > {noformat} > 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at > /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381) > failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage > 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 > (TID 7290, ,
[jira] [Commented] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
[ https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505768#comment-16505768 ] Saisai Shao commented on SPARK-24493: - Adding more background, the issue is happened when building Spark with Hadoop 2.8+. In Spark, we use {{TokenIdentifer}} return by {{decodeIdentifier}} to get renew interval for HDFS token, but due to missing service loader file, Hadoop failed to create {{TokenIdentifier}} and returns *null*, which leads to an unexpected renew interval (Long.MaxValue). The related code in Hadoop is: private static Class getClassForIdentifier(Text kind) {Class cls = null;synchronized (Token.class) { if (tokenKindMap == null) { tokenKindMap = Maps.newHashMap();for (TokenIdentifier id : ServiceLoader.load(TokenIdentifier.class)) \{ tokenKindMap.put(id.getKind(), id.getClass()); } } cls = tokenKindMap.get(kind); }if (cls == null) { LOG.debug("Cannot find class for token kind " + kind); return null; }return cls; } The problem is: The HDFS "DelegationTokenIdentifier" class is in hadoop-hdfs-client jar, but the service loader description file "META-INF/services/org.apache.hadoop.security.token.TokenIdentifier" is in hadoop-hdfs jar. Spark local submit process/driver process (depends on client or cluster mode) only relies on hadoop-hdfs-client jar, but not hadoop-hdfs jar. So the ServiceLoader will be failed to find HDFS "DelegationTokenIdentifier" class and return null. The issue is due to the change in HADOOP-6200. Previously we only have building profile for Hadoop 2.6 and 2.7, so there's no issue here. But currently we has a building profile for Hadoop 3.1, so this will fail the token renew in Hadoop 3.1. The is a Hadoop issue, creating a Spark Jira to track this issue and bump the version when Hadoop side is fixed. > Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3 > -- > > Key: SPARK-24493 > URL: https://issues.apache.org/jira/browse/SPARK-24493 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Asif M >Priority: Major > > Kerberos Ticket Renewal is failing on long running spark job. I have added > below 2 kerberos properties in the HDFS configuration and ran a spark > streaming job > ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py]) > {noformat} > dfs.namenode.delegation.token.max-lifetime=180 (30min) > dfs.namenode.delegation.token.renew-interval=90 (15min) > {noformat} > > Spark Job failed at 15min with below error: > {noformat} > 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at > /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381) > failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage > 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 > (TID 7290, , executor 1): > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, > renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, > sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 > 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+ > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499) > at org.apache.hadoop.ipc.Client.call(Client.java:1445) > at org.apache.hadoop.ipc.Client.call(Client.java:1355) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at >
[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3
[ https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24493: Summary: Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3 (was: Kerberos Ticket Renewal is failing in long running Spark job) > Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3 > -- > > Key: SPARK-24493 > URL: https://issues.apache.org/jira/browse/SPARK-24493 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Asif M >Priority: Major > > Kerberos Ticket Renewal is failing on long running spark job. I have added > below 2 kerberos properties in the HDFS configuration and ran a spark > streaming job > ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py]) > {noformat} > dfs.namenode.delegation.token.max-lifetime=180 (30min) > dfs.namenode.delegation.token.renew-interval=90 (15min) > {noformat} > > Spark Job failed at 15min with below error: > {noformat} > 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at > /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381) > failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage > 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 > (TID 7290, , executor 1): > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, > renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, > sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 > 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+ > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499) > at org.apache.hadoop.ipc.Client.call(Client.java:1445) > at org.apache.hadoop.ipc.Client.call(Client.java:1355) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at
[jira] [Commented] (SPARK-24487) Add support for RabbitMQ.
[ https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505760#comment-16505760 ] Saisai Shao commented on SPARK-24487: - You can add it in your own package with DataSource API. There's no meaning to add to Spark code base. > Add support for RabbitMQ. > - > > Key: SPARK-24487 > URL: https://issues.apache.org/jira/browse/SPARK-24487 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Michał Jurkiewicz >Priority: Major > > Add support for RabbitMQ. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in long running Spark job
[ https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24493: Component/s: Spark Core > Kerberos Ticket Renewal is failing in long running Spark job > > > Key: SPARK-24493 > URL: https://issues.apache.org/jira/browse/SPARK-24493 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Asif M >Priority: Major > > Kerberos Ticket Renewal is failing on long running spark job. I have added > below 2 kerberos properties in the HDFS configuration and ran a spark > streaming job > ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py]) > {noformat} > dfs.namenode.delegation.token.max-lifetime=180 (30min) > dfs.namenode.delegation.token.renew-interval=90 (15min) > {noformat} > > Spark Job failed at 15min with below error: > {noformat} > 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at > /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381) > failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage > 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 > (TID 7290, , executor 1): > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, > renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, > sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 > 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+ > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499) > at org.apache.hadoop.ipc.Client.call(Client.java:1445) > at org.apache.hadoop.ipc.Client.call(Client.java:1355) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at
[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in long running Spark job
[ https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24493: Component/s: (was: Spark Core) YARN > Kerberos Ticket Renewal is failing in long running Spark job > > > Key: SPARK-24493 > URL: https://issues.apache.org/jira/browse/SPARK-24493 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0 >Reporter: Asif M >Priority: Major > > Kerberos Ticket Renewal is failing on long running spark job. I have added > below 2 kerberos properties in the HDFS configuration and ran a spark > streaming job > ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py]) > {noformat} > dfs.namenode.delegation.token.max-lifetime=180 (30min) > dfs.namenode.delegation.token.renew-interval=90 (15min) > {noformat} > > Spark Job failed at 15min with below error: > {noformat} > 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at > /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381) > failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage > 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 > (TID 7290, , executor 1): > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, > renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, > sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 > 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+ > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499) > at org.apache.hadoop.ipc.Client.call(Client.java:1445) > at org.apache.hadoop.ipc.Client.call(Client.java:1355) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
[jira] [Updated] (SPARK-24493) Kerberos Ticket Renewal is failing in long running Spark job
[ https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-24493: Priority: Major (was: Blocker) > Kerberos Ticket Renewal is failing in long running Spark job > > > Key: SPARK-24493 > URL: https://issues.apache.org/jira/browse/SPARK-24493 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.3.0 >Reporter: Asif M >Priority: Major > > Kerberos Ticket Renewal is failing on long running spark job. I have added > below 2 kerberos properties in the HDFS configuration and ran a spark > streaming job > ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py]) > {noformat} > dfs.namenode.delegation.token.max-lifetime=180 (30min) > dfs.namenode.delegation.token.renew-interval=90 (15min) > {noformat} > > Spark Job failed at 15min with below error: > {noformat} > 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at > /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381) > failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage > 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 > (TID 7290, , executor 1): > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, > renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, > sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 > 18:56:51,276+ expected renewal time: 2018-06-04 18:56:13,875+ > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499) > at org.apache.hadoop.ipc.Client.call(Client.java:1445) > at org.apache.hadoop.ipc.Client.call(Client.java:1355) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:186) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at
[jira] [Commented] (SPARK-24487) Add support for RabbitMQ.
[ https://issues.apache.org/jira/browse/SPARK-24487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505608#comment-16505608 ] Saisai Shao commented on SPARK-24487: - What's the usage and purpose to integrate RabbitMQ to Spark? > Add support for RabbitMQ. > - > > Key: SPARK-24487 > URL: https://issues.apache.org/jira/browse/SPARK-24487 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Michał Jurkiewicz >Priority: Major > > Add support for RabbitMQ. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
[ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505607#comment-16505607 ] Saisai Shao commented on HIVE-16391: Any comment [~vanzin] [~ste...@apache.org]? > Publish proper Hive 1.2 jars (without including all dependencies in uber jar) > - > > Key: HIVE-16391 > URL: https://issues.apache.org/jira/browse/HIVE-16391 > Project: Hive > Issue Type: Task > Components: Build Infrastructure >Affects Versions: 1.2.2 >Reporter: Reynold Xin >Assignee: Saisai Shao >Priority: Major > Labels: pull-request-available > Fix For: 1.2.3 > > Attachments: HIVE-16391.1.patch, HIVE-16391.patch > > > Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the > only change in the fork is to work around the issue that Hive publishes only > two sets of jars: one set with no dependency declared, and another with all > the dependencies included in the published uber jar. That is to say, Hive > doesn't publish a set of jars with the proper dependencies declared. > There is general consensus on both sides that we should remove the forked > Hive. > The change in the forked version is recorded here > https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 > Note that the fork in the past included other fixes but those have all become > unnecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
[ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503188#comment-16503188 ] Saisai Shao commented on HIVE-16391: Uploaded a new patch [^HIVE-16391.1.patch]to use the solution mentioned by Marcelo. Simply by adding two new maven modules and rename the original "hive-exec" module. One added module is new "hive-exec" which is compliant to existing Hive, another added module "hive-exec-spark" is specifically for Spark. > Publish proper Hive 1.2 jars (without including all dependencies in uber jar) > - > > Key: HIVE-16391 > URL: https://issues.apache.org/jira/browse/HIVE-16391 > Project: Hive > Issue Type: Task > Components: Build Infrastructure >Affects Versions: 1.2.2 >Reporter: Reynold Xin >Assignee: Saisai Shao >Priority: Major > Labels: pull-request-available > Fix For: 1.2.3 > > Attachments: HIVE-16391.1.patch, HIVE-16391.patch > > > Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the > only change in the fork is to work around the issue that Hive publishes only > two sets of jars: one set with no dependency declared, and another with all > the dependencies included in the published uber jar. That is to say, Hive > doesn't publish a set of jars with the proper dependencies declared. > There is general consensus on both sides that we should remove the forked > Hive. > The change in the forked version is recorded here > https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 > Note that the fork in the past included other fixes but those have all become > unnecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
[ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated HIVE-16391: --- Attachment: HIVE-16391.1.patch > Publish proper Hive 1.2 jars (without including all dependencies in uber jar) > - > > Key: HIVE-16391 > URL: https://issues.apache.org/jira/browse/HIVE-16391 > Project: Hive > Issue Type: Task > Components: Build Infrastructure >Affects Versions: 1.2.2 >Reporter: Reynold Xin >Assignee: Saisai Shao >Priority: Major > Labels: pull-request-available > Fix For: 1.2.3 > > Attachments: HIVE-16391.1.patch, HIVE-16391.patch > > > Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the > only change in the fork is to work around the issue that Hive publishes only > two sets of jars: one set with no dependency declared, and another with all > the dependencies included in the published uber jar. That is to say, Hive > doesn't publish a set of jars with the proper dependencies declared. > There is general consensus on both sides that we should remove the forked > Hive. > The change in the forked version is recorded here > https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 > Note that the fork in the past included other fixes but those have all become > unnecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
[ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502976#comment-16502976 ] Saisai Shao commented on HIVE-16391: [~vanzin] one problem about your proposed solution: hive-exec test jar is not valid anymore, because we changed the artifact name for the current "hive-exec" pom. This might affect the user who relies on this test jar. > Publish proper Hive 1.2 jars (without including all dependencies in uber jar) > - > > Key: HIVE-16391 > URL: https://issues.apache.org/jira/browse/HIVE-16391 > Project: Hive > Issue Type: Task > Components: Build Infrastructure >Affects Versions: 1.2.2 >Reporter: Reynold Xin >Assignee: Saisai Shao >Priority: Major > Labels: pull-request-available > Fix For: 1.2.3 > > Attachments: HIVE-16391.patch > > > Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the > only change in the fork is to work around the issue that Hive publishes only > two sets of jars: one set with no dependency declared, and another with all > the dependencies included in the published uber jar. That is to say, Hive > doesn't publish a set of jars with the proper dependencies declared. > There is general consensus on both sides that we should remove the forked > Hive. > The change in the forked version is recorded here > https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 > Note that the fork in the past included other fixes but those have all become > unnecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
[ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502756#comment-16502756 ] Saisai Shao commented on HIVE-16391: {quote}The problem with that is that it changes the meaning of Hive's artifacts, so anybody currently importing hive-exec would see a breakage, and that's probably not desired. {quote} This might not be acceptable from Hive community, because it will break the current user as you mentioned. As [~joshrosen] mentioned, Spark wants the hive-exec jar which shades kryo and prototuf-java, not a pure non-shaded jar. {quote}Another option is to change the artifact name of the current "hive-exec" pom. Then you'd publish the normal jar under the new artifact name, then have a separate module that imports that jar, shades dependencies, and publishes the result as "hive-exec". That would maintain compatibility with existing artifacts. {quote} I can try this approach, but it seems not a small change for Hive, I'm not sure if Hive community will accept such approach (at least for branch 1.2). > Publish proper Hive 1.2 jars (without including all dependencies in uber jar) > - > > Key: HIVE-16391 > URL: https://issues.apache.org/jira/browse/HIVE-16391 > Project: Hive > Issue Type: Task > Components: Build Infrastructure >Affects Versions: 1.2.2 >Reporter: Reynold Xin >Assignee: Saisai Shao >Priority: Major > Labels: pull-request-available > Fix For: 1.2.3 > > Attachments: HIVE-16391.patch > > > Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the > only change in the fork is to work around the issue that Hive publishes only > two sets of jars: one set with no dependency declared, and another with all > the dependencies included in the published uber jar. That is to say, Hive > doesn't publish a set of jars with the proper dependencies declared. > There is general consensus on both sides that we should remove the forked > Hive. > The change in the forked version is recorded here > https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 > Note that the fork in the past included other fixes but those have all become > unnecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
[ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated HIVE-16391: --- Fix Version/s: 1.2.3 Affects Version/s: 1.2.2 Attachment: HIVE-16391.patch Status: Patch Available (was: Open) > Publish proper Hive 1.2 jars (without including all dependencies in uber jar) > - > > Key: HIVE-16391 > URL: https://issues.apache.org/jira/browse/HIVE-16391 > Project: Hive > Issue Type: Task > Components: Build Infrastructure >Affects Versions: 1.2.2 >Reporter: Reynold Xin >Assignee: Saisai Shao >Priority: Major > Labels: pull-request-available > Fix For: 1.2.3 > > Attachments: HIVE-16391.patch > > > Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the > only change in the fork is to work around the issue that Hive publishes only > two sets of jars: one set with no dependency declared, and another with all > the dependencies included in the published uber jar. That is to say, Hive > doesn't publish a set of jars with the proper dependencies declared. > There is general consensus on both sides that we should remove the forked > Hive. > The change in the forked version is recorded here > https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 > Note that the fork in the past included other fixes but those have all become > unnecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
[ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501667#comment-16501667 ] Saisai Shao commented on HIVE-16391: I see, thanks. Will upload the patch. > Publish proper Hive 1.2 jars (without including all dependencies in uber jar) > - > > Key: HIVE-16391 > URL: https://issues.apache.org/jira/browse/HIVE-16391 > Project: Hive > Issue Type: Task > Components: Build Infrastructure >Reporter: Reynold Xin >Assignee: Saisai Shao >Priority: Major > Labels: pull-request-available > > Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the > only change in the fork is to work around the issue that Hive publishes only > two sets of jars: one set with no dependency declared, and another with all > the dependencies included in the published uber jar. That is to say, Hive > doesn't publish a set of jars with the proper dependencies declared. > There is general consensus on both sides that we should remove the forked > Hive. > The change in the forked version is recorded here > https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 > Note that the fork in the past included other fixes but those have all become > unnecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
[ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501561#comment-16501561 ] Saisai Shao commented on HIVE-16391: Seems there's no permission for me to upload a file. > Publish proper Hive 1.2 jars (without including all dependencies in uber jar) > - > > Key: HIVE-16391 > URL: https://issues.apache.org/jira/browse/HIVE-16391 > Project: Hive > Issue Type: Task > Components: Build Infrastructure >Reporter: Reynold Xin >Priority: Major > Labels: pull-request-available > > Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the > only change in the fork is to work around the issue that Hive publishes only > two sets of jars: one set with no dependency declared, and another with all > the dependencies included in the published uber jar. That is to say, Hive > doesn't publish a set of jars with the proper dependencies declared. > There is general consensus on both sides that we should remove the forked > Hive. > The change in the forked version is recorded here > https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 > Note that the fork in the past included other fixes but those have all become > unnecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
[ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501561#comment-16501561 ] Saisai Shao edited comment on HIVE-16391 at 6/5/18 10:15 AM: - Seems there's no permission for me to upload a file. There's no such button. was (Author: jerryshao): Seems there's no permission for me to upload a file. > Publish proper Hive 1.2 jars (without including all dependencies in uber jar) > - > > Key: HIVE-16391 > URL: https://issues.apache.org/jira/browse/HIVE-16391 > Project: Hive > Issue Type: Task > Components: Build Infrastructure >Reporter: Reynold Xin >Priority: Major > Labels: pull-request-available > > Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the > only change in the fork is to work around the issue that Hive publishes only > two sets of jars: one set with no dependency declared, and another with all > the dependencies included in the published uber jar. That is to say, Hive > doesn't publish a set of jars with the proper dependencies declared. > There is general consensus on both sides that we should remove the forked > Hive. > The change in the forked version is recorded here > https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 > Note that the fork in the past included other fixes but those have all become > unnecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
[ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501415#comment-16501415 ] Saisai Shao commented on HIVE-16391: I'm not sure if submitting a PR is a right way to review in Hive Community, waiting for the feedback. > Publish proper Hive 1.2 jars (without including all dependencies in uber jar) > - > > Key: HIVE-16391 > URL: https://issues.apache.org/jira/browse/HIVE-16391 > Project: Hive > Issue Type: Task > Components: Build Infrastructure >Reporter: Reynold Xin >Priority: Major > Labels: pull-request-available > > Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the > only change in the fork is to work around the issue that Hive publishes only > two sets of jars: one set with no dependency declared, and another with all > the dependencies included in the published uber jar. That is to say, Hive > doesn't publish a set of jars with the proper dependencies declared. > There is general consensus on both sides that we should remove the forked > Hive. > The change in the forked version is recorded here > https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 > Note that the fork in the past included other fixes but those have all become > unnecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-16391) Publish proper Hive 1.2 jars (without including all dependencies in uber jar)
[ https://issues.apache.org/jira/browse/HIVE-16391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501285#comment-16501285 ] Saisai Shao commented on HIVE-16391: Hi [~joshrosen] I'm trying to make the hive changes as you mentioned above using the new classifier {{core-spark}}. I found one problem about release two shaded jars (one is hive-exec, another is hive-exec-core-spark). The published pom file is still reduced pom file, which is related to hive-exec, so when Spark using hive-exec-core-spark jar, it should explicitly declare all the transitive dependencies of hive-exec. I'm not sure if there's a way to publish two pom files mapping to two different shaded jars, or it is acceptable for Spark to explicitly declare all the transitive dependencies, like {{core}} classifier you used before? > Publish proper Hive 1.2 jars (without including all dependencies in uber jar) > - > > Key: HIVE-16391 > URL: https://issues.apache.org/jira/browse/HIVE-16391 > Project: Hive > Issue Type: Task > Components: Build Infrastructure >Reporter: Reynold Xin >Priority: Major > > Apache Spark currently depends on a forked version of Apache Hive. AFAIK, the > only change in the fork is to work around the issue that Hive publishes only > two sets of jars: one set with no dependency declared, and another with all > the dependencies included in the published uber jar. That is to say, Hive > doesn't publish a set of jars with the proper dependencies declared. > There is general consensus on both sides that we should remove the forked > Hive. > The change in the forked version is recorded here > https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 > Note that the fork in the past included other fixes but those have all become > unnecessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501225#comment-16501225 ] Saisai Shao commented on SPARK-20202: - OK, for the 1st, I've already started working on it locally. Looks like it is not a big change, only some POM changes are enough, I will submit a patch to Hive community. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Major > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499707#comment-16499707 ] Saisai Shao commented on SPARK-20202: - What is our plan to to fix this issue, are we going to use new Hive version, or we are still stick to 1.2? If we're still stick to 1.2, [~ste...@apache.org] and I will take this issue and make the ball rolling in Hive community. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Major > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497931#comment-16497931 ] Saisai Shao commented on SPARK-18673: - Thanks Steve, looking forward to your inputs. > Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version > -- > > Key: SPARK-18673 > URL: https://issues.apache.org/jira/browse/SPARK-18673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT >Reporter: Steve Loughran >Priority: Major > > Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader > considers 3.x to be an unknown Hadoop version. > Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it > will need to be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24355) Improve Spark shuffle server responsiveness to non-ChunkFetch requests
[ https://issues.apache.org/jira/browse/SPARK-24355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497626#comment-16497626 ] Saisai Shao commented on SPARK-24355: - Do we have a test result before and after? > Improve Spark shuffle server responsiveness to non-ChunkFetch requests > -- > > Key: SPARK-24355 > URL: https://issues.apache.org/jira/browse/SPARK-24355 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 > Environment: Hadoop-2.7.4 > Spark-2.3.0 >Reporter: Min Shen >Priority: Major > > We run Spark on YARN, and deploy Spark external shuffle service as part of > YARN NM aux service. > One issue we saw with Spark external shuffle service is the various timeout > experienced by the clients on either registering executor with local shuffle > server or establish connection to remote shuffle server. > Example of a timeout for establishing connection with remote shuffle server: > {code:java} > java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout > waiting for task. > at > org.spark_project.guava.base.Throwables.propagate(Throwables.java:160) > at > org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:288) > at > org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:248) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) > at > org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:106) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) > at > org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:115) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:182) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.org$apache$spark$storage$ShuffleBlockFetcherIterator$$send$1(ShuffleBlockFetcherIterator.scala:396) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:391) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:345) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:57) > {code} > Example of a timeout for registering executor with local shuffle server: > {code:java} > ava.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout > waiting for task. > at > org.spark-project.guava.base.Throwables.propagate(Throwables.java:160) > at > org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278) > at > org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) > at > org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181) > at > org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141) > at > org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218) > {code} > While patches such as SPARK-20640 and config parameters such as > spark.shuffle.registration.timeout and spark.shuffle.sasl.timeout (when > spark.authenticate is set to true) could help to alleviate this type of > problems, it does not solve the fundamental issue. > We have observed that, when the shuffle workload gets very busy in peak > hours, the client requests could timeout even after configuring these > parameters to very high values. Further investigating this issue revealed the > following issue: > Right now, the default server side netty handler threads is 2 * # cores, and > can be further configured with parameter spark.shuffle.io.serverThreads. > In order to process a client request, it would require one available server > netty handler thread. > However, when the server netty handler threads start to process > ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk > contentions from the random read operations initiated by all the > ChunkFetchRequests received from clients. > As a result, when the shuffle server is serving many concurrent >
[jira] [Commented] (SPARK-24448) File not found on the address SparkFiles.get returns on standalone cluster
[ https://issues.apache.org/jira/browse/SPARK-24448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497548#comment-16497548 ] Saisai Shao commented on SPARK-24448: - Does it only happen in standalone cluster mode, have you tried client mode? > File not found on the address SparkFiles.get returns on standalone cluster > -- > > Key: SPARK-24448 > URL: https://issues.apache.org/jira/browse/SPARK-24448 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Pritpal Singh >Priority: Major > > I want to upload a file on all worker nodes in a standalone cluster and > retrieve the location of file. Here is my code > > val tempKeyStoreLoc = System.getProperty("java.io.tmpdir") + "/keystore.jks" > val file = new File(tempKeyStoreLoc) > sparkContext.addFile(file.getAbsolutePath) > val keyLoc = SparkFiles.get("keystore.jks") > > SparkFiles.get returns a random location where keystore.jks does not exist. I > submit the job in cluster mode. In fact the location Spark.Files returns does > not exist on any of the worker nodes (including the driver node). > I observed that Spark does load keystore.jks files on worker nodes at > /work///keystore.jks. The partition_id > changes from one worker node to another. > My requirement is to upload a file on all nodes of a cluster and retrieve its > location. I'm expecting the location to be common across all worker nodes. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496187#comment-16496187 ] Saisai Shao commented on SPARK-18673: - I created a PR for this issue [https://github.com/JoshRosen/hive/pull/2] Actually one line fix is enough, most of other changes in Hive is related to HBase, which is not required for us. I'm not sure what is our plan for Hive support, are we still planning to use 1.2.1spark2 as a built-in version for Spark in future, or we plan to upgrade to latest Hive version. If we plan to upgrade to the latest one, then this PR is not necessary. > Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version > -- > > Key: SPARK-18673 > URL: https://issues.apache.org/jira/browse/SPARK-18673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT >Reporter: Steve Loughran >Priority: Major > > Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader > considers 3.x to be an unknown Hadoop version. > Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it > will need to be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23991) data loss when allocateBlocksToBatch
[ https://issues.apache.org/jira/browse/SPARK-23991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-23991. - Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 21430 [https://github.com/apache/spark/pull/21430] > data loss when allocateBlocksToBatch > > > Key: SPARK-23991 > URL: https://issues.apache.org/jira/browse/SPARK-23991 > Project: Spark > Issue Type: Bug > Components: DStreams, Input/Output >Affects Versions: 2.2.0 > Environment: spark 2.11 >Reporter: kevin fu >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > with checkpoint and WAL enabled, driver will write the allocation of blocks > to batch into hdfs. however, if it fails as following, the blocks of this > batch cannot be computed by the DAG. Because the blocks have been dequeued > from the receivedBlockQueue and get lost. > {panel:title=error log} > 18/04/15 11:11:25 WARN ReceivedBlockTracker: Exception thrown while writing > record: BatchAllocationEvent(152376548 ms,AllocatedBlocks(Map(0 -> > ArrayBuffer( to the WriteAheadLog. org.apache.spark.SparkException: > Exception thrown in awaitResult: at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) at > org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:83) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:234) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.allocateBlocksToBatch(ReceivedBlockTracker.scala:118) > at > org.apache.spark.streaming.scheduler.ReceiverTracker.allocateBlocksToBatch(ReceiverTracker.scala:213) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) > at scala.util.Try$.apply(Try.scala:192) at > org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247) > at > org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Caused > by: java.util.concurrent.TimeoutException: Futures timed out after [5000 > milliseconds] at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:190) at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190) ... 12 > more 18/04/15 11:11:25 INFO ReceivedBlockTracker: Possibly processed batch > 152376548 ms needs to be processed again in WAL recovery{panel} > the concerning codes are showed below: > {code} > /** >* Allocate all unallocated blocks to the given batch. >* This event will get written to the write ahead log (if enabled). >*/ > def allocateBlocksToBatch(batchTime: Time): Unit = synchronized { > if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) > { > val streamIdToBlocks = streamIds.map { streamId => > (streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true)) > }.toMap > val allocatedBlocks = AllocatedBlocks(streamIdToBlocks) > if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) { > timeToAllocatedBlocks.put(batchTime, allocatedBlocks) > lastAllocatedBatchTime = batchTime > } else { > logInfo(s"Possibly processed batch $batchTime needs to be processed > again in WAL recovery") > } > } else { > // This situation occurs when: > // 1. WAL is ended with BatchAllocationEvent, but without > BatchCleanupEvent, > // possibly processed batch job or half-processed batch job need to be > processed again, > // so the batchTime will be equal to lastAllocatedBatchTime. > // 2. Slow checkpointing makes recovered batch time older than WAL > recovered > // lastAllocatedBatchTime. > // This situation will only occurs in recovery time. > logInfo(s"Possibly processed batch $batchTime needs to be processed > again in WAL recovery") > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To
[jira] [Assigned] (SPARK-23991) data loss when allocateBlocksToBatch
[ https://issues.apache.org/jira/browse/SPARK-23991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-23991: --- Assignee: Gabor Somogyi > data loss when allocateBlocksToBatch > > > Key: SPARK-23991 > URL: https://issues.apache.org/jira/browse/SPARK-23991 > Project: Spark > Issue Type: Bug > Components: DStreams, Input/Output >Affects Versions: 2.2.0 > Environment: spark 2.11 >Reporter: kevin fu >Assignee: Gabor Somogyi >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > with checkpoint and WAL enabled, driver will write the allocation of blocks > to batch into hdfs. however, if it fails as following, the blocks of this > batch cannot be computed by the DAG. Because the blocks have been dequeued > from the receivedBlockQueue and get lost. > {panel:title=error log} > 18/04/15 11:11:25 WARN ReceivedBlockTracker: Exception thrown while writing > record: BatchAllocationEvent(152376548 ms,AllocatedBlocks(Map(0 -> > ArrayBuffer( to the WriteAheadLog. org.apache.spark.SparkException: > Exception thrown in awaitResult: at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194) at > org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:83) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:234) > at > org.apache.spark.streaming.scheduler.ReceivedBlockTracker.allocateBlocksToBatch(ReceivedBlockTracker.scala:118) > at > org.apache.spark.streaming.scheduler.ReceiverTracker.allocateBlocksToBatch(ReceiverTracker.scala:213) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) > at scala.util.Try$.apply(Try.scala:192) at > org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247) > at > org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Caused > by: java.util.concurrent.TimeoutException: Futures timed out after [5000 > milliseconds] at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:190) at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190) ... 12 > more 18/04/15 11:11:25 INFO ReceivedBlockTracker: Possibly processed batch > 152376548 ms needs to be processed again in WAL recovery{panel} > the concerning codes are showed below: > {code} > /** >* Allocate all unallocated blocks to the given batch. >* This event will get written to the write ahead log (if enabled). >*/ > def allocateBlocksToBatch(batchTime: Time): Unit = synchronized { > if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) > { > val streamIdToBlocks = streamIds.map { streamId => > (streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true)) > }.toMap > val allocatedBlocks = AllocatedBlocks(streamIdToBlocks) > if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) { > timeToAllocatedBlocks.put(batchTime, allocatedBlocks) > lastAllocatedBatchTime = batchTime > } else { > logInfo(s"Possibly processed batch $batchTime needs to be processed > again in WAL recovery") > } > } else { > // This situation occurs when: > // 1. WAL is ended with BatchAllocationEvent, but without > BatchCleanupEvent, > // possibly processed batch job or half-processed batch job need to be > processed again, > // so the batchTime will be equal to lastAllocatedBatchTime. > // 2. Slow checkpointing makes recovered batch time older than WAL > recovered > // lastAllocatedBatchTime. > // This situation will only occurs in recovery time. > logInfo(s"Possibly processed batch $batchTime needs to be processed > again in WAL recovery") > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Resolved] (LIVY-472) Improve the logs for fail-to-create session
[ https://issues.apache.org/jira/browse/LIVY-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved LIVY-472. -- Resolution: Fixed Fix Version/s: 0.6.0 0.5.1 Issue resolved by pull request 96 [https://github.com/apache/incubator-livy/pull/96] > Improve the logs for fail-to-create session > --- > > Key: LIVY-472 > URL: https://issues.apache.org/jira/browse/LIVY-472 > Project: Livy > Issue Type: Improvement > Components: Server >Affects Versions: 0.5.0 >Reporter: Saisai Shao >Priority: Minor > Fix For: 0.5.1, 0.6.0 > > > Livy currently doesn't give a very clear log about the fail-to-create > session, it only says that session related app tag cannot be found in RM, but > doesn't tell user how to search and get the true root cause. So here change > the logs to make it more clear. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (SPARK-24377) Make --py-files work in non pyspark application
Saisai Shao created SPARK-24377: --- Summary: Make --py-files work in non pyspark application Key: SPARK-24377 URL: https://issues.apache.org/jira/browse/SPARK-24377 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.3.0 Reporter: Saisai Shao For some Spark applications, though they're a java program, they require not only jar dependencies, but also python dependencies. One example is Livy remote SparkContext application, this application is actually a embedded REPL for Scala/Python/R, so it will not only load in jar dependencies, but also python and R deps. Currently for a Spark application, --py-files can only be worked for a pyspark application, so it will not be worked in the above case. So here propose to remove such restriction. Also we tested that "spark.submit.pyFiles" only supports quite limited scenario (client mode with local deps), so here also expand the usage of "spark.submit.pyFiles" to be alternative of --py-files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (LIVY-471) New session creation API set to support resource uploading
[ https://issues.apache.org/jira/browse/LIVY-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481375#comment-16481375 ] Saisai Shao commented on LIVY-471: -- I have a local PR for solution 2, which was simpler and straightforward. It only adds new APIs while keep the consistency of old API. Also it handles both batch and interactive session. I can submit a WIP pr, in which you can see the implementation more clearly. > New session creation API set to support resource uploading > -- > > Key: LIVY-471 > URL: https://issues.apache.org/jira/browse/LIVY-471 > Project: Livy > Issue Type: Improvement > Components: Server >Affects Versions: 0.5.0 >Reporter: Saisai Shao >Priority: Major > > Already post in mail list. > In our current API design to create interactive / batch session, we assume > end user should upload jars, pyFiles and related dependencies to HDFS before > creating the session, and we use one POST request to create session. But > usually end user may not have the permission to access the HDFS in their > submission machine, so it makes them hard to create new sessions. So the > requirement here is that if Livy could offer APIs to upload resources during > session creation. One implementation is proposed in > [https://github.com/apache/incubator-livy/pull/91|https://github.com/apache/incubator-livy/pull/91.] > This add a field in session creation request to delay the session creation, > then adding a bunch of APIs to support resource upload, finally adding an API > to start creating the session. This seems a feasible solution, but also a > little hack to support such scenario. So I was thinking if we could a set of > new APIs to support such scenarios, rather than hack the existing APIs. > To borrow the concept from yarn application submission, we can have 3 APIs to > create session. > * requesting a new session id from Livy Server. > * uploading resources associate with this session id. > * submitting request to create session. > This is similar to YARN's process to submit application, and we can bump the > supported API version for newly added APIs. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-24241) Do not fail fast when dynamic resource allocation enabled with 0 executor
[ https://issues.apache.org/jira/browse/SPARK-24241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475480#comment-16475480 ] Saisai Shao commented on SPARK-24241: - Issue resolved by pull request 21290 https://github.com/apache/spark/pull/21290 > Do not fail fast when dynamic resource allocation enabled with 0 executor > - > > Key: SPARK-24241 > URL: https://issues.apache.org/jira/browse/SPARK-24241 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 2.4.0 > > > {code:java} > ~/spark-2.3.0-bin-hadoop2.7$ bin/spark-sql --num-executors 0 --conf > spark.dynamicAllocation.enabled=true > Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=1024m; > support was removed in 8.0 > Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; > support was removed in 8.0 > Error: Number of executors must be a positive number > Run with --help for usage help or --verbose for debug output > {code} > Actually, we could start up with min executor number with 0 before -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org