[jira] [Created] (SPARK-6478) new RDD.pipeWithPartition method
Maxim Ivanov created SPARK-6478: --- Summary: new RDD.pipeWithPartition method Key: SPARK-6478 URL: https://issues.apache.org/jira/browse/SPARK-6478 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Maxim Ivanov Priority: Minor This method allows building command line args and process environement map using partition as an argument. Use case for this feature is to provide additional informatin about the partition to spawned application in case where partitioner provides it (like in cassandra connector or when custom partitioner/RDD is used). Also it provides simpler and more intuitive alternative for printPipeContext function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1828) Created forked version of hive-exec that doesn't bundle other dependencies
[ https://issues.apache.org/jira/browse/SPARK-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098774#comment-14098774 ] Maxim Ivanov commented on SPARK-1828: - I don't have a pull request at hand if you are askin that ;) But IMHO proper solution is to tinker with maven shade plugin, to drop classes pulled by hive dependency in favor of those specified in Spark POM. If it is done that way, then it would be possible to specify hive version using "-D" param in the same way we can specify hadoop version and be sure (to some extent of course :) ) that if it builds,it works. > Created forked version of hive-exec that doesn't bundle other dependencies > -- > > Key: SPARK-1828 > URL: https://issues.apache.org/jira/browse/SPARK-1828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Blocker > Fix For: 1.0.0 > > > The hive-exec jar includes a bunch of Hive's dependencies in addition to hive > itself (protobuf, guava, etc). See HIVE-5733. This breaks any attempt in > Spark to manage those dependencies. > The only solution to this problem is to publish our own version of hive-exec > 0.12.0 that behaves correctly. While we are doing this, we might as well > re-write the protobuf dependency to use the shaded version of protobuf 2.4.1 > that we already have for Akka. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1828) Created forked version of hive-exec that doesn't bundle other dependencies
[ https://issues.apache.org/jira/browse/SPARK-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098656#comment-14098656 ] Maxim Ivanov commented on SPARK-1828: - Because of this change any incompatibilities with Hive in Hadoop distros are hidden untile you run job on an actual cluster. Unless you are willing to keep your fork up to date with every major Hadoop distro of course. Right now we see incompatibility with CDH5.0.2 Hive, but I'd rather have it failing to compile rather that seeing problems at runtime > Created forked version of hive-exec that doesn't bundle other dependencies > -- > > Key: SPARK-1828 > URL: https://issues.apache.org/jira/browse/SPARK-1828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Blocker > Fix For: 1.0.0 > > > The hive-exec jar includes a bunch of Hive's dependencies in addition to hive > itself (protobuf, guava, etc). See HIVE-5733. This breaks any attempt in > Spark to manage those dependencies. > The only solution to this problem is to publish our own version of hive-exec > 0.12.0 that behaves correctly. While we are doing this, we might as well > re-write the protobuf dependency to use the shaded version of protobuf 2.4.1 > that we already have for Akka. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2669) Hadoop configuration is not localised when submitting job in yarn-cluster mode
Maxim Ivanov created SPARK-2669: --- Summary: Hadoop configuration is not localised when submitting job in yarn-cluster mode Key: SPARK-2669 URL: https://issues.apache.org/jira/browse/SPARK-2669 Project: Spark Issue Type: Bug Reporter: Maxim Ivanov I'd like to propose a fix for a problem when Hadoop configuration is not localized when job is submitted in yarn-cluster mode. Here is a description from github pull request https://github.com/apache/spark/pull/1574 This patch fixes a problem when Spark driver is run in the container managed by YARN ResourceManager it inherits configuration from a NodeManager process, which can be different from the Hadoop configuration present on the client (submitting machine). Problem is most vivid when fs.defaultFS property differs between these two. Hadoop MR solves it by serializing client's Hadoop configuration into job.xml in application staging directory and then making Application Master to use it. That guarantees that regardless of execution nodes configurations all application containers use same config identical to one on the client side. This patch uses similar approach. YARN ClientBase serializes configuration and adds it to ClientDistributedCacheManager under "job.xml" link name. ClientDistributedCacheManager is then utilizes Hadoop localizer to deliver it to whatever container is started by this application, including the one running Spark driver. YARN ClientBase also adds "SPARK_LOCAL_HADOOPCONF" env variable to AM container request which is then used by SparkHadoopUtil.newConfiguration to trigger new behavior when machine-wide hadoop configuration is merged with application specific job.xml (exactly how it is done in Hadoop MR). SparkContext is then follows same approach, adding SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use client-side Hadopo configuration. Also all the references to "new Configuration()" which might be executed on YARN cluster side are changed to use SparkHadoopUtil.get.conf Please note that it fixes only core Spark, the part which I am comfortable to test and verify the result. I didn't descend into steaming/shark directories, so things might need to be changed there too. -- This message was sent by Atlassian JIRA (v6.2#6252)