[jira] [Created] (SPARK-6478) new RDD.pipeWithPartition method

2015-03-23 Thread Maxim Ivanov (JIRA)
Maxim Ivanov created SPARK-6478:
---

 Summary: new RDD.pipeWithPartition method
 Key: SPARK-6478
 URL: https://issues.apache.org/jira/browse/SPARK-6478
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Maxim Ivanov
Priority: Minor




This method allows building command line args and process environement map
using partition as an argument.

Use case for this feature is to provide additional informatin about the 
partition to spawned application in case where partitioner provides it (like in 
cassandra connector or when custom partitioner/RDD is used).

Also it provides simpler and more intuitive alternative for printPipeContext 
function.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1828) Created forked version of hive-exec that doesn't bundle other dependencies

2014-08-15 Thread Maxim Ivanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098774#comment-14098774
 ] 

Maxim Ivanov commented on SPARK-1828:
-

I don't have a pull request at hand if you are askin that ;) But IMHO proper 
solution is to tinker with maven shade plugin, to drop classes pulled by hive 
dependency in favor of those specified in Spark POM. 

If it is done that way, then it would be possible to specify hive version using 
"-D" param in the same way we can specify hadoop version and be sure (to some 
extent of course :) ) that if it builds,it works.

> Created forked version of hive-exec that doesn't bundle other dependencies
> --
>
> Key: SPARK-1828
> URL: https://issues.apache.org/jira/browse/SPARK-1828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>
> The hive-exec jar includes a bunch of Hive's dependencies in addition to hive 
> itself (protobuf, guava, etc). See HIVE-5733. This breaks any attempt in 
> Spark to manage those dependencies.
> The only solution to this problem is to publish our own version of hive-exec 
> 0.12.0 that behaves correctly. While we are doing this, we might as well 
> re-write the protobuf dependency to use the shaded version of protobuf 2.4.1 
> that we already have for Akka.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1828) Created forked version of hive-exec that doesn't bundle other dependencies

2014-08-15 Thread Maxim Ivanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098656#comment-14098656
 ] 

Maxim Ivanov commented on SPARK-1828:
-

Because of this change any incompatibilities with Hive in Hadoop distros are 
hidden untile you run job on an actual cluster. Unless you are willing to keep 
your fork up to date with every major Hadoop distro of course. 

Right now we see incompatibility with CDH5.0.2 Hive, but I'd rather have it 
failing to compile rather that seeing problems at runtime

> Created forked version of hive-exec that doesn't bundle other dependencies
> --
>
> Key: SPARK-1828
> URL: https://issues.apache.org/jira/browse/SPARK-1828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>
> The hive-exec jar includes a bunch of Hive's dependencies in addition to hive 
> itself (protobuf, guava, etc). See HIVE-5733. This breaks any attempt in 
> Spark to manage those dependencies.
> The only solution to this problem is to publish our own version of hive-exec 
> 0.12.0 that behaves correctly. While we are doing this, we might as well 
> re-write the protobuf dependency to use the shaded version of protobuf 2.4.1 
> that we already have for Akka.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2669) Hadoop configuration is not localised when submitting job in yarn-cluster mode

2014-07-24 Thread Maxim Ivanov (JIRA)
Maxim Ivanov created SPARK-2669:
---

 Summary: Hadoop configuration is not localised when submitting job 
in yarn-cluster mode
 Key: SPARK-2669
 URL: https://issues.apache.org/jira/browse/SPARK-2669
 Project: Spark
  Issue Type: Bug
Reporter: Maxim Ivanov


I'd like to propose a fix for a problem when Hadoop configuration is not 
localized when job is submitted in yarn-cluster mode. Here is a description 
from github pull request https://github.com/apache/spark/pull/1574

This patch fixes a problem when Spark driver is run in the container
managed by YARN ResourceManager it inherits configuration from a
NodeManager process, which can be different from the Hadoop
configuration present on the client (submitting machine). Problem is
most vivid when fs.defaultFS property differs between these two.

Hadoop MR solves it by serializing client's Hadoop configuration into
job.xml in application staging directory and then making Application
Master to use it. That guarantees that regardless of execution nodes
configurations all application containers use same config identical to
one on the client side.

This patch uses similar approach. YARN ClientBase serializes
configuration and adds it to ClientDistributedCacheManager under
"job.xml" link name. ClientDistributedCacheManager is then utilizes
Hadoop localizer to deliver it to whatever container is started by this
application, including the one running Spark driver.

YARN ClientBase also adds "SPARK_LOCAL_HADOOPCONF" env variable to AM
container request which is then used by SparkHadoopUtil.newConfiguration
to trigger new behavior when machine-wide hadoop configuration is merged
with application specific job.xml (exactly how it is done in Hadoop MR).

SparkContext is then follows same approach, adding
SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use
client-side Hadopo configuration.

Also all the references to "new Configuration()" which might be executed
on YARN cluster side are changed to use SparkHadoopUtil.get.conf

Please note that it fixes only core Spark, the part which I am
comfortable to test and verify the result. I didn't descend into
steaming/shark directories, so things might need to be changed there too.




--
This message was sent by Atlassian JIRA
(v6.2#6252)