GraphX partitioning and threading details

2014-08-04 Thread Larry Xiao
Hi all, about GraphX partitioning details and possible optimization. * Can you tell how are partitions distributed to nodes? And inside worker, how does partitions get allocated to threads? o Is it possible to make manual configuration, like partition A => node 1, thread 1 * How

Re: "log" overloaded in SparkContext/ Spark 1.0.x

2014-08-04 Thread Matei Zaharia
Hah, weird. "log" should be protected actually (look at trait Logging). Is your class extending SparkContext or somehow being placed in the org.apache.spark package? Or maybe the Scala compiler looks at it anyway.. in that case we can rename it. Please open a JIRA for it if that's the case. On

"log" overloaded in SparkContext/ Spark 1.0.x

2014-08-04 Thread Dmitriy Lyubimov
it would seem the code like import o.a.spark.SparkContext._ import math._ a = log(b) does not seem to compile anymore with Spark 1.0.x since SparkContext._ also exposes a `log` function. Which happens a lot to a guy like me. obvious workaround is to use something like import o.a.spark.Sp

Re: Low Level Kafka Consumer for Spark

2014-08-04 Thread Yan Fang
Another suggestion that may help is that, you can consider use Kafka to store the latest offset instead of Zookeeper. There are at least two benefits: 1) lower the workload of ZK 2) support replay from certain offset. This is how Samza deals with the Kafka offse

Re: Interested in contributing to GraphX in Python

2014-08-04 Thread Reynold Xin
Thanks for your interest. I think the main challenge is if we have to call Python functions per record, it can be pretty expensive to serialize/deserialize across boundaries of the Python process and JVM process. I don't know if there is a good way to solve this problem yet. On Fri, Aug 1, 2

Re: Scala 2.11 external dependencies

2014-08-04 Thread Anand Avati
On Sun, Aug 3, 2014 at 9:09 PM, Patrick Wendell wrote: > Hey Anand, > > Thanks for looking into this - it's great to see momentum towards Scala > 2.11 and I'd love if this land in Spark 1.2. > > For the external dependencies, it would be good to create a sub-task of > SPARK-1812 to track our effo

Re: Problems running modified spark version on ec2 cluster

2014-08-04 Thread Matt Forbes
After rummaging through the worker instances I noticed they were using the assembly jar (which I hadn't noticed before). Now instead of using the core and mllib jars individually, I'm just overwriting the assembly jar in the master and using spark-ec2/copy-dir. For posterity, my run script is: MAS

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Steve Nunez
Hmm. Fair enough. I hadn¹t given that answer much thought and on reflection think you¹re right in that a profile would just be a bad hack. On 8/4/14, 10:35, "Sean Owen" wrote: >What would such a profile do though? In general building for a >specific vendor version means setting hadoop.verison

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Sean Owen
What would such a profile do though? In general building for a specific vendor version means setting hadoop.verison and/or yarn.version. Any hard-coded value is unlikely to match what a particular user needs. Setting protobuf versions and so on is already done by the generic profiles. In a similar

Problems running modified spark version on ec2 cluster

2014-08-04 Thread Matt Forbes
I'm trying to run a forked version of mllib where I am experimenting with a boosted trees implementation. Here is what I've tried, but can't seem to get working properly: *Directory layout:* src/spark-dev (spark github fork) pom.xml - I've tried changing the version to 1.2 arbitrarily in core

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Steve Nunez
I don’t think there is an hwx profile, but there probably should be. - Steve From: Patrick Wendell Date: Monday, August 4, 2014 at 10:08 To: Ron's Yahoo! Cc: Ron's Yahoo! , Steve Nunez , , "dev@spark.apache.org" Subject: Re: Issues with HDP 2.4.0.2.1.3.0-563 Ah I see, yeah you might need

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell
Ah I see, yeah you might need to set hadoop.version and yarn.version. I thought he profile set this automatically. On Mon, Aug 4, 2014 at 10:02 AM, Ron's Yahoo! wrote: > I meant yarn and hadoop defaulted to 1.0.4 so the yarn build fails since > 1.0.4 doesn't exist for yarn... > > Thanks, > Ron

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell
Can you try building without any of the special `hadoop.version` flags and just building only with -Phadoop-2.4? In the past users have reported issues trying to build random spot versions... I think HW is supposed to be compatible with the normal 2.4.0 build. On Mon, Aug 4, 2014 at 8:35 AM, Ron'

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Steve Nunez
Provided you¹ve got the HWX repo in your pom.xml, you can build with this line: mvn -Pyarn -Phive -Phadoop-2.4 -Dhadoop.version=2.4.0.2.1.1.0-385 -DskipTests clean package I haven¹t tried building a distro, but it should be similar. - SteveN On 8/4/14, 1:25, "Sean Owen" wrote: >For a

Re: Compiling Spark master (6ba6c3eb) with sbt/sbt assembly

2014-08-04 Thread Larry Xiao
Sorry I mean, I tried this command ./sbt/sbt clean and now it works. Is it because of cached components no recompiled? On 8/4/14, 4:44 PM, Larry Xiao wrote: I guessed ./sbt/sbt clean and it works fine now. On 8/4/14, 11:48 AM, Larry Xiao wrote: On the latest pull today (6ba6c3ebfe9a473

Re: Compiling Spark master (6ba6c3eb) with sbt/sbt assembly

2014-08-04 Thread Larry Xiao
I guessed ./sbt/sbt clean and it works fine now. On 8/4/14, 11:48 AM, Larry Xiao wrote: On the latest pull today (6ba6c3ebfe9a47351a50e45271e241140b09bf10) meet assembly problem. $ ./sbt/sbt assembly Using /usr/lib/jvm/java-7-oracle as default JAVA_HOME. Note, this will be overridden by -j

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Sean Owen
For any Hadoop 2.4 distro, yes, set hadoop.version but also set -Phadoop-2.4. http://spark.apache.org/docs/latest/building-with-maven.html On Mon, Aug 4, 2014 at 9:15 AM, Patrick Wendell wrote: > For hortonworks, I believe it should work to just link against the > corresponding upstream version.

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell
For hortonworks, I believe it should work to just link against the corresponding upstream version. I.e. just set the Hadoop version to "2.4.0" Does that work? - Patrick On Mon, Aug 4, 2014 at 12:13 AM, Ron's Yahoo! wrote: > Hi, > Not sure whose issue this is, but if I run make-distribution