Re: Hadoop's Configuration object isn't threadsafe

2014-07-17 Thread Patrick Wendell
Hey Andrew, I think you are correct and a follow up to SPARK-2521 will end up fixing this. The desing of SPARK-2521 automatically broadcasts RDD data in tasks and the approach creates a new copy of the RDD and associated data for each task. A natural follow-up to that patch is to stop handling

Re: Hadoop's Configuration object isn't threadsafe

2014-07-17 Thread Andrew Ash
Sounds good -- I added comments to the ticket. Since SPARK-2521 is scheduled for a 1.1.0 release and we can work around with spark.speculation, I don't personally see a need for a 1.0.2 backport. Thanks looking through this issue! On Thu, Jul 17, 2014 at 2:14 AM, Patrick Wendell

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Sean Owen
Are you setting -Pyarn-alpha? ./sbt/sbt -Pyarn-alpha, followed by projects, shows it as a module. You should only build yarn-stable *or* yarn-alpha at any given time. I don't remember the modules changing in a while. 'yarn-alpha' is for YARN before it stabilized, circa early Hadoop 2.0.x.

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Sandy Ryza
To add, we've made some effort to yarn-alpha to work with the 2.0.x line, but this was a time when YARN went through wild API changes. The only line that the yarn-alpha profile is guaranteed to work against is the 0.23 line. On Thu, Jul 17, 2014 at 12:40 AM, Sean Owen so...@cloudera.com wrote:

[VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Xiangrui Meng
Please vote on releasing the following candidate as Apache Spark version 0.9.2! The tag to be voted on is v0.9.2-rc1 (commit 4322c0ba): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=4322c0ba7f411cf9a2483895091440011742246b The release files, including signatures, digests, etc.

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Chester Chen
@Sean and @Sandy Thanks for the reply. I used to be able to see yarn-alpha and yarn directories which corresponding to the modules. I guess due to the recent SparkBuild.scala changes, I did not see yarn-alpha (by default) and I thought yarn-alpha is renamed to yarn and yarn-stable is the

Compile error when compiling for cloudera

2014-07-17 Thread Nathan Kronenfeld
I'm trying to compile the latest code, with the hadoop-version set for 2.0.0-mr1-cdh4.6.0. I'm getting the following error, which I don't get when I don't set the hadoop version: [error]

Re: Compile error when compiling for cloudera

2014-07-17 Thread Sean Owen
This looks like a Jetty version problem actually. Are you bringing in something that might be changing the version of Jetty used by Spark? It depends a lot on how you are building things. Good to specify exactly how your'e building here. On Thu, Jul 17, 2014 at 3:43 PM, Nathan Kronenfeld

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Sean Owen
Looks like a real problem. I see it too. I think the same workaround found in ClientBase.scala needs to be used here. There, the fact that this field can be a String or String[] is handled explicitly. In fact I think you can just call to ClientBase for this? PR it, I say. On Thu, Jul 17, 2014 at

Re: Compile error when compiling for cloudera

2014-07-17 Thread Nathan Kronenfeld
My full build command is: ./sbt/sbt -Dhadoop.version=2.0.0-mr1-cdh4.6.0 clean assembly I've changed one line in RDD.scala, nothing else. On Thu, Jul 17, 2014 at 10:56 AM, Sean Owen so...@cloudera.com wrote: This looks like a Jetty version problem actually. Are you bringing in something

Re: Compile error when compiling for cloudera

2014-07-17 Thread Nathan Kronenfeld
er, that line being in toDebugString, where it really shouldn't affect anything (no signature changes or the like) On Thu, Jul 17, 2014 at 10:58 AM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: My full build command is: ./sbt/sbt -Dhadoop.version=2.0.0-mr1-cdh4.6.0 clean assembly

Re: Does RDD checkpointing store the entire state in HDFS?

2014-07-17 Thread Yan Fang
Thank you, TD ! Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Wed, Jul 16, 2014 at 6:53 PM, Tathagata Das tathagata.das1...@gmail.com wrote: After every checkpointing interval, the latest state RDD is stored to HDFS in its entirety. Along with that, the series of DStream

Re: Compile error when compiling for cloudera

2014-07-17 Thread Sean Owen
CC tmalaska since he touched the line in question. This is a fun one. So, here's the line of code added last week: val channelFactory = new NioServerSocketChannelFactory (Executors.newCachedThreadPool(), Executors.newCachedThreadPool()); Scala parses this as two statements, one invoking a

Re: Compile error when compiling for cloudera

2014-07-17 Thread Ted Malaska
Don't make this change yet. I have a 1642 that needs to get through around the same code. I can make this change after 1642 is through. On Thu, Jul 17, 2014 at 12:25 PM, Sean Owen so...@cloudera.com wrote: CC tmalaska since he touched the line in question. This is a fun one. So, here's the

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Chester Chen
OK I will create PR. thanks On Thu, Jul 17, 2014 at 7:58 AM, Sean Owen so...@cloudera.com wrote: Looks like a real problem. I see it too. I think the same workaround found in ClientBase.scala needs to be used here. There, the fact that this field can be a String or String[] is handled

Re: Compile error when compiling for cloudera

2014-07-17 Thread Sean Owen
Should be an easy rebase for your PR, so I went ahead just to get this fixed up: https://github.com/apache/spark/pull/1466 On Thu, Jul 17, 2014 at 5:32 PM, Ted Malaska ted.mala...@cloudera.com wrote: Don't make this change yet. I have a 1642 that needs to get through around the same code. I

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-17 Thread Nicholas Chammas
On Thu, Jul 17, 2014 at 1:23 AM, Stephen Haberman stephen.haber...@gmail.com wrote: I'd be ecstatic if more major changes were this well/succinctly explained Ditto on that. The summary of user impact was very nice. It would be good to repeat that on the user list or release notes when this

Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Xiangrui Meng
I start the voting with a +1. Ran tests on the release candidates and some basic operations in spark-shell and pyspark (local and standalone). -Xiangrui On Thu, Jul 17, 2014 at 3:16 AM, Xiangrui Meng men...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark

Current way to include hive in a build

2014-07-17 Thread Stephen Boesch
Having looked at trunk make-distribution.sh the --with-hive and --with-yarn are now deprecated. Here is the way I have built it: Added to pom.xml: profile idcdh5/id activation activeByDefaultfalse/activeByDefault /activation properties

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-17 Thread Jeremy Freeman
Hi all, Cool discussion! I agree that a more standardized API for clustering, and easy access to underlying routines, would be useful (we've also been discussing this when trying to develop streaming clustering algorithms, similar to https://github.com/apache/spark/pull/1361) For divisive,

Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread DB Tsai
+1 Tested with my Ubuntu Linux. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Jul 17, 2014 at 6:36 PM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Tested on Mac, verified

preferred Hive/Hadoop environment for generating golden test outputs

2014-07-17 Thread Will Benton
Hi all, What's the preferred environment for generating golden test outputs for new Hive tests? In particular: * what Hadoop version and Hive version should I be using, * are there particular distributions people have run successfully, and * are there any system properties or environment

Re: [VOTE] Release Apache Spark 0.9.2 (RC1)

2014-07-17 Thread Reynold Xin
+1 On Thursday, July 17, 2014, Matei Zaharia matei.zaha...@gmail.com wrote: +1 Tested on Mac, verified CHANGES.txt is good, verified several of the bug fixes. Matei On Jul 17, 2014, at 11:12 AM, Xiangrui Meng men...@gmail.com javascript:; wrote: I start the voting with a +1. Ran

Re: preferred Hive/Hadoop environment for generating golden test outputs

2014-07-17 Thread Zongheng Yang
Hi Will, These three environment variables are needed [1]. I have had success with Hive 0.12 and Hadoop 1.0.4. For Hive, getting the source distribution seems to be required. Docs contribution will be much appreciated! [1]

Re: Current way to include hive in a build

2014-07-17 Thread Patrick Wendell
Hey Stephen, The only change the build was that we ask users to run -Phive and -Pyarn of --with-hive and --with-yarn (which internally just set -Phive and -Pyarn). I don't think this should affect the dependency graph. Just to test this, what happens if you run *without* the CDH profile and