Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-26 Thread Mark Hamstra
, 2014 at 10:54 AM, Mark Hamstra m...@clearstorydata.com wrote: Evan, Have you actually tried to build Spark using its POM file and sbt-pom-reader? I just made a first, naive attempt, and I'm still sorting through just what this did and didn't produce. It looks like the basic jar files

Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-02-28 Thread Mark Hamstra
Couple of comments: 1) Whether the Spark POM is produced by SBT or Maven shouldn't matter for those who just need to link against published artifacts, but right now SBT and Maven do not produce equivalent POMs for Spark -- I think 2) Incremental builds using Maven are trivially more difficult

Re: [GitHub] spark pull request: MLI-2: Start adding k-fold cross validation to...

2014-03-03 Thread Mark Hamstra
'to' is an exception to the usual rule, so (1 to folds).map {... } would be the best form. On Mon, Mar 3, 2014 at 1:02 AM, holdenk g...@git.apache.org wrote: Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/18#discussion_r10203849

Re: sbt-package-bin

2014-04-01 Thread Mark Hamstra
A basic Debian package can already be created from the Maven build: mvn -Pdeb ... On Tue, Apr 1, 2014 at 11:24 AM, Evan Chan e...@ooyala.com wrote: Also, I understand this is the last week / merge window for 1.0, so if folks are interested I'd like to get in a PR quickly. thanks, Evan

Re: sbt-package-bin

2014-04-01 Thread Mark Hamstra
wrote: Ya there is already some fragmentation here. Maven has some dist targets and there is also ./make-distribution.sh. On Tue, Apr 1, 2014 at 11:31 AM, Mark Hamstra m...@clearstorydata.com wrote: A basic Debian package can already be created from the Maven build: mvn

Re: sbt-package-bin

2014-04-01 Thread Mark Hamstra
...or at least you could do that if the Maven build wasn't broken right now. On Tue, Apr 1, 2014 at 6:01 PM, Mark Hamstra m...@clearstorydata.comwrote: What the ... is kind of depends on what you're trying to accomplish. You could be setting Hadoop version and other stuff

Re: sbt-package-bin

2014-04-01 Thread Mark Hamstra
Whoops! Looks like it was just my brain that was broken. On Tue, Apr 1, 2014 at 6:03 PM, Mark Hamstra m...@clearstorydata.comwrote: ...or at least you could do that if the Maven build wasn't broken right now. On Tue, Apr 1, 2014 at 6:01 PM, Mark Hamstra m...@clearstorydata.comwrote

Any ideas on SPARK-1021?

2014-05-12 Thread Mark Hamstra
I'm trying to decide whether attacking the underlying issue of RangePartitioner running eager jobs in rangeBounds (i.e. SPARK-1021) is a better option than a messy workaround for some async job-handling stuff that I am working on. It looks like there have been a couple of aborted attempts to

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-13 Thread Mark Hamstra
There were a few early/test RCs this cycle that were never put to a vote. On Tue, May 13, 2014 at 8:07 AM, Nan Zhu zhunanmcg...@gmail.com wrote: just curious, where is rc4 VOTE? I searched my gmail but didn't find that? On Tue, May 13, 2014 at 9:49 AM, Sean Owen so...@cloudera.com

Re: Scala examples for Spark do not work as written in documentation

2014-05-16 Thread Mark Hamstra
Sorry, looks like an extra line got inserted in there. One more try: val count = spark.parallelize(1 to NUM_SAMPLES).map { _ = val x = Math.random() val y = Math.random() if (x*x + y*y 1) 1 else 0 }.reduce(_ + _) On Fri, May 16, 2014 at 12:36 PM, Mark Hamstra m

Re: [VOTE] Release Apache Spark 1.0.0 (rc7)

2014-05-16 Thread Mark Hamstra
Sorry for the duplication, but I think this is the current VOTE candidate -- we're not voting on rc8 yet? +1, but just barely. We've got quite a number of outstanding bugs identified, and many of them have fixes in progress. I'd hate to see those efforts get lost in a post-1.0.0 flood of new

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-16 Thread Mark Hamstra
+1, but just barely. We've got quite a number of outstanding bugs identified, and many of them have fixes in progress. I'd hate to see those efforts get lost in a post-1.0.0 flood of new features targeted at 1.1.0 -- in other words, I'd like to see 1.0.1 retain a high priority relative to 1.1.0.

Re: [VOTE] Release Apache Spark 1.0.0 (rc8)

2014-05-16 Thread Mark Hamstra
+1 On Fri, May 16, 2014 at 2:16 AM, Patrick Wendell pwend...@gmail.com wrote: [Due to ASF e-mail outage, I'm not if anyone will actually receive this.] Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has only minor changes on top of rc7. The tag to be

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Mark Hamstra
features and 1.x. I think there are a few steps that could streamline triage of this flood of contributions, and make all of this easier, but that's for another thread. On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra m...@clearstorydata.com wrote: +1, but just barely. We've got quite

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Mark Hamstra
opinion known but left it to the wisdom of larger group of committers to decide ... I did not think it was critical enough to do a binding -1 on. Regards Mridul On 17-May-2014 9:43 pm, Mark Hamstra m...@clearstorydata.com wrote: Which of the unresolved bugs in spark-core do you think

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Mark Hamstra
, then I'm listening. On Sat, May 17, 2014 at 11:59 AM, Mridul Muralidharan mri...@gmail.comwrote: On 17-May-2014 11:40 pm, Mark Hamstra m...@clearstorydata.com wrote: That is a past issue that we don't need to be re-opening now. The present Huh ? If we need to revisit based on changed

Re: spark 1.0 standalone application

2014-05-19 Thread Mark Hamstra
That's the crude way to do it. If you run `sbt/sbt publishLocal`, then you can resolve the artifact from your local cache in the same way that you would resolve it if it were deployed to a remote cache. That's just the build step. Actually running the application will require the necessary jars

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread Mark Hamstra
That's all very old functionality in Spark terms, so it shouldn't have anything to do with your installation being out-of-date. There is also no need to cast as long as the relevant implicit conversions are in scope: import org.apache.spark.SparkContext._ On Tue, May 20, 2014 at 1:00 PM,

Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-21 Thread Mark Hamstra
+1 On Tue, May 20, 2014 at 11:09 PM, Henry Saputra henry.sapu...@gmail.comwrote: Signature and hash for source looks good No external executable package with source - good Compiled with git and maven - good Ran examples and sample programs locally and standalone -good +1 - Henry On

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-27 Thread Mark Hamstra
+1 On Tue, May 27, 2014 at 9:26 AM, Ankur Dave ankurd...@gmail.com wrote: 0 OK, I withdraw my downvote. Ankur http://www.ankurdave.com/

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-05 Thread Mark Hamstra
+1 On Fri, Jul 4, 2014 at 12:40 PM, Patrick Wendell pwend...@gmail.com wrote: I'll start the voting with a +1 - ran tests on the release candidate and ran some basic programs. RC1 passed our performance regression suite, and there are no major changes from that RC. On Fri, Jul 4, 2014 at

Re: ExecutorState.LOADING?

2014-07-09 Thread Mark Hamstra
) by the same Mr. Zaharia: https://github.com/apache/spark/commit/bb1bce79240da22c2677d9f8159683cdf73158c2#diff-776a630ac2b2ec5fe85c07ca20a58fc0 So I'd say it's safe to delete it. On Wed, Jul 9, 2014 at 2:36 PM, Mark Hamstra m...@clearstorydata.com wrote: Doesn't look to me like this is used

Re: Master compilation with sbt

2014-07-19 Thread Mark Hamstra
project mllib . . . clean . . . compile . . . test ...all works fine for me @2a732110d46712c535b75dd4f5a73761b6463aa8 On Sat, Jul 19, 2014 at 11:10 AM, Debasish Das debasish.da...@gmail.com wrote: I am at the reservoir sampling commit: commit 586e716e47305cd7c2c3ff35c0e828b63ef2f6a8

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Mark Hamstra
Sure, drop() would be useful, but breaking the transformations are lazy; only actions launch jobs model is abhorrent -- which is not to say that we haven't already broken that model for useful operations (cf. RangePartitioner, which is used for sorted RDDs), but rather that each such exception to

Re: RFC: Supporting the Scala drop Method for Spark RDDs

2014-07-21 Thread Mark Hamstra
Rather than embrace non-lazy transformations and add more of them, I'd rather we 1) try to fully characterize the needs that are driving their creation/usage; and 2) design and implement new Spark abstractions that will allow us to meet those needs and eliminate existing non-lazy transformation.

Re: Working Formula for Hive 0.13?

2014-07-28 Thread Mark Hamstra
Where and how is that fork being maintained? I'm not seeing an obviously correct branch or tag in the main asf hive repo github mirror. On Mon, Jul 28, 2014 at 9:55 AM, Patrick Wendell pwend...@gmail.com wrote: It would be great if the hive team can fix that issue. If not, we'll have to

Re: Working Formula for Hive 0.13?

2014-07-28 Thread Mark Hamstra
. - Patrick On Mon, Jul 28, 2014 at 10:02 AM, Mark Hamstra m...@clearstorydata.com wrote: Where and how is that fork being maintained? I'm not seeing an obviously correct branch or tag in the main asf hive repo github mirror. On Mon, Jul 28, 2014 at 9:55 AM, Patrick Wendell pwend

JIRA content request

2014-07-29 Thread Mark Hamstra
Of late, I've been coming across quite a few pull requests and associated JIRA issues that contain nothing indicating their purpose beyond a pretty minimal description of what the pull request does. On the pull request itself, a reference to the corresponding JIRA in the title combined with a

Re: Workflow Scheduler for Spark

2014-09-17 Thread Mark Hamstra
See https://issues.apache.org/jira/browse/SPARK-3530 and this doc, referenced in that JIRA: https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov pahomov.e...@gmail.com wrote: I have problems using Oozie.

Re: Moving PR Builder to mvn

2014-10-24 Thread Mark Hamstra
Your's are in the same ballpark with mine, where maven builds with zinc take about 1.4x the time to build with SBT. On Fri, Oct 24, 2014 at 4:24 PM, Sean Owen so...@cloudera.com wrote: Here's a crude benchmark on a Linux box (GCE n1-standard-4). zinc gets the assembly build in range of SBT's

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Mark Hamstra
+1 (binding) On Wed, Nov 5, 2014 at 6:29 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: +1 on this proposal. On Wed, Nov 5, 2014 at 8:55 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Will these maintainers have a cleanup for those pending PRs upon we start to apply this model? I

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Mark Hamstra
The console mode of sbt (just run sbt/sbt and then a long running console session is started that will accept further commands) is great for building individual subprojects or running single test suites. In addition to being faster since its a long running JVM, its got a lot of nice

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Mark Hamstra
Ok, strictly speaking, that's equivalent to your second class of examples, development console, not the first sbt console On Sun, Nov 16, 2014 at 1:47 PM, Mark Hamstra m...@clearstorydata.com wrote: The console mode of sbt (just run sbt/sbt and then a long running console session is started

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Mark Hamstra
More or less correct, but I'd add that there are an awful lot of software systems out there that use Maven. Integrating with those systems is generally easier if you are also working with Spark in Maven. (And I wouldn't classify all of those Maven-built systems as legacy, Michael :) What that

Re: Spurious test failures, testing best practices

2014-11-30 Thread Mark Hamstra
- Start the SBT interactive console with sbt/sbt - Build your assembly by running the assembly target in the assembly project: assembly/assembly - Run all the tests in one module: core/test - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this also supports tab

Re: drop table if exists throws exception

2014-12-05 Thread Mark Hamstra
And that is no different from how Hive has worked for a long time. On Fri, Dec 5, 2014 at 11:42 AM, Michael Armbrust mich...@databricks.com wrote: The command run fine for me on master. Note that Hive does print an exception in the logs, but that exception does not propogate to user code.

Re: Adding RDD function to segment an RDD (like substring)

2014-12-09 Thread Mark Hamstra
`zipWithIndex` is both compute intensive and breaks Spark's transformations are lazy model, so it is probably not appropriate to add this to the public RDD API. If `zipWithIndex` weren't already what I consider to be broken, I'd be much friendlier to building something more on top of it, but I

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-12 Thread Mark Hamstra
+1 On Fri, Dec 12, 2014 at 8:00 PM, Josh Rosen rosenvi...@gmail.com wrote: +1. Tested using spark-perf and the Spark EC2 scripts. I didn’t notice any performance regressions that could not be attributed to changes of default configurations. To be more specific, when running Spark 1.2.0

Re: What RDD transformations trigger computations?

2014-12-18 Thread Mark Hamstra
SPARK-2992 is a good start, but it's not exhaustive. For example, zipWithIndex is also an eager transformation, and we occasionally see PRs suggesting additional eager transformations. On Thu, Dec 18, 2014 at 12:14 PM, Reynold Xin r...@databricks.com wrote: Alessandro was probably referring to

Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Mark Hamstra
In master, Reynold has already taken care of moving Row into org.apache.spark.sql; so, even though the implementation of Row (and GenericRow et al.) is in Catalyst (which is more optimizer than parser), that needn't be of concern to users of the API in its most recent state. On Tue, Jan 27, 2015

Re: Job priority

2015-01-11 Thread Mark Hamstra
policy as can be: a priority queue. Alex On Sat, Jan 10, 2015 at 5:00 PM, Mark Hamstra m...@clearstorydata.com wrote: -dev, +user http://spark.apache.org/docs/latest/job-scheduling.html On Sat, Jan 10, 2015 at 4:40 PM, Alessandro Baretta alexbare...@gmail.com wrote: Is it possible

Re: Keep or remove Debian packaging in Spark?

2015-02-09 Thread Mark Hamstra
it sounds like nobody intends these to be used to actually deploy Spark I wouldn't go quite that far. What we have now can serve as useful input to a deployment tool like Chef, but the user is then going to need to add some customization or configuration within the context of that tooling to

Re: What is the meaning to of 'STATE' in a worker/ an executor?

2015-03-29 Thread Mark Hamstra
A LOADING Executor is on the way to RUNNING, but hasn't yet been registered with the Master, so it isn't quite ready to do useful work. On Mar 29, 2015, at 9:09 PM, Niranda Perera niranda.per...@gmail.com wrote: Hi, I have noticed in the Spark UI, workers and executors run on several

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Mark Hamstra
, but I haven't run it to ground yet.) On Mon, Feb 23, 2015 at 12:18 PM, Michael Armbrust mich...@databricks.com wrote: On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra m...@clearstorydata.com wrote: So what are we expecting of Hive 0.12.0 builds with this RC? I know not every combination

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-22 Thread Mark Hamstra
So what are we expecting of Hive 0.12.0 builds with this RC? I know not every combination of Hadoop and Hive versions, etc., can be supported, but even an example build from the Building Spark page isn't looking too good to me. Working from f97b0d4, the example build command works: mvn -Pyarn

Re: Should we let everyone set Assignee?

2015-04-22 Thread Mark Hamstra
Agreed. The Spark project and community that Vinod describes do not resemble the ones with which I am familiar. On Wed, Apr 22, 2015 at 1:20 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Vinod, Thanks for you thoughts - However, I do not agree with your sentiment and implications. Spark

Re: Speeding up Spark build during development

2015-05-03 Thread Mark Hamstra
https://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn On Sun, May 3, 2015 at 2:54 PM, Pramod Biligiri pramodbilig...@gmail.com wrote: This is great. I didn't know about the mvn script in the build directory. Pramod On Fri, May 1, 2015 at 9:51 AM, York, Brennon

Re: [VOTE] Release Apache Spark 1.3.1 (RC3)

2015-04-12 Thread Mark Hamstra
+1 On Fri, Apr 10, 2015 at 11:05 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 3e83913):

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Mark Hamstra
+1 On Sat, Apr 4, 2015 at 5:09 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):

Re: [VOTE] Release Apache Spark 1.4.0 (RC4)

2015-06-06 Thread Mark Hamstra
+1 On Tue, Jun 2, 2015 at 8:53 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc3 (commit 22596c5): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=

Re: SparkSQL errors in 1.4 rc when using with Hive 0.12 metastore

2015-05-24 Thread Mark Hamstra
This discussion belongs on the dev list. Please post any replies there. On Sat, May 23, 2015 at 10:19 PM, Cheolsoo Park piaozhe...@gmail.com wrote: Hi, I've been testing SparkSQL in 1.4 rc and found two issues. I wanted to confirm whether these are bugs or not before opening a jira. *1)*

Re: Foundation policy on releases and Spark nightly builds

2015-07-14 Thread Mark Hamstra
Please keep in mind that you are also ASF people, as is the entire Spark community (users and all)[4]. Phrasing things in terms of us and them by drawing a distinction on [they] get in a fight on our mailing list is not helpful. whineBut they started it!/whine A bit more seriously, my

Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-25 Thread Mark Hamstra
Should 1.5.2 wait for Josh's fix of SPARK-11293? On Sun, Oct 25, 2015 at 2:25 PM, Sean Owen wrote: > The signatures and licenses are fine. I continue to get failures in > these tests though, with "-Pyarn -Phadoop-2.6 -Phive > -Phive-thriftserver" on Ubuntu 15 / Java 7. > > -

Re: Ready to talk about Spark 2.0?

2015-11-08 Thread Mark Hamstra
Yes, that's clearer -- at least to me. But before going any further, let me note that we are already sliding past Sean's opening question of "Should we start talking about Spark 2.0?" to actually start talking about Spark 2.0. I'll try to keep the rest of this post at a higher- or meta-level in

Re: A proposal for Spark 2.0

2015-11-12 Thread Mark Hamstra
The place of the RDD API in 2.0 is also something I've been wondering about. I think it may be going too far to deprecate it, but changing emphasis is something that we might consider. The RDD API came well before DataFrames and DataSets, so programming guides, introductory how-to articles and

Re: A proposal for Spark 2.0

2015-11-12 Thread Mark Hamstra
n we know that the source relation > (/RDD) is already partitioned on the grouping expressions. AFAIK the spark > sql still does not allow that knowledge to be applied to the optimizer - so > a full shuffle will be performed. However in the native RDD we can use > preservesPartitioning=tru

Re: A proposal for Spark 2.0

2015-11-13 Thread Mark Hamstra
d in DF/DS. >> >> >> >> I mean, we need to think about what kind of RDD APIs we have to provide >> to developer, maybe the fundamental API is enough, like, the ShuffledRDD >> etc.. But PairRDDFunctions probably not in this category, as we can do the >> sam

Re: Support for local disk columnar storage for DataFrames

2015-11-16 Thread Mark Hamstra
FiloDB is also closely reated. https://github.com/tuplejump/FiloDB On Mon, Nov 16, 2015 at 12:24 AM, Nick Pentreath wrote: > Cloudera's Kudu also looks interesting here (getkudu.io) - Hadoop > input/output format support: >

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-11 Thread Mark Hamstra
For more than a small number of files, you'd be better off using SparkContext#union instead of RDD#union. That will avoid building up a lengthy lineage. On Wed, Nov 11, 2015 at 10:21 AM, Jakob Odersky wrote: > Hey Jeff, > Do you mean reading from multiple text files? In

Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra
Really, Sandy? "Extra consideration" even for already-deprecated API? If we're not going to remove these with a major version change, then just when will we remove them? On Tue, Nov 10, 2015 at 4:53 PM, Sandy Ryza wrote: > Another +1 to Reynold's proposal. > > Maybe

Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra
I'm liking the way this is shaping up, and I'd summarize it this way (let me know if I'm misunderstanding or misrepresenting anything): - New features are not at all the focus of Spark 2.0 -- in fact, a release with no new features might even be best. - Remove deprecated API that we

Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra
think we are in agreement, although I wouldn't go to the extreme and say > "a release with no new features might even be best." > > Can you elaborate "anticipatory changes"? A concrete example or so would > be helpful. > > On Tue, Nov 10, 2015 at 5:19 PM, Mark Ham

Re: A proposal for Spark 2.0

2015-11-10 Thread Mark Hamstra
at 7:04 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > Heh... ok, I was intentionally pushing those bullet points to be extreme > to find where people would start pushing back, and I'll agree that we do > probably want some new features in 2.0 -- but I think we've got good >

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-07 Thread Mark Hamstra
+1 On Tue, Nov 3, 2015 at 3:22 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if a > majority of at least 3 +1 PMC votes are cast. > > [ ] +1 Release

Re: State of the Build

2015-11-05 Thread Mark Hamstra
There was a lot of discussion that preceded our arriving at this statement in the Spark documentation: "Maven is the official build tool recommended for packaging Spark, and is the build of reference." https://spark.apache.org/docs/latest/building-spark.html#building-with-sbt I'm not aware of

Re: [VOTE] Release Apache Spark 1.4.1 (RC3)

2015-07-08 Thread Mark Hamstra
HiveSparkSubmitSuite is fine for me, but I do see the same issue with DataFrameStatSuite -- OSX 10.10.4, java 1.7.0_75, -Phive -Phive-thriftserver -Phadoop-2.4 -Pyarn On Wed, Jul 8, 2015 at 4:18 AM, Sean Owen so...@cloudera.com wrote: The POM issue is resolved and the build succeeds. The

Re: [VOTE] Release Apache Spark 1.4.1 (RC4)

2015-07-09 Thread Mark Hamstra
+1 On Wed, Jul 8, 2015 at 10:55 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.1! This release fixes a handful of known issues in Spark 1.4.0, listed here: http://s.apache.org/spark-1.4.1 The tag to be voted on is

Re: New Spark json endpoints

2015-09-17 Thread Mark Hamstra
While we're at it, adding endpoints that get results by jobGroup (cf. SparkContext#setJobGroup) instead of just for a single Job would also be very useful to some of us. On Thu, Sep 17, 2015 at 7:30 AM, Imran Rashid wrote: > Hi Kevin, > > I think it would be great if you

Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Mark Hamstra
/spark/spark-core_2.10/ > > Mark Hamstra <m...@clearstorydata.com>于2015年9月22日周二 下午12:55写道: > >> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released. The >> current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1 release >> candidates and then

Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Mark Hamstra
There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released. The current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1 release candidates and then the 1.5.1 release. On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang wrote: > I'd like to use some important bug fixes

Re: Quick question regarding Maven and Spark Assembly jar

2015-12-03 Thread Mark Hamstra
Try to read this before Marcelo gets to you. https://issues.apache.org/jira/browse/SPARK-11157 On Thu, Dec 3, 2015 at 5:27 PM, Matt Cheah wrote: > Hi everyone, > > A very brief question out of curiosity – is there any particular reason > why we don’t publish the Spark

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-16 Thread Mark Hamstra
+1 On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.0! > > The vote is open until Saturday, December 19, 2015 at 18:00 UTC and > passes if a majority of at least 3 +1 PMC votes are

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-14 Thread Mark Hamstra
I'm afraid you're correct, Krishna: core/src/main/scala/org/apache/spark/package.scala: val SPARK_VERSION = "1.6.0-SNAPSHOT" docs/_config.yml:SPARK_VERSION: 1.6.0-SNAPSHOT On Mon, Dec 14, 2015 at 6:51 PM, Krishna Sankar wrote: > Guys, >The sc.version gives

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-12 Thread Mark Hamstra
+1 On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.0! > > The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes > if a majority of at least 3 +1 PMC votes are

Re: RDD[Vector] Immutability issue

2015-12-29 Thread Mark Hamstra
You can, but you shouldn't. Using backdoors to mutate the data in an RDD is a good way to produce confusing and inconsistent results when, e.g., an RDD's lineage needs to be recomputed or a Task is resubmitted on fetch failure. On Tue, Dec 29, 2015 at 11:24 AM, ai he wrote:

Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-22 Thread Mark Hamstra
+1 On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.0! > > The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes > if a majority of at least 3 +1 PMC votes are

Re: A proposal for Spark 2.0

2015-11-18 Thread Mark Hamstra
; APIs but can't move to Spark 2.0 because of the backwards incompatible > changes, like removal of deprecated APIs, Scala 2.11 etc. > > Kostas > > > On Fri, Nov 13, 2015 at 12:26 PM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> Why does stabili

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-06 Thread Mark Hamstra
make this consistent. > > But, I think the resolution is simple: it's not 'dangerous' to release > this and I don't think people who say they think this really do. So > just finish this release normally, and we're done. Even if you think > there's an argument against it, weig

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-06 Thread Mark Hamstra
This is not a Databricks vs. The World situation, and the fact that some persist in forcing every issue into that frame is getting annoying. There are good engineering and project-management reasons not to populate the long-term, canonical repository of Maven artifacts with what are known to be

Re: Spark 2.0.0-preview artifacts still not available in Maven

2016-06-03 Thread Mark Hamstra
It's not a question of whether the preview artifacts can be made available on Maven central, but rather whether they must be or should be. I've got no problems leaving these unstable, transitory artifacts out of the more permanent, canonical repository. On Fri, Jun 3, 2016 at 1:53 AM, Steve

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Mark Hamstra
SPARK-15893 is resolved as a duplicate of SPARK-15899. SPARK-15899 is Unresolved. On Wed, Jun 22, 2016 at 4:04 PM, Ulanov, Alexander wrote: > -1 > > Spark Unit tests fail on Windows. Still not resolved, though marked as > resolved. > >

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Mark Hamstra
It's also marked as Minor, not Blocker. On Wed, Jun 22, 2016 at 4:07 PM, Marcelo Vanzin wrote: > On Wed, Jun 22, 2016 at 4:04 PM, Ulanov, Alexander > wrote: > > -1 > > > > Spark Unit tests fail on Windows. Still not resolved, though marked as > >

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Mark Hamstra
e are people that develop for Spark on Windows. The > referenced issue is indeed Minor and has nothing to do with unit tests. > > > > *From:* Mark Hamstra [mailto:m...@clearstorydata.com] > *Sent:* Wednesday, June 22, 2016 4:09 PM > *To:* Marcelo Vanzin <van...@cloudera.com> &

Re: Spark 2.0.0 release plan

2016-01-29 Thread Mark Hamstra
https://github.com/apache/spark/pull/10608 On Fri, Jan 29, 2016 at 11:50 AM, Jakob Odersky wrote: > I'm not an authoritative source but I think it is indeed the plan to > move the default build to 2.11. > > See this discussion for more detail > >

Re: Latency due to driver fetching sizes of output statuses

2016-01-23 Thread Mark Hamstra
Do all of those thousands of Stages end up being actual Stages that need to be computed, or are the vast majority of them eventually "skipped" Stages? If the latter, then there is the potential to modify the DAGScheduler to avoid much of this behavior:

Re: Dataframe Partitioning

2016-03-01 Thread Mark Hamstra
I don't entirely agree. You're best off picking the right size :). That's almost impossible, though, since at the input end of the query processing you often want a large number of partitions to get sufficient parallelism for both performance and to avoid spilling or OOM, while at the output end

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-06 Thread Mark Hamstra
d about we can cleanup those warnings >>> once we get there. >>> >>> On Fri, Apr 1, 2016 at 10:00 PM, Raymond Honderdors < >>> raymond.honderd...@sizmek.com> wrote: >>> >>>> What about a seperate branch for scala 2.10? >>>>

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-06 Thread Mark Hamstra
I agree with your general logic and understanding of semver. That is why if we are going to violate the strictures of semver, I'd only be happy doing so if support for Java 7 and/or Scala 2.10 were clearly understood to be deprecated already in the 2.0.0 release -- i.e. from the outset not to be

Re: Executor shutdown hooks?

2016-04-06 Thread Mark Hamstra
Why would the Executors shutdown when the Job is terminated? Executors are bound to Applications, not Jobs. Furthermore, unless spark.job.interruptOnCancel is set to true, canceling the Job at the Application and DAGScheduler level won't actually interrupt the Tasks running on the Executors. If

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-04-11 Thread Mark Hamstra
r 6, 2016 at 2:57 PM, Mark Hamstra <m...@clearstorydata.com> > wrote: > > ... My concern is that either of those options will take more resources >> than some Spark users will have available in the ~3 months remaining before >> Spark 2.0.0, which will cause fragmentation

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Mark Hamstra
It's a pain in the ass. Especially if some of your transitive dependencies never upgraded from 2.10 to 2.11. On Thu, Mar 24, 2016 at 4:50 PM, Reynold Xin wrote: > If you want to go down that route, you should also ask somebody who has > had experience managing a large

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Mark Hamstra
There aren't many such libraries, but there are a few. When faced with one of those dependencies that still doesn't go beyond 2.10, you essentially have the choice of taking on the maintenance burden to bring the library up to date, or you do what is potentially a fairly larger refactoring to use

Re: [VOTE] Release Apache Spark 1.6.1 (RC1)

2016-03-02 Thread Mark Hamstra
+1 On Wed, Mar 2, 2016 at 2:45 PM, Michael Armbrust wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.1! > > The vote is open until Saturday, March 5, 2016 at 20:00 UTC and passes if > a majority of at least 3+1 PMC votes are cast. >

Re: Dynamic allocation availability on standalone mode. Misleading doc.

2016-03-07 Thread Mark Hamstra
Yes, it works in standalone mode. On Mon, Mar 7, 2016 at 4:25 PM, Eugene Morozov wrote: > Hi, the feature looks like the one I'd like to use, but there are two > different descriptions in the docs of whether it's available. > > I'm on a standalone deployment mode and

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Mark Hamstra
Dropping Scala 2.10 support has to happen at some point, so I'm not fundamentally opposed to the idea; but I've got questions about how we go about making the change and what degree of negative consequences we are willing to accept. Until now, we have been saying that 2.10 support will be

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Mark Hamstra
t for all of spark 2.x >> >> Regarding Koert's comment on akka, I thought all akka dependencies >> have been removed from spark after SPARK-7997 and the recent removal >> of external/akka >> >> On Wed, Mar 30, 2016 at 9:36 AM, Mark Hamstra <m...@clearstorydat

Re: Discuss: commit to Scala 2.10 support for Spark 2.x lifecycle

2016-03-30 Thread Mark Hamstra
ith Mark in that I don't see how supporting scala 2.10 for >> spark 2.0 implies supporting it for all of spark 2.x >> >> Regarding Koert's comment on akka, I thought all akka dependencies >> have been removed from spark after SPARK-7997 and the recent remo

Re: Coding style question (about extra anonymous closure within functional transformations)

2016-04-14 Thread Mark Hamstra
I don't believe the Scala compiler understands the difference between your two examples the same way that you do. Looking at a few similar cases, I've only found the bytecode produced to be the same regardless of which style is used. On Wed, Apr 13, 2016 at 7:46 PM, Hyukjin Kwon

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra
Yes, replicated and distributed shuffle materializations are key requirement to maintain performance in a fully elastic cluster where Executors aren't just reallocated across an essentially fixed number of Worker nodes, but rather the number of Workers itself is dynamic. Retaining the file

Re: HDFS as Shuffle Service

2016-04-28 Thread Mark Hamstra
after a work-load burst your cluster dynamically changes from 1 > workers to 1000, will the typical HDFS replication factor be sufficient to > retain access to the shuffle files in HDFS > > HDFS isn't resizing. Spark is. HDFS files should be HA and durable. > > On Thu,

  1   2   >