RE: MIMA Compatiblity Checks

2014-07-10 Thread Liu, Raymond
so how to run the check locally? On master tree, sbt mimaReportBinaryIssues Seems to lead to a lot of errors reported. Do we need to modify SparkBuilder.scala etc to run it locally? Could not figure out how Jekins run the check on its console outputs. Best Regards, Raymond Liu -Original

when insert data into one table which is on tachyon, how can i control the data position?

2014-07-10 Thread qingyang li
when insert data (the data is small, it will not be partitioned automatically)into one table which is on tachyon, how can i control the data position, i mean how can i point which machine the data should exist on? if we can not control, what is the data assign strategy of tachyon or spark?

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread RJ Nowling
I went ahead and created JIRAs. JIRA for Hierarchical Clustering: https://issues.apache.org/jira/browse/SPARK-2429 JIRA for Standarized Clustering APIs: https://issues.apache.org/jira/browse/SPARK-2430 Before submitting a PR for the standardized API, I want to implement a few clustering

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread Nick Pentreath
Might be worth checking out scikit-learn and mahout to get some broad ideas— Sent from Mailbox On Thu, Jul 10, 2014 at 4:25 PM, RJ Nowling rnowl...@gmail.com wrote: I went ahead and created JIRAs. JIRA for Hierarchical Clustering: https://issues.apache.org/jira/browse/SPARK-2429 JIRA for

Feature selection interface

2014-07-10 Thread Ulanov, Alexander
Hi, I've implemented a class that does Chi-squared feature selection for RDD[LabeledPoint]. It also computes basic class/feature occurrence statistics and other methods like mutual information or information gain can be easily implemented. I would like to make a pull request. However, MLlib

Changes to sbt build have been merged

2014-07-10 Thread Patrick Wendell
Just a heads up, we merged Prashant's work on having the sbt build read all dependencies from Maven. Please report any issues you find on the dev list or on JIRA. One note here for developers, going forward the sbt build will use the same configuration style as the maven build (-D for options and

Re: Changes to sbt build have been merged

2014-07-10 Thread Sandy Ryza
Woot! On Thu, Jul 10, 2014 at 11:15 AM, Patrick Wendell patr...@databricks.com wrote: Just a heads up, we merged Prashant's work on having the sbt build read all dependencies from Maven. Please report any issues you find on the dev list or on JIRA. One note here for developers, going

Re: Changes to sbt build have been merged

2014-07-10 Thread yao
Cool~ On Thu, Jul 10, 2014 at 1:29 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Woot! On Thu, Jul 10, 2014 at 11:15 AM, Patrick Wendell patr...@databricks.com wrote: Just a heads up, we merged Prashant's work on having the sbt build read all dependencies from Maven. Please report

EC2 clusters ready in launch time + 30 seconds

2014-07-10 Thread Nicholas Chammas
Hi devs! Right now it takes a non-trivial amount of time to launch EC2 clusters. Part of this time is spent starting the EC2 instances, which is out of our control. Another part of this time is spent installing stuff on and configuring the instances. This, we can control. I’d like to explore

sparkSQL thread safe?

2014-07-10 Thread Ian O'Connell
Had a few quick questions... Just wondering if right now spark sql is expected to be thread safe on master? doing a simple hadoop file - RDD - schema RDD - write parquet will fail in reflection code if i run these in a thread pool. The SparkSqlSerializer, seems to create a new Kryo instance

Re: sparkSQL thread safe?

2014-07-10 Thread Michael Armbrust
Hey Ian, Thanks for bringing these up! Responses in-line: Just wondering if right now spark sql is expected to be thread safe on master? doing a simple hadoop file - RDD - schema RDD - write parquet will fail in reflection code if i run these in a thread pool. You are probably hitting

RE: EC2 clusters ready in launch time + 30 seconds

2014-07-10 Thread Nate D'Amico
You are partially correct. It's not terribly complex, but also not easy to accomplish. Sounds like you want to manage some partially/fully baked AMI's with the core spark libs and dependencies already on the image. Main issues that crop up are: 1) image sprawl, as libs/config/defaults/etc

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-10 Thread Gary Malouf
-1 I honestly do not know the voting rules for the Spark community, so please excuse me if I am out of line or if Mesos compatibility is not a concern at this point. We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos 0.18.2. All of our jobs with data above a few gigabytes

Re: [VOTE] Release Apache Spark 1.0.1 (RC2)

2014-07-10 Thread Gary Malouf
Just realized the deadline was Monday, my apologies. The issue nevertheless stands. On Thu, Jul 10, 2014 at 9:28 PM, Gary Malouf malouf.g...@gmail.com wrote: -1 I honestly do not know the voting rules for the Spark community, so please excuse me if I am out of line or if Mesos compatibility