RE: MIMA Compatiblity Checks
so how to run the check locally? On master tree, sbt mimaReportBinaryIssues Seems to lead to a lot of errors reported. Do we need to modify SparkBuilder.scala etc to run it locally? Could not figure out how Jekins run the check on its console outputs. Best Regards, Raymond Liu -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Monday, June 09, 2014 3:40 AM To: dev@spark.apache.org Subject: MIMA Compatiblity Checks Hey All, Some people may have noticed PR failures due to binary compatibility checks. We've had these enabled in several of the sub-modules since the 0.9.0 release but we've turned them on in Spark core post 1.0.0 which has much higher churn. The checks are based on the migration manager tool from Typesafe. One issue is that tool doesn't support package-private declarations of classes or methods. Prashant Sharma has built instrumentation that adds partial support for package-privacy (via a workaround) but since there isn't really native support for this in MIMA we are still finding cases in which we trigger false positives. In the next week or two we'll make it a priority to handle more of these false-positive cases. In the mean time users can add manual excludes to: project/MimaExcludes.scala to avoid triggering warnings for certain issues. This is definitely annoying - sorry about that. Unfortunately we are the first open source Scala project to ever do this, so we are dealing with uncharted territory. Longer term I'd actually like to see us just write our own sbt-based tool to do this in a better way (we've had trouble trying to extend MIMA itself, it e.g. has copy-pasted code in it from an old version of the scala compiler). If someone in the community is a Scala fan and wants to take that on, I'm happy to give more details. - Patrick
when insert data into one table which is on tachyon, how can i control the data position?
when insert data (the data is small, it will not be partitioned automatically)into one table which is on tachyon, how can i control the data position, i mean how can i point which machine the data should exist on? if we can not control, what is the data assign strategy of tachyon or spark?
Re: Contributing to MLlib: Proposal for Clustering Algorithms
I went ahead and created JIRAs. JIRA for Hierarchical Clustering: https://issues.apache.org/jira/browse/SPARK-2429 JIRA for Standarized Clustering APIs: https://issues.apache.org/jira/browse/SPARK-2430 Before submitting a PR for the standardized API, I want to implement a few clustering algorithms for myself to get a good feel for how to structure such an API. If others with more experience want to dive into designing the API, though, that would allow us to get moving more quickly. On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Cool seems like a god initiative. Adding a couple extra high quality clustering implantations will be great. I'd say it would make most sense to submit a PR for the Standardised API first, agree that with everyone and then build on it for the specific implementations. — Sent from Mailbox On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote: Thanks everyone for the input. So it seems what people want is: * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and conquer approach, look at DecisionTree implementation as a reference) * Restructure 3 Kmeans clustering algorithm implementations to prevent code duplication and conform to a consistent API where possible If this is correct, I'll start work on that. How would it be best to structure it? Should I submit separate JIRAs / PRs for refactoring of current code, MiniBatch KMeans, and Hierarchical or keep my current JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR for Hierarchical KMeans that builds on top of that? Thanks! On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote: Yeah if one were to replace the objective function in decision tree with minimizing the variance of the leaf nodes it would be a hierarchical clusterer. On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com wrote: If you're thinking along these lines, have a look at the DecisionTree implementation in MLlib. It uses the same idea and is optimized to prevent multiple passes over the data by computing several splits at each level of tree building. The tradeoff is increased model state and computation per pass over the data, but fewer total passes and hopefully lower communication overheads than, say, shuffling data around that belongs to one cluster or another. Something like that could work here as well. I'm not super-familiar with hierarchical K-Means so perhaps there's a more efficient way to implement it, though. On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote: No was thinking more top-down: assuming a distributed kmeans system already existing, recursively apply the kmeans algorithm on data already partitioned by the previous level of kmeans. I haven't been much of a fan of bottom up approaches like HAC mainly because they assume there is already a distance metric for items to items. This makes it hard to cluster new content. The distances between sibling clusters is also hard to compute (if you have thrown away the similarity matrix), do you count paths to same parent node if you are computing distances between items in two adjacent nodes for example. It is also a bit harder to distribute the computation for bottom up approaches as you have to already find the nearest neighbor to an item to begin the process. On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote: The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used as a template. Is something like that you were thinking Hector? On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the references for hierarchical K-means? thanks. On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee hector@gmail.com wrote: I would say for bigdata applications
Re: Contributing to MLlib: Proposal for Clustering Algorithms
Might be worth checking out scikit-learn and mahout to get some broad ideas— Sent from Mailbox On Thu, Jul 10, 2014 at 4:25 PM, RJ Nowling rnowl...@gmail.com wrote: I went ahead and created JIRAs. JIRA for Hierarchical Clustering: https://issues.apache.org/jira/browse/SPARK-2429 JIRA for Standarized Clustering APIs: https://issues.apache.org/jira/browse/SPARK-2430 Before submitting a PR for the standardized API, I want to implement a few clustering algorithms for myself to get a good feel for how to structure such an API. If others with more experience want to dive into designing the API, though, that would allow us to get moving more quickly. On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Cool seems like a god initiative. Adding a couple extra high quality clustering implantations will be great. I'd say it would make most sense to submit a PR for the Standardised API first, agree that with everyone and then build on it for the specific implementations. — Sent from Mailbox On Wed, Jul 9, 2014 at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote: Thanks everyone for the input. So it seems what people want is: * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and conquer approach, look at DecisionTree implementation as a reference) * Restructure 3 Kmeans clustering algorithm implementations to prevent code duplication and conform to a consistent API where possible If this is correct, I'll start work on that. How would it be best to structure it? Should I submit separate JIRAs / PRs for refactoring of current code, MiniBatch KMeans, and Hierarchical or keep my current JIRA and PR for MiniBatch KMeans open and submit a second JIRA and PR for Hierarchical KMeans that builds on top of that? Thanks! On Tue, Jul 8, 2014 at 5:44 PM, Hector Yee hector@gmail.com wrote: Yeah if one were to replace the objective function in decision tree with minimizing the variance of the leaf nodes it would be a hierarchical clusterer. On Tue, Jul 8, 2014 at 2:12 PM, Evan R. Sparks evan.spa...@gmail.com wrote: If you're thinking along these lines, have a look at the DecisionTree implementation in MLlib. It uses the same idea and is optimized to prevent multiple passes over the data by computing several splits at each level of tree building. The tradeoff is increased model state and computation per pass over the data, but fewer total passes and hopefully lower communication overheads than, say, shuffling data around that belongs to one cluster or another. Something like that could work here as well. I'm not super-familiar with hierarchical K-Means so perhaps there's a more efficient way to implement it, though. On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee hector@gmail.com wrote: No was thinking more top-down: assuming a distributed kmeans system already existing, recursively apply the kmeans algorithm on data already partitioned by the previous level of kmeans. I haven't been much of a fan of bottom up approaches like HAC mainly because they assume there is already a distance metric for items to items. This makes it hard to cluster new content. The distances between sibling clusters is also hard to compute (if you have thrown away the similarity matrix), do you count paths to same parent node if you are computing distances between items in two adjacent nodes for example. It is also a bit harder to distribute the computation for bottom up approaches as you have to already find the nearest neighbor to an item to begin the process. On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote: The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward It's a bottom up approach. The pair of clusters for merging are chosen to minimize variance. Their code is under a BSD license so it can be used as a template. Is something like that you were thinking Hector? On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: sure. more interesting problem here is choosing k at each level. Kernel methods seem to be most promising. On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee hector@gmail.com wrote: No idea, never looked it up. Always just implemented it as doing k-means again on each cluster. FWIW standard k-means with euclidean distance has problems too with some dimensionality reduction methods. Swapping out the distance metric with negative dot or cosine may help. Other more useful clustering would be hierarchical SVD. The reason why I like hierarchical clustering is it makes for faster inference especially over billions of users. On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hector, could you share the
Feature selection interface
Hi, I've implemented a class that does Chi-squared feature selection for RDD[LabeledPoint]. It also computes basic class/feature occurrence statistics and other methods like mutual information or information gain can be easily implemented. I would like to make a pull request. However, MLlib master branch doesn't have any feature selection methods implemented. So, I need to create a proper interface that my class will extend or mix. It should be easy to use from developers and users prospective. I was thinking that there should be FeatureEvaluator that for each feature from RDD[LabeledPoint] returns RDD[((featureIndex: Int, label: Double), value: Double)]. Then there should be FeatureSelector that selects top N features or top N features group by class etc. And the simplest one, FeatureFilter that filters the data based on set of feature indices. Additionally, there should be the interface for FeatureEvaluators that don't use class labels, i.e. for RDD[Vector]. I am concerned that such design looks rather disconnected because there are 3 disconnected objects. As a result of use, I would like to see something like val filteredData = Filter(data, ChiSquared(data).selectTop(100)). Any ideas or suggestions? Best regards, Alexander
Changes to sbt build have been merged
Just a heads up, we merged Prashant's work on having the sbt build read all dependencies from Maven. Please report any issues you find on the dev list or on JIRA. One note here for developers, going forward the sbt build will use the same configuration style as the maven build (-D for options and -P for maven profiles). So this will be a change for developers: sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly For now, we'll continue to support the old env-var options with a deprecation warning. - Patrick
Re: Changes to sbt build have been merged
Woot! On Thu, Jul 10, 2014 at 11:15 AM, Patrick Wendell patr...@databricks.com wrote: Just a heads up, we merged Prashant's work on having the sbt build read all dependencies from Maven. Please report any issues you find on the dev list or on JIRA. One note here for developers, going forward the sbt build will use the same configuration style as the maven build (-D for options and -P for maven profiles). So this will be a change for developers: sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly For now, we'll continue to support the old env-var options with a deprecation warning. - Patrick
Re: Changes to sbt build have been merged
Cool~ On Thu, Jul 10, 2014 at 1:29 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Woot! On Thu, Jul 10, 2014 at 11:15 AM, Patrick Wendell patr...@databricks.com wrote: Just a heads up, we merged Prashant's work on having the sbt build read all dependencies from Maven. Please report any issues you find on the dev list or on JIRA. One note here for developers, going forward the sbt build will use the same configuration style as the maven build (-D for options and -P for maven profiles). So this will be a change for developers: sbt/sbt -Dhadoop.version=2.2.0 -Pyarn assembly For now, we'll continue to support the old env-var options with a deprecation warning. - Patrick
EC2 clusters ready in launch time + 30 seconds
Hi devs! Right now it takes a non-trivial amount of time to launch EC2 clusters. Part of this time is spent starting the EC2 instances, which is out of our control. Another part of this time is spent installing stuff on and configuring the instances. This, we can control. I’d like to explore approaches to upgrading spark-ec2 so that launching a cluster of any size generally takes only 30 seconds on top of the time to launch the base EC2 instances. Since Amazon can launch instances concurrently, I believe this means we should be able to launch a fully operational Spark cluster of any size in constant time. Is that correct? Do we already have an idea of what it would take to get to that point? Nick
sparkSQL thread safe?
Had a few quick questions... Just wondering if right now spark sql is expected to be thread safe on master? doing a simple hadoop file - RDD - schema RDD - write parquet will fail in reflection code if i run these in a thread pool. The SparkSqlSerializer, seems to create a new Kryo instance each time it wants to serialize anything. I got a huge speedup when I had any non-primitive type in my SchemaRDD using the ResourcePool's from Chill for providing the KryoSerializer to it. (I can open an RB if there is some reason not to re-use them?) With the Distinct Count operator there is no map-side operations, and a test to check for this. Is there any reason not to do a map side combine into a set and then merge the sets later? (similar to the approximate distinct count operator) === Another thing while i'm mailing.. the 1.0.1 docs have a section like: // Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit, // you can use custom classes that implement the Product interface. Which sounds great, we have lots of data in thrift.. so via scrooge ( https://github.com/twitter/scrooge), we end up with ultimately instances of traits which implement product. Though the reflection code appears to look for the constructor of the class and base the types based on those parameters? Ian.
Re: sparkSQL thread safe?
Hey Ian, Thanks for bringing these up! Responses in-line: Just wondering if right now spark sql is expected to be thread safe on master? doing a simple hadoop file - RDD - schema RDD - write parquet will fail in reflection code if i run these in a thread pool. You are probably hitting SPARK-2178 https://issues.apache.org/jira/browse/SPARK-2178 which is caused by SI-6240 https://issues.scala-lang.org/browse/SI-6240. We have a plan to fix this by moving the schema introspection to compile time, using macros. The SparkSqlSerializer, seems to create a new Kryo instance each time it wants to serialize anything. I got a huge speedup when I had any non-primitive type in my SchemaRDD using the ResourcePool's from Chill for providing the KryoSerializer to it. (I can open an RB if there is some reason not to re-use them?) Sounds like SPARK-2102 https://issues.apache.org/jira/browse/SPARK-2102. There is no reason AFAIK to not reuse the instance. A PR would be greatly appreciated! With the Distinct Count operator there is no map-side operations, and a test to check for this. Is there any reason not to do a map side combine into a set and then merge the sets later? (similar to the approximate distinct count operator) Thats just not an optimization that we had implemented yet... but I've just done it here https://github.com/apache/spark/pull/1366 and it'll be in master soon :) Another thing while i'm mailing.. the 1.0.1 docs have a section like: // Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit, // you can use custom classes that implement the Product interface. Which sounds great, we have lots of data in thrift.. so via scrooge ( https://github.com/twitter/scrooge), we end up with ultimately instances of traits which implement product. Though the reflection code appears to look for the constructor of the class and base the types based on those parameters? Yeah, thats true that we only look in the constructor at the moment, but I don't think there is a really good reason for that (other than I guess we will need to add code to make sure we skip builtin object methods). If you want to open a JIRA, we can try fixing this. Michael
RE: EC2 clusters ready in launch time + 30 seconds
You are partially correct. It's not terribly complex, but also not easy to accomplish. Sounds like you want to manage some partially/fully baked AMI's with the core spark libs and dependencies already on the image. Main issues that crop up are: 1) image sprawl, as libs/config/defaults/etc change, images need to be rebuilt and references updated 2) cross region support (not too huge deal now with copy functionality, just more complex image mgmt.) If you don’t want to restrict which instance types/sizes one can use, you also have uptick in image mgmt. complexity with: 3) instance type (need both standard and hvm) Starting to work through some automation/config stuff for spark stack on EC2 with a project, will be focusing the work through the apache bigtop effort to start, can then share with spark community directly as things progress if people are interested Nate -Original Message- From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com] Sent: Thursday, July 10, 2014 3:06 PM To: dev Subject: EC2 clusters ready in launch time + 30 seconds Hi devs! Right now it takes a non-trivial amount of time to launch EC2 clusters. Part of this time is spent starting the EC2 instances, which is out of our control. Another part of this time is spent installing stuff on and configuring the instances. This, we can control. I’d like to explore approaches to upgrading spark-ec2 so that launching a cluster of any size generally takes only 30 seconds on top of the time to launch the base EC2 instances. Since Amazon can launch instances concurrently, I believe this means we should be able to launch a fully operational Spark cluster of any size in constant time. Is that correct? Do we already have an idea of what it would take to get to that point? Nick
Re: [VOTE] Release Apache Spark 1.0.1 (RC2)
-1 I honestly do not know the voting rules for the Spark community, so please excuse me if I am out of line or if Mesos compatibility is not a concern at this point. We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos 0.18.2. All of our jobs with data above a few gigabytes hung indefinitely. Downgrading back to the 1.0.0 stable release of Spark built the same way worked for us. On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves tgraves...@yahoo.com.invalid wrote: +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with authentication on. Tom On Friday, July 4, 2014 2:39 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.1! The tag to be voted on is v1.0.1-rc1 (commit 7d1043c): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1021/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/ Please vote on releasing this package as Apache Spark 1.0.1! The vote is open until Monday, July 07, at 20:45 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ === Differences from RC1 === This release includes only one blocking patch from rc1: https://github.com/apache/spark/pull/1255 There are also smaller fixes which came in over the last week. === About this release === This release fixes a few high-priority bugs in 1.0 and has a variety of smaller fixes. The full list is here: http://s.apache.org/b45. Some of the more visible patches are: SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size. SPARK-1790: Support r3 instance types on EC2. This is the first maintenance release on the 1.0 line. We plan to make additional maintenance releases as new fixes come in.
Re: [VOTE] Release Apache Spark 1.0.1 (RC2)
Just realized the deadline was Monday, my apologies. The issue nevertheless stands. On Thu, Jul 10, 2014 at 9:28 PM, Gary Malouf malouf.g...@gmail.com wrote: -1 I honestly do not know the voting rules for the Spark community, so please excuse me if I am out of line or if Mesos compatibility is not a concern at this point. We just tried to run this version built against 2.3.0-cdh5.0.2 on mesos 0.18.2. All of our jobs with data above a few gigabytes hung indefinitely. Downgrading back to the 1.0.0 stable release of Spark built the same way worked for us. On Mon, Jul 7, 2014 at 5:17 PM, Tom Graves tgraves...@yahoo.com.invalid wrote: +1. Ran some Spark on yarn jobs on a hadoop 2.4 cluster with authentication on. Tom On Friday, July 4, 2014 2:39 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.1! The tag to be voted on is v1.0.1-rc1 (commit 7d1043c): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=7d1043c99303b87aef8ee19873629c2bfba4cc78 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc2/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1021/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.0.1-rc2-docs/ Please vote on releasing this package as Apache Spark 1.0.1! The vote is open until Monday, July 07, at 20:45 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ === Differences from RC1 === This release includes only one blocking patch from rc1: https://github.com/apache/spark/pull/1255 There are also smaller fixes which came in over the last week. === About this release === This release fixes a few high-priority bugs in 1.0 and has a variety of smaller fixes. The full list is here: http://s.apache.org/b45. Some of the more visible patches are: SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys SPARK-2156 and SPARK-1112: Issues with jobs hanging due to akka frame size. SPARK-1790: Support r3 instance types on EC2. This is the first maintenance release on the 1.0 line. We plan to make additional maintenance releases as new fixes come in.