Record metadata with RDDs and DataFrames

2015-07-15 Thread RJ Nowling
Hi all, I'm working on an ETL task with Spark. As part of this work, I'd like to mark records with some info such as: 1. Whether the record is good or bad (e.g, Either) 2. Originating file and lines Part of my motivation is to prevent errors with individual records from stopping the entire

Re: Record metadata with RDDs and DataFrames

2015-07-15 Thread RJ Nowling
at 12:36 PM, Reynold Xin r...@databricks.com wrote: How about just using two fields, one boolean field to mark good/bad, and another to get the source file? On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I'm working on an ETL task with Spark. As part

Re: Grouping runs of elements in a RDD

2015-07-02 Thread RJ Nowling
at 12:21 PM, RJ Nowling rnowl...@gmail.com wrote: That's an interesting idea! I hadn't considered that. However, looking at the Partitioner interface, I would need to know from looking at a single key which doesn't fit my case, unfortunately. For my case, I need to compare successive pairs

Re: Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
prematurely.) On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh abhis...@tetrationanalytics.com wrote: could you use a custom partitioner to preserve boundaries such that all related tuples end up on the same partition? On Jun 30, 2015, at 12:00 PM, RJ Nowling rnowl...@gmail.com wrote

Re: Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
by others? On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin r...@databricks.com wrote: Try mapPartitions, which gives you an iterator, and you can produce an iterator back. On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I have a problem where I have a RDD of elements

Re: enum-like types in Spark

2015-03-11 Thread RJ Nowling
How do these proposals affect PySpark? I think compatibility with PySpark through Py4J should be considered. On Mon, Mar 9, 2015 at 8:39 PM, Patrick Wendell pwend...@gmail.com wrote: Does this matter for our own internal types in Spark? I don't think any of these types are designed to be used

Re: Incorrect Maven Artifact Names

2015-01-14 Thread RJ Nowling
-6382f8428b13fa6082fa688178f3dbcc On Wed, Jan 14, 2015 at 2:59 PM, RJ Nowling rnowl...@gmail.com wrote: Thanks, Sean. Yes, Spark is incorrectly copying the spark assembly jar to com/google/guava in the maven repository. This is for the 1.2.0 release, just to clarify. I reverted the patches that shade Guava

Re: Incorrect Maven Artifact Names

2015-01-14 Thread RJ Nowling
, instead of pom. On Wed, Jan 14, 2015 at 1:08 PM, RJ Nowling rnowl...@gmail.com wrote: Hi Sean, I confirmed that if I take the Spark 1.2.0 release (a428c446), undo the guava PR [1], and use -Dmaven.install.skip=false with the workflow above, the problem is fixed. RJ [1

Re: Incorrect Maven Artifact Names

2015-01-14 Thread RJ Nowling
is in com/google/guava? You can un-skip the install plugin with -Dmaven.install.skip=false On Wed, Jan 14, 2015 at 7:26 PM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I'm trying to upgrade some Spark RPMs from 1.1.0 to 1.2.0. As part of the RPM process, we build Spark with Maven

Re: SciSpark: NASA AIST14 proposal

2015-01-14 Thread RJ Nowling
Congratulations, Chris! I created a JIRA for dimensional RDDs that might be relevant: https://issues.apache.org/jira/browse/SPARK-4727 Jeremy Freeman pointed me to his lab's work on for neuroscience that have some related functionality : http://thefreemanlab.com/thunder/ On Wed, Jan 14, 2015 at

Re: Maintainer for Mesos

2015-01-08 Thread RJ Nowling
Hi Andrew, Patrick Wendell and Andrew Or have committed previous patches related to Mesos. Maybe they would be good committers to look at it? RJ On Mon, Jan 5, 2015 at 6:40 PM, Andrew Ash and...@andrewash.com wrote: Hi Spark devs, I'm interested in having a committer look at a PR [1] for

RDDs for dimensional (time series, spatial) data

2014-12-04 Thread RJ Nowling
that MLlib supports some operations for time series in 1.2.0-rc1, but I think that specialized RDDs could optimize the partitioning and algorithms better than a regular RDD. Or, for example, spatial data could be partitioned into a grid. Any feedback would be great! Thanks, RJ Nowling -- em rnowl

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread RJ Nowling
Matei, I saw that you're listed as a maintainer for ~6 different subcomponents, and on over half of those, you're only the 2nd person. My concern is that you would be stretched thin and maybe wouldn't be able to work as a back up on all of those subcomponents. Are you planning on adding more

Re: Surprising Spark SQL benchmark

2014-11-01 Thread RJ Nowling
Two thoughts here: 1. The real flaw with the sort benchmark was that Hadoop wasn't run on the same hardware. Given the advances in networking (availabIlity of 10GB Ethernet) and disks (SSDs) since the Hadoop benchmarks it was compared to, it's an apples to oranges comparison. Without that, it

Re: PR for Hierarchical Clustering Needs Review

2014-10-24 Thread RJ Nowling
. I will ask other developers to help review the PR. Thanks for working with Yu and helping the code review! Best, Xiangrui On Thu, Oct 23, 2014 at 2:58 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, A few months ago, I collected feedback on what the community was looking

PR for Hierarchical Clustering Needs Review

2014-10-23 Thread RJ Nowling
Hi all, A few months ago, I collected feedback on what the community was looking for in clustering methods. A number of the community members requested a divisive hierarchical clustering method. Yu Ishikawa has stepped up to implement such a method. I've been working with him to communicate

Re: Spark on Mesos 0.20

2014-10-08 Thread RJ Nowling
On Tue, Oct 7, 2014 at 6:29 AM, RJ Nowling rnowl...@gmail.com wrote: I was able to reproduce it on a small 4 node cluster (1 mesos master and 3 mesos slaves) with relatively low-end specs. As I said, I just ran the log query examples with the fine-grained mesos mode. Spark 1.1.0 and mesos

Re: Spark on Mesos 0.20

2014-10-07 Thread RJ Nowling
at 9:20 AM, Timothy Chen tnac...@gmail.com wrote: Ok I created SPARK-3817 to track this, will try to repro it as well. Tim On Mon, Oct 6, 2014 at 6:08 AM, RJ Nowling rnowl...@gmail.com wrote: I've recently run into this issue as well. I get it from running Spark examples such as log query

Re: Spark on Mesos 0.20

2014-10-06 Thread RJ Nowling
I've recently run into this issue as well. I get it from running Spark examples such as log query. Maybe that'll help reproduce the issue. On Monday, October 6, 2014, Gurvinder Singh gurvinder.si...@uninett.no wrote: The issue does not occur if the task at hand has small number of map tasks.

Re: [mllib] Add multiplying large scale matrices

2014-09-05 Thread RJ Nowling
I think it would be interesting to have a variety of matrix operations (multiplication, addition / subtraction, powers, scalar multiply, etc.) available in Spark. Diagonalization may be more difficult but iterative approximation approaches may be quite amenable. On Fri, Sep 5, 2014 at 5:26 AM,

Re: Is breeze thread safe in Spark?

2014-09-03 Thread RJ Nowling
to lapack, so it should be fine. In general concurrent modification isn't thread safe of course, but things that ought to be thread safe really should be. On Wed, Sep 3, 2014 at 10:41 AM, RJ Nowling rnowl...@gmail.com wrote: No, it's not in all cases. Since Breeze uses lapack under the hood

Re: Is breeze thread safe in Spark?

2014-09-03 Thread RJ Nowling
work arrays for each call to lapack, so it should be fine. In general concurrent modification isn't thread safe of course, but things that ought to be thread safe really should be. On Wed, Sep 3, 2014 at 10:41 AM, RJ Nowling rnowl...@gmail.com wrote: No, it's not in all cases. Since Breeze

Re: Is breeze thread safe in Spark?

2014-09-03 Thread RJ Nowling
be safe. On Wed, Sep 3, 2014 at 11:58 AM, RJ Nowling rnowl...@gmail.com wrote: David, Can you confirm that += is not thread safe but + is? I'm assuming + allocates a new object for the write, while += doesn't. Thanks! RJ On Wed, Sep 3, 2014 at 2:50 PM, David Hall d...@cs.berkeley.edu

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread RJ Nowling
Hi Yu, A standardized API has not been implemented yet. I think it would be better to implement the other clustering algorithms then extract a common API. Others may feel differently. :) Just a note, there was a pre-existing JIRA for hierarchical KMeans SPARK-2429

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread RJ Nowling
On Aug 12, 2014, at 2:20 PM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I wanted to follow up. I have a prototype for an optimized version of hierarchical k-means. I wanted to get some feedback on my apporach. Jeremy's implementation splits the largest cluster in each round. Is it better

[GraphX] JIRA / PR to fix breakage in GraphGenerator.logNormalGraph in PR #720

2014-08-27 Thread RJ Nowling
Hi all, PR #720 https://github.com/apache/spark/pull/720 made multiple changes to GraphGenerator.logNormalGraph including: - Replacing the call to functions for generating random vertices and edges with in-line implementations with different equations. Based on reading the Pregel

Re: Gradient descent and runMiniBatchSGD

2014-08-26 Thread RJ Nowling
Hi Alexander, Can you post a link to the code? RJ On Tue, Aug 26, 2014 at 6:53 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, I've implemented back propagation algorithm using Gradient class and a simple update using Updater class. Then I run the algorithm with mllib's

Re: Gradient descent and runMiniBatchSGD

2014-08-26 Thread RJ Nowling
/mllib/classification/NeuralNetwork.scala Unit tests are in the same branch. Alexander From: RJ Nowling [mailto:rnowl...@gmail.com] Sent: Tuesday, August 26, 2014 6:59 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Gradient descent and runMiniBatchSGD Hi

Re: Gradient descent and runMiniBatchSGD

2014-08-26 Thread RJ Nowling
Also, another idea: may algorithms that use sampling tend to do so multiple times. It may be beneficial to allow a transformation to a representation that is more efficient for multiple rounds of sampling. On Tue, Aug 26, 2014 at 4:36 PM, RJ Nowling rnowl...@gmail.com wrote: Xiangrui, I

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-12 Thread RJ Nowling
? Are there are any open-source examples that are being widely used in production? Thanks! On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling rnowl...@gmail.com wrote: Nice to meet you, Jeremy! This is great! Hierarchical clustering was next on my list -- currently trying to get my PR for MiniBatch KMeans

Re: Examples have SparkContext improperly labeled?

2014-07-21 Thread RJ Nowling
. -Sandy On Mon, Jul 21, 2014 at 8:36 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, The examples listed here https://spark.apache.org/examples.html refer to the spark context as spark but when running Spark Shell uses sc for the SparkContext. Am I missing something? Thanks! RJ

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-18 Thread RJ Nowling
Nice to meet you, Jeremy! This is great! Hierarchical clustering was next on my list -- currently trying to get my PR for MiniBatch KMeans accepted. If it's cool with you, I'll try converting your code to fit in with the existing MLLib code as you suggest. I also need to review the Decision

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread RJ Nowling
at 2:15 PM, RJ Nowling rnowl...@gmail.com wrote: Thanks everyone for the input. So it seems what people want is: * Implement MiniBatch KMeans and Hierarchical KMeans (Divide and conquer approach, look at DecisionTree implementation as a reference) * Restructure 3 Kmeans clustering algorithm

Re: Contribution to MLlib

2014-07-09 Thread RJ Nowling
Hi Meethu, There is no code for a Gaussian Mixture Model clustering algorithm in the repository, but I don't know if anyone is working on it. RJ On Wednesday, July 9, 2014, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I am interested in contributing a clustering algorithm towards MLlib

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-09 Thread RJ Nowling
as you have to already find the nearest neighbor to an item to begin the process. On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling rnowl...@gmail.com wrote: The scikit-learn implementation may be of interest: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html

Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
Thanks, Hector! Your feedback is useful. On Tuesday, July 8, 2014, Hector Yee hector@gmail.com wrote: I would say for bigdata applications the most useful would be hierarchical k-means with back tracking and the ability to support k nearest centroids. On Tue, Jul 8, 2014 at 10:54 AM, RJ

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
at 10:54 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering

Re: Contributing to MLlib

2014-07-02 Thread RJ Nowling
Hey Alex, I'm also a new contributor. I created a pull request for the KMeans MiniBatch implementation here: https://github.com/apache/spark/pull/1248 I also created a JIRA here: https://issues.apache.org/jira/browse/SPARK-2308 As part of my work, I started to refactor the common code to