Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen
Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to
get what you want is to transform to another RDD. But you might look at
MutablePair (
https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala)
to see if the semantics meet your needs.

Alternatively you can consider:

   1. Build  provide a fast lookup service that stores and returns the
   mutable information keyed by the RDD row IDs, or
   2. Use DDF (Distributed DataFrame) which we'll make available in the
   near future, which will give you fully mutable-row table semantics.


--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Fri, Mar 28, 2014 at 5:16 PM, Sung Hwan Chung
coded...@cs.stanford.eduwrote:

 Hey guys,

 I need to tag individual RDD lines with some values. This tag value would
 change at every iteration. Is this possible with RDD (I suppose this is
 sort of like mutable RDD, but it's more) ?

 If not, what would be the best way to do something like this? Basically,
 we need to keep mutable information per data row (this would be something
 much smaller than actual data row, however).

 Thanks



Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen
Sung Hwan, yes, I'm saying exactly what you interpreted, including that if
you tried it, it would (mostly) work, and my uncertainty with respect to
guarantees on the semantics. Definitely there would be no fault tolerance
if the mutations depend on state that is not captured in the RDD lineage.

DDF is to RDD is like RDD is to HDFS. Not a perfect analogy, but the point
is that it's an abstraction above with all attendant implications, plusses
and minusses. With DDFs you get to think of everything as tables with
schemas, while the underlying complexity of mutability and data
representation is hidden away. You also get rich idioms to operate on those
tables like filtering, projection, subsetting, handling of missing data
(NA's), dummy-column generation, data mining statistics and machine
learning, etc. In some aspects it replaces a lot of boiler plate analytics
that you don't want to re-invent over and over again, e.g., FiveNum or
XTabs. So instead of 100 lines of code, it's 4. In other aspects it allows
you to easily apply arbitrary machine learning algorithms without having
to think too hard about getting the data types just right. Etc.

But you would also find yourself wanting access to the underlying RDDs for
their full semantics  flexibility.
--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen



On Fri, Mar 28, 2014 at 8:46 PM, Sung Hwan Chung
coded...@cs.stanford.eduwrote:

 Thanks Chris,

 I'm not exactly sure what you mean with MutablePair, but are you saying
 that we could create RDD[MutablePair] and modify individual rows?

 If so, will that play nicely with RDD's lineage and fault tolerance?

 As for the alternatives, I don't think 1 is something we want to do, since
 that would require another complex system we'll have to implement. Is DDF
 going to be an alternative to RDD?

 Thanks again!



 On Fri, Mar 28, 2014 at 7:02 PM, Christopher Nguyen c...@adatao.comwrote:

 Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to
 get what you want is to transform to another RDD. But you might look at
 MutablePair (
 https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala)
 to see if the semantics meet your needs.

 Alternatively you can consider:

1. Build  provide a fast lookup service that stores and returns the
mutable information keyed by the RDD row IDs, or
2. Use DDF (Distributed DataFrame) which we'll make available in the
near future, which will give you fully mutable-row table semantics.


 --
 Christopher T. Nguyen
 Co-founder  CEO, Adatao http://adatao.com
 linkedin.com/in/ctnguyen



 On Fri, Mar 28, 2014 at 5:16 PM, Sung Hwan Chung 
 coded...@cs.stanford.edu wrote:

 Hey guys,

 I need to tag individual RDD lines with some values. This tag value
 would change at every iteration. Is this possible with RDD (I suppose this
 is sort of like mutable RDD, but it's more) ?

 If not, what would be the best way to do something like this? Basically,
 we need to keep mutable information per data row (this would be something
 much smaller than actual data row, however).

 Thanks