Re: Thanks -- and intro to Jon and Rehmi

Ted Dunning Thu, 18 Aug 2011 21:16:43 -0700

Jon said I could follow up to the list in order to get others involved.

The context here is in the solution of very large over-determined systems
which, interestingly enough, are not sparse.  Thus, they potentially have
billions of rows versus thousands of columns.

On Tue, Aug 16, 2011 at 5:14 PM, Jon Allen <j...@raveldata.com> wrote:

> ...If you have a moment, I have a couple of questions, mostly dealing with
> regards to iterative methods... what have you found to be the cleanest way
> to pass around global data (if data is small, should I manage partitions and
> pass around the data to each partition... is there a clean way to implement
> a threshold so that when the data being passed around gets large I can write
> to a file / use distributed cache / etc)?

This is actually a key problem with current map-reduce frameworks.
 Pregel-like systems provide some nice mechanisms for this in terms of their
global aggregators, but there aren't really great answers there either.

With Hadoop, you can write things to HDFS, but that leads to real problems
with coordination because of the weak semantics of the file system.

Hbase provides an interesting option if you have gobs of things to store.
 It is less good when you have small numbers of massive things.

My own, prejudiced recommendation is to use MapR.  This gives you a full
read-write file system with well defined semantics as well as Hadoop
compatible map-reduce computation.   Having the entire file system be
available via NFS and being able to create thousands to millions of files
per second is also key to using the file store as a communication channel.

Overall, I expect that the map-reduce 2.0 work (aka MR-279 aka Hadoop
version 0.23) will have a big impact on this question since you won't have
to have world-wide agreement on a lowest common denominator for computation.
 Instead, different folks will be able to have different strokes.

> What would you suggest for cleanly implement stopping criteria (that is,
> what's the best way to pull out of a loop of iterative MR jobs?

Well, I really don't like iterative MR jobs in the first place.  The problem
is the overhead at each step.  I think that many of these algorithms have
very nice alternatives similar to the random projection tricks with SVD.
 Clearly also, what is useful here is a global aggregator for accumulating
some sort of global stopping indicator.  But again, I prefer to avoid
iteration in map-reduce if possible.

If one must iterate, then I think it is always good practice to have a
combination criterion.  Error dropping to near zero is nice, but you need to
have a limited amount of time spent on the problem.  This is often the way
that it goes in modeling applications where you spend as much computation as
you can afford to do, but then just commit the answer.

> ... And finally, for the SVD implementation how are you storing the data as
> writables?

The Mahout convention is one row per key/value in a sequence file.   This
seems to work pretty well.  The MatrixWritable provides a pretty good
support for that.

I'm interested because I'd like to investigate more the performance effects
> associated with grouping vectors into a single writable so that the entire
> chunk is loaded into memory simultaneously.

I doubt that this much matters.  It is pretty easy to read 100 rows in 100
values and I doubt that you would save all that much.  Where you might get
some savings would be in a more compact representation, but I don't see that
much happening.

> ... Would also find it valuable if you have any suggestions for how to
> benchmark performance testing for algorithms in the hadoop world... it
> doesn't seem as though the methods I have used for MPI stuff will
> necessarily lend themselves to performance testing in a hadoop environment,
> so what will be the best way to test benchmark scalability of algorithms I
> implement?

Actually, old school is pretty reasonable here.  You have three variables at
play, problem size, number of machines and run-time.  Plotting that out
gives pretty clear results.  From there it is all about analyzing where the
bottlenecks are.  With machine learning applications, that is commonly
pretty obviously either I/O binding or CPU limitations.

> I suppose I could focus on weak scaling rather than strong scaling, since
> really the interest is in addressing large problems, but not sure what the
> norm is within the mahout community.
>

The norm is wildly variable needs.  That leads to highly varied evaluations.
 Few people have huge variation in problem size and thus can focus on the
interplay between resources and run-time at a (nearly) constant problem
size.

>
> Thanks,
> Jon
>
> On Mon, Aug 15, 2011 at 4:53 PM, Ted Dunning <tdunn...@maprtech.com>wrote:
>
>> Sure.
>>
>> The attached outline shows some of the work I have been doing on large
>> out-of-core SVD.  The key trick is that if you are solving min ||A x - b||_F
>> , then you can solve the normal equations fairly trivially:
>>
>>      A' A x = A' b
>>
>> Commonly, QR is used for this, but A' A = R' R so Cholesky decomposition
>> of A'A suffices since this
>>
>>     R' R x = A' b
>>
>> can be solved using forward and backward substitution.  Moreover, if A is
>> tall and skinny, then A' A will fit into memory rather easily.  If A is
>> blocked into horizontal swathes A_i, then
>>
>>       A'A = sum_i A'_i A_i
>>
>> can be computed in a single map-reduce pass where A'A is accumulated using
>> a combiner and reducer in memory.
>>
>> The it is known that A has a rapidly decreasing set of singular values,
>> then the larger process described in the attached paper can be used to get
>> the first many singular values and vectors and the solution can proceed in
>> traditional wise.
>>
>> I have code that implements these that I am in the process of preparing
>> for Mahout.  Holler and I can put it up on github in advance of a commit.
>>
>> On Mon, Aug 15, 2011 at 12:45 PM, Zach Richardson <z...@raveldata.com>wrote:
>>
>>> Hi Ted,
>>>
>>> Wanted to say thanks for meeting me for lunch last week--it was fun and
>>> very informative.  It seems like MapR is quite solid, and solving some real
>>> problems.
>>>
>>> I also wanted to put you in touch with Rehmi and Jon in a little
>>> follow-up about the least-squares solution, and the general work they are
>>> doing on building their IPM solver.  They definitely follow much more of it
>>> than I do!
>>>
>>> Jon, Rehmi,
>>>
>>> Ted and I had some nice conversations about performing a Linear
>>> Regression in MapReduce--it would be great if you all could share some
>>> ideas!
>>>
>>> Thanks again,
>>>
>>> --
>>> Zach Richardson
>>> Ravel, Co-founder
>>> Austin, TX
>>> z...@raveldata.com
>>> 512.825.6031
>>>
>>>
>>>
>>
>
>
> --
> Jon Allen
> Ravel, Co-founder
> Austin, TX
> j...@raveldata.com
> +1.512.586.4401
>
>

Re: Thanks -- and intro to Jon and Rehmi

Reply via email to