Re: Problem of dimensions

Anand Avati Thu, 17 Jul 2014 16:59:24 -0700

And I still really doubt if just fudging nrow is a "complete". For e.g, if
after fixing up nrow (either mutating or by creating a new CheckpointedDrm
as I described in my previous mail), if you were to do:


  drmA = ... // somehow nrow is fudged

  drmB = drmA + 1 // invoke OpAewScalar operator

I don't see how this would return the correct answer in drmB. mapBlock() on
drmA is just not performed on those "invisible" rows for the "+ 1" to be
applied on the cells.

I think rbind() is the safest approach here. Also, I'm not sure why you
feel rbind() is only for "dense" matrices. If the B matrix for rbind
operator was created from drmParallelizeEmpty() (as shown in the example in
the commit), the DrmRdd will be holding only empty RandomAccessSparseVectors
and will be significantly less expensive than a dense operation.

Thanks


On Thu, Jul 17, 2014 at 4:20 PM, Anand Avati <[email protected]> wrote:

> Pat,
> I had a look at the PR. The addToRowCardinality() method is clearly
> mutating the object. A CheckpointedDrm is a "computed" object itself, but
> it very much can be a leaf node of many *other* lazy evaluated graphs for
> computing other DRMs. Besides, mutating the object goes right against the
> principles of functional programming.
>
> Whether CheckpointedDrmSpark is tolerant to arbitrary nrow values (or
> higher nrow) or not, I am not very certain yet. Assuming it is tolerant,
> the way to exploit that tolerance by mutating the object is definitely not
> ideal. Instead please create a new operator which will end up creating a
> new CheckpointedDrmSpark object with the same DrmRdd of the original
> CheckpointedDrmSpark, but newer nrow/ncol values. This way it is
> inexpensive, compatible with functional programming principles, and we can
> be a LOT more confident about having no unforeseen side effects.
>
> Thanks
>
>
> On Thu, Jul 17, 2014 at 3:11 PM, Pat Ferrel <[email protected]> wrote:
>
>> comments on PR #31
>> https://github.com/apache/mahout/pull/31
>>
>>
>> On Jul 16, 2014, at 3:46 PM, Pat Ferrel <[email protected]> wrote:
>>
>> I didn’t do anything with nrow, it is still lazy. I added a method that
>> changes _nrow, which is a var anyway. This change is made after all
>> matrices are read in and before any math/optimizer stuff is done. When nrow
>> is calculated _nrow should already be the right value unless I’ve missed
>> something. Also this is a CheckpointedDrmSpark and I thought that
>> guaranteed that it was not mid optimization.
>>
>> Let me put a PR up so you can have a look.
>>
>> I’m now wondering is changing _nrow is
>>
>>
>> On Jul 15, 2014, at 11:10 AM, Ted Dunning <[email protected]> wrote:
>>
>> My worry about this is that I thought that DSL objects needed to remain
>> immutable due to the lazy evaluation.
>>
>> For instance, suppose that A and B can be multiplied because they are the
>> right shapes:
>>
>>    C = A.t %*% B
>>
>> Now I change number of rows in A
>>
>>    A.nrow = 2 * A.nrow
>>
>> and A now has the shape compatible with D
>>
>>    E = A.t %*% D
>>
>> And now I ask for some materialization:
>>
>>    x = C.sum + E.sum
>>
>> And this fails.  The reason is that the change in A's shape happened
>> before
>> computation was started on multiplying A and B.
>>
>> Now, to my mind, lazy evaluation is pretty crucial because it allows the
>> optimizer to work.  That seems to me to say that mutation should not be
>> allowed.
>>
>> (and yes, I know that these aren't the correct notation ... that isn't the
>> point)
>>
>>
>>
>>
>>
>> On Tue, Jul 15, 2014 at 8:26 AM, Pat Ferrel <[email protected]>
>> wrote:
>>
>> > Now you have me doubting. At least for cases I haven’t tested.
>> >
>> > Dmitriy: on a sparse drm (CheckpointedDrmSpark) will increasing _nrow do
>> > the equivalent of adding empty row vectors, with keys but no vector
>> > elements?
>> >
>> >
>> > On Jul 14, 2014, at 7:30 PM, Pat Ferrel <[email protected]> wrote:
>> >
>> > It’ll be in a PR for review before it goes in and if you have already
>> > merged it will be stubbed out in the h2o implementation if I can figure
>> out
>> > where. The tests are all in the Spark specific module. All I did was
>> change
>> > _nrow in CheckpointedDrmSpark, which is a private var. I added a method
>> to
>> > do that but not in the core-math only on Spark specific code.
>> >
>> > cooccurrence works at a lower level and does not need to worry about
>> this
>> > because drm multiply and other ops just work. The place I change nrow
>> is in
>> > ItemSimilarity, which is non-core.
>> >
>> > I would indeed make sure your sparse matrix code handles this but not
>> sure
>> > where you’d test that—bug Dmitriy.
>> >
>> > On Jul 14, 2014, at 5:09 PM, Anand Avati <[email protected]> wrote:
>> >
>> >
>> > On Mon, Jul 14, 2014 at 3:02 PM, Pat Ferrel <[email protected]>
>> wrote:
>> >>
>> >> On Jul 14, 2014, at 12:10 PM, Anand Avati <[email protected]> wrote:
>> >>
>> >> On Mon, Jul 14, 2014 at 11:56 AM, Pat Ferrel <[email protected]>
>> > wrote:
>> >> In the application, the number of rows will always be increased, adding
>> > blank rows. I don’t think shuffle is necessary in this case because
>> there
>> > is no actual row, no data in the drm it’s just needed to make the
>> > cardinality match, the IDs will take care of data matching . Maybe
>> calling
>> > it something else is a good idea to emphasize the special case for it’s
>> > use. I went over this with Dmitriy and, though I haven’t checked actual
>> > values on large datasets, it works.
>> >>
>> >>
>> >> Does that mean the cardinality is faked at the logical layer with no
>> > changes at the engine level? Does that means the physical operators
>> need to
>> > be prepared to handle non-matching matrix multiplication by assuming the
>> > missing rows or columns are 0's? Does that really work with no changes?
>> >
>> > yes, Dmitriy recently confirmed this. But not faked, it is just not
>> > possible to calculate it from data in some cases since “does not exist”
>> may
>> > mean “= 0”.
>> >
>> > I’m no R expert but base R seems to assume a 0 value actually exists in
>> > the Matrix encoded in as much space as it’s type dictates, like Dense
>> > things in Mahout. I think there are R packages that add support for
>> sparse
>> > things (slam?) and so assume this is one place where some rethinking is
>> > required:
>> >
>> http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/
>> >
>> > Sparse linear algebra with the added complication of foreign IDs makes
>> for
>> > some odd cases. The number of extrenal/foreign IDs for rows and columns
>> > defines the true cardinality even though in a sparse matrix the empty
>> row
>> > or column is just absent. In Mahout the IDs are row and column numbers
>> so
>> > there are cases where the real-world cardinality does not match the
>> number
>> > of Mahout IDs or DRM cardinality and the calculations should be fine
>> with
>> > that as long as the real-world dimensions are supplied for cardinality
>> > checking and various values calculated from cardinality.
>> >
>> > I’m maintaining a mapping of external ID to/from Mahout ID. For instance
>> > in item similarity it’s use is where the rows have a key = user ID. For
>> a
>> > single application the row space is defined by all user IDs. In the
>> cross
>> > similarity A’B reading in A may not ID every user so if B find more or
>> > different ones then the union of the two is our best guess at the total.
>> > And in fact more matrices could be added and the user ID space is all
>> users
>> > seen in all of the data. If we knew how many users are defined in the
>> > application we could also use that but it’s not needed of there is no
>> data
>> > at all for some users.
>> >
>> > BTW this mapping seems to be one of the biggest generator of questions
>> on
>> > the lists. The above issue is one that would likeli further trip up
>> users
>> > generating their own ID mapping, which is why we are finally doing it
>> for
>> > them.
>> >
>> >>
>> >> This sounds like a need to introduce a new R-like rbind() operator.
>> This
>> > way you could fix up row cardinality like:
>> >>
>> >> drmAnew = drmA rbind drmParallelizeEmpty(extra_rows, drmA.ncol)
>> >>
>> >
>> > true, add an empty slice.
>> >
>> >> You could already do this, though twisted::
>> >>
>> >> drmAnew = (drmA.t cbind drmParallelizeEmpty(drmA.ncol, extra_rows).t
>> >>
>> >>
>> >
>> > yes and I can dig deeper and do another drmWrap constructing a new
>> larger
>> > matrix.
>> >
>> > Still changing the number of rows on Sparse matrices is so much simpler
>> > but I think "drm.nrow = “ may hide the special nature of what we are
>> doing.
>> >
>> >
>> > So finally how are you changing the cardinality? I just want to make h2o
>> > engine "works" with that technique.
>> >
>> >
>> >
>> >
>> >
>>
>>
>>
>

Re: Problem of dimensions

Reply via email to