Re: Problem of dimensions

Pat Ferrel Wed, 16 Jul 2014 15:47:34 -0700

I didn’t do anything with nrow, it is still lazy. I added a method that changes 
_nrow, which is a var anyway. This change is made after all matrices are read 
in and before any math/optimizer stuff is done. When nrow is calculated _nrow 
should already be the right value unless I’ve missed something. Also this is a 
CheckpointedDrmSpark and I thought that guaranteed that it was not mid 
optimization.


Let me put a PR up so you can have a look.

I’m now wondering is changing _nrow is 


On Jul 15, 2014, at 11:10 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

My worry about this is that I thought that DSL objects needed to remain
immutable due to the lazy evaluation.

For instance, suppose that A and B can be multiplied because they are the
right shapes:

    C = A.t %*% B

Now I change number of rows in A

    A.nrow = 2 * A.nrow

and A now has the shape compatible with D

    E = A.t %*% D

And now I ask for some materialization:

    x = C.sum + E.sum

And this fails.  The reason is that the change in A's shape happened before
computation was started on multiplying A and B.

Now, to my mind, lazy evaluation is pretty crucial because it allows the
optimizer to work.  That seems to me to say that mutation should not be
allowed.

(and yes, I know that these aren't the correct notation ... that isn't the
point)





On Tue, Jul 15, 2014 at 8:26 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

> Now you have me doubting. At least for cases I haven’t tested.
> 
> Dmitriy: on a sparse drm (CheckpointedDrmSpark) will increasing _nrow do
> the equivalent of adding empty row vectors, with keys but no vector
> elements?
> 
> 
> On Jul 14, 2014, at 7:30 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> 
> It’ll be in a PR for review before it goes in and if you have already
> merged it will be stubbed out in the h2o implementation if I can figure out
> where. The tests are all in the Spark specific module. All I did was change
> _nrow in CheckpointedDrmSpark, which is a private var. I added a method to
> do that but not in the core-math only on Spark specific code.
> 
> cooccurrence works at a lower level and does not need to worry about this
> because drm multiply and other ops just work. The place I change nrow is in
> ItemSimilarity, which is non-core.
> 
> I would indeed make sure your sparse matrix code handles this but not sure
> where you’d test that—bug Dmitriy.
> 
> On Jul 14, 2014, at 5:09 PM, Anand Avati <av...@gluster.org> wrote:
> 
> 
> On Mon, Jul 14, 2014 at 3:02 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>> 
>> On Jul 14, 2014, at 12:10 PM, Anand Avati <av...@gluster.org> wrote:
>> 
>> On Mon, Jul 14, 2014 at 11:56 AM, Pat Ferrel <p...@occamsmachete.com>
> wrote:
>> In the application, the number of rows will always be increased, adding
> blank rows. I don’t think shuffle is necessary in this case because there
> is no actual row, no data in the drm it’s just needed to make the
> cardinality match, the IDs will take care of data matching . Maybe calling
> it something else is a good idea to emphasize the special case for it’s
> use. I went over this with Dmitriy and, though I haven’t checked actual
> values on large datasets, it works.
>> 
>> 
>> Does that mean the cardinality is faked at the logical layer with no
> changes at the engine level? Does that means the physical operators need to
> be prepared to handle non-matching matrix multiplication by assuming the
> missing rows or columns are 0's? Does that really work with no changes?
> 
> yes, Dmitriy recently confirmed this. But not faked, it is just not
> possible to calculate it from data in some cases since “does not exist” may
> mean “= 0”.
> 
> I’m no R expert but base R seems to assume a 0 value actually exists in
> the Matrix encoded in as much space as it’s type dictates, like Dense
> things in Mahout. I think there are R packages that add support for sparse
> things (slam?) and so assume this is one place where some rethinking is
> required:
> http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/
> 
> Sparse linear algebra with the added complication of foreign IDs makes for
> some odd cases. The number of extrenal/foreign IDs for rows and columns
> defines the true cardinality even though in a sparse matrix the empty row
> or column is just absent. In Mahout the IDs are row and column numbers so
> there are cases where the real-world cardinality does not match the number
> of Mahout IDs or DRM cardinality and the calculations should be fine with
> that as long as the real-world dimensions are supplied for cardinality
> checking and various values calculated from cardinality.
> 
> I’m maintaining a mapping of external ID to/from Mahout ID. For instance
> in item similarity it’s use is where the rows have a key = user ID. For a
> single application the row space is defined by all user IDs. In the cross
> similarity A’B reading in A may not ID every user so if B find more or
> different ones then the union of the two is our best guess at the total.
> And in fact more matrices could be added and the user ID space is all users
> seen in all of the data. If we knew how many users are defined in the
> application we could also use that but it’s not needed of there is no data
> at all for some users.
> 
> BTW this mapping seems to be one of the biggest generator of questions on
> the lists. The above issue is one that would likeli further trip up users
> generating their own ID mapping, which is why we are finally doing it for
> them.
> 
>> 
>> This sounds like a need to introduce a new R-like rbind() operator. This
> way you could fix up row cardinality like:
>> 
>> drmAnew = drmA rbind drmParallelizeEmpty(extra_rows, drmA.ncol)
>> 
> 
> true, add an empty slice.
> 
>> You could already do this, though twisted::
>> 
>> drmAnew = (drmA.t cbind drmParallelizeEmpty(drmA.ncol, extra_rows).t
>> 
>> 
> 
> yes and I can dig deeper and do another drmWrap constructing a new larger
> matrix.
> 
> Still changing the number of rows on Sparse matrices is so much simpler
> but I think "drm.nrow = “ may hide the special nature of what we are doing.
> 
> 
> So finally how are you changing the cardinality? I just want to make h2o
> engine "works" with that technique.
> 
> 
> 
> 
>

Re: Problem of dimensions

Reply via email to