I should have said "don't forget which row is which". On Mon, Nov 14, 2011 at 12:06 AM, Jake Mannix <[email protected]> wrote:
> The ordering *can* be chosen to be that. But nothing in our api > documentation > implies we will always do this, and in fact it completely depends on > whether the > MR job used to create the matrix had reducer outputs creating row numbers > sequentially. > > -jake > > On Sun, Nov 13, 2011 at 11:28 PM, Lance Norskog <[email protected]> wrote: > > > So, a DRM is a set of one or more files, where each SequenceFile > int/vector > > pair is a row number and a fully wide vector? Then ordering is in the > > IntWritable keys. > > > > On Sun, Nov 13, 2011 at 10:56 PM, Jake Mannix <[email protected]> > > wrote: > > > > > I don't think we currently make any guarantees about sort-order of the > > > parts > > > themselves, or among the various part-files, as the may be created by > any > > > number of map-reduce jobs, and are then consumed by map-reduce jobs > > > which have no inter-process communication. > > > > > > What would ordering even *mean* among map-inputs? Or are you just > > > referring to in each chunk itself? Or for non-MR use of the files? > > > > > > -jake > > > > > > On Sun, Nov 13, 2011 at 10:38 PM, Ted Dunning <[email protected]> > > > wrote: > > > > > > > Make sure that the files can be ordered, of course. Losing the > > ordering > > > > can be really bad. > > > > > > > > On Sun, Nov 13, 2011 at 10:34 PM, Jake Mannix <[email protected] > > > > > > wrote: > > > > > > > > > Yeah, in particular, DistributedRowMatrix "is" simply a > > > > > SequenceFile<IntWritable,VectorWritable>, when in its serialized > > form. > > > > As > > > > > such, > > > > > this "file" can be (and typically is) a series of part-* files in a > > > > > directory (typically > > > > > on HDFS). > > > > > > > > > > -jake > > > > > > > > > > On Sun, Nov 13, 2011 at 10:23 PM, Dmitriy Lyubimov < > > [email protected] > > > > > >wrote: > > > > > > > > > > > It's my understanding drm can be multifile. In fact, stuff like > > > > > seq2sparse > > > > > > will produce multifile output, being a MR job itself. > > > > > > On Nov 12, 2011 3:23 PM, "Lance Norskog" <[email protected]> > > wrote: > > > > > > > > > > > > > Is there a convention for multi-file matrices? For example, the > > > > > > > DistributedRowMatrix? > > > > > > > > > > > > > > -- > > > > > > > Lance Norskog > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Lance Norskog > > [email protected] > > >
