I don't think we currently make any guarantees about sort-order of the parts themselves, or among the various part-files, as the may be created by any number of map-reduce jobs, and are then consumed by map-reduce jobs which have no inter-process communication.
What would ordering even *mean* among map-inputs? Or are you just referring to in each chunk itself? Or for non-MR use of the files? -jake On Sun, Nov 13, 2011 at 10:38 PM, Ted Dunning <[email protected]> wrote: > Make sure that the files can be ordered, of course. Losing the ordering > can be really bad. > > On Sun, Nov 13, 2011 at 10:34 PM, Jake Mannix <[email protected]> > wrote: > > > Yeah, in particular, DistributedRowMatrix "is" simply a > > SequenceFile<IntWritable,VectorWritable>, when in its serialized form. > As > > such, > > this "file" can be (and typically is) a series of part-* files in a > > directory (typically > > on HDFS). > > > > -jake > > > > On Sun, Nov 13, 2011 at 10:23 PM, Dmitriy Lyubimov <[email protected] > > >wrote: > > > > > It's my understanding drm can be multifile. In fact, stuff like > > seq2sparse > > > will produce multifile output, being a MR job itself. > > > On Nov 12, 2011 3:23 PM, "Lance Norskog" <[email protected]> wrote: > > > > > > > Is there a convention for multi-file matrices? For example, the > > > > DistributedRowMatrix? > > > > > > > > -- > > > > Lance Norskog > > > > [email protected] > > > > > > > > > >
