Re: multi-dimensional array storage

Thomas Jungblut Thu, 29 Mar 2012 01:09:12 -0700

>
> Does that make sense, and are there any suggestions for doing this?
>


Yep seems fine. Just use the I/O system of Hama and the ArrayWritable trick
for dense matrices as a SequenceFile. I guess this would be the best
solution.
There is a little bit of overhead in SeqFiles, it uses zLib compression by
default. So textfiles may be fast as well, but you have to parse the
strings.
You can turn off compression in the configuration via
"io.seqfile.compression.type" and set it to "NONE".

If you need additional tips, don't hesitate to come back and ask ;)

Am 28. März 2012 16:06 schrieb Noah Watkins <[email protected]>:

> Thanks for the feedback. I'll focus on the dense array challenge for now.
>
> We will be examining 1000^3 arrays, multiple of which will represent
> changes to a spatial environment over time. That being the case, I think
> (but could be wrong), that representing each individual coordinate value is
> overkill, and that an array should be stored in chunks? For example, rather
> than store a coordinate value as an HBase key or a value in a sequence file
> (resulting in N billion keys), an array should be decomposed and stored as
> contiguous array hyperslab. Then a key becomes, for example, the corner of
> the hyperslab.
>
> Does that make sense, and are there any suggestions for doing this? I
> think as you said, simply using ArrayWritable as a SequenceFile value would
> work?
>
> As for our algorithms, currently we are interested in only structural
> manipulation, such as extracting hyperslabs. We will focus on analysis
> later, but the chunked solution should be OK for that, too.
>
>
> On Mar 27, 2012, at 11:20 PM, Thomas Jungblut wrote:
>
> > Hey, besides HBase you can use SequenceFiles, they have Key/Value pairs.
> > So normally you use somekind of <VectorWritable, NullWritable> pairs,
> > VectorWritable is for example in mahout. They have a good math package
> for
> > sparse and dense vectors.
> >
> > If you don't want vector classes then you can use ArrayWritable for dense
> > and MapWritable for sparse data.
> > It depends also on what you're doing with your data, so if you have more
> > information about the algorithm, we can give you a better suggestion ;)
> >
> > Am 28. März 2012 00:51 schrieb Edward J. Yoon <[email protected]>:
> >
> >> Hi,
> >>
> >> I believe that HBase is the best way to store multi-dimensional
> >> arrays. HBase provides storage efficiencies as number of dimensions
> >> grow, ordering capability, and also allows you to record and access
> >> data corrections and updates directly via HBase client library.
> >>
> >> Another option is use of SequenceFile and MapFile. Once data loaded to
> >> the program initially, your math operations can run directly in memory
> >> and and synchronized using a standard BSP APIs.
> >>
> >> Thanks.
> >>
> >> On Wed, Mar 28, 2012 at 12:46 AM, Noah Watkins <[email protected]>
> >> wrote:
> >>> Hi Hama list,
> >>>
> >>> I'm interested in using Hama to process large multi-dimensional arrays
> >> (sparse and dense). What is the best way to store and represent this
> type
> >> of data for processing in Hama?
> >>>
> >>> Thanks,
> >>> Noah
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >> @eddieyoon
> >>
> >
> >
> >
> > --
> > Thomas Jungblut
> > Berlin <[email protected]>
>
>


-- 
Thomas Jungblut
Berlin <[email protected]>

Re: multi-dimensional array storage

Reply via email to