Thanks for the feedback. I'll focus on the dense array challenge for now.

We will be examining 1000^3 arrays, multiple of which will represent changes to 
a spatial environment over time. That being the case, I think (but could be 
wrong), that representing each individual coordinate value is overkill, and 
that an array should be stored in chunks? For example, rather than store a 
coordinate value as an HBase key or a value in a sequence file (resulting in N 
billion keys), an array should be decomposed and stored as contiguous array 
hyperslab. Then a key becomes, for example, the corner of the hyperslab.

Does that make sense, and are there any suggestions for doing this? I think as 
you said, simply using ArrayWritable as a SequenceFile value would work?

As for our algorithms, currently we are interested in only structural 
manipulation, such as extracting hyperslabs. We will focus on analysis later, 
but the chunked solution should be OK for that, too.


On Mar 27, 2012, at 11:20 PM, Thomas Jungblut wrote:

> Hey, besides HBase you can use SequenceFiles, they have Key/Value pairs.
> So normally you use somekind of <VectorWritable, NullWritable> pairs,
> VectorWritable is for example in mahout. They have a good math package for
> sparse and dense vectors.
> 
> If you don't want vector classes then you can use ArrayWritable for dense
> and MapWritable for sparse data.
> It depends also on what you're doing with your data, so if you have more
> information about the algorithm, we can give you a better suggestion ;)
> 
> Am 28. März 2012 00:51 schrieb Edward J. Yoon <[email protected]>:
> 
>> Hi,
>> 
>> I believe that HBase is the best way to store multi-dimensional
>> arrays. HBase provides storage efficiencies as number of dimensions
>> grow, ordering capability, and also allows you to record and access
>> data corrections and updates directly via HBase client library.
>> 
>> Another option is use of SequenceFile and MapFile. Once data loaded to
>> the program initially, your math operations can run directly in memory
>> and and synchronized using a standard BSP APIs.
>> 
>> Thanks.
>> 
>> On Wed, Mar 28, 2012 at 12:46 AM, Noah Watkins <[email protected]>
>> wrote:
>>> Hi Hama list,
>>> 
>>> I'm interested in using Hama to process large multi-dimensional arrays
>> (sparse and dense). What is the best way to store and represent this type
>> of data for processing in Hama?
>>> 
>>> Thanks,
>>> Noah
>> 
>> 
>> 
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>> 
> 
> 
> 
> -- 
> Thomas Jungblut
> Berlin <[email protected]>

Reply via email to