If the data set doesn't fit in working memory, but is still of a reasonable
size  (lets say a few hundred gigabytes), then I'd probably use something
like this:

http://fallabs.com/tokyocabinet/

>From reading the Hadoop docs (which I'm very new to), then I might use
DistributedCache to replicate that database around.  My impression would be
that this might be among the most efficient things one could do.

However, for my particular application, even using tokycabinet introduces
too much inefficiency, and a pure plain old memory-based lookups is by far
the most efficient.  (not to mention that some of the lookups I'm doing are
specialized trees that can't be done with tokyocabinet or any typical db,
but thats beside the point)

I'm having trouble understanding your more efficient method by using more
data and HDFS, and having trouble understanding how it could possibly be any
more efficient than say the above approach.

How is increasing the size minimizing the lookups?

Ian

>I had the same problem before, a big lookup table too large to load in 
>memory.
>
>I tried and compared the following approaches:  in-memory MySQL DB, a 
>dedicated central memcache server, a dedicated central MongoDB server,  
>local DB (each node has its own MongoDB server) model.
>
>The local DB model is the most efficient one.  I believe dedicated 
>server approach could get improved if the number of server is increased 
>and distributed. I just tried single server.
>
>But later I dropped out the lookup table approach. Instead, I attached 
>the table information in the HDFS (which could be considered as an inner 
>join DB process), which significantly increases the size of data sets 
>but avoids the bottle neck of table look up. There is a trade-off, when 
>no table looks up, the data to process is intensive (TB size). In 
>contrast, a look-up table could save 90% of the data storage.
>
>According to our experiments on a 30-node cluster, attaching information 
>in HDFS is even 20%  faster than the local DB model. When attaching 
>information in HDFS, it is also easier to ping-pong Map/Reduce 
>configuration to further improve the efficiency.
>
>Shi
>
>On 6/15/2011 5:05 PM, GOEKE, MATTHEW (AG/1000) wrote:
>> Is the lookup table constant across each of the tasks? You could try putting 
>> it into memcached:
>>
>> http://hcil.cs.umd.edu/trs/2009-01/2009-01.pdf
>>
>> Matt
>>
>> -----Original Message-----
>> From: Ian Upright [mailto:i...@upright.net]
>> Sent: Wednesday, June 15, 2011 3:42 PM
>> To: common-user@hadoop.apache.org
>> Subject: large memory tasks
>>
>> Hello, I'm quite new to Hadoop, so I'd like to get an understanding of
>> something.
>>
>> Lets say I have a task that requires 16gb of memory, in order to execute.
>> Lets say hypothetically it's some sort of big lookuptable of sorts that
>> needs that kind of memory.
>>
>> I could have 8 cores run the task in parallel (multithreaded), and all 8
>> cores can share that 16gb lookup table.
>>
>> On another machine, I could have 4 cores run the same task, and they still
>> share that same 16gb lookup table.
>>
>> Now, with my understanding of Hadoop, each task has it's own memory.
>>
>> So if I have 4 tasks that run on one machine, and 8 tasks on another, then
>> the 4 tasks need a 64 GB machine, and the 8 tasks need a 128 GB machine, but
>> really, lets say I only have two machines, one with 4 cores and one with 8,
>> each machine only having 24 GB.
>>
>> How can the work be evenly distributed among these machines?  Am I missing
>> something?  What other ways can this be configured such that this works
>> properly?
>>
>> Thanks, Ian
>> This e-mail message may contain privileged and/or confidential information, 
>> and is intended to be received only by persons entitled
>> to receive such information. If you have received this e-mail in error, 
>> please notify the sender immediately. Please delete it and
>> all attachments from any servers, hard drives or any other media. Other use 
>> of this e-mail by you is strictly prohibited.
>>
>> All e-mails and attachments sent and received are subject to monitoring, 
>> reading and archival by Monsanto, including its
>> subsidiaries. The recipient of this e-mail is solely responsible for 
>> checking for the presence of "Viruses" or other "Malware".
>> Monsanto, along with its subsidiaries, accepts no liability for any damage 
>> caused by any such code transmitted by or accompanying
>> this e-mail or any attachment.
>>
>>
>> The information contained in this email may be subject to the export control 
>> laws and regulations of the United States, potentially
>> including but not limited to the Export Administration Regulations (EAR) and 
>> sanctions regulations issued by the U.S. Department of
>> Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
>> information you are obligated to comply with all
>> applicable U.S. export laws and regulations.
>>
>

Reply via email to