There is no look up. The process is done by shuffle and sort (secondary sort for multiple keys) in Map/Reduce.

The key problem is to join your record files with lookup tables

K1  R1                                        K1 V1
K2  R2                                        K2 V2
...
which gives you

K1 R1 V1
K2 R2 V2

You could either implement Map-side join or Reduce side join to do this. For multiple keys in a record, you then need secondary sort to preserve the order of the keys.

There is nothing broken, moreover, it is scalable by just tuning the Map/Reduce configuration parameters or adding more nodes. In contrast, to scale up a memcache look up approach you would need to tune the memcache service (servers, bandwidth, connection pools, etc.), which is a bottleneck could not be solely tackled by Map/Reduce settings.


On 6/16/2011 12:19 PM, Ian Upright wrote:
Again.. this is totally not making sense to me.  If your lookups are being
done ahead of time, then how are they being done?  What process is doing all
these lookups for R1, to get the values of V1 V2 V3?  Is this lookup process
done in parallel or on the cluster?

If this indeed faster, then I would argue there is something very broken
about your lookup processes being done in the tasks.

Ian

Suppose you are looking up a value V of a key K.   And V is required for
an upcoming process. Suppose the data in the upcoming process  has the form

R1  K1 K2 K3,

where R1 is the record number, K1 to K3 are the keys occurring in the
record, which means in the look up case you would query for V1, V2, V3

Using inner join you could attach all the V values for a single record
and prepare the data like

R1 K1 K2 K3 V1 V2 V3

then each record has the complete information for the next process. So
you pay the storage for the efficiency. Even taking into account the
time required for preparing the data, it is still faster than the
look-up approach.

I have also tried TokyoCabinet, you need to compile and install some
extensions to get it working. Sometimes getting things and APIs to work
can be painful. If you don't need to update the lookup table, install
TC, MemCache, MongoDB locally on each node would be the most efficient
solution because all the look-ups are local.


On 6/15/2011 5:56 PM, Ian Upright wrote:
If the data set doesn't fit in working memory, but is still of a reasonable
size  (lets say a few hundred gigabytes), then I'd probably use something
like this:

http://fallabs.com/tokyocabinet/

  From reading the Hadoop docs (which I'm very new to), then I might use
DistributedCache to replicate that database around.  My impression would be
that this might be among the most efficient things one could do.

However, for my particular application, even using tokycabinet introduces
too much inefficiency, and a pure plain old memory-based lookups is by far
the most efficient.  (not to mention that some of the lookups I'm doing are
specialized trees that can't be done with tokyocabinet or any typical db,
but thats beside the point)

I'm having trouble understanding your more efficient method by using more
data and HDFS, and having trouble understanding how it could possibly be any
more efficient than say the above approach.

How is increasing the size minimizing the lookups?

Ian

I had the same problem before, a big lookup table too large to load in
memory.

I tried and compared the following approaches:  in-memory MySQL DB, a
dedicated central memcache server, a dedicated central MongoDB server,
local DB (each node has its own MongoDB server) model.

The local DB model is the most efficient one.  I believe dedicated
server approach could get improved if the number of server is increased
and distributed. I just tried single server.

But later I dropped out the lookup table approach. Instead, I attached
the table information in the HDFS (which could be considered as an inner
join DB process), which significantly increases the size of data sets
but avoids the bottle neck of table look up. There is a trade-off, when
no table looks up, the data to process is intensive (TB size). In
contrast, a look-up table could save 90% of the data storage.

According to our experiments on a 30-node cluster, attaching information
in HDFS is even 20%  faster than the local DB model. When attaching
information in HDFS, it is also easier to ping-pong Map/Reduce
configuration to further improve the efficiency.

Shi

On 6/15/2011 5:05 PM, GOEKE, MATTHEW (AG/1000) wrote:
Is the lookup table constant across each of the tasks? You could try putting it 
into memcached:

http://hcil.cs.umd.edu/trs/2009-01/2009-01.pdf

Matt

-----Original Message-----
From: Ian Upright [mailto:i...@upright.net]
Sent: Wednesday, June 15, 2011 3:42 PM
To: common-user@hadoop.apache.org
Subject: large memory tasks

Hello, I'm quite new to Hadoop, so I'd like to get an understanding of
something.

Lets say I have a task that requires 16gb of memory, in order to execute.
Lets say hypothetically it's some sort of big lookuptable of sorts that
needs that kind of memory.

I could have 8 cores run the task in parallel (multithreaded), and all 8
cores can share that 16gb lookup table.

On another machine, I could have 4 cores run the same task, and they still
share that same 16gb lookup table.

Now, with my understanding of Hadoop, each task has it's own memory.

So if I have 4 tasks that run on one machine, and 8 tasks on another, then
the 4 tasks need a 64 GB machine, and the 8 tasks need a 128 GB machine, but
really, lets say I only have two machines, one with 4 cores and one with 8,
each machine only having 24 GB.

How can the work be evenly distributed among these machines?  Am I missing
something?  What other ways can this be configured such that this works
properly?

Thanks, Ian
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of 
"Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



Reply via email to