I'm guessing that you want to set the width of the text to avoid the issue where if you split by block, then all splits but the first will have an unknown offset.

Most texts have natural divisions in them which I'm guessing you'll want to respect anyway. In the Bible this would be the different books, in more recent books it would be different chapters. Could you instead set up your InputFormat to split on these divisions in the text? Then you don't have to go through this single threaded step. And in most cases the divisions in the text will be small enough to be handled by a single mapper (though not necessarily well balanced).

Alan.

On Jan 11, 2010, at 11:52 AM, Edward Capriolo wrote:

Hey all,
I saw a special on discovery about bible code.
http://en.wikipedia.org/wiki/Bible_code

I am designing something in hadoop to do bible code on any text (not
just the bible). I have a rough idea on how to make all the parts
efficient in map reduce. I have a little challenge I originally
thought I could solve with with a custom InputFormat but it seems I
may have to do this in a stand alone program.

Lets assume your input looks like this:

Is there any
bible-code in this
text? I don't know.

The end result might look like this ( assuming I take every 5th letter.)

irbcn
tdn__

The first part of the process is given an input text we have to strip
out a user configured list of things '\t' '-' '.' '?' .  That I have
no problem with.

The second part of the process, I would like to get the data to be the
proper width, in this case 5 characters. This is a challenge because
assuming a line is 5 characters e.g. 'done?' Once it is cleaned it
will be 4 characters  'done'. This -1 offsets changes the rest of the
data, the next line might have another offset, so on and so on.

Originally I was thinking I could create NCharacterInputFormat, but it
seems like this stage of the process can not easily be done in
map/reduce. I guess I need to write a single threaded program to read
through the data and make the correct offsets (5 characters per line).
Unless someone else has an idea.

Reply via email to