Neat ! Please keep the list appraised when you have something to demo. Kind regards Steve Watt
From: Edward Capriolo <edlinuxg...@gmail.com> To: common-user@hadoop.apache.org Date: 01/11/2010 01:55 PM Subject: Bible Code and some input format ideas Hey all, I saw a special on discovery about bible code. http://en.wikipedia.org/wiki/Bible_code I am designing something in hadoop to do bible code on any text (not just the bible). I have a rough idea on how to make all the parts efficient in map reduce. I have a little challenge I originally thought I could solve with with a custom InputFormat but it seems I may have to do this in a stand alone program. Lets assume your input looks like this: Is there any bible-code in this text? I don't know. The end result might look like this ( assuming I take every 5th letter.) irbcn tdn__ The first part of the process is given an input text we have to strip out a user configured list of things '\t' '-' '.' '?' . That I have no problem with. The second part of the process, I would like to get the data to be the proper width, in this case 5 characters. This is a challenge because assuming a line is 5 characters e.g. 'done?' Once it is cleaned it will be 4 characters 'done'. This -1 offsets changes the rest of the data, the next line might have another offset, so on and so on. Originally I was thinking I could create NCharacterInputFormat, but it seems like this stage of the process can not easily be done in map/reduce. I guess I need to write a single threaded program to read through the data and make the correct offsets (5 characters per line). Unless someone else has an idea.