Re: Using MapReduce to do table comparing.

2008-07-24 Thread Amber
Yes, I think this is the simplest method , but there are problems too:

1. The reduce stage wouldn't begin until the map stage ends, by when we have 
done a two table scanning, and the comparing will take almost the same time, 
because about 90% of intermediate  pairs will have two values and 
different keys, if I can specify a number n, by when there are n intermediate 
pairs with the same key the reduce tasks start, that will be better. In my case 
I will set the magic number to 2.

2. I am not sure about how Hadoop stores intermediate  pairs, we 
would not afford it as data volume increasing if it is kept in memory.

--
From: "James Moore" <[EMAIL PROTECTED]>
Sent: Thursday, July 24, 2008 1:12 AM
To: 
Subject: Re: Using MapReduce to do table comparing.

> On Wed, Jul 23, 2008 at 7:33 AM, Amber <[EMAIL PROTECTED]> wrote:
>> We have a 10 million row table exported from AS400 mainframe every day, the 
>> table is exported as a csv text file, which is about 30GB in size, then the 
>> csv file is imported into a RDBMS table which is dropped and recreated every 
>> day. Now we want to find how many rows are updated during each export-import 
>> interval, the table has a primary key, so deletes and inserts can be found 
>> using RDBMS joins quickly, but we must do a column to column comparing in 
>> order to find the difference between rows ( about 90%) with the same primary 
>> keys. Our goal is to find a comparing process which takes no more than 10 
>> minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz 
>> CPUs, 8GB memory  and a  300G local  RAID5 array.
>>
>> Bellow is our current solution:
>>The old data is kept in the RDBMS with index created on the primary key, 
>> the new data is imported into HDFS as the input file of our Map-Reduce job. 
>> Every map task connects to the RDBMS database, and selects old data from it 
>> for every row, map tasks will generate outputs if differences are found, and 
>> there are no reduce tasks.
>>
>> As you can see, with the number of concurrent map tasks increasing, the 
>> RDBMS database will become the bottleneck, so we want to kick off the RDBMS, 
>> but we have no idea about how to retrieve the old row with a given key 
>> quickly from HDFS files, any suggestion is welcome.
> 
> Think of map/reduce as giving you a kind of key/value lookup for free
> - it just falls out of how the system works.
> 
> You don't care about the RDBMS.  It's a distraction - you're given a
> set of csv files with unique keys and dates, and you need to find the
> differences between them.
> 
> Say the data looks like this:
> 
> File for jul 10:
> 0x1,stuff
> 0x2,more stuff
> 
> File for jul 11:
> 0x1,stuff
> 0x2,apples
> 0x3,parrot
> 
> Preprocess the csv files to add dates to the values:
> 
> File for jul 10:
> 0x1,20080710,stuff
> 0x2,20080710,more stuff
> 
> File for jul 11:
> 0x1,20080711,stuff
> 0x2,20080711,apples
> 0x3,20080711,parrot
> 
> Feed two days worth of these files into a hadoop job.
> 
> The mapper splits these into k=0x1, v=20080710,stuff etc.
> 
> The reducer gets one or two v's per key, and each v has the date
> embedded in it - that's essentially your lookup step.
> 
> You'll end up with a system that can do compares for any two dates,
> and could easily be expanded to do all sorts of deltas across these
> files.
> 
> The preprocess-the-files-to-add-a-date can probably be included as
> part of your mapper and isn't really a separate step - just depends on
> how easy it is to use one of the off-the-shelf mappers with your data.
> If it turns out to be its own step, it can become a very simple
> hadoop job.
> 
> -- 
> James Moore | [EMAIL PROTECTED]
> Ruby and Ruby on Rails consulting
> blog.restphone.com
> 

Re: Using MapReduce to do table comparing.

2008-07-24 Thread Amber
I agree with you this is an acceptable method if time spent on exporting data 
from RDBM, importing file into HDFS and then importing data into RDBM again is 
considered as well, but this is an single-process/thread method. BTW, can you 
tell me how long does it take your method to process those 130 million rows, 
how much is the data volume, and how powerful are your physical computers, 
thanks a lot!
--
From: "Michael Lee" <[EMAIL PROTECTED]>
Sent: Thursday, July 24, 2008 11:51 AM
To: 
Subject: Re: Using MapReduce to do table comparing.

> Amber wrote:
>> We have a 10 million row table exported from AS400 mainframe every day, the 
>> table is exported as a csv text file, which is about 30GB in size, then the 
>> csv file is imported into a RDBMS table which is dropped and recreated every 
>> day. Now we want to find how many rows are updated during each export-import 
>> interval, the table has a primary key, so deletes and inserts can be found 
>> using RDBMS joins quickly, but we must do a column to column comparing in 
>> order to find the difference between rows ( about 90%) with the same primary 
>> keys. Our goal is to find a comparing process which takes no more than 10 
>> minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz 
>> CPUs, 8GB memory  and a  300G local  RAID5 array.
>>
>> Bellow is our current solution:
>> The old data is kept in the RDBMS with index created on the primary key, 
>> the new data is imported into HDFS as the input file of our Map-Reduce job. 
>> Every map task connects to the RDBMS database, and selects old data from it 
>> for every row, map tasks will generate outputs if differences are found, and 
>> there are no reduce tasks.
>>
>> As you can see, with the number of concurrent map tasks increasing, the 
>> RDBMS database will become the bottleneck, so we want to kick off the RDBMS, 
>> but we have no idea about how to retrieve the old row with a given key 
>> quickly from HDFS files, any suggestion is welcome.
> 10 million is not bad.  I do this all the time in UDB 8.1 - multiple key 
> columns and multiple value columns and calculate delta's - insert, 
> delete and update.
> 
> What other has suggested works ( I tried very crude version of what 
> James Moore suggested in Hadoop with 70+ million records ) but you have 
> to remember there are other costs ( dumping out files, putting into 
> HDFS, etc. ).  It might be better if you process straight in database or 
> do a straight file processing. Also the key is avoiding transaction.
> 
> If you are doing outside of database...
> 
> you have 'old.csv' and 'new.csv' and sorted by primary keys ( when you 
> extract make sure you do order by ).  In your application, you open two 
> file handlers and read one line at time.  Create the keys.  If the keys 
> are the same, you compare two strings if they are the same.  If key is 
> not the same, you have to find out natural orders - it can be insert or 
> delete.  Once you decide, you read another line ( if insert/delete - you 
> only read one line from one of the file )
> 
> Here is the pseudo code
> 
> oldFile = File.new(oldFilename, "r")
> newFile = File.new(newFilename, "r")
> outFile = File.new(outFilename, "w")
> 
> oldLine = oldFile.gets
> newLine = newFile.gets
> 
> while ( true )
> {
>oldKey = convertToKey(oldLine)
>newKey = convertToKey(newLine)
>   
>if ( oldKey < newKey )
>{
>   ## it is deletion
>   outFile.puts oldLine + "," + "DELETE";  
>   oldLine = oldFile.gets
>}
>elsif ( oldKey > newKey )
>{
>   ## it is insert
>   outFile.puts newLine + "," + "INSERT";
>   newLine = newFile.gets
>}
>else
>{
>   ## compare
>   outFile.puts newLine + "," + "UPDATE" if ( oldLine != newLine )
> 
>   oldLine = oldFile.gets
>   newLine = newFile.gets
>}
> }
> 
> Okay - I skipped the part if eof is reached for each file but you get 
> the point.
> 
> If the both old and new are in database, you can open two databases 
> connections and just do the process without dumping files.
> 
> I journal about 130 million rows every day for quant financial database...
> 
> 
> 
> 
> 
> 

Using MapReduce to do table comparing.

2008-07-23 Thread Amber
We have a 10 million row table exported from AS400 mainframe every day, the 
table is exported as a csv text file, which is about 30GB in size, then the csv 
file is imported into a RDBMS table which is dropped and recreated every day. 
Now we want to find how many rows are updated during each export-import 
interval, the table has a primary key, so deletes and inserts can be found 
using RDBMS joins quickly, but we must do a column to column comparing in order 
to find the difference between rows ( about 90%) with the same primary keys. 
Our goal is to find a comparing process which takes no more than 10 minutes 
with a 4-node cluster, each server in which has 4 4-core 3.0 GHz CPUs, 8GB 
memory  and a  300G local  RAID5 array.

Bellow is our current solution:
The old data is kept in the RDBMS with index created on the primary key, 
the new data is imported into HDFS as the input file of our Map-Reduce job. 
Every map task connects to the RDBMS database, and selects old data from it for 
every row, map tasks will generate outputs if differences are found, and there 
are no reduce tasks.

As you can see, with the number of concurrent map tasks increasing, the RDBMS 
database will become the bottleneck, so we want to kick off the RDBMS, but we 
have no idea about how to retrieve the old row with a given key quickly from 
HDFS files, any suggestion is welcome.

Hadoop 0.17.1 namenode service can't start on windows XP.

2008-07-18 Thread Amber
Hi, I followed the instructions from 
http://hayesdavis.net/2008/06/14/running-hadoop-on-windows/ to install Hadoop 
0.17.1 on my Windows XP computer, whose computer name is AMBER, and the current 
user name is User. I installed CygWin on G:\. I have verified ssh and 
bin/hadoop version work fine. But when  trying to start the dfs service I found 
the following problems:

1. Hadoop can't create the logs directory automatically if it does not exist in 
the install directory.
2. The datanode service can automatically create the 
G:\tmp\hadoop-SYSTEM\dfs\data directory, but the namenode service cant' 
automatically create G:\tmp\hadoop-User directory and it's sub directories, 
even I manually created the G:\tmp\hadoop-User\dfs\name\image directory the 
name service can't start neither, I found the following exceptions in the 
nameservice's log file:
2008-07-18 22:11:46,578 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG: 
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = amber/116.76.140.27
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.17.1
STARTUP_MSG:   build = 
http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 669344; 
compiled by 'hadoopqa' on Thu Jun 19 01:18:25 UTC 2008
/
2008-07-18 22:11:47,234 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: 
Initializing RPC Metrics with hostName=NameNode, port=47110
2008-07-18 22:11:47,250 INFO org.apache.hadoop.dfs.NameNode: Namenode up at: 
localhost/127.0.0.1:47110
2008-07-18 22:11:47,265 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=NameNode, sessionId=null
2008-07-18 22:11:47,281 INFO org.apache.hadoop.dfs.NameNodeMetrics: 
Initializing NameNodeMeterics using context 
object:org.apache.hadoop.metrics.spi.NullContext
2008-07-18 22:11:48,296 INFO org.apache.hadoop.fs.FSNamesystem: 
fsOwner=User,None,root,Administrators,Users,ORA_DBA
2008-07-18 22:11:48,296 INFO org.apache.hadoop.fs.FSNamesystem: 
supergroup=supergroup
2008-07-18 22:11:48,296 INFO org.apache.hadoop.fs.FSNamesystem: 
isPermissionEnabled=true
2008-07-18 22:11:48,359 INFO org.apache.hadoop.dfs.Storage: Storage directory 
G:\tmp\hadoop-User\dfs\name does not exist.
2008-07-18 22:11:48,359 INFO org.apache.hadoop.ipc.Server: Stopping server on 
47110
2008-07-18 22:11:48,359 ERROR org.apache.hadoop.dfs.NameNode: 
org.apache.hadoop.dfs.InconsistentFSStateException: Directory 
G:\tmp\hadoop-User\dfs\name is in an inconsistent state: storage directory does 
not exist or is not accessible.
 at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:154)
 at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
 at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274)
 at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:255)
 at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133)
 at org.apache.hadoop.dfs.NameNode.(NameNode.java:178)
 at org.apache.hadoop.dfs.NameNode.(NameNode.java:164)
 at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
 at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)

2008-07-18 22:11:48,359 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG: 
/
SHUTDOWN_MSG: Shutting down NameNode at amber/116.76.140.27
/
2008-07-18 22:26:35,734 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG: 
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = amber/116.76.140.27
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.17.1
STARTUP_MSG:   build = 
http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 669344; 
compiled by 'hadoopqa' on Thu Jun 19 01:18:25 UTC 2008
/
2008-07-18 22:26:36,046 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: 
Initializing RPC Metrics with hostName=NameNode, port=47110
2008-07-18 22:26:36,062 INFO org.apache.hadoop.dfs.NameNode: Namenode up at: 
localhost/127.0.0.1:47110
2008-07-18 22:26:36,062 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=NameNode, sessionId=null
2008-07-18 22:26:36,093 INFO org.apache.hadoop.dfs.NameNodeMetrics: 
Initializing NameNodeMeterics using context 
object:org.apache.hadoop.metrics.spi.NullContext
2008-07-18 22:26:37,421 INFO org.apache.hadoop.fs.FSNamesystem: 
fsOwner=User,None,root,Administrators,Users,ORA_DBA
2008-07-18 22:26:37,421 INFO org.apache.hadoop.fs.FSNamesystem: 
supergroup=supergroup
2008-07-18 22:26:37,421 INFO org.apache.hadoop.fs.FSNamesystem: 
isPermissionEnabled=true
2008-07-18 22:26:37,515 INFO org.apache.hadoop.dfs.Storage: Storage directory 
G:\tmp\hadoop-User\dfs\name does not exist.
2008-07-18 22:26:37,515 INFO org.apache.hado

Is Hadoop compatiable with IBM JDK 1.5 64 bit for AIX 5?

2008-07-18 Thread Amber
The Hadoop documentation says "Sun's JDK must be used", this message is post to 
make sure that there is official statement about this.