On Mon, Jan 18, 2010 at 5:18 PM, Zaharije Pasalic < [email protected]> wrote:
> On Tue, Jan 19, 2010 at 12:13 AM, stack <[email protected]> wrote: > > On Mon, Jan 18, 2010 at 8:47 AM, Zaharije Pasalic < > > [email protected]> wrote: > >> Importing process is really simple one: small map reduce program will > >> read CSV file, split lines and insert it into table (only Map, no > >> Reduce parts). We are using default hadoop configuration (on 7 nodes > >> we can run 14 maps). Also we are using 32MB for writeBufferSize on > >> HBase and also we set setWriteToWAL to false. > >> > >> > > The mapreduce tasks are running on same nodes as hbase+datanodes? WIth > 8G > > of RAM only, that might be a bit of a stretch. You have monitoring on > these > > machines? Any swapping? Or are they fine? > > > > > > No, there is no swapping at all. Also cpu usage is really small. > > OK. Then it unlikely MapReduce is robbing resources from datanodes (whats i/o like on these machines? Load?). > Are you inserting one row only per map task or more than this? You are > > reusing an HTable instance? Or failing that passing the same > > HBaseConfiguration each time? If you make a new HTable with a new > > HBaseConfiguration each time then it does not make use of cache of region > > locations; it has to go fetch them again. This can make for extra > loading > > on .META. table. > > > > We are having 500000 lines per single CSV file ~518MB. Default > splitting is used. Whats that? A task per line? Does the line have 100 columns on it? Is that a MR task per line of a CSV file? Is the HTable being created per Task? > We are using a little modified TableOutputFormat > class (I added support for write buffer size). > > So, we are instantiating HBaseConfiguration only in main method, and > leaving rest to (Custom)TableOutputFormat. > So, you have TOF hooked up as the MR Map output? > > > Regards logs, enable DEBUG if you can (See FAQ for how). > > > > Will provide logs soon ... > Thanks. > > > > >> second manifestation is that i can create new empty table and start > >> importing data normaly, but if i try to import more data into same > >> table (now having ~33 millions) i'm having really bad performance and > >> hbase status page does not work at all (will not load into browser). > >> > >> Thats bad. Can you tell how many regions you have on your cluster? How > > many per server? > > > > ~1800 regions on cluster and ~250 per node. We are using replication > by factor of 2 (there is > no reason why we used 2 instead of default 3) > > Also, if I leave maps to run i will got following errors in datanode logs: > > 2010-01-18 23:15:15,795 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(10.177.88.209:50010, > storageID=DS-515966566-10.177.88.209-50010-1263597214826, > infoPort=50075, ipcPort=50020):DataXceiver > java.io.IOException: Block blk_3350193476599136386_135159 is not valid. > at > org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:734) > at > org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:722) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:92) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:172) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95) > at java.lang.Thread.run(Thread.java:619) > > But this does not show up in the regionserver, right? My guess is that HDFS deals with the broken block. St.Ack > > > > > >> So my questions is: what i'm doing wrong? Is current cluster good > >> enough to support 50millions records or my current 33 millions is > >> limit on current configuration? Any hints. Also, I'm getting about 800 > >> inserts per second, is this slow? Any hint is appreciated. > >> > >> An insert has 100 columns? Is this 800/second across the whole cluster? > > > > St.Ack > > >
