On Mon, Jan 18, 2010 at 5:18 PM, Zaharije Pasalic <
[email protected]> wrote:

> On Tue, Jan 19, 2010 at 12:13 AM, stack <[email protected]> wrote:
> > On Mon, Jan 18, 2010 at 8:47 AM, Zaharije Pasalic <
> > [email protected]> wrote:
> >> Importing process is really simple one: small map reduce program will
> >> read CSV file, split lines and insert it into table (only Map, no
> >> Reduce parts). We are using default hadoop configuration (on 7 nodes
> >> we can run 14 maps). Also we are using 32MB for writeBufferSize on
> >> HBase and also we set setWriteToWAL to false.
> >>
> >>
> > The mapreduce tasks are running on same nodes as hbase+datanodes?  WIth
> 8G
> > of RAM only, that might be a bit of a stretch.  You have monitoring on
> these
> > machines?  Any swapping?   Or are they fine?
> >
> >
>
> No, there is no swapping at all. Also cpu usage is really small.
>
>
OK.  Then it unlikely MapReduce is robbing resources from datanodes (whats
i/o like on these machines?  Load?).

> Are you inserting one row only per map task or more than this?  You are
> > reusing an HTable instance?  Or failing that passing the same
> > HBaseConfiguration each time?  If you make a new HTable with a new
> > HBaseConfiguration each time then it does not make use of cache of region
> > locations; it has to go fetch them again.  This can make for extra
> loading
> > on .META. table.
> >
>
> We are having 500000 lines per single CSV file ~518MB. Default
> splitting is used.


Whats that?  A task per line?  Does the line have 100 columns on it?  Is
that a MR task per line of a CSV file?  Is the HTable being created per
Task?





> We are using a little modified TableOutputFormat
> class (I added support for write buffer size).
>
> So, we are instantiating HBaseConfiguration only in main method, and
> leaving rest to (Custom)TableOutputFormat.
>

So, you have TOF hooked up as the MR Map output?



>
> > Regards logs, enable DEBUG if you can (See FAQ for how).
> >
>
> Will provide logs soon ...
>


Thanks.



>
> >
> >> second manifestation is that i can create new empty table and start
> >> importing data normaly, but if i try to import more data into same
> >> table (now having ~33 millions) i'm having really bad performance and
> >> hbase status page does not work at all (will not load into browser).
> >>
> >> Thats bad.  Can you tell how many regions you have on your cluster?  How
> > many per server?
> >
>
> ~1800 regions on cluster and ~250 per node. We are using replication
> by factor of 2 (there is
> no reason why we used 2 instead of default 3)
>
> Also, if I leave maps to run i will got following errors in datanode logs:
>
> 2010-01-18 23:15:15,795 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(10.177.88.209:50010,
> storageID=DS-515966566-10.177.88.209-50010-1263597214826,
> infoPort=50075, ipcPort=50020):DataXceiver
> java.io.IOException: Block blk_3350193476599136386_135159 is not valid.
>        at
> org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:734)
>        at
> org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:722)
>        at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:92)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:172)
>        at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
>        at java.lang.Thread.run(Thread.java:619)
>
>
But this does not show up in the regionserver, right?  My guess is that HDFS
deals with the broken block.

St.Ack


> >
> >
> >> So my questions is: what i'm doing wrong? Is current cluster good
> >> enough to support 50millions records or my current 33 millions is
> >> limit on current configuration? Any hints. Also, I'm getting about 800
> >> inserts per second, is this slow?   Any hint is appreciated.
> >>
> >> An insert has 100 columns?  Is this 800/second across the whole cluster?
> >
> > St.Ack
> >
>

Reply via email to