On Tue, Jan 19, 2010 at 2:38 AM, stack <[email protected]> wrote:
> On Mon, Jan 18, 2010 at 5:18 PM, Zaharije Pasalic <
> [email protected]> wrote:
>
>> On Tue, Jan 19, 2010 at 12:13 AM, stack <[email protected]> wrote:
>> > On Mon, Jan 18, 2010 at 8:47 AM, Zaharije Pasalic <
>> > [email protected]> wrote:
>> >> Importing process is really simple one: small map reduce program will
>> >> read CSV file, split lines and insert it into table (only Map, no
>> >> Reduce parts). We are using default hadoop configuration (on 7 nodes
>> >> we can run 14 maps). Also we are using 32MB for writeBufferSize on
>> >> HBase and also we set setWriteToWAL to false.
>> >>
>> >>
>> > The mapreduce tasks are running on same nodes as hbase+datanodes?  WIth
>> 8G
>> > of RAM only, that might be a bit of a stretch.  You have monitoring on
>> these
>> > machines?  Any swapping?   Or are they fine?
>> >
>> >
>>
>> No, there is no swapping at all. Also cpu usage is really small.
>>
>>
> OK.  Then it unlikely MapReduce is robbing resources from datanodes (whats
> i/o like on these machines?  Load?).

we are using RackSpace cloud, so i'm not sure about i/o (i will try to
check with their support). Currently there is no more load on those
ervers except when i run MapReduce.

>
>> Are you inserting one row only per map task or more than this?  You are
>> > reusing an HTable instance?  Or failing that passing the same
>> > HBaseConfiguration each time?  If you make a new HTable with a new
>> > HBaseConfiguration each time then it does not make use of cache of region
>> > locations; it has to go fetch them again.  This can make for extra
>> loading
>> > on .META. table.
>> >
>>
>> We are having 500000 lines per single CSV file ~518MB. Default
>> splitting is used.
>
>
> Whats that?  A task per line?  Does the line have 100 columns on it?  Is
> that a MR task per line of a CSV file?  Is the HTable being created per
> Task?
>
>

Not sure that i understand "task per line". Did you mean one map per
one line? If that, no, one map will parse ~6K lines
(so 6K rows will written in one map).

Here is snippet of main createJobConfgiuration:

    // Job configuration
    Job job = new Job(conf, "hbase import");
    job.setJarByClass(HBaseImport2.class);
    job.setMapperClass(ImportMapper.class);

    // INPUT
    FileInputFormat.addInputPath(job, new Path(fileName));
                
    // OUTPUT
    job.setOutputFormatClass(CustomTableOutputFormat.class);
    job.getConfiguration().set(CustomTableOutputFormat.OUTPUT_TABLE, tableName);
    job.setOutputKeyClass(ImmutableBytesWritable.class);
    job.setOutputValueClass(Writable.class);            
                
    // MISC
    job.setNumReduceTasks(0);

main method looks like:

    HBaseConfiguration conf = new HBaseConfiguration();
    // parse coimmand line args ...
    Job job = createJob(conf, fileNameFromArgs, tableNameFromArgs);

and map part:

    public void map(Object key, Text value, Context context) throws
IOException, InterruptedException               {
         int i=0;
        String name = "";
                try {
                        String[] values = value.toString().split(",");
                                
                        context.getCounter(Counters.ROWS_WRITTEN).increment(1);
                                
                        Put put = new Put(values[0].getBytes());
                        put.setWriteToWAL(false);
                        for (i=1; i<values.length; i++) {
                                name = values[i];
                                put.add("attr".getBytes(),
context.getConfiguration().get("column_name_" + (i-1)).getBytes(),
values[i].getBytes());
                        }

                        context.write(key, put);
                }
                catch(Exception e) {
                        throw new RuntimeException("Values: '" + value + "' [" 
+ i +
":"+name+"]" + "\n" + e.getMessage());
                }
        }

>
>
>
>> We are using a little modified TableOutputFormat
>> class (I added support for write buffer size).
>>
>> So, we are instantiating HBaseConfiguration only in main method, and
>> leaving rest to (Custom)TableOutputFormat.
>>
>
> So, you have TOF hooked up as the MR Map output?
>

Yes. Check upper code.

>
>
>>
>> > Regards logs, enable DEBUG if you can (See FAQ for how).
>> >
>>
>> Will provide logs soon ...
>>
>
>
> Thanks.
>
>
>
>>
>> >
>> >> second manifestation is that i can create new empty table and start
>> >> importing data normaly, but if i try to import more data into same
>> >> table (now having ~33 millions) i'm having really bad performance and
>> >> hbase status page does not work at all (will not load into browser).
>> >>
>> >> Thats bad.  Can you tell how many regions you have on your cluster?  How
>> > many per server?
>> >
>>
>> ~1800 regions on cluster and ~250 per node. We are using replication
>> by factor of 2 (there is
>> no reason why we used 2 instead of default 3)
>>
>> Also, if I leave maps to run i will got following errors in datanode logs:
>>
>> 2010-01-18 23:15:15,795 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(10.177.88.209:50010,
>> storageID=DS-515966566-10.177.88.209-50010-1263597214826,
>> infoPort=50075, ipcPort=50020):DataXceiver
>> java.io.IOException: Block blk_3350193476599136386_135159 is not valid.
>>        at
>> org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:734)
>>        at
>> org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:722)
>>        at
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:92)
>>        at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:172)
>>        at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
>>        at java.lang.Thread.run(Thread.java:619)
>>
>>
> But this does not show up in the regionserver, right?  My guess is that HDFS
> deals with the broken block.
>

No, nothing in regionserver.

> St.Ack
>
>
>> >
>> >
>> >> So my questions is: what i'm doing wrong? Is current cluster good
>> >> enough to support 50millions records or my current 33 millions is
>> >> limit on current configuration? Any hints. Also, I'm getting about 800
>> >> inserts per second, is this slow?   Any hint is appreciated.
>> >>
>> >> An insert has 100 columns?  Is this 800/second across the whole cluster?
>> >
>> > St.Ack
>> >
>>
>

Reply via email to