Re: Bulk loading disadvantages

Sever Fundatureanu Fri, 27 Jul 2012 06:46:55 -0700

After digging a bit I've found my problem comes from the following
lines in the Store class:


void bulkLoadHFile(String srcPathStr) throws IOException {
    Path srcPath = new Path(srcPathStr);

    // Move the file if it's on another filesystem
    FileSystem srcFs = srcPath.getFileSystem(conf);
    if (!srcFs.equals(fs)) {
      LOG.info("File " + srcPath + " on different filesystem than " +
          "destination store - moving to this filesystem.");
      Path tmpPath = getTmpPath();
      FileUtil.copy(srcFs, srcPath, fs, tmpPath, false, conf);
      LOG.info("Copied to temporary path on dst filesystem: " + tmpPath);
      srcPath = tmpPath;
    }

The equality for the 2 filesystems fails in my case and I get the following log:

2012-07-27 14:47:25,321 INFO
org.apache.hadoop.hbase.regionserver.Store: File
hdfs://fs0.cm.cluster:8020/user/sfu200/outputBsbm/string2Id/F/e6cf2d1b69354e268b79597bf3855357
on different filesystem than destination store - moving to this
filesystem.
2012-07-27 14:47:27,286 INFO
org.apache.hadoop.hbase.regionserver.Store: Copied to temporary path
on dst filesystem:
hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
2012-07-27 14:47:27,286 DEBUG
org.apache.hadoop.hbase.regionserver.Store: Renaming bulk load file
hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
to 
hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712
2012-07-27 14:47:27,297 INFO
org.apache.hadoop.hbase.regionserver.StoreFile: HFile Bloom filter
type for c4bbf70a6654422db81884f15f34c712: NONE, but ROW specified in
column family configuration
2012-07-27 14:47:27,297 INFO
org.apache.hadoop.hbase.regionserver.Store: Moved hfile
hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
into store directory
hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F
- updating store file list.
2012-07-27 14:47:27,297 INFO
org.apache.hadoop.hbase.regionserver.Store: Successfully loaded store
file 
hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9
into store F (new location:
hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712)

In my hbase-site.xml I have:
<property>
      <name>hbase.rootdir</name>
      <value>hdfs://fs0.cm.cluster:8020/hbase</value>
      <description>The directory shared by RegionServers.
      </description>
    </property>

and in my hdfs-site.xml I have:
<property>
  <name>fs.default.name</name>
  <value>hdfs://fs0.cm.cluster:8020</value>
</property>

As you can see they point to the same namenode. So I really don't
understand why the above check fails..

Regards,
Sever

On Fri, Jul 27, 2012 at 1:17 PM, Sever Fundatureanu
<fundatureanu.se...@gmail.com> wrote:
> Hi Anil,
>
> I am using HBase 0.94.0 with Hadoop 1.0.0. The directories are indeed
> the ones mentioned my Bijeet. I can also add that I am doing the 2nd
> stage programatically by calling doBulkLoad(org.apache.hadoop.fs.Path
> sourceDir, HTable table) on a LoadIncrementalHFiles object.
>
> Best,
> Sever
>
>
> On Fri, Jul 27, 2012 at 5:40 AM, Anil Gupta <anilgupt...@gmail.com> wrote:
>> Hi Sever,
>>
>> That's a very interesting thing. Which Hadoop and hbase version you are 
>> using? I am going to run bulk loads tomorrow. If you can tell me which 
>> directories in hdfs you compared with /hbase/$table then I will try to check 
>> the same.
>>
>> Best Regards,
>> Anil
>>
>> On Jul 26, 2012, at 3:46 PM, Sever Fundatureanu 
>> <fundatureanu.se...@gmail.com> wrote:
>>
>>> On Thu, Jul 26, 2012 at 6:47 PM, Sateesh Lakkarsu <lakka...@gmail.com> 
>>> wrote:
>>>>>
>>>>>
>>>>> For the bulkloading process, the HBase documentation mentions that in
>>>>> a 2nd stage "the appropriate Region Server adopts the HFile, moving it
>>>>> into its storage directory and making the data available to clients."
>>>>> But from my experience the files also remain in the original location
>>>>> from where they are "adopted". So I guess the data is actually copied
>>>>> into the HBase directory right? This means that, compared to the
>>>>> online importing, when bulk loading you essentially need twice the
>>>>> disk space on HDFS, right?
>>>>>
>>>>
>>>> Yes, if you are generating HFiles on one cluster and loading into a
>>>> separate hbase cluster. If they are co-located, its just a hdfs mv.
>>>
>>> Hmm, both the HFile generation and the HBase cluster runs on top of
>>> the same HDFS cluster. I did a "du" on both the source HDFS directory
>>> and the destination "/hbase" directory and I got the same sizes (+-
>>> few bytes). I deleted the source directory from HDFS and then scanned
>>> the table without any problems. Maybe there is a config parameter I'm
>>> missing?
>>>
>>> Sever
>
>
>
> --
> Sever Fundatureanu
>
> Vrije Universiteit Amsterdam
> E-mail: fundatureanu.se...@gmail.com



-- 
Sever Fundatureanu

Vrije Universiteit Amsterdam
E-mail: fundatureanu.se...@gmail.com

Re: Bulk loading disadvantages

Reply via email to