After digging a bit I've found my problem comes from the following lines in the Store class:
void bulkLoadHFile(String srcPathStr) throws IOException { Path srcPath = new Path(srcPathStr); // Move the file if it's on another filesystem FileSystem srcFs = srcPath.getFileSystem(conf); if (!srcFs.equals(fs)) { LOG.info("File " + srcPath + " on different filesystem than " + "destination store - moving to this filesystem."); Path tmpPath = getTmpPath(); FileUtil.copy(srcFs, srcPath, fs, tmpPath, false, conf); LOG.info("Copied to temporary path on dst filesystem: " + tmpPath); srcPath = tmpPath; } The equality for the 2 filesystems fails in my case and I get the following log: 2012-07-27 14:47:25,321 INFO org.apache.hadoop.hbase.regionserver.Store: File hdfs://fs0.cm.cluster:8020/user/sfu200/outputBsbm/string2Id/F/e6cf2d1b69354e268b79597bf3855357 on different filesystem than destination store - moving to this filesystem. 2012-07-27 14:47:27,286 INFO org.apache.hadoop.hbase.regionserver.Store: Copied to temporary path on dst filesystem: hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 2012-07-27 14:47:27,286 DEBUG org.apache.hadoop.hbase.regionserver.Store: Renaming bulk load file hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 to hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712 2012-07-27 14:47:27,297 INFO org.apache.hadoop.hbase.regionserver.StoreFile: HFile Bloom filter type for c4bbf70a6654422db81884f15f34c712: NONE, but ROW specified in column family configuration 2012-07-27 14:47:27,297 INFO org.apache.hadoop.hbase.regionserver.Store: Moved hfile hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 into store directory hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F - updating store file list. 2012-07-27 14:47:27,297 INFO org.apache.hadoop.hbase.regionserver.Store: Successfully loaded store file hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/.tmp/90f6b193e6fd48ba8e814c968179abb9 into store F (new location: hdfs://fs0.cm.cluster:8020/hbase/String2Id_bsbm/9028c6a70b30a089b4312c622729e98e/F/c4bbf70a6654422db81884f15f34c712) In my hbase-site.xml I have: <property> <name>hbase.rootdir</name> <value>hdfs://fs0.cm.cluster:8020/hbase</value> <description>The directory shared by RegionServers. </description> </property> and in my hdfs-site.xml I have: <property> <name>fs.default.name</name> <value>hdfs://fs0.cm.cluster:8020</value> </property> As you can see they point to the same namenode. So I really don't understand why the above check fails.. Regards, Sever On Fri, Jul 27, 2012 at 1:17 PM, Sever Fundatureanu <fundatureanu.se...@gmail.com> wrote: > Hi Anil, > > I am using HBase 0.94.0 with Hadoop 1.0.0. The directories are indeed > the ones mentioned my Bijeet. I can also add that I am doing the 2nd > stage programatically by calling doBulkLoad(org.apache.hadoop.fs.Path > sourceDir, HTable table) on a LoadIncrementalHFiles object. > > Best, > Sever > > > On Fri, Jul 27, 2012 at 5:40 AM, Anil Gupta <anilgupt...@gmail.com> wrote: >> Hi Sever, >> >> That's a very interesting thing. Which Hadoop and hbase version you are >> using? I am going to run bulk loads tomorrow. If you can tell me which >> directories in hdfs you compared with /hbase/$table then I will try to check >> the same. >> >> Best Regards, >> Anil >> >> On Jul 26, 2012, at 3:46 PM, Sever Fundatureanu >> <fundatureanu.se...@gmail.com> wrote: >> >>> On Thu, Jul 26, 2012 at 6:47 PM, Sateesh Lakkarsu <lakka...@gmail.com> >>> wrote: >>>>> >>>>> >>>>> For the bulkloading process, the HBase documentation mentions that in >>>>> a 2nd stage "the appropriate Region Server adopts the HFile, moving it >>>>> into its storage directory and making the data available to clients." >>>>> But from my experience the files also remain in the original location >>>>> from where they are "adopted". So I guess the data is actually copied >>>>> into the HBase directory right? This means that, compared to the >>>>> online importing, when bulk loading you essentially need twice the >>>>> disk space on HDFS, right? >>>>> >>>> >>>> Yes, if you are generating HFiles on one cluster and loading into a >>>> separate hbase cluster. If they are co-located, its just a hdfs mv. >>> >>> Hmm, both the HFile generation and the HBase cluster runs on top of >>> the same HDFS cluster. I did a "du" on both the source HDFS directory >>> and the destination "/hbase" directory and I got the same sizes (+- >>> few bytes). I deleted the source directory from HDFS and then scanned >>> the table without any problems. Maybe there is a config parameter I'm >>> missing? >>> >>> Sever > > > > -- > Sever Fundatureanu > > Vrije Universiteit Amsterdam > E-mail: fundatureanu.se...@gmail.com -- Sever Fundatureanu Vrije Universiteit Amsterdam E-mail: fundatureanu.se...@gmail.com