Re: Bulk load - #Reducers different from #Regions

2012-08-07 Thread Subir S
Bulk load using
ImportTsv with pre-splitted regions for target table?

Do u mean to set number of reducers that ImportTsv must use?

On 8/7/12, Ioakim Perros imper...@gmail.com wrote:
 HI,

 I am bulk importing (updating) data iteratively and I would like to be
 able to set the number of reducers at a M/R task, to be different from
 the number of regions of the table to which I am updating data.

 I tried it through job.setNumReduceTasks(#reducers), but the job ignored
 it.

 Is there a way to avoid an intermediary job and to set the number of
 reducers explicitly ?
 I would be grateful if anyone could shed a light to this.

 Thanks and regards,
 Ioakim




Re: HBase bulk loaded region can't be splitted

2012-05-12 Thread Subir S
Wouldn't major_compact trigger a split...if it really needs to split

However if you want to presplit regions for your table you can use the
regionsplitter utility as below:

$export HADOOP_CLASSPATH=`hbase classpath`; hbase
org.apache.hadoop.hbase.util.RegionSplitter

This will give you a usage

sample is: hbase org.apache.hadoop.hbase.util.RegionSplitter -c 10
'mytable' -f ns


On Fri, May 11, 2012 at 8:37 AM, Bruce Bian weidong@gmail.com wrote:

 Yes, I understand that.
 But after I complete the bulk load, shouldn't it trigger the region server
 to split that region in order to meet the
  *hbase*.*hregion*.*max*.*filesize
 * criteria?
 When I try to split the regions manually using the WebUI, nothing happened,
 but instead a Region
 mytable,,1334215360439.71611409ea972a65b0876f953ad6377e.
 not splittable because midkey=null
 message is found in the region server log.


 On Fri, May 11, 2012 at 10:56 AM, Bryan Beaudreault 
 bbeaudrea...@hubspot.com wrote:

  I haven't done bulk loads using the importtsv tool, but I imagine it
 works
  similarly to the mapreduce bulk load tool we are provided.  If so, the
  following stands.
 
  In order to do a bulk load you need to have a table ready to accept the
  data.  The bulk load does not create regions, but only puts data into the
  right place based on existing regions.  Since you only have 1 region to
  start with, it makes sense that they would all go to that one region.
  You
  should find a way to calculate the regions that you want and create your
  table with pre-created regions.  Then re-run the import.
 
  On Thu, May 10, 2012 at 10:50 PM, Bruce Bian weidong@gmail.com
  wrote:
 
   I use importtsv to load data as HFile
  
   hadoop jar hbase-0.92.1.jar importtsv
   -Dimporttsv.bulk.output=/outputs/mytable.bulk
   -Dimporttsv.columns=HBASE_ROW_KEY,ns: -Dimporttsv.separator=, mytable
   /input
  
   Then I use completebulkload to load those bulk data into my table
  
   hadoop jar hbase-0.92.1.jar completebulkload /outputs/mytable.bulk
  mytable
  
   However, the size of table is very huge (4.x GB). And it has only one
   region. Oddly, why doesn't HBase split it into multiple regions? It did
   exceed the size to split (256MB).
  
   /hbase/mytable/71611409ea972a65b0876f953ad6377e/ns:
  
   [image: enter image description here]
  
   To split it, I try to use Split button on the Web UI of HBase. Sadly,
 it
   shows
  
   org.apache.hadoop.hbase.regionserver.CompactSplitThread: Region
   mytable,,1334215360439.71611409ea972a65b0876f953ad6377e. not
   splittable because midkey=null
  
   I have more data to load. About 300GB, no matter how many data I have
   loaded, it is still only one region. Also, it is still not splittable.
  Any
   idea?
  
 



Re: Best Hbase Storage for PIG

2012-05-12 Thread Subir S
Could it be that you could use Completebulkload and see if that
worksThat must be faster...than HBaseStorage.you could pre-split
using

export HADOOP_CLASSPATH=`hbase classpath`;hbase
org.apache.hadoop.hbase.util.RegionSplitter -c 10 'table_name' -f cf
name

On Sat, Apr 28, 2012 at 8:46 PM, M. C. Srivas mcsri...@gmail.com wrote:

 On Thu, Apr 26, 2012 at 4:38 AM, Rajgopal Vaithiyanathan 
 raja.f...@gmail.com wrote:

  Hey all,
 
  The default - HBaseStorage() takes hell lot of time for puts.
 
  In a cluster of 5 machines, insertion of 175 Million records took 4Hours
 45
  minutes
  Question -  Is this good enough ?
  each machine has 32 cores and 32GB ram with 7*600GB harddisks. HBASE's
 heap
  has been configured to 8GB.
  If the put speed is low, how can i improve them..?
 

 Raj, how big is each record?



 
  I tried tweaking the TableOutputFormat by increasing the WriteBufferSize
 to
  24MB, and adding the multi put feature (by adding 10,000 puts in
 ArrayList
  and putting it as a batch).  After doing this,  it started throwing
 
 



Re: HBaseStorage not working

2012-04-24 Thread Subir S
Looping HBase group.

On Tue, Apr 24, 2012 at 5:18 PM, Royston Sellman 
royston.sell...@googlemail.com wrote:

 We still haven't cracked this but  bit more info (HBase 0.95; Pig 0.11):

 The script below runs fine in a few seconds using Pig in local mode but
 with
 Pig in MR mode it sometimes works rapidly but usually takes 40 minutes to
 an
 hour.

 --hbaseuploadtest.pig
 register /opt/hbase/hbase-trunk/lib/protobuf-java-2.4.0a.jar
 register /opt/hbase/hbase-trunk/lib/guava-r09.jar
 register /opt/hbase/hbase-trunk/hbase-0.95-SNAPSHOT.jar
 register /opt/zookeeper/zookeeper-3.4.3/zookeeper-3.4.3.jar
 raw_data = LOAD '/data/sse.tbl1.HEADERLESS.csv' USING PigStorage( ',' ) AS
 (mid : chararray, hid : chararray, mf : chararray, mt : chararray, mind :
 chararray, mimd : chararray, mst : chararray );
 dump raw_data;
 STORE raw_data INTO 'hbase://hbaseuploadtest' USING
 org.apache.pig.backend.hadoop.hbase.HBaseStorage ('info:hid info:mf info:mt
 info:mind info:mimd info:mst);

 i.e.
 [hadoop1@namenode hadoop-1.0.2]$ pig -x local
 ../pig-scripts/hbaseuploadtest.pig
 WORKS EVERY TIME!!
 But
 [hadoop1@namenode hadoop-1.0.2]$ pig -x mapreduce
 ../pig-scripts/hbaseuploadtest.pig
 Sometimes (but rarely) runs in under a minute, often takes more than 40
 minutes to get to 50% but then completes to 100% in seconds. The dataset is
 very small.

 Note that the dump of raw_data works in both cases. However the STORE
 command causes the MR job to stall and the job setup task shows the
 following errors:
 Task attempt_201204240854_0006_m_02_0 failed to report status for 602
 seconds. Killing!
 Task attempt_201204240854_0006_m_02_1 failed to report status for 601
 seconds. Killing!

 And task log shows the following stream of errors:

 2012-04-24 11:57:27,427 INFO org.apache.zookeeper.ZooKeeper: Initiating
 client connection, connectString=localhost:2181 sessionTimeout=18
 watcher=hconnection 0x5567d7fb
 2012-04-24 11:57:27,441 INFO org.apache.zookeeper.ClientCnxn: Opening
 socket
 connection to server /127.0.0.1:2181
 2012-04-24 11:57:27,443 WARN
 org.apache.zookeeper.client.ZooKeeperSaslClient: SecurityException:
 java.lang.SecurityException: Unable to locate a login configuration
 occurred
 when trying to find JAAS configuration.
 2012-04-24 11:57:27,443 INFO
 org.apache.zookeeper.client.ZooKeeperSaslClient: Client will not
 SASL-authenticate because the default JAAS configuration section 'Client'
 could not be found. If you are not using SASL, you may ignore this. On the
 other hand, if you expected SASL to work, please fix your JAAS
 configuration.
 2012-04-24 11:57:27,444 WARN org.apache.zookeeper.ClientCnxn: Session 0x0
 for server null, unexpected error, closing socket connection and attempting
 reconnect
 java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at

 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.jav
 a:286)
at
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1035)
 2012-04-24 11:57:27,445 INFO
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: The identifier of
 this process is 6846@slave2
 2012-04-24 11:57:27,551 INFO org.apache.zookeeper.ClientCnxn: Opening
 socket
 connection to server /127.0.0.1:2181
 2012-04-24 11:57:27,552 WARN
 org.apache.zookeeper.client.ZooKeeperSaslClient: SecurityException:
 java.lang.SecurityException: Unable to locate a login configuration
 occurred
 when trying to find JAAS configuration.
 2012-04-24 11:57:27,552 INFO
 org.apache.zookeeper.client.ZooKeeperSaslClient: Client will not
 SASL-authenticate because the default JAAS configuration section 'Client'
 could not be found. If you are not using SASL, you may ignore this. On the
 other hand, if you expected SASL to work, please fix your JAAS
 configuration.
 2012-04-24 11:57:27,552 WARN org.apache.zookeeper.ClientCnxn: Session 0x0
 for server null, unexpected error, closing socket connection and attempting
 reconnect
 java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at

 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.jav
 a:286)
at
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1035)
 2012-04-24 11:57:27,553 WARN
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
 ZooKeeper exception:
 org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
 2012-04-24 11:57:27,553 INFO org.apache.hadoop.hbase.util.RetryCounter:
 Sleeping 2000ms before retry #1...
 2012-04-24 11:57:28,652 INFO org.apache.zookeeper.ClientCnxn: Opening
 socket
 connection to server localhost/127.0.0.1:2181
 2012-04-24 11:57:28,653 WARN
 

Integration of Hadoop Streaming with Ruby and HBase

2012-03-16 Thread Subir S
Hi,

I just joined HBase user list. So this is my first question :-)

Is there any way i can dump the output of my Ruby Map Reduce jobs into
HBase directly? In other words does Hadoop Streaming with Ruby integrate
with HBase? Like Pig has HBaseStorage etc.

Thanks in advance!

Regards
Subir