> From: Dru Jensen <[EMAIL PROTECTED]>
> I am putting data in the same row and column family as
> i am scanning. According the St.Ack's response, I need
> to put the data in a separate column family.  I will
> see if this helps.  I'm curious, does the commit write
> the data to the same region as the map task is scanning?   
> Is this what may cause contention?

Tables are partitioned by rows so yes if you are
committing back into the same row that you are scanning,
both will interact with the same region server. One
important consideration of this is that if your inserts
cause a region split, your scanner will be blocked and
meanwhile the regionserver must then wait for the scanner
lease to expire before proceeding with the split. This
can potentially take the full period of the scanner lease
to get under way. For batch processes of course this is
acceptable but this entire situation can be avoided by
using separate tables. I use one table for content and
another for results, for example. There's more 
application overhead for this because for performance I
"join" the tables through data duplication, so if one
table is updated the other must be synchronized. Another
issue is consistency of the replicas but JGray's row lock
primitives will help here because it should be possible
to lock rows in both tables in advance of committing the
column cells in question, if I understand it correctly.

> Can you disclose what settings you are using for the
> commons-httpclient?

I set socket timeout depending on the number of fetch
retries:
   0 - 10 sec
   1 - 30 sec
   2 - 60 sec
   no further retries after this
and while reading from the result body stream I use an
8K buffer and call reporter.progress() whenever the
buffer is filled and flushed.

> Is there a way to split the regions before the MR task
> runs?  I know it is going to write ~2K per row, is 
> there a way to tell HBase to go ahead and split based
> on this anticipated size?

You can't trigger a split "by hand" (maybe file a JIRA
if you want this?) but you can influence how often a
table splits by setting the maximum size an HStore file
can grow to in DFS. There are relevant per-table and
global settings. See:

http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200808.mbox/[EMAIL 
PROTECTED]

and

http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200807.mbox/[EMAIL 
PROTECTED]

Hope this helps,

   - Andy




      

Reply via email to