> From: Dru Jensen <[EMAIL PROTECTED]> > I am putting data in the same row and column family as > i am scanning. According the St.Ack's response, I need > to put the data in a separate column family. I will > see if this helps. I'm curious, does the commit write > the data to the same region as the map task is scanning? > Is this what may cause contention?
Tables are partitioned by rows so yes if you are committing back into the same row that you are scanning, both will interact with the same region server. One important consideration of this is that if your inserts cause a region split, your scanner will be blocked and meanwhile the regionserver must then wait for the scanner lease to expire before proceeding with the split. This can potentially take the full period of the scanner lease to get under way. For batch processes of course this is acceptable but this entire situation can be avoided by using separate tables. I use one table for content and another for results, for example. There's more application overhead for this because for performance I "join" the tables through data duplication, so if one table is updated the other must be synchronized. Another issue is consistency of the replicas but JGray's row lock primitives will help here because it should be possible to lock rows in both tables in advance of committing the column cells in question, if I understand it correctly. > Can you disclose what settings you are using for the > commons-httpclient? I set socket timeout depending on the number of fetch retries: 0 - 10 sec 1 - 30 sec 2 - 60 sec no further retries after this and while reading from the result body stream I use an 8K buffer and call reporter.progress() whenever the buffer is filled and flushed. > Is there a way to split the regions before the MR task > runs? I know it is going to write ~2K per row, is > there a way to tell HBase to go ahead and split based > on this anticipated size? You can't trigger a split "by hand" (maybe file a JIRA if you want this?) but you can influence how often a table splits by setting the maximum size an HStore file can grow to in DFS. There are relevant per-table and global settings. See: http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200808.mbox/[EMAIL PROTECTED] and http://mail-archives.apache.org/mod_mbox/hadoop-hbase-user/200807.mbox/[EMAIL PROTECTED] Hope this helps, - Andy
