https://issues.apache.org/jira/browse/HBASE-8691
On 6/4/13 6:11 PM, "Sandy Pratt" <prat...@adobe.com> wrote: >Haven't had a chance to write a JIRA yet, but I thought I'd pop in here >with an update in the meantime. > >I tried a number of different approaches to eliminate latency and >"bubbles" in the scan pipeline, and eventually arrived at adding a >streaming scan API to the region server, along with refactoring the scan >interface into an event-drive message receiver interface. In so doing, I >was able to take scan speed on my cluster from 59,537 records/sec with the >classic scanner to 222,703 records per second with my new scan API. >Needless to say, I'm pleased ;) > >More details forthcoming when I get a chance. > >Thanks, >Sandy > >On 5/23/13 3:47 PM, "Ted Yu" <yuzhih...@gmail.com> wrote: > >>Thanks for the update, Sandy. >> >>If you can open a JIRA and attach your producer / consumer scanner there, >>that would be great. >> >>On Thu, May 23, 2013 at 3:42 PM, Sandy Pratt <prat...@adobe.com> wrote: >> >>> I wrote myself a Scanner wrapper that uses a producer/consumer queue to >>> keep the client fed with a full buffer as much as possible. When >>>scanning >>> my table with scanner caching at 100 records, I see about a 24% uplift >>>in >>> performance (~35k records/sec with the ClientScanner and ~44k >>>records/sec >>> with my P/C scanner). However, when I set scanner caching to 5000, >>>it's >>> more of a wash compared to the standard ClientScanner: ~53k records/sec >>> with the ClientScanner and ~60k records/sec with the P/C scanner. >>> >>> I'm not sure what to make of those results. I think next I'll shut >>>down >>> HBase and read the HFiles directly, to see if there's a drop off in >>> performance between reading them directly vs. via the RegionServer. >>> >>> I still think that to really solve this there needs to be sliding >>>window >>> of records in flight between disk and RS, and between RS and client. >>>I'm >>> thinking there's probably a single batch of records in flight between >>>RS >>> and client at the moment. >>> >>> Sandy >>> >>> On 5/23/13 8:45 AM, "Bryan Keller" <brya...@gmail.com> wrote: >>> >>> >I am considering scanning a snapshot instead of the table. I believe >>>this >>> >is what the ExportSnapshot class does. If I could use the scanning >>>code >>> >from ExportSnapshot then I will be able to scan the HDFS files >>>directly >>> >and bypass the regionservers. This could potentially give me a huge >>>boost >>> >in performance for full table scans. However, it doesn't really >>>address >>> >the poor scan performance against a table. >>> >>> >