Each range might span multiple regions, depending on the data size I want scan for MR jobs.
The ranges are dynamic, specified by the user, but the number of bins can be static (when the table/schema is created). Jianshi On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu <yuzhih...@gmail.com> wrote: > bq. 16 to 256 ranges > > Would each range be within single region or the range may span regions ? > Are the ranges dynamic ? > > Using command line for multiple ranges would be out of question. A file > with ranges is needed. > > Cheers > > > On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang <jianshi.hu...@gmail.com> > wrote: > > > Thanks Ted for the reference. > > > > That's right, extend the row.start and row.end to specify multiple ranges > > and also getSplits. > > > > I would probably bin the event sequence CF into 16 to 256 bins. So 16 to > > 256 ranges. > > > > Jianshi > > > > > > > > On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > Please refer to HBASE-5416 Filter on one CF and if a match, then load > and > > > return full row > > > > > > bq. to extend TableInputFormat to accept multiple row ranges > > > > > > You mean extending hbase.mapreduce.scan.row.start and > > > hbase.mapreduce.scan.row.stop so that multiple ranges can be specified > ? > > > How many such ranges do you normally need ? > > > > > > Cheers > > > > > > > > > On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang < > jianshi.hu...@gmail.com> > > > wrote: > > > > > > > Thanks Ted, > > > > > > > > I'll pre-split the table during ingestion. The reason to keep the > > rowkey > > > > monotonic is for easier working with TableInputFormat, otherwise I > > > would've > > > > binned it into 256 splits. (well, I think a good way is to extend > > > > TableInputFormat to accept multiple row ranges, if there's an > existing > > > > efficient implementation, please let me know :) > > > > > > > > Would you elaborate a little more on the heap memory usage during > scan? > > > Is > > > > there any reference to that? > > > > > > > > Jianshi > > > > > > > > > > > > > > > > On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > > > > > If you use monotonically increasing rowkeys, separating out the > > column > > > > > family into a new table would give you same issue you're facing > > today. > > > > > > > > > > Using a single table, essential column family feature would reduce > > the > > > > > amount of heap memory used during scan. With two tables, there is > no > > > such > > > > > facility. > > > > > > > > > > Cheers > > > > > > > > > > > > > > > On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang < > > > jianshi.hu...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi Ted, > > > > > > > > > > > > Yes, that's the table having RegionTooBusyExceptions :) But the > > > > > performance > > > > > > I care most are scan performance. > > > > > > > > > > > > It's mostly for analytics, so I don't care much about atomicity > > > > > currently. > > > > > > > > > > > > What's your suggestion? > > > > > > > > > > > > Jianshi > > > > > > > > > > > > > > > > > > On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu <yuzhih...@gmail.com> > > wrote: > > > > > > > > > > > > > Is this the same table you mentioned in the thread about > > > > > > > RegionTooBusyException > > > > > > > ? > > > > > > > > > > > > > > If you move the column family to another table, you may have to > > > > handle > > > > > > > atomicity yourself - currently atomic operations are within > > region > > > > > > > boundaries. > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > > On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang < > > > > jianshi.hu...@gmail.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > I'm currently putting everything into one table (to make > cross > > > > > > reference > > > > > > > > queries easier) and there's one CF which contains rowkeys > very > > > > > > different > > > > > > > to > > > > > > > > the rest. Currently it works well, but I'm wondering if it > will > > > > cause > > > > > > > > performance issues in the future. > > > > > > > > > > > > > > > > So my questions are > > > > > > > > > > > > > > > > 1) will there be performance penalties in the way I'm doing? > > > > > > > > 2) should I move that CF to a separate table? > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > -- > > > > > > > > Jianshi Huang > > > > > > > > > > > > > > > > LinkedIn: jianshi > > > > > > > > Twitter: @jshuang > > > > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Jianshi Huang > > > > > > > > > > > > LinkedIn: jianshi > > > > > > Twitter: @jshuang > > > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Jianshi Huang > > > > > > > > LinkedIn: jianshi > > > > Twitter: @jshuang > > > > Github & Blog: http://huangjs.github.com/ > > > > > > > > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/