BTW, a little explanation about the binning I mentioned. Currently the rowkey looks like <type_of_events>#<rev_timestamp>#<id>.
And with binning, it looks like <bin_number>#<type_of_events>#<rev_timestamp>#<id>. The bin_number could be id % 256 or timestamp % 256. And the table could be pre-splitted. So future ingestions could do parallel insertion to #<bin> regions, even without pre-split. Jianshi On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang <jianshi.hu...@gmail.com> wrote: > Each range might span multiple regions, depending on the data size I want > scan for MR jobs. > > The ranges are dynamic, specified by the user, but the number of bins can > be static (when the table/schema is created). > > Jianshi > > > On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> bq. 16 to 256 ranges >> >> Would each range be within single region or the range may span regions ? >> Are the ranges dynamic ? >> >> Using command line for multiple ranges would be out of question. A file >> with ranges is needed. >> >> Cheers >> >> >> On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang <jianshi.hu...@gmail.com> >> wrote: >> >> > Thanks Ted for the reference. >> > >> > That's right, extend the row.start and row.end to specify multiple >> ranges >> > and also getSplits. >> > >> > I would probably bin the event sequence CF into 16 to 256 bins. So 16 to >> > 256 ranges. >> > >> > Jianshi >> > >> > >> > >> > On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu <yuzhih...@gmail.com> wrote: >> > >> > > Please refer to HBASE-5416 Filter on one CF and if a match, then load >> and >> > > return full row >> > > >> > > bq. to extend TableInputFormat to accept multiple row ranges >> > > >> > > You mean extending hbase.mapreduce.scan.row.start and >> > > hbase.mapreduce.scan.row.stop so that multiple ranges can be >> specified ? >> > > How many such ranges do you normally need ? >> > > >> > > Cheers >> > > >> > > >> > > On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang < >> jianshi.hu...@gmail.com> >> > > wrote: >> > > >> > > > Thanks Ted, >> > > > >> > > > I'll pre-split the table during ingestion. The reason to keep the >> > rowkey >> > > > monotonic is for easier working with TableInputFormat, otherwise I >> > > would've >> > > > binned it into 256 splits. (well, I think a good way is to extend >> > > > TableInputFormat to accept multiple row ranges, if there's an >> existing >> > > > efficient implementation, please let me know :) >> > > > >> > > > Would you elaborate a little more on the heap memory usage during >> scan? >> > > Is >> > > > there any reference to that? >> > > > >> > > > Jianshi >> > > > >> > > > >> > > > >> > > > On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu <yuzhih...@gmail.com> wrote: >> > > > >> > > > > If you use monotonically increasing rowkeys, separating out the >> > column >> > > > > family into a new table would give you same issue you're facing >> > today. >> > > > > >> > > > > Using a single table, essential column family feature would reduce >> > the >> > > > > amount of heap memory used during scan. With two tables, there is >> no >> > > such >> > > > > facility. >> > > > > >> > > > > Cheers >> > > > > >> > > > > >> > > > > On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang < >> > > jianshi.hu...@gmail.com> >> > > > > wrote: >> > > > > >> > > > > > Hi Ted, >> > > > > > >> > > > > > Yes, that's the table having RegionTooBusyExceptions :) But the >> > > > > performance >> > > > > > I care most are scan performance. >> > > > > > >> > > > > > It's mostly for analytics, so I don't care much about atomicity >> > > > > currently. >> > > > > > >> > > > > > What's your suggestion? >> > > > > > >> > > > > > Jianshi >> > > > > > >> > > > > > >> > > > > > On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu <yuzhih...@gmail.com> >> > wrote: >> > > > > > >> > > > > > > Is this the same table you mentioned in the thread about >> > > > > > > RegionTooBusyException >> > > > > > > ? >> > > > > > > >> > > > > > > If you move the column family to another table, you may have >> to >> > > > handle >> > > > > > > atomicity yourself - currently atomic operations are within >> > region >> > > > > > > boundaries. >> > > > > > > >> > > > > > > Cheers >> > > > > > > >> > > > > > > >> > > > > > > On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang < >> > > > jianshi.hu...@gmail.com >> > > > > > >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Hi, >> > > > > > > > >> > > > > > > > I'm currently putting everything into one table (to make >> cross >> > > > > > reference >> > > > > > > > queries easier) and there's one CF which contains rowkeys >> very >> > > > > > different >> > > > > > > to >> > > > > > > > the rest. Currently it works well, but I'm wondering if it >> will >> > > > cause >> > > > > > > > performance issues in the future. >> > > > > > > > >> > > > > > > > So my questions are >> > > > > > > > >> > > > > > > > 1) will there be performance penalties in the way I'm doing? >> > > > > > > > 2) should I move that CF to a separate table? >> > > > > > > > >> > > > > > > > >> > > > > > > > Thanks, >> > > > > > > > -- >> > > > > > > > Jianshi Huang >> > > > > > > > >> > > > > > > > LinkedIn: jianshi >> > > > > > > > Twitter: @jshuang >> > > > > > > > Github & Blog: http://huangjs.github.com/ >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > -- >> > > > > > Jianshi Huang >> > > > > > >> > > > > > LinkedIn: jianshi >> > > > > > Twitter: @jshuang >> > > > > > Github & Blog: http://huangjs.github.com/ >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > Jianshi Huang >> > > > >> > > > LinkedIn: jianshi >> > > > Twitter: @jshuang >> > > > Github & Blog: http://huangjs.github.com/ >> > > > >> > > >> > >> > >> > >> > -- >> > Jianshi Huang >> > >> > LinkedIn: jianshi >> > Twitter: @jshuang >> > Github & Blog: http://huangjs.github.com/ >> > >> > > > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/