Hi Michael, Thanks for the questions.
I'm modeling dynamic Graphs in HBase, all elements (vertices, edges) have a timestamp and I can query things like events between A and B for the last 7 days. CFs are used for grouping different types of data for the same account. However, I have lots of skews in the data, to avoid having too much for the same row, I had to put what was in CQs to now RKs. So CF now acts more like a table. There's one CF containing sequence of events ordered by timestamp, and this CF is quite different as the use case is mostly in mapreduce jobs. Jianshi On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel <michael_se...@hotmail.com> wrote: > Again, a silly question. > > Why are you using column families? > > Just to play devil’s advocate in terms of design, why are you not treating > your row as a record? Think hierarchal not relational. > > This really gets in to some design theory. > > Think Column Family as a way to group data that has the same row key, > reference the same thing, yet the data in each column family is used > separately. > The example I always turn to when teaching, is to think of an order entry > system at a retailer. > > You generate data which is segmented by business process. (order entry, > pick slips, shipping, invoicing) All reflect a single order, yet the data > in each process tends to be accessed separately. > (You don’t need the order entry when using the pick slip to pull orders > from the warehouse.) So here, the data access pattern is that each column > family is used separately, except in generating the data (the order entry > is used to generate the pick slip(s) and set up things like backorders and > then the pick process generates the shipping slip(s) etc … And since they > are all focused on the same order, they have the same row key. > > So its reasonable to ask how you are accessing the data and how you are > designing your HBase model? > > Many times, developers create a model using column families because the > developer is thinking in terms of relationships. Not access patterns on the > data. > > Does this make sense? > > > On Sep 6, 2014, at 7:46 PM, Jianshi Huang <jianshi.hu...@gmail.com> wrote: > > > BTW, a little explanation about the binning I mentioned. > > > > Currently the rowkey looks like <type_of_events>#<rev_timestamp>#<id>. > > > > And with binning, it looks like > > <bin_number>#<type_of_events>#<rev_timestamp>#<id>. The bin_number could > be > > id % 256 or timestamp % 256. And the table could be pre-splitted. So > future > > ingestions could do parallel insertion to #<bin> regions, even without > > pre-split. > > > > > > Jianshi > > > > > > On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang <jianshi.hu...@gmail.com> > > wrote: > > > >> Each range might span multiple regions, depending on the data size I > want > >> scan for MR jobs. > >> > >> The ranges are dynamic, specified by the user, but the number of bins > can > >> be static (when the table/schema is created). > >> > >> Jianshi > >> > >> > >> On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> > >>> bq. 16 to 256 ranges > >>> > >>> Would each range be within single region or the range may span regions > ? > >>> Are the ranges dynamic ? > >>> > >>> Using command line for multiple ranges would be out of question. A file > >>> with ranges is needed. > >>> > >>> Cheers > >>> > >>> > >>> On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang < > jianshi.hu...@gmail.com> > >>> wrote: > >>> > >>>> Thanks Ted for the reference. > >>>> > >>>> That's right, extend the row.start and row.end to specify multiple > >>> ranges > >>>> and also getSplits. > >>>> > >>>> I would probably bin the event sequence CF into 16 to 256 bins. So 16 > to > >>>> 256 ranges. > >>>> > >>>> Jianshi > >>>> > >>>> > >>>> > >>>> On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >>>> > >>>>> Please refer to HBASE-5416 Filter on one CF and if a match, then load > >>> and > >>>>> return full row > >>>>> > >>>>> bq. to extend TableInputFormat to accept multiple row ranges > >>>>> > >>>>> You mean extending hbase.mapreduce.scan.row.start and > >>>>> hbase.mapreduce.scan.row.stop so that multiple ranges can be > >>> specified ? > >>>>> How many such ranges do you normally need ? > >>>>> > >>>>> Cheers > >>>>> > >>>>> > >>>>> On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang < > >>> jianshi.hu...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> Thanks Ted, > >>>>>> > >>>>>> I'll pre-split the table during ingestion. The reason to keep the > >>>> rowkey > >>>>>> monotonic is for easier working with TableInputFormat, otherwise I > >>>>> would've > >>>>>> binned it into 256 splits. (well, I think a good way is to extend > >>>>>> TableInputFormat to accept multiple row ranges, if there's an > >>> existing > >>>>>> efficient implementation, please let me know :) > >>>>>> > >>>>>> Would you elaborate a little more on the heap memory usage during > >>> scan? > >>>>> Is > >>>>>> there any reference to that? > >>>>>> > >>>>>> Jianshi > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >>>>>> > >>>>>>> If you use monotonically increasing rowkeys, separating out the > >>>> column > >>>>>>> family into a new table would give you same issue you're facing > >>>> today. > >>>>>>> > >>>>>>> Using a single table, essential column family feature would reduce > >>>> the > >>>>>>> amount of heap memory used during scan. With two tables, there is > >>> no > >>>>> such > >>>>>>> facility. > >>>>>>> > >>>>>>> Cheers > >>>>>>> > >>>>>>> > >>>>>>> On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang < > >>>>> jianshi.hu...@gmail.com> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi Ted, > >>>>>>>> > >>>>>>>> Yes, that's the table having RegionTooBusyExceptions :) But the > >>>>>>> performance > >>>>>>>> I care most are scan performance. > >>>>>>>> > >>>>>>>> It's mostly for analytics, so I don't care much about atomicity > >>>>>>> currently. > >>>>>>>> > >>>>>>>> What's your suggestion? > >>>>>>>> > >>>>>>>> Jianshi > >>>>>>>> > >>>>>>>> > >>>>>>>> On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu <yuzhih...@gmail.com> > >>>> wrote: > >>>>>>>> > >>>>>>>>> Is this the same table you mentioned in the thread about > >>>>>>>>> RegionTooBusyException > >>>>>>>>> ? > >>>>>>>>> > >>>>>>>>> If you move the column family to another table, you may have > >>> to > >>>>>> handle > >>>>>>>>> atomicity yourself - currently atomic operations are within > >>>> region > >>>>>>>>> boundaries. > >>>>>>>>> > >>>>>>>>> Cheers > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang < > >>>>>> jianshi.hu...@gmail.com > >>>>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> I'm currently putting everything into one table (to make > >>> cross > >>>>>>>> reference > >>>>>>>>>> queries easier) and there's one CF which contains rowkeys > >>> very > >>>>>>>> different > >>>>>>>>> to > >>>>>>>>>> the rest. Currently it works well, but I'm wondering if it > >>> will > >>>>>> cause > >>>>>>>>>> performance issues in the future. > >>>>>>>>>> > >>>>>>>>>> So my questions are > >>>>>>>>>> > >>>>>>>>>> 1) will there be performance penalties in the way I'm doing? > >>>>>>>>>> 2) should I move that CF to a separate table? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> -- > >>>>>>>>>> Jianshi Huang > >>>>>>>>>> > >>>>>>>>>> LinkedIn: jianshi > >>>>>>>>>> Twitter: @jshuang > >>>>>>>>>> Github & Blog: http://huangjs.github.com/ > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Jianshi Huang > >>>>>>>> > >>>>>>>> LinkedIn: jianshi > >>>>>>>> Twitter: @jshuang > >>>>>>>> Github & Blog: http://huangjs.github.com/ > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Jianshi Huang > >>>>>> > >>>>>> LinkedIn: jianshi > >>>>>> Twitter: @jshuang > >>>>>> Github & Blog: http://huangjs.github.com/ > >>>>>> > >>>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Jianshi Huang > >>>> > >>>> LinkedIn: jianshi > >>>> Twitter: @jshuang > >>>> Github & Blog: http://huangjs.github.com/ > >>>> > >>> > >> > >> > >> > >> -- > >> Jianshi Huang > >> > >> LinkedIn: jianshi > >> Twitter: @jshuang > >> Github & Blog: http://huangjs.github.com/ > >> > > > > > > > > -- > > Jianshi Huang > > > > LinkedIn: jianshi > > Twitter: @jshuang > > Github & Blog: http://huangjs.github.com/ > > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/