BTW, a little explanation about the binning I mentioned.

Currently the rowkey looks like <type_of_events>#<rev_timestamp>#<id>.

And with binning, it looks like
<bin_number>#<type_of_events>#<rev_timestamp>#<id>. The bin_number could be
id % 256 or timestamp % 256. And the table could be pre-splitted. So future
ingestions could do parallel insertion to #<bin> regions, even without
pre-split.


Jianshi


On Sun, Sep 7, 2014 at 2:34 AM, Jianshi Huang <jianshi.hu...@gmail.com>
wrote:

> Each range might span multiple regions, depending on the data size I want
> scan for MR jobs.
>
> The ranges are dynamic, specified by the user, but the number of bins can
> be static (when the table/schema is created).
>
> Jianshi
>
>
> On Sun, Sep 7, 2014 at 2:23 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> bq. 16 to 256 ranges
>>
>> Would each range be within single region or the range may span regions ?
>> Are the ranges dynamic ?
>>
>> Using command line for multiple ranges would be out of question. A file
>> with ranges is needed.
>>
>> Cheers
>>
>>
>> On Sat, Sep 6, 2014 at 11:18 AM, Jianshi Huang <jianshi.hu...@gmail.com>
>> wrote:
>>
>> > Thanks Ted for the reference.
>> >
>> > That's right, extend the row.start and row.end to specify multiple
>> ranges
>> > and also getSplits.
>> >
>> > I would probably bin the event sequence CF into 16 to 256 bins. So 16 to
>> > 256 ranges.
>> >
>> > Jianshi
>> >
>> >
>> >
>> > On Sun, Sep 7, 2014 at 2:09 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>> >
>> > > Please refer to HBASE-5416 Filter on one CF and if a match, then load
>> and
>> > > return full row
>> > >
>> > > bq. to extend TableInputFormat to accept multiple row ranges
>> > >
>> > > You mean extending hbase.mapreduce.scan.row.start and
>> > > hbase.mapreduce.scan.row.stop so that multiple ranges can be
>> specified ?
>> > > How many such ranges do you normally need ?
>> > >
>> > > Cheers
>> > >
>> > >
>> > > On Sat, Sep 6, 2014 at 11:01 AM, Jianshi Huang <
>> jianshi.hu...@gmail.com>
>> > > wrote:
>> > >
>> > > > Thanks Ted,
>> > > >
>> > > > I'll pre-split the table during ingestion. The reason to keep the
>> > rowkey
>> > > > monotonic is for easier working with TableInputFormat, otherwise I
>> > > would've
>> > > > binned it into 256 splits. (well, I think a good way is to extend
>> > > > TableInputFormat to accept multiple row ranges, if there's an
>> existing
>> > > > efficient implementation, please let me know :)
>> > > >
>> > > > Would you elaborate a little more on the heap memory usage during
>> scan?
>> > > Is
>> > > > there any reference to that?
>> > > >
>> > > > Jianshi
>> > > >
>> > > >
>> > > >
>> > > > On Sun, Sep 7, 2014 at 1:20 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>> > > >
>> > > > > If you use monotonically increasing rowkeys, separating out the
>> > column
>> > > > > family into a new table would give you same issue you're facing
>> > today.
>> > > > >
>> > > > > Using a single table, essential column family feature would reduce
>> > the
>> > > > > amount of heap memory used during scan. With two tables, there is
>> no
>> > > such
>> > > > > facility.
>> > > > >
>> > > > > Cheers
>> > > > >
>> > > > >
>> > > > > On Sat, Sep 6, 2014 at 10:11 AM, Jianshi Huang <
>> > > jianshi.hu...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Hi Ted,
>> > > > > >
>> > > > > > Yes, that's the table having RegionTooBusyExceptions :) But the
>> > > > > performance
>> > > > > > I care most are scan performance.
>> > > > > >
>> > > > > > It's mostly for analytics, so I don't care much about atomicity
>> > > > > currently.
>> > > > > >
>> > > > > > What's your suggestion?
>> > > > > >
>> > > > > > Jianshi
>> > > > > >
>> > > > > >
>> > > > > > On Sun, Sep 7, 2014 at 1:08 AM, Ted Yu <yuzhih...@gmail.com>
>> > wrote:
>> > > > > >
>> > > > > > > Is this the same table you mentioned in the thread about
>> > > > > > > RegionTooBusyException
>> > > > > > > ?
>> > > > > > >
>> > > > > > > If you move the column family to another table, you may have
>> to
>> > > > handle
>> > > > > > > atomicity yourself - currently atomic operations are within
>> > region
>> > > > > > > boundaries.
>> > > > > > >
>> > > > > > > Cheers
>> > > > > > >
>> > > > > > >
>> > > > > > > On Sat, Sep 6, 2014 at 9:49 AM, Jianshi Huang <
>> > > > jianshi.hu...@gmail.com
>> > > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi,
>> > > > > > > >
>> > > > > > > > I'm currently putting everything into one table (to make
>> cross
>> > > > > > reference
>> > > > > > > > queries easier) and there's one CF which contains rowkeys
>> very
>> > > > > > different
>> > > > > > > to
>> > > > > > > > the rest. Currently it works well, but I'm wondering if it
>> will
>> > > > cause
>> > > > > > > > performance issues in the future.
>> > > > > > > >
>> > > > > > > > So my questions are
>> > > > > > > >
>> > > > > > > > 1) will there be performance penalties in the way I'm doing?
>> > > > > > > > 2) should I move that CF to a separate table?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > > --
>> > > > > > > > Jianshi Huang
>> > > > > > > >
>> > > > > > > > LinkedIn: jianshi
>> > > > > > > > Twitter: @jshuang
>> > > > > > > > Github & Blog: http://huangjs.github.com/
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Jianshi Huang
>> > > > > >
>> > > > > > LinkedIn: jianshi
>> > > > > > Twitter: @jshuang
>> > > > > > Github & Blog: http://huangjs.github.com/
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Jianshi Huang
>> > > >
>> > > > LinkedIn: jianshi
>> > > > Twitter: @jshuang
>> > > > Github & Blog: http://huangjs.github.com/
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Jianshi Huang
>> >
>> > LinkedIn: jianshi
>> > Twitter: @jshuang
>> > Github & Blog: http://huangjs.github.com/
>> >
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Reply via email to