Re: Accumulo as a Column Storage

Keith Turner Fri, 20 Oct 2017 09:30:18 -0700

On Thu, Oct 19, 2017 at 9:05 PM, Christopher <[email protected]> wrote:
> There's no expected scaling issue with having each column qualifier in its
> own unique column family, regardless of how large the number of these
> becomes. I've ingested random data like this before for testing, and it
> works fine.
>
> However, there may be an issue trying to create a very large number of
> locality groups. Locality groups are named, and you must explicitly
> configure them to store particular column families. That configuration is
> typically stored in ZooKeeper, and the configuration storage (in ZooKeeper,
> and/or in your conf/accumulo-site.xml file) does not scale as well as the
> data storage (HDFS) does. Where, and how, it will break, is probably
> system-dependent and not directly known (at least, not known by me). I would
> expect dozens, and possibly hundreds, of locality groups to work okay, but
> thousands seems like it's too many (but I haven't tried).


Seeking to a random location is O(F*L), where F is the number of files
and L is the number of locality groups used.  So if a tablet had 10
files and 10 locality groups were being used, then a seek on the
tablet would result in 100 seeks at the lowest levels.

After the initial seek, scanning over locality groups uses a heap of
heaps.  A heap that select the min key from all files.  Within each
file there is a heap that selects the min key from each loc group.
So scanning is O(log2(F) * log2(L)) or O(log2(F) + log2(L)) not sure.

Scanning over lots locality groups is probably pretty efficient, but
doing lots of random seeks over lots of loc groups may not be.

>
>
> On Thu, Oct 19, 2017 at 6:47 PM Mohammad Kargar <[email protected]> wrote:
>>
>> That makes sense. So this means that there's no limit or concerns on
>> having, potentially,  large number of column families (holing only one
>> column qualifier), right?
>>
>> On Thu, Oct 19, 2017 at 3:06 PM, Josh Elser <[email protected]> wrote:
>>>
>>> Yup, that's the intended use case. You have the flexibility to determine
>>> what column families make sense to group together. Your only "cost" in
>>> changing your mind is the speed at which you can re-compact your data.
>>>
>>> There is one concern which comes to mind. Though making many locality
>>> groups does increase the speed at which you can read from specific columns,
>>> it decreases the speed at which you can read from _all_ columns. So, you can
>>> do this trick to make Accumulo act more like a columnar database, but beware
>>> that you're going to have an impact if you still have a use-case where you
>>> read more than just one or two columns at a time.
>>>
>>> Does that make sense?
>>>
>>>
>>> On 10/19/17 5:50 PM, Mohammad Kargar wrote:
>>>>
>>>> AFAIK in Accumulo we can use "locality groups" to group sets of columns
>>>> together on disk which would make it more like  a column-oriented database.
>>>> Considering that "locality groups" are per column family, I was wondering
>>>> what if we treat column families like column qualifiers (creating one 
>>>> column
>>>> family per each qualifier) and assigning each to a different locality 
>>>> group.
>>>> This way all the data in a given column will be next to each other on disk
>>>> which makes it easier for analytical applications to query the data.
>>>>
>>>> Any thoughts?
>>>>
>>>> Thanks,
>>>> Mohammad
>>>>
>>
>

Re: Accumulo as a Column Storage

Reply via email to