On Thu, Oct 19, 2017 at 9:05 PM, Christopher <[email protected]> wrote: > There's no expected scaling issue with having each column qualifier in its > own unique column family, regardless of how large the number of these > becomes. I've ingested random data like this before for testing, and it > works fine. > > However, there may be an issue trying to create a very large number of > locality groups. Locality groups are named, and you must explicitly > configure them to store particular column families. That configuration is > typically stored in ZooKeeper, and the configuration storage (in ZooKeeper, > and/or in your conf/accumulo-site.xml file) does not scale as well as the > data storage (HDFS) does. Where, and how, it will break, is probably > system-dependent and not directly known (at least, not known by me). I would > expect dozens, and possibly hundreds, of locality groups to work okay, but > thousands seems like it's too many (but I haven't tried).
Seeking to a random location is O(F*L), where F is the number of files and L is the number of locality groups used. So if a tablet had 10 files and 10 locality groups were being used, then a seek on the tablet would result in 100 seeks at the lowest levels. After the initial seek, scanning over locality groups uses a heap of heaps. A heap that select the min key from all files. Within each file there is a heap that selects the min key from each loc group. So scanning is O(log2(F) * log2(L)) or O(log2(F) + log2(L)) not sure. Scanning over lots locality groups is probably pretty efficient, but doing lots of random seeks over lots of loc groups may not be. > > > On Thu, Oct 19, 2017 at 6:47 PM Mohammad Kargar <[email protected]> wrote: >> >> That makes sense. So this means that there's no limit or concerns on >> having, potentially, large number of column families (holing only one >> column qualifier), right? >> >> On Thu, Oct 19, 2017 at 3:06 PM, Josh Elser <[email protected]> wrote: >>> >>> Yup, that's the intended use case. You have the flexibility to determine >>> what column families make sense to group together. Your only "cost" in >>> changing your mind is the speed at which you can re-compact your data. >>> >>> There is one concern which comes to mind. Though making many locality >>> groups does increase the speed at which you can read from specific columns, >>> it decreases the speed at which you can read from _all_ columns. So, you can >>> do this trick to make Accumulo act more like a columnar database, but beware >>> that you're going to have an impact if you still have a use-case where you >>> read more than just one or two columns at a time. >>> >>> Does that make sense? >>> >>> >>> On 10/19/17 5:50 PM, Mohammad Kargar wrote: >>>> >>>> AFAIK in Accumulo we can use "locality groups" to group sets of columns >>>> together on disk which would make it more like a column-oriented database. >>>> Considering that "locality groups" are per column family, I was wondering >>>> what if we treat column families like column qualifiers (creating one >>>> column >>>> family per each qualifier) and assigning each to a different locality >>>> group. >>>> This way all the data in a given column will be next to each other on disk >>>> which makes it easier for analytical applications to query the data. >>>> >>>> Any thoughts? >>>> >>>> Thanks, >>>> Mohammad >>>> >> >
