Re: Dynamic Scaling of Accumulo

Keith Turner Mon, 27 Mar 2023 11:15:10 -0700

On Fri, Mar 24, 2023 at 9:27 AM Drew Farris <[email protected]> wrote:
>
> I'll echo that the bulk import to offline tables is a useful feature and it
> would be great to maintain this if we can.
>
> In the import table use case, for example, keeping the table offline allows
> us to perform external validation on the metadata table that all expected
> rfiles have been imported prior to allowing compactions to occur. This is
> slightly different from the import directory case but I think it is
> reasonable to extend this concept to that as well.
>
> External compactions for offline tables sounds useful as well.


One realization that came out examining the different table states is
that export table currently relies on the fact that offline tables
will not delete files.  If we enable compactions on offline tables
then that could cause files to be deleted which would break the
expectation of export table.

Pulling this thread further it was realized that certain Accumulo API
operations currently throw exceptions when a table is offline.  For
example splits, compactions, and merges all throw table offline
exceptions.  So the export table issue plus the current behavior
behind the existing APIs made us think its probably best to leave the
behavior for offline tables as is.

After we add ondemand tables, from an implementation perspective it
would be easy to change the behavior of offline tables to support
compact,split, and merge because we are planning to no longer require
tablets to be hosted for these operations.  However it will also be
easy to maintain the current behavior of offline tables.  We just need
to decide what we want.

>
> On Thu, Mar 23, 2023 at 8:15 PM Christopher <[email protected]> wrote:
>
> > In that case, I think it's probably sufficient to let the users know the
> > risks of bulk importing and never bringing it online for compactions. It
> > seems like that's a risk some users might be okay with for their use case.
> >
> > On Thu, Mar 23, 2023, 19:38 Dave Marion <[email protected]> wrote:
> >
> > > Yes, if the table is never brought online. I believe that Keith said that
> > > the table could still be scanned when offline with existing MapReduce
> > code
> > > or the OfflineScanner, which presents an issue that is not currently
> > > handled. I think we discussed today that the same thing could be achieved
> > > with tables in the on demand state. The reason to not modify an offline
> > > table is the export case, where the table needs to be immutable until the
> > > files are copied.
> > >
> > > On Thu, Mar 23, 2023, 6:58 PM Christopher <[email protected]> wrote:
> > >
> > >> What do you mean by "when not used in this manner"? What other way is
> > >> there to use that feature? Do you mean simply never being brought
> > >> online?
> > >>
> > >> Would it be possible to support (external) compactions for an offline
> > >> table?
> > >>
> > >> I feel like that's a pretty useful feature to revert, and would want
> > >> to consider alternatives.
> > >>
> > >> On Thu, Mar 23, 2023 at 6:39 PM Dave Marion <[email protected]>
> > wrote:
> > >> >
> > >> > Keith and I had a discussion today (that included some user input)
> > >> > regarding table operations with the new OnDemand table concept. I have
> > >> put
> > >> > the notes up on the wiki at:
> > >> >
> > >>
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=247828052
> > >> .
> > >> > One thing that came out of that is that we may want to revert the
> > >> change in
> > >> > the new bulk import code that allows a user to import into an offline
> > >> > table. The feature allows a user to create a table that is initially
> > >> > offline, bulk import data into it, then bring it online. However, when
> > >> not
> > >> > used in this manner the number of bulk import files would continue to
> > >> grow
> > >> > because compactions are never run on the table.
> > >> >
> > >> > On Mon, Mar 20, 2023 at 9:37 AM Dave Marion <[email protected]>
> > >> wrote:
> > >> >
> > >> > > Following up on this. Discussion and design documents are up on the
> > >> > > wiki[1]. There is a GitHub project[2] for planning out some of the
> > >> tasks,
> > >> > > which are then turned into issues. Some of the issues have draft PRs
> > >> > > submitted for them.
> > >> > >
> > >> > > [1] https://cwiki.apache.org/confluence/display/ACCUMULO/Elasticity
> > >> > > [2] https://github.com/orgs/apache/projects/164
> > >> > >
> > >> > > On Wed, Feb 22, 2023 at 2:35 PM Dave Marion <[email protected]>
> > >> wrote:
> > >> > >
> > >> > >> Except for the new bulk import code, Accumulo requires that tables
> > >> are in
> > >> > >> an online state to work with them (ingest, scan, compact, split,
> > >> etc.). In
> > >> > >> some cases this could become cost prohibitive and resource
> > >> inefficient as
> > >> > >> resources necessary to keep the tables online might be unused. I'd
> > >> like to
> > >> > >> propose a new capability for Accumulo - the ability to work with
> > >> tables
> > >> > >> that are not online. This could either mean working with tables in
> > an
> > >> > >> offline state, or maybe the ability to assign/host tables/tablets
> > on
> > >> > >> demand.
> > >> > >>
> > >> > >> At a high level the two ideas currently being discussed are below.
> > I
> > >> > >> think in both cases the root and metadata tables must be online,
> > >> table
> > >> > >> management functions move to manager components, and compactions of
> > >> offline
> > >> > >> tables move to the external compaction processes. In addition, new
> > >> metrics
> > >> > >> would need to be emitted so that an external resource scheduler
> > >> could spin
> > >> > >> up/down server processes as demand changes.
> > >> > >>
> > >> > >>
> > >> > >> *Offline Operations*
> > >> > >>
> > >> > >> This approach allows all operations to occur on offline tables at
> > the
> > >> > >> cost of having eventual consistency to the data at scan time (via
> > >> Scan
> > >> > >> Servers only). Live ingest could be supported through the creation
> > >> of an
> > >> > >> ingest server component that just receives mutations and minor
> > >> compacts.
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> *On-demand Tables*
> > >> > >> This approach allows for user tables to be offline and un-hosted,
> > but
> > >> > >> hosts them on demand for the purpose of live ingest and immediate
> > >> scans at
> > >> > >> the latency cost of possibly assigning and hosting the tablet.
> > >> > >>
> > >> > >> We have a few releases (1.10.3, 2.1.1, and 3.0.0) coming up in
> > >> likely the
> > >> > >> next month or two, but after that I'd like to start implementing
> > >> something
> > >> > >> to address this. Please contribute to the discussion if you have
> > >> thoughts
> > >> > >> on requirements, design, etc.
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >>
> > >
> >

Re: Dynamic Scaling of Accumulo

Reply via email to