I guess hdfs has overhead, so I don't worry about that.

So in my case, I had stored some dozens of rows, and heaps of columns in
each, with values in the 50-100 character range. When doing "scan -t
dataTable" I got back a dozen or more pages filled with more than 100
characters per line, and "du" was reporting "5170"... Hence I was a bit
surprised.

Yes, my values are highly repetitive and subject to good compression.

So, I am all good!!

Thanks for accurate and speedy responses. Really appreciated.
Niclas

On Wed, Apr 15, 2020 at 12:41 PM Christopher <[email protected]> wrote:

> The `du` command should show in bytes. Keep in mind that Accumulo
> compresses data in its files. If the number doesn't match what you see
> for the *.rf files in Hadoop, there may be a bug. Please let us know
> if you find this to be the case.
>
> On Tue, Apr 14, 2020 at 10:30 PM Niclas Hedhman <[email protected]> wrote:
> >
> > Yes, a bit of experimentation and I figured that out.
> >
> > As for the "putIfAbsent"; I can actually figure that out from the data
> being written in this case, effectively an event store, and all rows starts
> with a "created" event.
> >
> > One more small question;
> > there is a "du" command, does it really report "bytes" or is it kB, of
> storage space needed? The number seems too small for bytes, and if in kB
> then it is over the hdfs physical disk usage...
> >
> > Cheers
> > Niclas
> >
> > On Tue, Apr 14, 2020 at 9:49 PM Adam J. Shook <[email protected]>
> wrote:
> >>
> >> limitVersion = false would *not* set the default VersioningIterator,
> effectively keeping every entry you write to Accumulo.  Sounds like it hits
> your requirement of "versions never to be removed", though keep in mind
> that your static "metadata" qualifier would also never be versioned/deleted.
> >>
> >> On Mon, Apr 13, 2020 at 8:47 PM Niclas Hedhman <[email protected]>
> wrote:
> >>>
> >>> Ah! I had some misunderstandings implanted in me, and good to get
> corrected.
> >>>
> >>> For
> >>>
> >>> connector.tableOperations.create(String tableName, boolean
> limitVersion);
> >>>
> >>>
> >>> Will limitVersion=false disable versioning completely and I will
> always only have one version, or will it have a "no limit" and "no removal"
> policy of versions?
> >>>
> >>> Well, to be clear, I am looking for "versions never to be removed", a
> requirement that made me smile and remember "Accumulo can do that
> automatically", rather than implement that at a higher level.
> >>>
> >>> Thanks
> >>>
> >>> On Tue, Apr 14, 2020 at 12:55 AM Adam J. Shook <[email protected]>
> wrote:
> >>>>
> >>>> Hi Niclas,
> >>>>
> >>>> 1. Accumulo uses a VersioningIterator for all tables which ensures
> that you see the latest version of a particular entry, defined as the entry
> that has the highest value for the timestamp.  Older versions of the same
> key (row ID + family + qualifier + visibility) are compacted away by
> Accumulo and will eventually be deleted.  You can set the number of
> versions you want to keep to something other than the default of 1 (see
> https://accumulo.apache.org/1.9/accumulo_user_manual.html#_versioning_iterators_and_timestamps
> ).
> >>>>
> >>>> 2. Related to #1, Accumulo will update the value to the latest
> version of entry.  I believe if you keep writing the same entry with the
> same data over and over again, you'll see them if you are keeping more than
> one version of the same entry.  AFAIK there is no "put if absent" behavior
> without reading for every write.  You can, of course, configure an existing
> iterator or write your own to achieve whatever logic you want as far as
> what versions to keep of what columns of your data model.
> >>>>
> >>>> 3. The "Scanner" will return entries in order.  Related to #1, it
> will only return the latest version of an entry (by default).  If you are
> keeping more versions of the same entry, then you would see the newest
> entry first.  The "BatchScanner" is multi-threaded and communicates to
> several tablets at once, returning entries out of order.  One common
> pattern is to use the WholeRowIterator when scanning.  This iterator
> serializes all entries with the same row into one entry on the server side,
> then you can deserialize the row on the client side to view the entire
> contents of a row at once.  The order of the rows themselves is still
> undefined when using a BatchScanner due to the multi-threaded nature of the
> scanner.
> >>>>
> >>>> Hope this helps!
> >>>> --Adam
> >>>>
> >>>> On Mon, Apr 13, 2020 at 12:57 AM Niclas Hedhman <[email protected]>
> wrote:
> >>>>>
> >>>>> Hi,
> >>>>> I am steaming new on Accumulo, but tasked to put it into what used
> to be Apache Polygene (now in Attic) as a entity store, one that keeps
> history.
> >>>>>
> >>>>> I have a couple of questions;
> >>>>> 1. Assuming that I can guarantee that no one executes any explicit
> deletes, can I rely on the mutation sequences not disappearing over time?
> >>>>>
> >>>>> 2. Part of storing a row, I have a "metadata" qualifier, that
> contains static information. But since I don't know whether the row exists
> without reading it first, then IIUIC I will fill the "metadata" with the
> same information over and over again.... OR, does Accumulo realize that
> this is the same byte[] as before and won't update the value, alternatively
> creating a new Key, but pointing to the same Value?  I effectively want a
> "putIfAbsent()"
> >>>>>
> >>>>> 3. The Scanner can fetch multiple rows, and constrained by CF and
> qualifier. I think that is quite clear. But what does the iterator()
> actually return? I presume that it is many key/value paris, of ALL
> timestamped values. But what is the order guarantees here? I get the
> impression that within a row->cf->qualifier, the returned values are in
> timestamp order, newest first. And I think that within a row, I am
> guaranteed that the order maintained, i.e. row -> cf -> qualifier (all
> ascending). But am I also guaranteed that the iterator is "done" with a row
> when the has changed? Or can rows be interleaved in the iterator?
> >>>>>
> >>>>> Thanks in advance
> >>>>> Niclas
>

Reply via email to