I guess hdfs has overhead, so I don't worry about that. So in my case, I had stored some dozens of rows, and heaps of columns in each, with values in the 50-100 character range. When doing "scan -t dataTable" I got back a dozen or more pages filled with more than 100 characters per line, and "du" was reporting "5170"... Hence I was a bit surprised.
Yes, my values are highly repetitive and subject to good compression. So, I am all good!! Thanks for accurate and speedy responses. Really appreciated. Niclas On Wed, Apr 15, 2020 at 12:41 PM Christopher <[email protected]> wrote: > The `du` command should show in bytes. Keep in mind that Accumulo > compresses data in its files. If the number doesn't match what you see > for the *.rf files in Hadoop, there may be a bug. Please let us know > if you find this to be the case. > > On Tue, Apr 14, 2020 at 10:30 PM Niclas Hedhman <[email protected]> wrote: > > > > Yes, a bit of experimentation and I figured that out. > > > > As for the "putIfAbsent"; I can actually figure that out from the data > being written in this case, effectively an event store, and all rows starts > with a "created" event. > > > > One more small question; > > there is a "du" command, does it really report "bytes" or is it kB, of > storage space needed? The number seems too small for bytes, and if in kB > then it is over the hdfs physical disk usage... > > > > Cheers > > Niclas > > > > On Tue, Apr 14, 2020 at 9:49 PM Adam J. Shook <[email protected]> > wrote: > >> > >> limitVersion = false would *not* set the default VersioningIterator, > effectively keeping every entry you write to Accumulo. Sounds like it hits > your requirement of "versions never to be removed", though keep in mind > that your static "metadata" qualifier would also never be versioned/deleted. > >> > >> On Mon, Apr 13, 2020 at 8:47 PM Niclas Hedhman <[email protected]> > wrote: > >>> > >>> Ah! I had some misunderstandings implanted in me, and good to get > corrected. > >>> > >>> For > >>> > >>> connector.tableOperations.create(String tableName, boolean > limitVersion); > >>> > >>> > >>> Will limitVersion=false disable versioning completely and I will > always only have one version, or will it have a "no limit" and "no removal" > policy of versions? > >>> > >>> Well, to be clear, I am looking for "versions never to be removed", a > requirement that made me smile and remember "Accumulo can do that > automatically", rather than implement that at a higher level. > >>> > >>> Thanks > >>> > >>> On Tue, Apr 14, 2020 at 12:55 AM Adam J. Shook <[email protected]> > wrote: > >>>> > >>>> Hi Niclas, > >>>> > >>>> 1. Accumulo uses a VersioningIterator for all tables which ensures > that you see the latest version of a particular entry, defined as the entry > that has the highest value for the timestamp. Older versions of the same > key (row ID + family + qualifier + visibility) are compacted away by > Accumulo and will eventually be deleted. You can set the number of > versions you want to keep to something other than the default of 1 (see > https://accumulo.apache.org/1.9/accumulo_user_manual.html#_versioning_iterators_and_timestamps > ). > >>>> > >>>> 2. Related to #1, Accumulo will update the value to the latest > version of entry. I believe if you keep writing the same entry with the > same data over and over again, you'll see them if you are keeping more than > one version of the same entry. AFAIK there is no "put if absent" behavior > without reading for every write. You can, of course, configure an existing > iterator or write your own to achieve whatever logic you want as far as > what versions to keep of what columns of your data model. > >>>> > >>>> 3. The "Scanner" will return entries in order. Related to #1, it > will only return the latest version of an entry (by default). If you are > keeping more versions of the same entry, then you would see the newest > entry first. The "BatchScanner" is multi-threaded and communicates to > several tablets at once, returning entries out of order. One common > pattern is to use the WholeRowIterator when scanning. This iterator > serializes all entries with the same row into one entry on the server side, > then you can deserialize the row on the client side to view the entire > contents of a row at once. The order of the rows themselves is still > undefined when using a BatchScanner due to the multi-threaded nature of the > scanner. > >>>> > >>>> Hope this helps! > >>>> --Adam > >>>> > >>>> On Mon, Apr 13, 2020 at 12:57 AM Niclas Hedhman <[email protected]> > wrote: > >>>>> > >>>>> Hi, > >>>>> I am steaming new on Accumulo, but tasked to put it into what used > to be Apache Polygene (now in Attic) as a entity store, one that keeps > history. > >>>>> > >>>>> I have a couple of questions; > >>>>> 1. Assuming that I can guarantee that no one executes any explicit > deletes, can I rely on the mutation sequences not disappearing over time? > >>>>> > >>>>> 2. Part of storing a row, I have a "metadata" qualifier, that > contains static information. But since I don't know whether the row exists > without reading it first, then IIUIC I will fill the "metadata" with the > same information over and over again.... OR, does Accumulo realize that > this is the same byte[] as before and won't update the value, alternatively > creating a new Key, but pointing to the same Value? I effectively want a > "putIfAbsent()" > >>>>> > >>>>> 3. The Scanner can fetch multiple rows, and constrained by CF and > qualifier. I think that is quite clear. But what does the iterator() > actually return? I presume that it is many key/value paris, of ALL > timestamped values. But what is the order guarantees here? I get the > impression that within a row->cf->qualifier, the returned values are in > timestamp order, newest first. And I think that within a row, I am > guaranteed that the order maintained, i.e. row -> cf -> qualifier (all > ascending). But am I also guaranteed that the iterator is "done" with a row > when the has changed? Or can rows be interleaved in the iterator? > >>>>> > >>>>> Thanks in advance > >>>>> Niclas >
