Re: pre-sorting row keys vs not pre-sorting row keys

2015-10-29 Thread Christopher
How many tablets were these batches going to?

How much were the column updates spread across mutations? 1 mutation
per update? or grouped by row?

10k also seems like a very small number. I'd be curious to know where
the error bars are around that 50% value.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii


On Thu, Oct 29, 2015 at 3:30 PM, Ara Ebrahimi
 wrote:
> Hi,
>
> We just did a simple test:
>
> - insert 10k batches of columns
> - sort the same 10k batch based on row keys and insert
>
> So basically the batch writer in the first test has items in non-sorted order 
> and in the second one in sorted order. We noticed 50% better performance in 
> the sorted version! Why is that the case? Is this something we need to 
> consider doing for live ingest scenarios?
>
> Thanks,
> Ara.
>
>
>
> 
>
> This message is for the designated recipient only and may contain privileged, 
> proprietary, or otherwise confidential information. If you have received it 
> in error, please notify the sender immediately and delete the original. Any 
> other use of the e-mail by you is prohibited. Thank you in advance for your 
> cooperation.
>
> 


Re: how to maintain versioning in D4M schema?

2015-11-30 Thread Christopher
I can think of two options:

1. Instead of "field|value", use "field|value", where version
behaves similarly to Accumulo's timestamp field, and add a custom iterator
which achieves the same effect as the VersioningIterator using this part of
the colq.

2. Instead of putting each "value" in its own field, you could combine them
into an ordered set: field|{time1:value1,time2:value2,time3:value3}. For
this to work well, you'd have to write a custom combining iterator that
kept only the most recent 3 during scans and compactions, based on time (or
whatever you use to denote version).

Of the two, I think the second is simpler and fits best within the existing
D4M schema. At the most, it just adds some structure to the value, which
can be processed with an additional combining iterator, but doesn't
fundamentally change the the table structure.

On Sun, Nov 29, 2015 at 11:10 PM shweta.agrawal 
wrote:

> The example which I am working is:
>
> rowidcolf  colq  value
>idfield|value1  1
>idfield|value2  1
>idfield|value3  1
>idfield|value4  1
>idfield|value5  1
>idfield|value6  1
>
> This is my schema in D4M style. Here one field has multiple values. And
> I want to keep latest 3 values and I want that automatically other
> values to be deleted as in case of versioning iterator.
>
> So after versioning my table should look like this:
>
> rowidcolf  colq  value
>idfield|value1  1
>idfield|value2  1
>idfield|value3  1
>
> Thanks
> Shweta
>
> On Friday 27 November 2015 07:15 PM, Jeremy Kepner wrote:
> > Can you provide a made up specific example?  I think that will
> > make the discussion easier.
> >
> >
> > On Fri, Nov 27, 2015 at 02:46:33PM +0530, shweta.agrawal wrote:
> >> Thanks for the answer.
> >> But I am asking about versioning in D4M style. How can I use
> >> versioning iterator in D4M style as in D4M style, in Rowid id is
> >> strored and field|value is stored in ColumnQualifier. So as value is
> >> stored in columnQualifier I cannot maintain versions through
> >> versioning iterator. So I am asking how will I maintain versioning
> >> in D4M style?
> >>
> >> Thanks
> >> Shweta
> >>
> >> On Friday 27 November 2015 12:45 PM, Dylan Hutchison wrote:
> >>> In order to store five versions of a key but return only one of
> >>> them during a scan, set the minc and majc VersioningIterator to 5
> >>> and set the scan VersioningIterator to 1.  You can set scanning
> >>> iterators on a per-scan basis if this helps.
> >>>
> >>> It is not necessary to put the timestamp in the column family if
> >>> you are going with the VersioningIterator approach.
> >>>
> >>> There are many ways to achieve versioning in Accumulo. As the
> >>> designer/programmer, you must choose one that fits your
> >>> application, of which we do not know the full details. It sounds
> >>> like you have narrowed your choice to (1) putting the timestamp in
> >>> the column family, or (2) not putting the timestamp anywhere else
> >>> but instead changing the VersioningIterator such that Accumulo
> >>> stores more versions than the latest version of a
> >>> (row,colfam,colqual,colvis) key.
> >>>
> >>>
> >>>
> >>> On Thu, Nov 26, 2015 at 8:45 PM, mohit.kaushik
> >>> mailto:mohit.kaus...@orkash.com>>
> >>> wrote:
> >>>
> >>> David,
> >>>
> >>> But this is the case when we store versions based on timestamp
> >>> field. The point is, in D4M schema we can not achieve it by doing
> >>> this. In this case we are considering CF to store timestamp in
> >>> reverse order as described by Dylan. Then how can we configure
> >>> Accumulo to return only latest version and store only 5 versions?
> >>>
> >>> Thanks
> >>> Mohit Kaushik
> >>>
> >>> On 11/27/2015 09:54 AM, David Medinets wrote:
>   From the user manual:
> 
>  user@myinstance  mytable>  config  -t  mytable  -s
> table.iterator.scan.vers.opt.maxVersions=5
>  user@myinstance  mytable>  config  -t  mytable  -s
> table.iterator.minc.vers.opt.maxVersions=5
>  user@myinstance  mytable>  config  -t  mytable  -s
> table.iterator.majc.vers.opt.maxVersions=5
> 
>  On Thu, Nov 26, 2015 at 11:10 PM, shweta.agrawal
>  mailto:shweta.agra...@orkash.com>>
> wrote:
> 
>  I want to maintain 5 versions only and user can enter any
>  number of versions but I want to keep only 5 latest version.
> 
> 
>  On Friday 27 November 2015 09:38 AM, David Medinets wrote:
> > Do you want five versions of every entry or will the number
> > of versions vary?
> >
> > On Thu, Nov 26, 2015 at 10:53 PM, shweta.agrawal
> >  >  

Re: Trigger for Accumulo table

2015-12-02 Thread Christopher
You could also implement a constraint to notify an external system when a
row is updated.

On Wed, Dec 2, 2015, 22:54 Josh Elser  wrote:

> oops :)
>
> [1] http://fluo.io/
>
> Josh Elser wrote:
> > Hi Thai,
> >
> > There is no out-of-the-box feature provided with Accumulo that does what
> > you're asking for. Accumulo doesn't provide any functionality to push
> > notifications to other systems. You could potentially maintain other
> > tables/columns in which you maintain the last time a row was updated,
> > but the onus is on your "other services" to read the table to find out
> > when a change occurred (which is probably not scalable at "real time").
> >
> > There are other systems you could likely leverage to solve this,
> > depending on the durability and scalability that your application needs.
> >
> > For a system "close" to Accumulo, you could take a look at Fluo [1]
> > which is an implementation of Google's "Percolator" system. This is a
> > system based on throughput rather than low-latency, so it may not be a
> > good fit for your needs. There are probably other systems in the Apache
> > ecosystem (Kafka, Storm, Flink or Spark Streaming maybe?) that are be
> > helpful to your problem. I'm not an expert on these to recommend on (nor
> > do I think I understand your entire architecture well enough).
> >
> > Thai Ngo wrote:
> >> Hi list,
> >>
> >> I have a use-case when existing rows in a table will be updated by an
> >> internal service. Data in a row of this table is composed of 2 parts:
> >> 1st part - immutable and the 2nd one - will be updated (filled in) a
> >> little later.
> >>
> >> Currently, I have a need of knowing when and which rows will be updated
> >> in the table so that other services will be wisely start consuming the
> >> data. It will make more sense when I need to consume the data in near
> >> realtime. So developing a notification function or simpler - a trigger
> >> is what I really want to do now.
> >>
> >> I am curious to know if someone has done similar job or there are
> >> features or APIs or best practices available for Accumulo so far. I'm
> >> thinking of letting the internal service which updates the data notify
> >> us whenever it updates the data.
> >>
> >> What do you think?
> >>
> >> Thanks,
> >> Thai
>


Re: Trigger for Accumulo table

2015-12-08 Thread Christopher
Look at org.apache.accumulo.core.constraints.Constraint for a description
and org.apache.accumulo.core.constraints.DefaultKeySizeConstraint as an
example.

In short, Mutations which are live-ingested into a tablet server are
validated against constraints you specify on the table. That means that all
Mutations written to a table go through this bit of user-provided code at
least once. You could use that fact to your advantage. However, this would
be highly experimental and might have some caveats to consider.

You can configure a constraint on a table with
connector.tableOperations().addConstraint(...)

On Sun, Dec 6, 2015 at 10:49 PM Thai Ngo  wrote:

> Christopher,
>
> This is interesting! Could you please give me more details about this?
>
> Thanks,
> Thai
>
> On Thu, Dec 3, 2015 at 12:17 PM, Christopher  wrote:
>
>> You could also implement a constraint to notify an external system when a
>> row is updated.
>>
>> On Wed, Dec 2, 2015, 22:54 Josh Elser  wrote:
>>
>>> oops :)
>>>
>>> [1] http://fluo.io/
>>>
>>> Josh Elser wrote:
>>> > Hi Thai,
>>> >
>>> > There is no out-of-the-box feature provided with Accumulo that does
>>> what
>>> > you're asking for. Accumulo doesn't provide any functionality to push
>>> > notifications to other systems. You could potentially maintain other
>>> > tables/columns in which you maintain the last time a row was updated,
>>> > but the onus is on your "other services" to read the table to find out
>>> > when a change occurred (which is probably not scalable at "real time").
>>> >
>>> > There are other systems you could likely leverage to solve this,
>>> > depending on the durability and scalability that your application
>>> needs.
>>> >
>>> > For a system "close" to Accumulo, you could take a look at Fluo [1]
>>> > which is an implementation of Google's "Percolator" system. This is a
>>> > system based on throughput rather than low-latency, so it may not be a
>>> > good fit for your needs. There are probably other systems in the Apache
>>> > ecosystem (Kafka, Storm, Flink or Spark Streaming maybe?) that are be
>>> > helpful to your problem. I'm not an expert on these to recommend on
>>> (nor
>>> > do I think I understand your entire architecture well enough).
>>> >
>>> > Thai Ngo wrote:
>>> >> Hi list,
>>> >>
>>> >> I have a use-case when existing rows in a table will be updated by an
>>> >> internal service. Data in a row of this table is composed of 2 parts:
>>> >> 1st part - immutable and the 2nd one - will be updated (filled in) a
>>> >> little later.
>>> >>
>>> >> Currently, I have a need of knowing when and which rows will be
>>> updated
>>> >> in the table so that other services will be wisely start consuming the
>>> >> data. It will make more sense when I need to consume the data in near
>>> >> realtime. So developing a notification function or simpler - a trigger
>>> >> is what I really want to do now.
>>> >>
>>> >> I am curious to know if someone has done similar job or there are
>>> >> features or APIs or best practices available for Accumulo so far. I'm
>>> >> thinking of letting the internal service which updates the data notify
>>> >> us whenever it updates the data.
>>> >>
>>> >> What do you think?
>>> >>
>>> >> Thanks,
>>> >> Thai
>>>
>>
>


Re: Trigger for Accumulo table

2015-12-08 Thread Christopher
In the future, it might be useful to provide a supported API hook here. It
certainly would've made implementing replication easier, but could also be
useful as a notification system.

On Tue, Dec 8, 2015 at 4:51 PM Keith Turner  wrote:

> Constraints are checked before data is written.  In the case of failures a
> constraint may see data thats never successfully written.
>
> On Tue, Dec 8, 2015 at 4:18 PM, Christopher  wrote:
>
>> Look at org.apache.accumulo.core.constraints.Constraint for a description
>> and org.apache.accumulo.core.constraints.DefaultKeySizeConstraint as an
>> example.
>>
>> In short, Mutations which are live-ingested into a tablet server are
>> validated against constraints you specify on the table. That means that all
>> Mutations written to a table go through this bit of user-provided code at
>> least once. You could use that fact to your advantage. However, this would
>> be highly experimental and might have some caveats to consider.
>>
>> You can configure a constraint on a table with
>> connector.tableOperations().addConstraint(...)
>>
>>
>> On Sun, Dec 6, 2015 at 10:49 PM Thai Ngo  wrote:
>>
>>> Christopher,
>>>
>>> This is interesting! Could you please give me more details about this?
>>>
>>> Thanks,
>>> Thai
>>>
>>> On Thu, Dec 3, 2015 at 12:17 PM, Christopher 
>>> wrote:
>>>
>>>> You could also implement a constraint to notify an external system when
>>>> a row is updated.
>>>>
>>>> On Wed, Dec 2, 2015, 22:54 Josh Elser  wrote:
>>>>
>>>>> oops :)
>>>>>
>>>>> [1] http://fluo.io/
>>>>>
>>>>> Josh Elser wrote:
>>>>> > Hi Thai,
>>>>> >
>>>>> > There is no out-of-the-box feature provided with Accumulo that does
>>>>> what
>>>>> > you're asking for. Accumulo doesn't provide any functionality to push
>>>>> > notifications to other systems. You could potentially maintain other
>>>>> > tables/columns in which you maintain the last time a row was updated,
>>>>> > but the onus is on your "other services" to read the table to find
>>>>> out
>>>>> > when a change occurred (which is probably not scalable at "real
>>>>> time").
>>>>> >
>>>>> > There are other systems you could likely leverage to solve this,
>>>>> > depending on the durability and scalability that your application
>>>>> needs.
>>>>> >
>>>>> > For a system "close" to Accumulo, you could take a look at Fluo [1]
>>>>> > which is an implementation of Google's "Percolator" system. This is a
>>>>> > system based on throughput rather than low-latency, so it may not be
>>>>> a
>>>>> > good fit for your needs. There are probably other systems in the
>>>>> Apache
>>>>> > ecosystem (Kafka, Storm, Flink or Spark Streaming maybe?) that are be
>>>>> > helpful to your problem. I'm not an expert on these to recommend on
>>>>> (nor
>>>>> > do I think I understand your entire architecture well enough).
>>>>> >
>>>>> > Thai Ngo wrote:
>>>>> >> Hi list,
>>>>> >>
>>>>> >> I have a use-case when existing rows in a table will be updated by
>>>>> an
>>>>> >> internal service. Data in a row of this table is composed of 2
>>>>> parts:
>>>>> >> 1st part - immutable and the 2nd one - will be updated (filled in) a
>>>>> >> little later.
>>>>> >>
>>>>> >> Currently, I have a need of knowing when and which rows will be
>>>>> updated
>>>>> >> in the table so that other services will be wisely start consuming
>>>>> the
>>>>> >> data. It will make more sense when I need to consume the data in
>>>>> near
>>>>> >> realtime. So developing a notification function or simpler - a
>>>>> trigger
>>>>> >> is what I really want to do now.
>>>>> >>
>>>>> >> I am curious to know if someone has done similar job or there are
>>>>> >> features or APIs or best practices available for Accumulo so far.
>>>>> I'm
>>>>> >> thinking of letting the internal service which updates the data
>>>>> notify
>>>>> >> us whenever it updates the data.
>>>>> >>
>>>>> >> What do you think?
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Thai
>>>>>
>>>>
>>>
>


Re: Accismus & Fluo

2016-01-21 Thread Christopher
Same thing. Accismus was renamed to Fluo.

On Thu, Jan 21, 2016, 07:30 mohit.kaushik  wrote:

> What Accismus is relation beetween Accismus and Fluo. Same thing?
>
>
>
> On 01/20/2016 08:39 AM, Thai Ngo wrote:
>
> That's awesome.
> +1
>
> On Wed, Jan 20, 2016 at 12:53 AM, Josh Elser  wrote:
>
>> +1
>>
>> William Slacum wrote:
>>
>>> Cool beans, Keith!
>>>
>>> On Tue, Jan 19, 2016 at 11:30 AM, Keith Turner >> > wrote:
>>>
>>> The Fluo project is happy to announce a 1.0.0-beta-2[1] release
>>> which is the
>>> third release of Fluo and likely the final release before 1.0.0. Many
>>> improvements in this release were driven by the creation of two new
>>> Fluo
>>> related projects:
>>>
>>>* Fluo recipes[2] is a collection of common development patterns
>>> designed to
>>>  make Fluo application development easier. Creating Fluo recipes
>>> required
>>>  new Fluo functionality and updates to the Fluo API. The first
>>> release of
>>>  Fluo recipes has been made and is available in Maven Central.
>>>
>>>* WebIndex[3] is an example Fluo application that indexes links
>>> to web pages
>>>  in multiple ways. Webindex enabled the testing of Fluo on real
>>> data at
>>>  scale.  It also inspired improvements to Fluo to allow it to
>>> work better
>>>  with Apache Spark.
>>>
>>> Fluo is now at a point where its two cluster test suites,
>>> Webindex[3] and
>>> Stress[4], are running well for long periods on Amazon EC2. We
>>> invite early
>>> adopters to try out the beta-2 release and help flush out problems
>>> before
>>> 1.0.0.
>>>
>>> [1]: http://fluo.io/1.0.0-beta-2-release/
>>> [2]: https://github.com/fluo-io/fluo-recipes
>>> [3]: https://github.com/fluo-io/webindex
>>> [4]: https://github.com/fluo-io/fluo-stress
>>>
>>>
>>>
>


Re: Kerberos Client Configuration

2016-02-02 Thread Christopher
The third property was "kerberos.server.realm". It looks like it was
removed from the docs perhaps because the property doesn't exist, and we
just assume that the default realm in the client and the servers are the
same.

In any case, this is a documentation bug, at the very least. I think you
can ignore the missing "third property".

On Tue, Feb 2, 2016 at 8:50 PM Tristen Georgiou  wrote:

> Hi all,
>
> Searched through the mail lists and couldn't find an answer, so hopefully
> this hasn't been asked already, but under this section of the documentation:
> http://accumulo.apache.org/1.7/accumulo_user_manual.html#_configuration_3
>
> It says:
>
> Three items need to be set to enable access to Accumulo:
>
>- instance.rpc.sasl.enabled=true
>- kerberos.server.primary=accumulo
>
> The second and third properties must match the configuration of the
> accumulo servers; this is required to set up the SASL transport.
>
> Does anyone know what the 3rd item is?
>
> Thanks,
>
> Tristen
>


Re: Accumulo 1.6.0 Import Fails From 1.5.0 Data

2016-02-04 Thread Christopher
Make sure you are using "importtable" and not "importdirectory". That error
you are seeing is a bulk importing error (importdirectory).

On Thu, Feb 4, 2016 at 1:27 PM Donald Mackert  wrote:

> Hello,
>
>   I am using the Accumulo 1.6.0 Cloudera distribution.   I exported
> three tables form Accumulo 1.5.0 and I am attempting to import the data
> into Accumulo 1.6.0.
>
>   Everything appears to be working less the import of the
> exportMetadata.zip with the fllowing error - "exportMetadata.zip does not
> have a valid extension, ignoring"
>
>   The data is the table but the required meta data is not imported.
>
> Thank you,
>
> Don
>


[ANNOUNCE] Apache Accumulo 1.6.5

2016-02-17 Thread Christopher
The Apache Accumulo project is pleased to announce its 1.6.5 release.

Version 1.6.5 is the most recent bug-fix release in its 1.6.x release line.
This version includes several bug fixes since 1.6.4. Existing users of the
1.6.x release line are encouraged to upgrade immediately with confidence.

The Apache Accumulo sorted, distributed key/value store is a robust,
scalable, high performance data storage system that features cell-based
access control and customizable server-side processing. It is based on
Google's BigTable design and is built on top of Apache Hadoop, Apache
ZooKeeper, and Apache Thrift.

This release is available at http://accumulo.apache.org/downloads/ and
release notes at http://accumulo.apache.org/release_notes/1.6.5.html.

- The Apache Accumulo Team


Re: Unable to get Mini to use native maps - 1.6.2

2016-02-23 Thread Christopher
MiniAccumuloConfig has a method, called "setNativeLibPaths(String...
nativePathItems)".
You should call that method with the absolute path for your compiled native
map shared library file (.so), before you start Mini.

On Tue, Feb 23, 2016 at 2:03 PM Josh Elser  wrote:

> MiniAccumuloCluster spawns its own processes, though. Calling
> NativeMap.isLoaded() in your test JVM isn't proving anything.
>
> That's why you need to call these methods on MAC, you would need to
> check the TabletServer*.log file(s), and make sure that its
> configuration is set up properly to find the .so.
>
> Does that make sense? Did I misinterpret you?
>
> Dan Blum wrote:
> > I'll see what I can do, but there's no simple way to pull out something
> > small we can share (and it would have to be a gradle project).
> >
> > I confirmed that the path is not the immediate issue by adding an
> explicit
> > call to NativeMap.isLoaded() at the start of my test - that produces
> logging
> > from NativeMap saying it can't find the library, which is what I expect.
> > Without this call NativeMap still logs nothing so the setting that should
> > cause it to be referenced is getting overridden somewhere. Calling
> > InstanceOperations.getSiteConfiguration and getSystemConfiguration shows
> > that the native maps are enabled, however.
> >
> > -Original Message-
> > From: Josh Elser [mailto:josh.el...@gmail.com]
> > Sent: Tuesday, February 23, 2016 12:56 PM
> > To: user@accumulo.apache.org
> > Subject: Re: Unable to get Mini to use native maps - 1.6.2
> >
> > Well, I'm near positive that 1.6.2 had native maps working, so there
> > must be something unexpected happening :). MAC should be very close to
> > what a real standalone instance is doing -- if you have the ability to
> > share some end-to-end project with where you are seeing this, that'd be
> > extremely helpful (e.g. a Maven project that we can just run would be
> > superb).
> >
> > Dan Blum wrote:
> >> I'll take a look but I don't think the path is the problem - NativeMap
> >> should try to load the library regardless of whether this path is set
> and
> >> will log if it can't find it. This isn't happening.
> >>
> >> -Original Message-
> >> From: Josh Elser [mailto:josh.el...@gmail.com]
> >> Sent: Tuesday, February 23, 2016 12:27 PM
> >> To: user@accumulo.apache.org
> >> Subject: Re: Unable to get Mini to use native maps - 1.6.2
> >>
> >> Hi Dan,
> >>
> >> I'm seeing in our internal integration tests that we have some
> >> configuration happening which (at least, intends to) configure the
> >> native maps for the minicluster.
> >>
> >> If you're not familiar, the MiniAccumuloConfig and MiniAccumuloCluster
> >> classes are thin wrappers around MiniAccumuloConfigImpl and
> >> MiniAccumuloClusterImpl. There is a setNativeLibPaths method on
> >> MiniAccumuloConfigImpl which you can use to provide the path to the
> >> native library shared object (.so). You will probably have to switch
> >> from MiniAccumuloConfig/MiniAccumuloCluster to
> >> MiniAccumuloConfigImpl/MiniAccumuloClusterImpl to use the "hidden"
> > methods.
> >> You could also look at MiniClusterHarness.java in>=1.7 if you want a
> >> concrete example of how we initialize things for our tests.
> >>
> >> - Josh
> >>
> >> Dan Blum wrote:
> >>> In order to test to make sure we don't have more code that needs a
> >>> workaround for https://issues.apache.org/jira/browse/ACCUMULO-4148 I
> am
> >>> trying again to enable the native maps for Mini, which we use for
> > testing.
> >>> I set tserver.memory.maps.native.enabled to true in the site XML, and
> > this
> >>> is getting picked up since I see this in the Mini logs:
> >>>
> >>> [server.Accumulo] INFO : tserver.memory.maps.native.enabled = true
> >>>
> >>> However, NativeMap should log something when it tries to load the
> > library,
> >>> whether it succeeds or fails, but it logs nothing. The obvious
> conclusion
> >> is
> >>> that something about how MiniAccumuloCluster starts means that this
> >> setting
> >>> is ignored or overridden, but I am not finding it. (I see the mergeProp
> >> call
> >>> in MiniAccumuloConfigImpl.initialize which will set
> >> TSERV_NATIVEMAP_ENABLED
> >>> to false, but that should only set it if it's not already in the
> >> properties,
> >>> which it should be, and as far as I can tell the log message above is
> >> issued
> >>> after this.)
> >>>
> >
>


Re: Unable to get Mini to use native maps - 1.6.2

2016-02-23 Thread Christopher
Looking at the NativeMap, it looks like it will always log some message at
the INFO level if it successfully loaded the native maps, or at the ERROR
level if it failed to do so (with some extra DEBUG messages while it
searches the path).

I thought maybe there was a class loading race condition where
NativeMap.isLoaded() returns false while it's still trying to load... that
might still be a possibility (I'm not sure if this can happen with static
initializer blocks?), but if it were, you'd still see the log messages
about loading or not.

I can't see your code, so I don't know what's wrong, but something like the
following should work fine:

1. MiniAccumuloConfig config = new MiniAccumuloConfig(new
File("/path/to/miniDir"), "rootPassword");
2. HashMap map = new HashMap();
3. map.put(Property.TSERV_NATIVEMAP_ENABLED.getKey(), "true");
4. config.setSiteConfig(map);
5. MiniAccumuloCluster mini = new MiniAccumuloCluster(config);


On Tue, Feb 23, 2016 at 2:21 PM Dan Blum  wrote:

> In fact, we are calling that (in Groovy which is why I missed it before,
> not being that familiar with Groovy). I verified that the path is correct –
> doesn’t help.
>
>
>
> *From:* Christopher [mailto:ctubb...@apache.org]
> *Sent:* Tuesday, February 23, 2016 2:06 PM
>
>
> *To:* user@accumulo.apache.org
> *Subject:* Re: Unable to get Mini to use native maps - 1.6.2
>
>
>
> MiniAccumuloConfig has a method, called "setNativeLibPaths(String...
> nativePathItems)".
>
> You should call that method with the absolute path for your compiled
> native map shared library file (.so), before you start Mini.
>
>
>
> On Tue, Feb 23, 2016 at 2:03 PM Josh Elser  wrote:
>
> MiniAccumuloCluster spawns its own processes, though. Calling
> NativeMap.isLoaded() in your test JVM isn't proving anything.
>
> That's why you need to call these methods on MAC, you would need to
> check the TabletServer*.log file(s), and make sure that its
> configuration is set up properly to find the .so.
>
> Does that make sense? Did I misinterpret you?
>
> Dan Blum wrote:
> > I'll see what I can do, but there's no simple way to pull out something
> > small we can share (and it would have to be a gradle project).
> >
> > I confirmed that the path is not the immediate issue by adding an
> explicit
> > call to NativeMap.isLoaded() at the start of my test - that produces
> logging
> > from NativeMap saying it can't find the library, which is what I expect.
> > Without this call NativeMap still logs nothing so the setting that should
> > cause it to be referenced is getting overridden somewhere. Calling
> > InstanceOperations.getSiteConfiguration and getSystemConfiguration shows
> > that the native maps are enabled, however.
> >
> > -Original Message-
> > From: Josh Elser [mailto:josh.el...@gmail.com]
> > Sent: Tuesday, February 23, 2016 12:56 PM
> > To: user@accumulo.apache.org
> > Subject: Re: Unable to get Mini to use native maps - 1.6.2
> >
> > Well, I'm near positive that 1.6.2 had native maps working, so there
> > must be something unexpected happening :). MAC should be very close to
> > what a real standalone instance is doing -- if you have the ability to
> > share some end-to-end project with where you are seeing this, that'd be
> > extremely helpful (e.g. a Maven project that we can just run would be
> > superb).
> >
> > Dan Blum wrote:
> >> I'll take a look but I don't think the path is the problem - NativeMap
> >> should try to load the library regardless of whether this path is set
> and
> >> will log if it can't find it. This isn't happening.
> >>
> >> -Original Message-
> >> From: Josh Elser [mailto:josh.el...@gmail.com]
> >> Sent: Tuesday, February 23, 2016 12:27 PM
> >> To: user@accumulo.apache.org
> >> Subject: Re: Unable to get Mini to use native maps - 1.6.2
> >>
> >> Hi Dan,
> >>
> >> I'm seeing in our internal integration tests that we have some
> >> configuration happening which (at least, intends to) configure the
> >> native maps for the minicluster.
> >>
> >> If you're not familiar, the MiniAccumuloConfig and MiniAccumuloCluster
> >> classes are thin wrappers around MiniAccumuloConfigImpl and
> >> MiniAccumuloClusterImpl. There is a setNativeLibPaths method on
> >> MiniAccumuloConfigImpl which you can use to provide the path to the
> >> native library shared object (.so). You will probably have to switch
> >> from MiniAccumuloConfig/MiniAccumuloClu

[ANNOUNCE] Apache Accumulo 1.7.1

2016-02-26 Thread Christopher
The Accumulo team is proud to announce the release of Accumulo version
1.7.1!

This release contains over 150 bugfixes and improvements over 1.7.0, and is
backwards-compatible with 1.7.0. Existing users of 1.7.0 are encouraged to
upgrade immediately.

This version is now available in Maven Central, and at:
https://accumulo.apache.org/downloads/

The full release notes can be viewed at:
https://accumulo.apache.org/release_notes/1.7.1.html

The Apache Accumulo™ sorted, distributed key/value store is a robust,
scalable, high performance data storage system that features cell-based
access control and customizable server-side processing. It is based on
Google's BigTable design and is built on top of Apache Hadoop, Apache
ZooKeeper, and Apache Thrift.

--
The Apache Accumulo Team


Re: 1.6 Javadoc missing classes

2016-03-04 Thread Christopher
Sure, we can include that. Are there any other classes which would be good
to have javadocs for which aren't public API?

On Fri, Mar 4, 2016 at 4:03 PM Josh Elser  wrote:

> Good catch, Dan. Thanks for letting us know. Moving this one over to the
> dev list to discuss further.
>
> Christopher, looks like it might also be good to include iterator
> javadocs despite not being in public API (interfaces, and o.a.a.c.i.user?).
>
>  Original Message 
> Subject: 1.6 Javadoc missing classes
> Date: Fri, 4 Mar 2016 15:59:26 -0500
> From: Dan Blum 
> Reply-To: user@accumulo.apache.org
> To: 
>
> A lot of classes seem to have gone missing from
> http://accumulo.apache.org/1.6/apidocs/ - SortedKeyValueIterator would be
> an
> obvious example.
>
>


Re: 1.6 Javadoc missing classes

2016-03-04 Thread Christopher
The tracing APIs vary from version to version significantly. That puts a
lot of extra effort on the person updating the included packages. How
important are those now we're transitioning to use an external dependency?

On Fri, Mar 4, 2016 at 5:17 PM Josh Elser  wrote:

> Maybe the distributed tracing APIs?
>
> Christopher wrote:
> > Sure, we can include that. Are there any other classes which would be
> > good to have javadocs for which aren't public API?
> >
> > On Fri, Mar 4, 2016 at 4:03 PM Josh Elser  > <mailto:josh.el...@gmail.com>> wrote:
> >
> > Good catch, Dan. Thanks for letting us know. Moving this one over to
> the
> > dev list to discuss further.
> >
> > Christopher, looks like it might also be good to include iterator
> > javadocs despite not being in public API (interfaces, and
> > o.a.a.c.i.user?).
> >
> >  Original Message 
> > Subject: 1.6 Javadoc missing classes
> > Date: Fri, 4 Mar 2016 15:59:26 -0500
> > From: Dan Blum mailto:db...@bbn.com>>
> > Reply-To: user@accumulo.apache.org <mailto:user@accumulo.apache.org>
> > To: mailto:user@accumulo.apache.org>>
> >
> > A lot of classes seem to have gone missing from
> > http://accumulo.apache.org/1.6/apidocs/ - SortedKeyValueIterator
> > would be an
> > obvious example.
> >
>


Re: Class path for shell commands

2016-03-07 Thread Christopher
Try $CLASSPATH instead of $CLASS_PATH

On Mon, Mar 7, 2016 at 4:11 PM Sravankumar Reddy Javaji (BLOOMBERG/ 731
LEX)  wrote:

> Hello Everyone,
>
> I built custom formatter by extending DefaultFormatter class. When I am
> trying to use DefaultFormatter, I am getting "Class not found" exception.
> Below are the steps I followed:
>
> From local machine:
>
> $ export CLASS_PATH=$CLASS_PATH:/*.jar
>
> $ CONNECTED to accumulo shell server
>
> $ formatter -f fully_qualified_classname -t table_name
>
> $ scan -t table
> ERROR: Class not found
>
>
> Could someone please let me know any other way to set class path for
> shell? Also is there anyway to debug this issue like checking current
> classpaths that shell is using?
>
> Thanks for your time.
>
> -
> Regards,
> Sravan
>


Re: Class path for shell commands

2016-03-07 Thread Christopher
Are you using a 1.6.x or earlier version? I think we fixed a bug in 1.7.0
where the user-specified CLASSPATH was overridden.


On Mon, Mar 7, 2016 at 4:23 PM Sravankumar Reddy Javaji (BLOOMBERG/ 731
LEX)  wrote:

> Sorry that's typo. I used CLASSPATH, still not working.
>
> From: user@accumulo.apache.org At: Mar 7 2016 16:23:04
> To: Sravankumar Reddy Javaji (BLOOMBERG/ 731 LEX) ,
> user@accumulo.apache.org
> Subject: Re: Class path for shell commands
>
> Try $CLASSPATH instead of $CLASS_PATH
>
> On Mon, Mar 7, 2016 at 4:11 PM Sravankumar Reddy Javaji (BLOOMBERG/ 731
> LEX)  wrote:
>
>> Hello Everyone,
>>
>> I built custom formatter by extending DefaultFormatter class. When I am
>> trying to use DefaultFormatter, I am getting "Class not found" exception.
>> Below are the steps I followed:
>>
>> From local machine:
>>
>> $ export CLASS_PATH=$CLASS_PATH:/*.jar
>>
>> $ CONNECTED to accumulo shell server
>>
>> $ formatter -f fully_qualified_classname -t table_name
>>
>> $ scan -t table
>> ERROR: Class not found
>>
>>
>> Could someone please let me know any other way to set class path for
>> shell? Also is there anyway to debug this issue like checking current
>> classpaths that shell is using?
>>
>> Thanks for your time.
>>
>> -
>> Regards,
>> Sravan
>>
>
>


Javadoc hosting service

2016-04-07 Thread Christopher
Found a very cool javadoc hosting service at:
http://www.javadoc.io

Example:
http://www.javadoc.io/doc/org.apache.accumulo/accumulo-core/1.7.1

Looks like it works from maven javadoc artifacts from Maven Central. Very
very cool. Also, great for linking between javadocs.

The first time you access a new artifact, it could take a few minutes to
download, but after that, it's very easy to switch between different
artifacts in the same group, and different versions for the same artifact.

Could be a good way to link to the full javadocs, beyond just our public
API (what we're not publishing on our site).


Re: Fwd: why compaction failure on one table brings other tables offline, how to recover

2016-04-11 Thread Christopher
You might be seeing https://issues.apache.org/jira/browse/ACCUMULO-4160

On Mon, Apr 11, 2016 at 5:52 PM Jayesh Patel  wrote:

> There really aren't a lot of log messages that can explain why tablets for
> other tables went offline except the following:
>
> 2016-04-11 13:32:18,258
> [tserver.TabletServerResourceManager$AssignmentWatcher] WARN :
> tserver:instance-accumulo-3 Assignment for 2<< has been running for at
> least 973455566ms
> java.lang.Exception: Assignment of 2<<
> at sun.misc.Unsafe.park(Native Method)
> at java.util.concurrent.locks.LockSupport.park(Unknown Source)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown
> Source)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(Unknown
> Source)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown
> Source)
> at java.util.concurrent.locks.ReentrantLock$FairSync.lock(Unknown
> Source)
> at java.util.concurrent.locks.ReentrantLock.lock(Unknown Source)
> at
> org.apache.accumulo.tserver.TabletServer.acquireRecoveryMemory(TabletServer.java:2230)
> at
> org.apache.accumulo.tserver.TabletServer.access$2600(TabletServer.java:252)
> at
> org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2150)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at
> org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
> at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at java.lang.Thread.run(Unknown Source)
>
> Table 2<< here doesn't have the issue with minc failing and so shouldn’t
> be offline.  These messages happened on a restart of a tserver if that
> offers any clues.  All the nodes were rebooted at that time due to a power
> failure.  I'm assuming that it's tablet went offline soon after this
> message first appeared in the logs.
>
> Other tidbit of note is that the Accumulo operates for hours/days without
> taking the tablets offline even though minc is failing and it's the crash
> of a tserver due to OutOfMemory situation in one case that seems to have
> taken the tablet offline.  Is it safe to assume that other tservers are not
> able to pick up the tablets that are failing minc from a crashed tserver?
>
> -Original Message-
> From: Josh Elser [mailto:josh.el...@gmail.com]
> Sent: Friday, April 08, 2016 10:52 AM
> To: user@accumulo.apache.org
> Subject: Re: Fwd: why compaction failure on one table brings other tables
> offline, how to recover
>
>
>
> Billie Rinaldi wrote:
> > *From:* Jayesh Patel
> > *Sent:* Thursday, April 07, 2016 4:36 PM
> > *To:* 'user@accumulo.apache.org '
> > mailto:user@accumulo.apache.org>>
> > *Subject:* RE: why compaction failure on one table brings other tables
> > offline, how to recover
> >
> > __ __
> >
> > I have a 3 node Accumulo 1.7 cluster with a few small tables (few MB
> > in size at most).
> >
> > __ __
> >
> > I had one of those table fail minc because I had configured a
> > SummingCombiner with FIXEDLEN but had smaller values:
> >
> > MinC failed (trying to convert to long, but byte array isn't long
> > enough, wanted 8 found 1) to create
> > hdfs://instance-accumulo:8020/accumulo/tables/1/default_tablet/F0002bc
> > s.rf_tmp
> > retrying ...
> >
> > __ __
> >
> > I have learned since to set the ‘lossy’ parameter to true to avoid this.
> > *Why is the default value for it false* if it can cause catastrophic
> > failure that you’ll read about ahead.
>
> I'm pretty sure I told you this on StackOverflow, but if you're not
> writing 8-byte long values, don't used FIXEDLEN. Use VARLEN instead.
>
> > However, this brought other the tablets for other tables offline
> > without any apparent errors or warnings. *Can someone please explain
> > why?*
>
> Can you provide logs? We are not wizards :)
>
> > In order to recover from this, I did a ‘droptable’ from the shell on
> > the affected tables, but they all got stuck in the ‘DELETING’ state.
> > I was able to finally delete them using zkcli ‘rmr’ command. *Is there
> > a better way?*
>
> Again, not sure why they would have gotten stuck in the deleting phase
> without more logs/context (nor how far along in the deletion process they
> got). It's possible that there were still entries in the accumulo.metadata
> table.
>
> > I’m assuming there is a more proper way because when I created the
> > tables again (with the same name), they went back to having a single
> > offline tablet right away. *Is this because there are “traces” of the
> > old table left behind that affect the new table even though the new
> > table has a

Re: Fwd: why compaction failure on one table brings other tables offline, how to recover

2016-04-11 Thread Christopher
I just meant that if there is a problem loading one tablet, other tablets
may stay indefinitely in an offline state due to ACCUMULO-4160, however it
got to that point.

On Mon, Apr 11, 2016 at 6:35 PM Josh Elser  wrote:

> Do you mean that after an OOME, the tserver process didn't die and got
> into this bad state with an permanently offline tablet?
>
> Christopher wrote:
> > You might be seeing https://issues.apache.org/jira/browse/ACCUMULO-4160
> >
> > On Mon, Apr 11, 2016 at 5:52 PM Jayesh Patel  > <mailto:jpa...@keywcorp.com>> wrote:
> >
> > There really aren't a lot of log messages that can explain why
> > tablets for other tables went offline except the following:
> >
> > 2016-04-11 13:32:18,258
> > [tserver.TabletServerResourceManager$AssignmentWatcher] WARN :
> > tserver:instance-accumulo-3 Assignment for 2<< has been running for
> > at least 973455566ms
> > java.lang.Exception: Assignment of 2<<
> >  at sun.misc.Unsafe.park(Native Method)
> >  at java.util.concurrent.locks.LockSupport.park(Unknown Source)
> >  at
> >
>  
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown
> > Source)
> >  at
> >
>  java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(Unknown
> > Source)
> >  at
> >
>  java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown
> Source)
> >  at
> > java.util.concurrent.locks.ReentrantLock$FairSync.lock(Unknown
> Source)
> >  at java.util.concurrent.locks.ReentrantLock.lock(Unknown Source)
> >  at
> >
>  
> org.apache.accumulo.tserver.TabletServer.acquireRecoveryMemory(TabletServer.java:2230)
> >  at
> >
>  org.apache.accumulo.tserver.TabletServer.access$2600(TabletServer.java:252)
> >  at
> >
>  
> org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2150)
> >  at
> >
>  org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> >  at
> >
>  
> org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
> >  at
> > org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
> >  at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> > Source)
> >  at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> > Source)
> >  at
> >
>  org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> >  at java.lang.Thread.run(Unknown Source)
> >
> > Table 2<< here doesn't have the issue with minc failing and so
> > shouldn’t be offline.  These messages happened on a restart of a
> > tserver if that offers any clues.  All the nodes were rebooted at
> > that time due to a power failure.  I'm assuming that it's tablet
> > went offline soon after this message first appeared in the logs.
> >
> > Other tidbit of note is that the Accumulo operates for hours/days
> > without taking the tablets offline even though minc is failing and
> > it's the crash of a tserver due to OutOfMemory situation in one case
> > that seems to have taken the tablet offline.  Is it safe to assume
> > that other tservers are not able to pick up the tablets that are
> > failing minc from a crashed tserver?
> >
> > -Original Message-
> > From: Josh Elser [mailto:josh.el...@gmail.com
> > <mailto:josh.el...@gmail.com>]
> > Sent: Friday, April 08, 2016 10:52 AM
> > To: user@accumulo.apache.org <mailto:user@accumulo.apache.org>
> > Subject: Re: Fwd: why compaction failure on one table brings other
> > tables offline, how to recover
> >
> >
> >
> > Billie Rinaldi wrote:
> >  > *From:* Jayesh Patel
> >  > *Sent:* Thursday, April 07, 2016 4:36 PM
> >  > *To:* 'user@accumulo.apache.org <mailto:user@accumulo.apache.org>
> > <mailto:user@accumulo.apache.org <mailto:user@accumulo.apache.org>>'
> >  > mailto:user@accumulo.apache.org>
> > <mailto:user@accumulo.apache.org <mailto:user@accumulo.apache.org>>>
> >  > *Subject:* RE: why compaction failure on one table brings other
> > tables
> >  > offline, how to recover
> >  >
> >  > __ __
> >  >
> >  > I have a 

Fluo Proposal submitted to incubator

2016-04-19 Thread Christopher
Just a heads-up, the Fluo developers have submitted a proposal to the
Apache Incubator (gene...@incubator.apache.org), and it may be of interest
to additional Accumulo users:

https://wiki.apache.org/incubator/FluoProposal


Re: Reuse Accumulo lexicographical ordering

2016-05-10 Thread Christopher
You can also use Guava's UnsignedBytes.lexicographicalComparator().

On Tue, May 10, 2016 at 10:40 AM Mario Pastorelli <
mario.pastore...@teralytics.ch> wrote:

> Hi Josh,
>
> Thanks for the answer and sorry for my question not being clear. I need
> the  same comparator that accumulo is using for arrays of bytes and I think
> your suggestion pointed me to the right class: I can use Hadoop
> WritableComparable.compareBytes static method to obtain the lexicographic
> order of binary data that is used by Accumulo.
>
> Thanks for the help,
> Mario
>
> On Tue, May 10, 2016 at 4:22 PM, Josh Elser  wrote:
>
>> Hi Mario,
>>
>> I'm not sure I 100% understand your question. Are you asking about the
>> code which sorts Accumulo Keys?
>>
>> If so, Key implements the Comparable interface (the `compareTo(Key)`
>> method). You might be able to make use of the `compareTo(Key, PartialKey)`
>> method as well. You can use this with standard sorting implementations
>> (e.g. Collections.sort(..) or any SortedMap implementation).
>>
>> - Josh
>>
>> Mario Pastorelli wrote:
>>
>>> Hi,
>>> I would like to reuse the ordering of byte arrays that Accumulo uses for
>>> the keys. Is it exposed to the users? Where can I find it?
>>>
>>> Thanks,
>>> Mario
>>>
>>> --
>>> Mario Pastorelli| TERALYTICS
>>>
>>> *software engineer*
>>>
>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>> phone:+41794381682
>>> email: mario.pastore...@teralytics.ch
>>> 
>>> www.teralytics.net 
>>>
>>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>>> Zurich
>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>> Yann de Vries
>>>
>>> This e-mail message contains confidential information which is for the
>>> sole attention and use of the intended recipient. Please notify us at
>>> once if you think that it may not be intended for you and delete it
>>> immediately.
>>>
>>>
>
>
> --
> Mario Pastorelli | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastore...@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.
>


Re: I have a problem about change HDFS address

2016-05-25 Thread Christopher
I believe you need to configure instance.volumes.replacements
http://accumulo.apache.org/1.7/accumulo_user_manual#_instance_volumes_replacements
to map your metadata from the old location to the new one.

On Wed, May 25, 2016 at 11:23 AM Keith Turner  wrote:

> Do you seen any errors in the Accumulo master log?
>
> On Wed, May 25, 2016 at 11:17 AM, Lu Qin  wrote:
>
>>
>> I only use new HDFS, I change the instance.volumes to the new ,and the
>> instance.volumes.replcaements.
>> When accumulo start , I exec ./bin/accumulo
>> org.apache.accumulo.server.util.FindOfflineTablets ,it shows the
>> accumulo.root table UNASSIGNED
>>
>>
>> 在 2016年5月25日,22:04,Keith Turner  写道:
>>
>> Accumulo stores data in HDFS and Zookeeper.   Are you using new zookeeper
>> servers?  If so, did you copy zookeepers data?
>>
>> On Wed, May 25, 2016 at 4:08 AM, Lu Qin  wrote:
>>
>>> I have a accumulo 1.7.1 work with a old HDFS 2.6. Now I have a new HDFS
>>> 2.6,and change the accumulo volume to the new.
>>>
>>> I have use distcp move the data from old HDFS to the new HDFS.And start
>>> the accumulo up.
>>>
>>> Now the ‘Accumulo Overview' shows the Tables is 0 and Tablets is 0 with
>>> red background, but In  'Table Status’ I can see all tables I have.
>>> I use bin/accumulo shell and tables command,it also show all tables,but
>>> I can not scan anyone.
>>>
>>> How can I resolve it?Thanks
>>>
>>
>>
>>
>


Re: [ANNOUNCE] Apache Accumulo 1.7.2 Released

2016-06-23 Thread Christopher
Minor correction: this release is version 1.7.2 :)

On Thu, Jun 23, 2016 at 11:47 AM Mike Drob  wrote:

> The Accumulo team is proud to announce the release of Accumulo version
> 1.7.1!
>
> This release contains over 30 bugfixes and improvements over 1.7.1, and is
> backwards-compatible with 1.7.0 and 1.7.1. Existing users of 1.7.1 are
> encouraged to
> upgrade immediately.
>
> This version is now available in Maven Central, and at:
> https://accumulo.apache.org/downloads/
>
> The full release notes can be viewed at:
> https://accumulo.apache.org/release_notes/1.7.2.html
>
> The Apache Accumulo™ sorted, distributed key/value store is a robust,
> scalable, high performance data storage system that features cell-based
> access control and customizable server-side processing. It is based on
> Google's BigTable design and is built on top of Apache Hadoop, Apache
> ZooKeeper, and Apache Thrift.
>
> --
> The Apache Accumulo Team
>


Re: default for tserver.total.mutation.queue.max increased from 50M to 1M in 1.7

2016-07-07 Thread Christopher
The change was introduced in
https://issues.apache.org/jira/browse/ACCUMULO-1950, and it's an entirely
new property. The old property was a per-session property. The new one is
per-tserver, and is a better strategy, because it reduces the risk of
multiple writers exhausting exhausting tserver memory, while still giving
the user control over how frequently flushes/sync's occur.

On Thu, Jul 7, 2016 at 10:32 AM Jeff Kubina  wrote:

> I noticed that the default value for tserver.total.mutation.queue.max in
> 1.7 is 50M but in 1.6 it is 1M (tserver.mutation.queue.max). Is this
> increase to compensate for the performance hit of moving the WALs to the
> HDFS or some other factor?
>
> Is there a way to compute the number of times the buffer is flushed to
> calculate how this effects performance?
>
>
> --
> Jeff Kubina
>
>
>


Re: Making a RowCounterIterator

2016-07-15 Thread Christopher
Dylan, that would make a great contribution to Accumulo :)

On Fri, Jul 15, 2016, 16:28 Dylan Hutchison 
wrote:

> Hi Mario,
>   You can reuse or adapt the RowCountingIterator
> 
> code here.
>
> The main trick is understanding how each tablet needs to emit a row within
> its seek range.  An iterator should not emit an entry whose row lies
> outside the seek range of the tablet the iterator is running on.  Instead,
> you can emit *partial sums* whose row stays within the seek range.  Each
> tablet server communicates one partial sum.  Then sum the partial sums at
> the client.  (I am probably mixing up tablet vs. tablet server.)
>
> Cheers, Dylan
>
>
> On Fri, Jul 15, 2016 at 1:02 PM, William Slacum  wrote:
>
>> The iterator in the gist also counts cells/entries/KV pairs, not unique
>> rows. You'll want to have some way to skip to the next row value if you
>> want the count to be reflective of the number of rows being read.
>>
>> On Fri, Jul 15, 2016 at 3:34 PM, Shawn Walker 
>> wrote:
>>
>>> My read is that you're mistaking the sequence of calls Accumulo will be
>>> making to your iterator.  The sequence isn't quite the same as a Java
>>> iterator (initially positioned "before" the first element), and is more
>>> like a C++ iterator:
>>>
>>> 0. Accumulo calls seek(...)
>>> 1. Is there more data? Accumulo calls hasTop(). You return yes.
>>> 2. Ok, so there's data.  Accumulo calls getTopKey(), getTopValue() to
>>> retrieve the data. You return a key indicating 0 columns seen (since next()
>>> hasn't yet been called)
>>> 3. First datum done, Accumulo calls next()
>>> ...
>>>
>>> I imagine that if you pull the second item out of your scan result,
>>> it'll have the number you expect.  Alternately, you might consider
>>> performing the count computation during an override of the seek(...)
>>> method, instead of in the next(...) method.
>>>
>>> --
>>> Shawn Walker
>>>
>>>
>>>
>>> On Fri, Jul 15, 2016 at 2:24 PM, Mario Pastorelli <
>>> mario.pastore...@teralytics.ch> wrote:
>>>
 I'm trying to create a RowCounterIterator that counts all the rows and
 returns only one key-value with the counter inside. The problem is that I
 can't get it work. The Scala code is available in the gist
 
 together with some pseudo-code of a test. The problem is that if I add an
 entry to my table, this iterator will return 0 instead of 1 and apparently
 the reason is that super.hasTop() is always false. I've tried without the
 iterator and the scanner returns 1 elements. Any idea of what I'm doing
 wrong here? Is WrappingIterator the right class to extend for this kind of
 behaviour?

 Thanks,
 Mario

 --
 Mario Pastorelli | TERALYTICS

 *software engineer*

 Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
 phone: +41794381682
 email: mario.pastore...@teralytics.ch
 www.teralytics.net

 Company registration number: CH-020.3.037.709-7 | Trade register Canton
 Zurich
 Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
 Yann de Vries

 This e-mail message contains confidential information which is for the
 sole attention and use of the intended recipient. Please notify us at once
 if you think that it may not be intended for you and delete it immediately.

>>>
>>>
>>
>


Re: Making a RowCounterIterator

2016-07-15 Thread Christopher
It'd be more efficient to use the FirstEntryInRowIterator to just grab one
each, rather than the WholeRowIterator which could use up a lot of memory
unnecessarily.

On Fri, Jul 15, 2016 at 6:20 PM Mario Pastorelli <
mario.pastore...@teralytics.ch> wrote:

> I'm actually using this after a wholerowiterator, which is used to filter
> rows with the same rowId.
>
> On Fri, Jul 15, 2016 at 10:02 PM, William Slacum 
> wrote:
>
>> The iterator in the gist also counts cells/entries/KV pairs, not unique
>> rows. You'll want to have some way to skip to the next row value if you
>> want the count to be reflective of the number of rows being read.
>>
>> On Fri, Jul 15, 2016 at 3:34 PM, Shawn Walker 
>> wrote:
>>
>>> My read is that you're mistaking the sequence of calls Accumulo will be
>>> making to your iterator.  The sequence isn't quite the same as a Java
>>> iterator (initially positioned "before" the first element), and is more
>>> like a C++ iterator:
>>>
>>> 0. Accumulo calls seek(...)
>>> 1. Is there more data? Accumulo calls hasTop(). You return yes.
>>> 2. Ok, so there's data.  Accumulo calls getTopKey(), getTopValue() to
>>> retrieve the data. You return a key indicating 0 columns seen (since next()
>>> hasn't yet been called)
>>> 3. First datum done, Accumulo calls next()
>>> ...
>>>
>>> I imagine that if you pull the second item out of your scan result,
>>> it'll have the number you expect.  Alternately, you might consider
>>> performing the count computation during an override of the seek(...)
>>> method, instead of in the next(...) method.
>>>
>>> --
>>> Shawn Walker
>>>
>>>
>>>
>>> On Fri, Jul 15, 2016 at 2:24 PM, Mario Pastorelli <
>>> mario.pastore...@teralytics.ch> wrote:
>>>
 I'm trying to create a RowCounterIterator that counts all the rows and
 returns only one key-value with the counter inside. The problem is that I
 can't get it work. The Scala code is available in the gist
 
 together with some pseudo-code of a test. The problem is that if I add an
 entry to my table, this iterator will return 0 instead of 1 and apparently
 the reason is that super.hasTop() is always false. I've tried without the
 iterator and the scanner returns 1 elements. Any idea of what I'm doing
 wrong here? Is WrappingIterator the right class to extend for this kind of
 behaviour?

 Thanks,
 Mario

 --
 Mario Pastorelli | TERALYTICS

 *software engineer*

 Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
 phone: +41794381682
 email: mario.pastore...@teralytics.ch
 www.teralytics.net

 Company registration number: CH-020.3.037.709-7 | Trade register Canton
 Zurich
 Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
 Yann de Vries

 This e-mail message contains confidential information which is for the
 sole attention and use of the intended recipient. Please notify us at once
 if you think that it may not be intended for you and delete it immediately.

>>>
>>>
>>
>
>
> --
> Mario Pastorelli | TERALYTICS
>
> *software engineer*
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: mario.pastore...@teralytics.ch
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the
> sole attention and use of the intended recipient. Please notify us at once
> if you think that it may not be intended for you and delete it immediately.
>


Re: Making a RowCounterIterator

2016-07-15 Thread Christopher
Ah, I thought you were doing WholeRowIterator -> RowCounterIterator
I now understand you're doing WholeRowIterator -> SomeCustomFilter (column
predicate) -> RowCounterIterator

That's okay to do, but it may be better to have an iterator that creates a
clone of its source at the beginning of each row, advances to do the
filtering, and then informs the spawning iterator to either accept or
reject. This is, admittedly, far more complicated than WholeRowIterator,
but it can safer if you have really big rows which don't fit in memory.

To your question about WholeRowIterator, yes, it's fine. The iterator will
always see sorted data (unless it's sitting on top of another iterator
which breaks this... which is possible, but not recommended at all), even
though the client may not. And yes, rows are never split (but if the query
range doesn't include the full row, it may return early). Their usage is
orthogonal, and can be used together or not.

On Fri, Jul 15, 2016 at 6:35 PM Mario Pastorelli <
mario.pastore...@teralytics.ch> wrote:

> The WholeRowIterator is for filtering: I need all the columns that the
> filter requires so that the filter can see if the row matches or not the
> query. That's the only proper way I found to implement logic operators on
> predicated over columns of the same row.
>
> Actually I do have a question about WholeRowIterator, while we are talking
> about them. Do they make sense when used with a BatchScanner? My guess is
> yes because while the BatchScanner can return data non-sorted to the
> client, when it is scanning a single tablet the data is sorted. Because the
> data of the same rowId is never split (right?) then there is no problem in
> using a WholeRowIterator with a BatchScanner. Is this correct? I really
> can't find much documentation for Accumulo and the book doesn't help enough.
>
> On Sat, Jul 16, 2016 at 12:29 AM, Christopher  wrote:
>
>> It'd be more efficient to use the FirstEntryInRowIterator to just grab
>> one each, rather than the WholeRowIterator which could use up a lot of
>> memory unnecessarily.
>>
>> On Fri, Jul 15, 2016 at 6:20 PM Mario Pastorelli <
>> mario.pastore...@teralytics.ch> wrote:
>>
>>> I'm actually using this after a wholerowiterator, which is used to
>>> filter rows with the same rowId.
>>>
>>> On Fri, Jul 15, 2016 at 10:02 PM, William Slacum 
>>> wrote:
>>>
>>>> The iterator in the gist also counts cells/entries/KV pairs, not unique
>>>> rows. You'll want to have some way to skip to the next row value if you
>>>> want the count to be reflective of the number of rows being read.
>>>>
>>>> On Fri, Jul 15, 2016 at 3:34 PM, Shawn Walker <
>>>> accum...@shawn-walker.net> wrote:
>>>>
>>>>> My read is that you're mistaking the sequence of calls Accumulo will
>>>>> be making to your iterator.  The sequence isn't quite the same as a Java
>>>>> iterator (initially positioned "before" the first element), and is more
>>>>> like a C++ iterator:
>>>>>
>>>>> 0. Accumulo calls seek(...)
>>>>> 1. Is there more data? Accumulo calls hasTop(). You return yes.
>>>>> 2. Ok, so there's data.  Accumulo calls getTopKey(), getTopValue() to
>>>>> retrieve the data. You return a key indicating 0 columns seen (since 
>>>>> next()
>>>>> hasn't yet been called)
>>>>> 3. First datum done, Accumulo calls next()
>>>>> ...
>>>>>
>>>>> I imagine that if you pull the second item out of your scan result,
>>>>> it'll have the number you expect.  Alternately, you might consider
>>>>> performing the count computation during an override of the seek(...)
>>>>> method, instead of in the next(...) method.
>>>>>
>>>>> --
>>>>> Shawn Walker
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jul 15, 2016 at 2:24 PM, Mario Pastorelli <
>>>>> mario.pastore...@teralytics.ch> wrote:
>>>>>
>>>>>> I'm trying to create a RowCounterIterator that counts all the rows
>>>>>> and returns only one key-value with the counter inside. The problem is 
>>>>>> that
>>>>>> I can't get it work. The Scala code is available in the gist
>>>>>> <https://gist.github.com/melrief/5f2ca248f1a980ddead2f2eeb19e6389>
>>>>>> together with some pseudo-code of a test.

Re: Making a RowCounterIterator

2016-07-15 Thread Christopher
+1 and we'll add you to the contributors list for doing so, if you want and
aren't already on it.

On Fri, Jul 15, 2016, 20:18 Dylan Hutchison 
wrote:

> Hi Mario,
>
> As you gain more experience with Accumulo, feel free to write or modify
> Accumulo's documentation in the places you find it lacking and send a PR.
> If you find a topic confusing, probably many others do too.
>
> Cheers, Dylan
>
> On Fri, Jul 15, 2016 at 4:04 PM, Christopher  wrote:
>
>> Ah, I thought you were doing WholeRowIterator -> RowCounterIterator
>> I now understand you're doing WholeRowIterator -> SomeCustomFilter
>> (column predicate) -> RowCounterIterator
>>
>> That's okay to do, but it may be better to have an iterator that creates
>> a clone of its source at the beginning of each row, advances to do the
>> filtering, and then informs the spawning iterator to either accept or
>> reject. This is, admittedly, far more complicated than WholeRowIterator,
>> but it can safer if you have really big rows which don't fit in memory.
>>
>> To your question about WholeRowIterator, yes, it's fine. The iterator
>> will always see sorted data (unless it's sitting on top of another iterator
>> which breaks this... which is possible, but not recommended at all), even
>> though the client may not. And yes, rows are never split (but if the query
>> range doesn't include the full row, it may return early). Their usage is
>> orthogonal, and can be used together or not.
>>
>> On Fri, Jul 15, 2016 at 6:35 PM Mario Pastorelli <
>> mario.pastore...@teralytics.ch> wrote:
>>
>>> The WholeRowIterator is for filtering: I need all the columns that the
>>> filter requires so that the filter can see if the row matches or not the
>>> query. That's the only proper way I found to implement logic operators on
>>> predicated over columns of the same row.
>>>
>>> Actually I do have a question about WholeRowIterator, while we are
>>> talking about them. Do they make sense when used with a BatchScanner? My
>>> guess is yes because while the BatchScanner can return data non-sorted to
>>> the client, when it is scanning a single tablet the data is sorted. Because
>>> the data of the same rowId is never split (right?) then there is no problem
>>> in using a WholeRowIterator with a BatchScanner. Is this correct? I really
>>> can't find much documentation for Accumulo and the book doesn't help enough.
>>>
>>> On Sat, Jul 16, 2016 at 12:29 AM, Christopher 
>>> wrote:
>>>
>>>> It'd be more efficient to use the FirstEntryInRowIterator to just grab
>>>> one each, rather than the WholeRowIterator which could use up a lot of
>>>> memory unnecessarily.
>>>>
>>>> On Fri, Jul 15, 2016 at 6:20 PM Mario Pastorelli <
>>>> mario.pastore...@teralytics.ch> wrote:
>>>>
>>>>> I'm actually using this after a wholerowiterator, which is used to
>>>>> filter rows with the same rowId.
>>>>>
>>>>> On Fri, Jul 15, 2016 at 10:02 PM, William Slacum 
>>>>> wrote:
>>>>>
>>>>>> The iterator in the gist also counts cells/entries/KV pairs, not
>>>>>> unique rows. You'll want to have some way to skip to the next row value 
>>>>>> if
>>>>>> you want the count to be reflective of the number of rows being read.
>>>>>>
>>>>>> On Fri, Jul 15, 2016 at 3:34 PM, Shawn Walker <
>>>>>> accum...@shawn-walker.net> wrote:
>>>>>>
>>>>>>> My read is that you're mistaking the sequence of calls Accumulo will
>>>>>>> be making to your iterator.  The sequence isn't quite the same as a Java
>>>>>>> iterator (initially positioned "before" the first element), and is more
>>>>>>> like a C++ iterator:
>>>>>>>
>>>>>>> 0. Accumulo calls seek(...)
>>>>>>> 1. Is there more data? Accumulo calls hasTop(). You return yes.
>>>>>>> 2. Ok, so there's data.  Accumulo calls getTopKey(), getTopValue()
>>>>>>> to retrieve the data. You return a key indicating 0 columns seen (since
>>>>>>> next() hasn't yet been called)
>>>>>>> 3. First datum done, Accumulo calls next()
>>>>>>> ...
>>>>>>>
>>>>>>> I imagine

Re: Testing Spark Job that uses the AccumuloInputFormat

2016-08-03 Thread Christopher
On Wed, Aug 3, 2016 at 10:19 AM Keith Turner  wrote:

> As for the MiniDFSCluster issue, that should be ok.   We use mini
> accumulo cluster to test Accumulo itself.  Some of this ends up
> bringing in a dependency on MiniDFSCluster, even thought the public
> API for mini cluster does not support using it.  We need to fix this,
> so that there is no dependency on MiniDFSCluster.
>
>
MiniDFSCluster was changed to an optional dependency for MAC in the 1.8
branch. So, it won't be resolved automatically as a transitive dependency
in the future.


Re: Testing Spark Job that uses the AccumuloInputFormat

2016-08-03 Thread Christopher
On Wed, Aug 3, 2016 at 1:34 PM Mario Pastorelli <
mario.pastore...@teralytics.ch> wrote:

> Do you guys know if it is possible to use MockInstance to test Spark jobs?
> It's so much faster...
>
>
Probably not. It's faster, but only by behaving very differently than
Accumulo. It has also largely been neglected.

To speed things up, it's best to run a single Mini instance with something
like accumulo-maven-plugin, rather than start a new one for each test.


Re: Testing Spark Job that uses the AccumuloInputFormat

2016-08-03 Thread Christopher
On Wed, Aug 3, 2016 at 4:57 PM Mario Pastorelli <
mario.pastore...@teralytics.ch> wrote:

> I do run a single Mini instance for the entire test suite but I need to
> destroy and recreate tables for each instance of each test.
>
> Out of curiosity: am I doing a huge mistake in using MockInstance to test
> Accumulo? Because right now I rely mainly on it.
>
>
I think it's probably in a "use at your own risk" state. If it provides you
with sufficient confidence of the behavior of your code, then it's fine.
But, it has been marked deprecated, and you should be aware that your code
may behave differently in a real instance.

>


Re: reboot autostart

2016-08-05 Thread Christopher
RHEL7 is systemd, but systemd still has sysvinit support. I wouldn't expect
sysvinit to go away until at least RHEL8.


On Fri, Aug 5, 2016 at 5:52 PM Josh Elser  wrote:

> Most of the time, your operating system can do this for you via init.d
> scripts (chkconfig on RHEL6, I forget if they moved to systemd in RHEL7).
>
> Most mechanisms also have some sort of "rc.local" script which you can
> provide your own commands to that is automatically run when the OS boots.
>
> Michael Wall wrote:
> > What do you mean auto reboot sequence?  Are you asking about the service
> > start order?  Start dfs, then yarn, then zookeeper, then accumulo's
> > start-all.  Shutdown is the reverse.
> >
> > On Fri, Aug 5, 2016 at 4:20 PM, Kevin Cho  > > wrote:
> >
> > Thanks again for helping on last ticket.  I'm trying to create auto
> > reboot sequence for Accumulo but it's not working right.  Did anyone
> > did this before? Tried googling but couldn't find much resource.
> >
> >
>


Tuned performance profiles for Accumulo

2016-08-30 Thread Christopher
Has anybody used tuned ("tune D") to manage their system performance
profiles on an Accumulo cluster?

I've recently been looking into tuned, and found it a very convenient tool
for switching between performance profiles, and verifying the current
configuration. It beats manually setting sysctl settings (which I usually
forget to do right away).

I haven't actually created my own tuned profiles, though, because I'm not
an expert on Linux system tuning. However, I have found the built-in
latency-network profile to be useful.

Has anybody tried a custom profile (for Accumulo specifically, or Hadoop
clusters in general)?

Has anybody else found using tuned profiles to be a useful way to manage
(CM and verification) system configuration for your clusters?


Re: 1 of 20 TServers unresponsive/slow, all writes fail?

2016-09-09 Thread Christopher
What version of Accumulo? Could narrow down the search for known issue
potentials.

On Fri, Sep 9, 2016 at 10:36 AM Michael Moss  wrote:

> Upon further internal discussion, it looks like the metadata/root tables
> are served from the tservers (not an HA master for example) and the one in
> question was serving it. It was unable to run MajC (compaction) for many
> hours leading up to the time where it couldn't service requests any longer,
> but it was still up, hosting tablets, just very slow or unable to respond.
> So all writes ended up timing out.
>
> If this condition is possible and there is a SPOF here, it'd be good to
> see what's on the roadmap to address it.
>
> On Fri, Sep 9, 2016 at 10:24 AM,  wrote:
>
>> What was happening on that 1 tserver? Was it in garbage collection? Was
>> it having network or O/S issues?
>>
>> --
>> *From: *"Michael Moss (BLOOMBERG/ 731 LEX)" 
>> *To: *user@accumulo.apache.org
>> *Sent: *Friday, September 9, 2016 9:40:42 AM
>> *Subject: *1 of 20 TServers unresponsive/slow, all writes fail?
>>
>>
>> Hi,
>>
>> We are starting to investigate an issue where 1 tserver was up, but
>> became slow/unresponsive for several hours, yet all writes to our 20+
>> servers began to fail. We could see leading up to the failure that the
>> writes were distributed among all of the tablet servers, so it wasn't a
>> hotspot. Whenever we receive a MutationsRejectedException, we recreate the
>> BatchWriter (ACCUMULO-2990). I'm digging into the TabletServerBatchWriter
>> code, but any ideas what could cause this issue? Is there some sort of
>> initialization or healthchecking that the client does where 1 server could
>> impact all?
>>
>> Thanks.
>>
>> -Mike
>>
>> Caused by: org.apache.accumulo.core.client.TimedOutException: Servers
>> timed out [pnj-bvlt-r4n03.abc.com:31113] at
>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing(TabletServerBatchWriter.java:177)
>> ~[stormjar.jar:1.0] at
>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182)
>> ~[stormjar.jar:1.0] at
>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:933)
>> ~[stormjar.jar:1.0] at
>>
>>
>


[ANNOUNCE] Apache Accumulo 1.6.6

2016-09-21 Thread Christopher
All-

The Accumulo team is proud to announce the release of Accumulo
version 1.6.6!  This release contains changes from more than 40 issues,
comprised of bug-fixes, performance improvements, build quality
improvements, and more. This is a maintenance (patch) release. Users of any
previous 1.6.x release are strongly encouraged to update as soon as
possible to benefit from the improvements with very little concern in
change of underlying functionality.

As of this release, active development has ceased for the 1.6 release
line, so users should consider upgrading to a newer, actively maintained
version when they can. While the developers may release another 1.6 version
to address a severe issue, there’s a strong possibility that this will be
the last 1.6 release. This would also mean that this is the last version to
support Java 6 and Hadoop 1.

This version is now available in Maven Central, and at:
https://accumulo.apache.org/downloads/

The full release notes can be viewed at:
http://accumulo.apache.org/release_notes/1.6.6

The Apache Accumulo™ sorted, distributed key/value store is a
robust, scalable, high performance data storage system that features
cell-based access control and customizable server-side processing. It is
based on Google's BigTable design and is built on top of Apache Hadoop,
Apache ZooKeeper, and Apache Thrift.

--
The Apache Accumulo Team


Re: how do I list user permissions per table

2016-09-23 Thread Christopher
Currently, there's no single command to run to list permissions for a
particular table.
However, you can iterate through the users, and get the list of table
permissions per user.

I did something like:

TABLENAME=trace
for x in $(bin/accumulo shell -u root -p "$PASS" -e users | grep -v
"^$(date +%Y)"; do
  output=$(bin/accumulo shell -u root -p "$PASS" -e "userpermissions -u $x"
| fgrep "($TABLENAME)")
  if [[ $? -eq 0 ]]; then
echo "$x has $output"
  fi
done

The "grep -v" was to filter out the timestamped log4j messages. That may or
may not be necessary for you, depending on your client's log4j
configuration.

Hope that helps.

On Fri, Sep 23, 2016 at 2:33 PM Jeff Kubina  wrote:

> From the accumulo shell how do I list all the users who have access to a
> specific table?
>
>


Re: setting tserver configs from the accumulo shell

2016-10-04 Thread Christopher
Some do, some don't. One thing we could add to the shell is a notification
that a restart is necessary for a particular change. Possibly.

On Tue, Oct 4, 2016, 20:25 Dave  wrote:

> I don't think so.
>
> On Oct 4, 2016 8:21 PM, Jeff Kubina  wrote:
>
> Does changing the values of tserver configs in the accumulo shell, like
> "config -s tserver.server.threads.minimum=256", require a restart of all
> the tservers to become effective?
>
>
>


Re: setting tserver configs from the accumulo shell

2016-10-04 Thread Christopher
Right now, I think you'd probably have to track down where that particular
property is used in the code to determine its lifecycle. I think it's going
to take some work to wrangle these into discrete sets for documentation
purposes, in the shell or otherwise. Some properties are only used during
certain times early in the server's lifecycle. Other properties are used on
demand. Some of those on demand properties are probably cached into
internal state for indefinite periods of time. It's hard to say which are
which without investigating each property individually (or through
empirical testing).

On Tue, Oct 4, 2016 at 9:04 PM Jeff Kubina  wrote:

> That would be very helpful, but a note in the documentation would be fine
> initially. Is there an easy way to determine this from the source code?
>
> --
> Jeff Kubina
> 410-988-4436 <(410)%20988-4436>
>
>
> On Tue, Oct 4, 2016 at 8:59 PM, Christopher  wrote:
>
> Some do, some don't. One thing we could add to the shell is a notification
> that a restart is necessary for a particular change. Possibly.
>
> On Tue, Oct 4, 2016, 20:25 Dave  wrote:
>
> I don't think so.
>
> On Oct 4, 2016 8:21 PM, Jeff Kubina  wrote:
>
> Does changing the values of tserver configs in the accumulo shell, like
> "config -s tserver.server.threads.minimum=256", require a restart of all
> the tservers to become effective?
>
>
>
>


Re: setting tserver configs from the accumulo shell

2016-10-04 Thread Christopher
At most, yes.

The properties which affect both would be the ones which start with
"general." or "instance.", with the latter being ones which must be the
same across the cluster in order for servers to participate in the same
cluster.

For all X, not in {general, instance}, properties starting with "X." should
only affect servers of type X. Otherwise, that's almost certainly a bug.

On Tue, Oct 4, 2016 at 9:21 PM Jeff Kubina  wrote:

> So just to clarify, changing a tserver.* option would at most only
> require a restart of all the tservers, not a restart of the master?
>
>
> On Tue, Oct 4, 2016 at 9:14 PM, Christopher  wrote:
>
> Right now, I think you'd probably have to track down where that particular
> property is used in the code to determine its lifecycle. I think it's going
> to take some work to wrangle these into discrete sets for documentation
> purposes, in the shell or otherwise. Some properties are only used during
> certain times early in the server's lifecycle. Other properties are used on
> demand. Some of those on demand properties are probably cached into
> internal state for indefinite periods of time. It's hard to say which are
> which without investigating each property individually (or through
> empirical testing).
>
> On Tue, Oct 4, 2016 at 9:04 PM Jeff Kubina  wrote:
>
> That would be very helpful, but a note in the documentation would be fine
> initially. Is there an easy way to determine this from the source code?
>
> --
> Jeff Kubina
> 410-988-4436 <(410)%20988-4436>
>
>
> On Tue, Oct 4, 2016 at 8:59 PM, Christopher  wrote:
>
> Some do, some don't. One thing we could add to the shell is a notification
> that a restart is necessary for a particular change. Possibly.
>
> On Tue, Oct 4, 2016, 20:25 Dave  wrote:
>
> I don't think so.
>
> On Oct 4, 2016 8:21 PM, Jeff Kubina  wrote:
>
> Does changing the values of tserver configs in the accumulo shell, like
> "config -s tserver.server.threads.minimum=256", require a restart of all
> the tservers to become effective?
>
>
>
>
>


Re: New Accumulo Blog Post

2016-11-02 Thread Christopher
I'm aware of at least one person who has patched Accumulo to allow
customizing the HDFS volume on which the WALs are stored. This reminds me
that I need to check on the status of that patch. I'm hoping it'll be
contributed soon.

I'm also curious if it'd make a difference writing to HDFS with the data
nodes mounted with sync, instead of doing a separate sync call.

On Wed, Nov 2, 2016 at 9:49 PM  wrote:

> Regarding #2 – I think there are two options here:
>
>
>
> 1. Modify Accumulo to take advantage of HDFS Heterogeneous Storage
>
> 2. Modify Accumulo WAL code to support volumes
>
>
>
> *From:* Jeff Kubina [mailto:jeff.kub...@gmail.com]
> *Sent:* Wednesday, November 02, 2016 9:02 PM
> *To:* user@accumulo.apache.org
> *Subject:* Re: New Accumulo Blog Post
>
>
>
> Thanks for the blog post, very interesting read. Some questions ...
>
>
>
> 1. Are the operations "Writes mutation to tablet servers’ WAL/Sync or
> flush tablet servers’ WAL" and "Adds mutations to sorted in memory map of
> each tablet." performed by threads in parallel?
>
>
>
> 2. Could the latency of hsync-ing the WALs be overcome by modifying
> Accumulo to write them to a separate SSD-only HDFS? To maintain data
> locality it would require two datanode processes (one for the HDDs and one
> for the SSD), running on the same node, which is not hard to do.
>
>
>


Re: HDFS Replication of data

2016-11-10 Thread Christopher
HDFS replication is transparent to Accumulo (though, the number of replicas
is configurable in Accumulo, on a per-table basis). Its primary purpose is
failure tolerance, but it *may* have an impact on read performance. I'm not
certain how significant that is, though.

There is no separate read-only and write-only copies of data on HDFS. HDFS
replication is at the block level, and files are updated by appending new
blocks to the files. All blocks are readable, and only new blocks are
written.

On Thu, Nov 10, 2016 at 11:28 AM Yamini Joshi  wrote:

> Hello all
>
> Does the HDFS replication improve performance of queries on Accumulo or is
> it transparent to the Accumulo system? If it does improve the performance
> by some notion of load balancing, is there is a Read Only or Write Only
> copy of data on HDFS for Accumulo?
>
> Best regards,
> Yamini Joshi
>


Re: List of Metrics2 Metrics

2016-11-10 Thread Christopher
It may be out of date, but it did not get lost. See
https://github.com/apache/accumulo/blob/rel/1.8.0/docs/src/main/resources/metrics.html
These additional docs should be merged into the manual, and not maintained
as separate HTML files, as a continuation of the work from
https://issues.apache.org/jira/browse/ACCUMULO-1490


On Thu, Nov 10, 2016 at 11:15 AM  wrote:

>
>  It used[1] to be in the documentation when it was hosted on the monitor.
> I did not see it looking at the current documentation. Looks like it was
> lost (and [1] is likely now out of date).
>
> [1] https://github.com/apache/accumulo/blob/1.4.0/docs/metrics.html
>
> --
> *From: *"Noe Detore" 
> *To: *user@accumulo.apache.org
> *Sent: *Thursday, November 10, 2016 11:06:21 AM
> *Subject: *List of Metrics2 Metrics
>
>
> hello,
>
> Is there a documented list of produced metrics of metrics2 for accumulo?
> Any documentation explaining what the metrics are?
>


Re: Detecting database changes

2016-11-22 Thread Christopher
Apache Fluo can do this with Accumulo: https://fluo.apache.org

On Tue, Nov 22, 2016, 07:26 vaibhav thapliyal <
vaibhav.thapliyal...@gmail.com> wrote:

> Hi,
>
> I have a use case where I need to send out notifications based on changes
> in a table. Are there any kind of listeners which can be used to listen to
> a change in table event in accumulo?
>
> How do I go about this use case?
>
> Thanks
> Vaibhav


Re: Accumulo Working

2016-11-22 Thread Christopher
That's basically how it works, yes.

1. The data from tserver1 and tserver2 necessarily comes from at least two
different tablets. This is because tables are divided into discrete,
non-overlapping tablets, and each tablet is hosted only on a single
tserver. So, it is not normally necessary to merge the data from these two
sources. Your application may do a join between the two tablets on the
client side, but that is outside the scope of Accumulo.

2. Custom iterators can be applied to minc, majc, and scan scopes. I
suggest starting here:
https://accumulo.apache.org/1.8/accumulo_user_manual.html#_iterators


On Tue, Nov 22, 2016 at 12:05 PM Yamini Joshi  wrote:

> Hello all
>
> I am trying to understand Accumulo scan workflow. I've checked the
> official docs but I couldn't understand the workflow properly. Could anyone
> please tell me if I'm on the right track? For example if I want to scan
> rows in the range e-g in a table mytable which is sharded across 3 nodes in
> the cluster:
>
> Step1: Client connects to the Zookeeper and gets the location of the root
> tablet.
> Step2: Client connects to tserver with the root tablet and gets the
> location of mytable.
> the row distribution is as follows:
> tserver1 tserver2   tserver3
> a-g   h-kl-z
>
> Step3: Client connects to tserver1 and tserver2.
> Step4: tservers merge and sort data from in-memory maps, minc files and
> majc files, apply versioning iterator, seek the requested range and send
> data back to the client.
>
> Is this how a scan works? Also, I have some doubts:
> 1. Where is the data from tserver1 and tserver2 merged?
> 2. when and how are custom iterators applied?
>
>
> Also, if there is any resource explaining this, please point me to it.
> I've found some slides but no detailed explanation.
>
>
> Best regards,
> Yamini Joshi
>


Re: Accumulo Working

2016-11-22 Thread Christopher
The names of the scanners don't clearly reflect how they behave.

The regular Scanner is really a sequential scanner. It queries one tablet
at a time, sequentially, in-order, for a given range. So, the data it will
return is always in-order, and doesn't need to be merged explicitly in the
client.

The BatchScanner is really a parallel scanner, which queries multiple
ranges simultaneously, and the API does not have ordering guarantees. So,
whichever threads have data first will have their data seen first.

Regarding iterators, the server side constructs a "stack" of iterators,
based on their priority, and the data traverses this stack before being
sent back to the client:

scan on tserver (system iterators -> user iter 1 -> user iter 2 -> user
iter 3) -> client

Only data coming out of the end of the pipeline is returned the the client.
The iterator stack could get torn-down and reconstructed during the
lifetime of the scan.

On Tue, Nov 22, 2016 at 1:09 PM Yamini Joshi  wrote:

> So, for a batch scan, the merge is not required but, for a scan, since it
> returns sorted data, data from tserver1 and tserver2 is merged at the
> client?
>
> I know how to write iterators but I can't vsiualize the workflow. Lets say
> in the same example I have 3 custom iterators to be applied on data: it1,
> it2, it3 respectively. When are the iterators applied:
>
> 1. scan on tserver -> client -> it1 on tserver -> client -> it2 on
> tserver  -> client -> it3 on tserver -> client
> I'm sure this is not the case, it adds a lot of overhead
>
> 2. scan on tserver ->  it1 on tserver ->  it2 on tserver  -> it3 on
> tserver -> client
> The processing is done in batches?
> Data is returned to the client when it reaches the max limit for
> table.scan.max.memory even if it is in the middle of the pipeline above?
>
> Best regards,
> Yamini Joshi
>
> On Tue, Nov 22, 2016 at 11:56 AM, Christopher  wrote:
>
> That's basically how it works, yes.
>
> 1. The data from tserver1 and tserver2 necessarily comes from at least two
> different tablets. This is because tables are divided into discrete,
> non-overlapping tablets, and each tablet is hosted only on a single
> tserver. So, it is not normally necessary to merge the data from these two
> sources. Your application may do a join between the two tablets on the
> client side, but that is outside the scope of Accumulo.
>
> 2. Custom iterators can be applied to minc, majc, and scan scopes. I
> suggest starting here:
> https://accumulo.apache.org/1.8/accumulo_user_manual.html#_iterators
>
>
> On Tue, Nov 22, 2016 at 12:05 PM Yamini Joshi 
> wrote:
>
> Hello all
>
> I am trying to understand Accumulo scan workflow. I've checked the
> official docs but I couldn't understand the workflow properly. Could anyone
> please tell me if I'm on the right track? For example if I want to scan
> rows in the range e-g in a table mytable which is sharded across 3 nodes in
> the cluster:
>
> Step1: Client connects to the Zookeeper and gets the location of the root
> tablet.
> Step2: Client connects to tserver with the root tablet and gets the
> location of mytable.
> the row distribution is as follows:
> tserver1 tserver2   tserver3
> a-g   h-kl-z
>
> Step3: Client connects to tserver1 and tserver2.
> Step4: tservers merge and sort data from in-memory maps, minc files and
> majc files, apply versioning iterator, seek the requested range and send
> data back to the client.
>
> Is this how a scan works? Also, I have some doubts:
> 1. Where is the data from tserver1 and tserver2 merged?
> 2. when and how are custom iterators applied?
>
>
> Also, if there is any resource explaining this, please point me to it.
> I've found some slides but no detailed explanation.
>
>
> Best regards,
> Yamini Joshi
>
>
>


Re: BatchScanner behavior with AccumuloRowInputFormat

2016-11-30 Thread Christopher
You'd only have to worry about this behavior if you set
RowInputFormat.setBatchScan(job, true), available since 1.7.0.
By default, our InputFormats use a regular Accumulo Scanner.

See https://issues.apache.org/jira/browse/ACCUMULO-3602 and
https://static.javadoc.io/org.apache.accumulo/accumulo-core/1.7.0/org/apache/accumulo/core/client/mapreduce/InputFormatBase.html#setBatchScan(org.apache.hadoop.mapreduce.Job,%20boolean)


On Wed, Nov 30, 2016 at 9:42 AM Massimilian Mattetti 
wrote:

Hi all,

as you already know, the AccumuloRowInputFormat is internally using a
RowIterator for iterating over all the key value pairs of a single row. In
the past when I was using the RowIterator together with a BatchScanner I
had the problem of a single row be split into multiple rows due to the fact
that a BatchScanner can interleave key-value pairs of different rows.
Should I expect the same behavior when using the AccumuloRowInputFormat
with a BatchScanner (enabled via setBatchScan)?
Thanks,
Max


Re: BatchScanner behavior with AccumuloRowInputFormat

2016-12-01 Thread Christopher
The benefit of using a BatchScanner in the AccumuloRowInputFormat is that
it can fetch multiple ranges in parallel within each Mapper. This may be
able to help you manage your MapReduce job resources a bit better (see the
discussion in the JIRA issue for details). If you don't need to use it, I
wouldn't use that option. If you have to use it because of performance
issues, then you can mitigate the row-splitting problem using the
WholeRowIterator, but that will come with its own performance implications.
You might also be able to mitigate by resolving the
single-row-represented-as-multiple-rows problem with a Combiner or in your
Reducer.

On Thu, Dec 1, 2016 at 1:51 AM Massimilian Mattetti 
wrote:

> I see, so the only solution here would be either to use a WholeRowIterator
> or to avoid enabling the BatchScanner. Since each executor will work on a
> single tablet I guess that the benefit of using a BatchScanner is that it
> can fetch multiple ranges over the same tablet in parallel, am I correct?
> Thanks,
> Max
>
>
>
>
> From:Christopher 
> To:user@accumulo.apache.org
> Date:30/11/2016 18:48
> Subject:Re: BatchScanner behavior with AccumuloRowInputFormat
> --
>
>
>
> You'd only have to worry about this behavior if you set
> RowInputFormat.setBatchScan(job, true), available since 1.7.0.
> By default, our InputFormats use a regular Accumulo Scanner.
>
> See *https://issues.apache.org/jira/browse/ACCUMULO-3602*
> <https://issues.apache.org/jira/browse/ACCUMULO-3602> and
> *https://static.javadoc.io/org.apache.accumulo/accumulo-core/1.7.0/org/apache/accumulo/core/client/mapreduce/InputFormatBase.html#setBatchScan(org.apache.hadoop.mapreduce.Job,%20boolean)*
> <https://static.javadoc.io/org.apache.accumulo/accumulo-core/1.7.0/org/apache/accumulo/core/client/mapreduce/InputFormatBase.html#setBatchScan(org.apache.hadoop.mapreduce.Job,%20boolean)>
>
>
> On Wed, Nov 30, 2016 at 9:42 AM Massimilian Mattetti <
> *massi...@il.ibm.com* > wrote:
> Hi all,
>
> as you already know, the AccumuloRowInputFormat is internally using a
> RowIterator for iterating over all the key value pairs of a single row. In
> the past when I was using the RowIterator together with a BatchScanner I
> had the problem of a single row be split into multiple rows due to the fact
> that a BatchScanner can interleave key-value pairs of different rows.
> Should I expect the same behavior when using the AccumuloRowInputFormat
> with a BatchScanner (enabled via setBatchScan)?
> Thanks,
> Max
>
>
>
>


Re: openjdk, Accumulo master state doesn't change from HAVE_LOCK to NORMAL

2016-12-01 Thread Christopher
This issue described doesn't seem related to the JDK. Yes, you should
expect Accumulo to work with OpenJDK. While we don't prescribe a JDK for
users, most of the developers have an interest in ensuring at least OpenJDK
and Oracle JDK work well. A few people care about IBM JDK also, and we
accept contributions to fix issues relevant to IBM JDK.

Personally, I use OpenJDK 8 exclusively, and haven't seen this issue.

On Thu, Dec 1, 2016 at 5:24 PM Jayesh Patel  wrote:

> Accumulo 1.7.0 with HDFS 2.7.1
>
>
>
> I was experimenting with openjdk instead of Oracle JRE for Accumulo and
> ran into this issue.  It seems like because it never transitions from
> HAVE_LOCK to NORMAL, it never gets around to start the thrift server.
> Changing back to Oracle JRE didn’t make a difference.
>
>
>
> Here’s what I get in the logs for the master:
>
> 2016-12-01 17:01:07,970 [trace.DistributedTrace] INFO : SpanReceiver
> org.apache.accumulo.tracer.ZooTraceCli
>
> ent was loaded successfully.
>
> 2016-12-01 17:01:07,971 [master.Master] INFO : trying to get master lock
>
> 2016-12-01 17:01:07,987 [master.EventCoordinator] INFO : State changed
> from INITIAL to HAVE_LOCK
>
> 2016-12-01 17:01:08,038 [master.Master] INFO : New servers:
> [instance-accumulo:9997[258ac79197e000d]]
>
>
>
> On the successful install it transitions right away to NORMAL and goes on
> the listen on port :
>
> 2015-06-07 17:50:35,393 [master.Master] INFO : trying to get master lock
>
> 2015-06-07 17:50:35,408 [master.EventCoordinator] INFO : State changed
> from INITIAL to HAVE_LOCK
>
> 2015-06-07 17:50:35,432 [master.EventCoordinator] INFO : State changed
> from HAVE_LOCK to NORMAL
>
> 2015-06-07 17:50:35,524 [balancer.TableLoadBalancer] INFO : Loaded class
> org.apache.accumulo.server.master.balancer.DefaultLoadBalancer for table !0
>
> 2015-06-07 17:50:35,631 [master.Master] INFO : Setting master lock data to
> 127.0.0.1:
>
>
>
> I found ACCUMULO-4513
> , but it doesn’t
> seem relevant as I didn’t try to stop.
>
>
>
> Any ideas as to what is going on?
>
>
>
> HDFS seems fine based on my limited tests with openjdk 1.8.  I did find
> some old posts about Accumulo issues with IBM JDK.  Should I expect
> Accumulo to work with openjdk?
>
>
>
> Thank you,
> Jayesh
>
>
>


Re: openjdk, Accumulo master state doesn't change from HAVE_LOCK to NORMAL

2016-12-01 Thread Christopher
I can't be sure why the master is stuck in that state. It could be a bug in
1.7.0. Can you reproduce in 1.7.2?

On Thu, Dec 1, 2016 at 5:56 PM Jayesh Patel  wrote:

> Thank you!
>
>
>
> What might be going on with the state transition of the master?  Looks
> like the tservers can’t talk to the master without the thrift interface.
> Is there another interface I can enable on the master?
>
>
>
>
>
> *From:* Christopher [mailto:ctubb...@apache.org]
> *Sent:* Thursday, December 01, 2016 5:34 PM
> *To:* user@accumulo.apache.org
> *Subject:* Re: openjdk, Accumulo master state doesn't change from
> HAVE_LOCK to NORMAL
>
>
>
> This issue described doesn't seem related to the JDK. Yes, you should
> expect Accumulo to work with OpenJDK. While we don't prescribe a JDK for
> users, most of the developers have an interest in ensuring at least OpenJDK
> and Oracle JDK work well. A few people care about IBM JDK also, and we
> accept contributions to fix issues relevant to IBM JDK.
>
> Personally, I use OpenJDK 8 exclusively, and haven't seen this issue.
>
>
>
> On Thu, Dec 1, 2016 at 5:24 PM Jayesh Patel  wrote:
>
> Accumulo 1.7.0 with HDFS 2.7.1
>
>
>
> I was experimenting with openjdk instead of Oracle JRE for Accumulo and
> ran into this issue.  It seems like because it never transitions from
> HAVE_LOCK to NORMAL, it never gets around to start the thrift server.
> Changing back to Oracle JRE didn’t make a difference.
>
>
>
> Here’s what I get in the logs for the master:
>
> 2016-12-01 17:01:07,970 [trace.DistributedTrace] INFO : SpanReceiver
> org.apache.accumulo.tracer.ZooTraceCli
>
> ent was loaded successfully.
>
> 2016-12-01 17:01:07,971 [master.Master] INFO : trying to get master lock
>
> 2016-12-01 17:01:07,987 [master.EventCoordinator] INFO : State changed
> from INITIAL to HAVE_LOCK
>
> 2016-12-01 17:01:08,038 [master.Master] INFO : New servers:
> [instance-accumulo:9997[258ac79197e000d]]
>
>
>
> On the successful install it transitions right away to NORMAL and goes on
> the listen on port :
>
> 2015-06-07 17:50:35,393 [master.Master] INFO : trying to get master lock
>
> 2015-06-07 17:50:35,408 [master.EventCoordinator] INFO : State changed
> from INITIAL to HAVE_LOCK
>
> 2015-06-07 17:50:35,432 [master.EventCoordinator] INFO : State changed
> from HAVE_LOCK to NORMAL
>
> 2015-06-07 17:50:35,524 [balancer.TableLoadBalancer] INFO : Loaded class
> org.apache.accumulo.server.master.balancer.DefaultLoadBalancer for table !0
>
> 2015-06-07 17:50:35,631 [master.Master] INFO : Setting master lock data to
> 127.0.0.1:
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__127.0.0.1-3A&d=DQMFaQ&c=31nHN1tvZeuWBT6LwDN4Ngk1qezfsYHyolgGeY2ZhlU&r=yH4vdeYURv2hJ1Gmntk0uc08nV-pgsJRlR6tfCR4KBw&m=PGaErZqBG4ZLhvBF_e2erg7tdil5yjEujZQMiKmp5Uk&s=TagTSNKkbvhsYiu5geGRtorEBK2ghw6Y8KvzKV2OWdE&e=>
>
>
>
> I found ACCUMULO-4513
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_ACCUMULO-2D4513&d=DQMFaQ&c=31nHN1tvZeuWBT6LwDN4Ngk1qezfsYHyolgGeY2ZhlU&r=yH4vdeYURv2hJ1Gmntk0uc08nV-pgsJRlR6tfCR4KBw&m=PGaErZqBG4ZLhvBF_e2erg7tdil5yjEujZQMiKmp5Uk&s=SNeP9F29b0ONCVeuddUvGx0xdd8Z3TLD-OxqjZsHa_s&e=>,
> but it doesn’t seem relevant as I didn’t try to stop.
>
>
>
> Any ideas as to what is going on?
>
>
>
> HDFS seems fine based on my limited tests with openjdk 1.8.  I did find
> some old posts about Accumulo issues with IBM JDK.  Should I expect
> Accumulo to work with openjdk?
>
>
>
> Thank you,
> Jayesh
>
>
>
>


Re: VFS version in 1.6.6 binary release

2016-12-02 Thread Christopher
The release notes for 1.6.6 are in error. I'll update them.

On Fri, Dec 2, 2016 at 11:36 AM Michael Wall  wrote:

> Andrew,
>
> The commons-vfs 2.1 jar broke the accumulo build in 1.6.6 using the hadoop
> 1 profile.  That profile is remove in 1.7+, so the commons-vfs update was
> left out of 1.6.6.  You should just replace the commons-vfs jar in your
> deployment.  See https://issues.apache.org/jira/browse/ACCUMULO-3470
>
> Mike
>
> On Fri, Dec 2, 2016 at 10:27 AM, Andrew Hulbert  wrote:
>
> Hi all,
>
> It appears that the commons-vfs2 jar that ships with the 1.6.6 binary
> tar.gz is still version 2.0 according the the META-INF/MANIFEST.MF and
> other maven artifacts in the META-INF instead of 2.1 which is what I
> thought it should be according to the release notes.
>
> Wondering if this is something that can be fixed in the distro or would it
> require a new 1.6.7 release?
>
> Andrew
>
>
>


Re: VFS version in 1.6.6 binary release

2016-12-02 Thread Christopher
For what it's worth, the Accumulo RPMs in the Fedora 25 repositories do
have the VFS 2.1 patch backported (
http://pkgs.fedoraproject.org/cgit/rpms/accumulo.git/tree/ACCUMULO-3470.patch?h=f25
)

On Fri, Dec 2, 2016 at 12:37 PM Christopher  wrote:

> The release notes for 1.6.6 are in error. I'll update them.
>
> On Fri, Dec 2, 2016 at 11:36 AM Michael Wall  wrote:
>
> Andrew,
>
> The commons-vfs 2.1 jar broke the accumulo build in 1.6.6 using the hadoop
> 1 profile.  That profile is remove in 1.7+, so the commons-vfs update was
> left out of 1.6.6.  You should just replace the commons-vfs jar in your
> deployment.  See https://issues.apache.org/jira/browse/ACCUMULO-3470
>
> Mike
>
> On Fri, Dec 2, 2016 at 10:27 AM, Andrew Hulbert  wrote:
>
> Hi all,
>
> It appears that the commons-vfs2 jar that ships with the 1.6.6 binary
> tar.gz is still version 2.0 according the the META-INF/MANIFEST.MF and
> other maven artifacts in the META-INF instead of 2.1 which is what I
> thought it should be according to the release notes.
>
> Wondering if this is something that can be fixed in the distro or would it
> require a new 1.6.7 release?
>
> Andrew
>
>
>


Re: VFS version in 1.6.6 binary release

2016-12-02 Thread Christopher
The backport may only be necessary if you are building Accumulo from
source. You may be able to drop in 2.1 as a replacement for 2.0 in the
classpath on the pre-built binaries, without a problem.

On Fri, Dec 2, 2016 at 1:00 PM Michael Wall  wrote:

> Andrew,
>
> You should be fine upgrading commons-vfs to 2.1 with Accumulo 1.6.4.  Ran
> that way for a long time with no problems.
>
> Mike
>
> On Fri, Dec 2, 2016 at 12:57 PM, Andrew Hulbert  wrote:
>
> Thanks all. Thanks! Think it would be safe to upgrade the VFS then with
> 1.6.4 as well?
>
> Andrew
> On 12/02/2016 12:37 PM, Christopher wrote:
>
> The release notes for 1.6.6 are in error. I'll update them.
>
> On Fri, Dec 2, 2016 at 11:36 AM Michael Wall < 
> mjw...@gmail.com> wrote:
>
> Andrew,
>
> The commons-vfs 2.1 jar broke the accumulo build in 1.6.6 using the hadoop
> 1 profile.  That profile is remove in 1.7+, so the commons-vfs update was
> left out of 1.6.6.  You should just replace the commons-vfs jar in your
> deployment.  See  <https://issues.apache.org/jira/browse/ACCUMULO-3470>
> https://issues.apache.org/jira/browse/ACCUMULO-3470
>
> Mike
>
> On Fri, Dec 2, 2016 at 10:27 AM, Andrew Hulbert  wrote:
>
> Hi all,
>
> It appears that the commons-vfs2 jar that ships with the 1.6.6 binary
> tar.gz is still version 2.0 according the the META-INF/MANIFEST.MF and
> other maven artifacts in the META-INF instead of 2.1 which is what I
> thought it should be according to the release notes.
>
> Wondering if this is something that can be fixed in the distro or would it
> require a new 1.6.7 release?
>
> Andrew
>
>
>
>
>


Re: Master server throw AccessControlException

2016-12-04 Thread Christopher
The stack trace doesn't include anything from Accumulo, so it's not clear
where in the Accumulo code this occurred. Do you have the full stack trace?

In particular, it's not clear to me that there should be a directory called
failed/da at that location, nor is it clear why Accumulo would be trying to
check for the execute permission on it, unless it's trying to recurse into
a directory. There is one part of the code where, if the directory exists
when log recovery begins, it may try to do a recursive delete, but I can't
see how this location would have been created by Accumulo. If that is the
case, then it should be safe to manually delete this directory and its
contents. The failed marker should be a regular file, though, and should
not be a directory with another directory called "da" in it. So, I can't
see how this was even created, unless by an older version or another
program.

The only way I can see this occurring is if you recently did an upgrade,
while Accumulo had not yet finished outstanding log recoveries from a
previous shutdown, AND the previous version did something different than
1.7.2. If that was the case, then perhaps the older version could have
created this problematic directory. It seems unlikely, though... because
directories are usually not created without the execute bit... and the
error message looks like a directory missing that bit.

It's hard to know more without seeing the full stack trace with the
relevant accumulo methods included. It might also help to see the master
debug logs leading up to the error.

On Sun, Dec 4, 2016 at 2:35 AM Takashi Sasaki  wrote:

> I use Accumulo-1.7.2 with Haddop2.7.2 and ZooKeeper 3.4.8
>
> Master server suddenly throw AccessControlException.
>
> java.io.IOException:
> org.apache.hadoop.security.AccessControlException: Permission denied:
> user=accumulo, access=EXECUTE,
>
> inode="/accumulo/recovery/603194f3-dd41-44ed-8ad6-90d408149952/failed/da":accumulo:accumulo:-rw-r--r--
>  at
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>  at
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
>  at
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
>  at
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>  at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1720)
>  at org.apache.hadoop.hdfs.server.namenode.FSDirSt
>  at AndListingOp.getFileInfo(FSDirSt
>  at AndListingOp.java:108)
>  at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3855)
>  at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1011)
>  at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTransl
>  at orPB.getFileInfo(ClientNamenodeProtocolServerSideTransl
>  at orPB.java:843)
>  at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>  at java.security.AccessController.doPrivileged(N
>  at ive Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at org.apache.hadoop.security.UserGroupInform
>  at ion.doAs(UserGroupInform
>  at ion.java:1657)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
>
>
> How can I solve this Exception?
>
>
> Thank you,
> Takashi.
>


Re: Closing a ZooKeeperInstance client connection

2016-12-13 Thread Christopher
The Instance does not represent a stateful connection. Thus, a close
operation does not make sense.

The Instance interface is supposed to represent a strategy for identifying
a particular deployed instance of Accumulo.

Over time, it may have gathered state. In addition, there are several
places in the client code where we store static state in the JVM, and it
may appear that these are coming from the Instance, only because that's the
logical entry point to Accumulo. These are areas we know we need to work
on, and a proper client resources lifecycle *is* expected at some future
point. However, there's a lot of development work to get to that point.


On Tue, Dec 13, 2016 at 5:42 PM Eric Daniels <
edani...@researchinnovations.com> wrote:

> This question may be naive, but I was just wondering why the Java
> Zookeeper implementation Accumulo provides gives no way to close the client
> connection(s).
>
> The reason I ask is that we spin up a mini cluster to use for various test
> classes. As far as I've been able to tell, there is no way to kill the
> client after you get a Connector from a Zookeeper instance connected to the
> cluster.   Hence every time we shut down the cluster it dumps socket and
> connection error messages because the Zookeeper client can no longer talk
> to the cluster (obviously).  Given we have a very large number of tests
> classes, it makes our build logs full of these exception messages.
>
> Is there something I'm missing here or not understanding?  It seems like
> the main Apache Zookeeper api has close methods, just the Accumulo one does
> not.  Any insight here would be appreciated.
>
> Thanks,
>
> Eric
>
-- 
Christopher


Re: New Accumulo Blog Post

2016-12-20 Thread Christopher
I believe the work is done, or nearly done. I was coordinating with Mike
Walch off list to prepare the code, before it's officially submitted as a
patch to the Apache project. I've asked him to give me a chance to review
it before it gets submitted.

If you'd like to take a preview, you can see it in this branch:
https://github.com/mikewalch/accumulo/tree/volume-chooser

I'd definitely like it to be a blocker for 2.0.0. I think it's an essential
feature.

On Tue, Dec 20, 2016 at 3:00 PM Jeff Kubina  wrote:

> Chris,
>
> Any status on the patch to Accumulo to allow customizing the HDFS volume
> on which the WALs are stored.
>
>
> --
> Jeff Kubina
> 410-988-4436 <(410)%20988-4436>
>
>
> On Wed, Nov 2, 2016 at 10:34 PM, Christopher  wrote:
>
> I'm aware of at least one person who has patched Accumulo to allow
> customizing the HDFS volume on which the WALs are stored. This reminds me
> that I need to check on the status of that patch. I'm hoping it'll be
> contributed soon.
>
> I'm also curious if it'd make a difference writing to HDFS with the data
> nodes mounted with sync, instead of doing a separate sync call.
>
> On Wed, Nov 2, 2016 at 9:49 PM  wrote:
>
> Regarding #2 – I think there are two options here:
>
>
>
> 1. Modify Accumulo to take advantage of HDFS Heterogeneous Storage
>
> 2. Modify Accumulo WAL code to support volumes
>
>
>
> *From:* Jeff Kubina [mailto:jeff.kub...@gmail.com]
> *Sent:* Wednesday, November 02, 2016 9:02 PM
> *To:* user@accumulo.apache.org
> *Subject:* Re: New Accumulo Blog Post
>
>
>
> Thanks for the blog post, very interesting read. Some questions ...
>
>
>
> 1. Are the operations "Writes mutation to tablet servers’ WAL/Sync or
> flush tablet servers’ WAL" and "Adds mutations to sorted in memory map of
> each tablet." performed by threads in parallel?
>
>
>
> 2. Could the latency of hsync-ing the WALs be overcome by modifying
> Accumulo to write them to a separate SSD-only HDFS? To maintain data
> locality it would require two datanode processes (one for the HDDs and one
> for the SSD), running on the same node, which is not hard to do.
>
>
>
>
> --
Christopher


Re: is there any "trick" to save the state of an iterator?

2017-01-09 Thread Christopher
FWIW, there is an open pull request on that issue that puts the work very
near to completion. It could probably use a bit more testing and review,
though.

On Mon, Jan 9, 2017 at 9:37 PM Josh Elser  wrote:

> And yet, Accumulo still doesn't have the API to safely do it.
>
> See ACCUMULO-1280 if you'd like to contribute towards to those efforts for
> the community.
>
> On Jan 9, 2017 20:23, "Jeremy Kepner"  wrote:
>
> It's done in D4M (d4m.mit.edu), you might look there.
> Dylan can explain (if necessary).
> Regards.  -Jeremy
>
> On Mon, Jan 09, 2017 at 07:30:03PM -0500, Josh Elser wrote:
> > Great. Glad I wasn't derailing things :)
> >
> > Unfortunately, I don't think this is a very well-documented area of the
> > code (it's quite advanced and would just confuse most users).
> >
> > I'll have to think about it some more and see if I can come up with
> > anything clever. I know there are some others subscribed to this list
> > who might be more clever than I am -- I'm sure they'll weigh in if they
> > have any suggestions.
> >
> > Finally, if you're interested in helping us put together some sort of
> > "advanced indexing" docs for the project, I'm sure we could find a few
> > people who would be happy to get something published on the Accumulo
> > website.
> >
> > Massimilian Mattetti wrote:
> > > Thank you for your answer John, you understood perfectly what my use
> > > case is.
> > >
> > > The possible solutions that you propose came to mind to me, too. This
> > > confirms to me that, unfortunately, there is no fancy way to overcome
> > > this problem.
> > >
> > > Is there any good documentation on different query planning for
> Accumulo
> > > that could help with my use case?
> > > Thanks.
> > >
> > > Regards,
> > > Max
> > >
> > >
> > >
> > >
> > > From: Josh Elser 
> > > To: user@accumulo.apache.org
> > > Date: 09/01/2017 21:55
> > > Subject: Re: is there any "trick" to save the state of an iterator?
> > >
> 
> > >
> > >
> > >
> > > Hey Max,
> > >
> > > There is no provided mechanism to do this, and this is a problem with
> > > supporting "range queries". I'm hoping I'm understanding your use-case
> > > correctly; sorry in advance if I'm going off on a tangent.
> > >
> > > When performing the standard sort-merge join across some columns to
> > > implement intersections and unions, the un-sorted range of values you
> > > want to scan over (500k-600k) breaks the ordering of the docIds which
> > > you are trying to catch.
> > >
> > > The trivial solution is to convert a range into a union of discrete
> > > values (50 || 51 || 52 || ..) but you can see how this
> > > quickly falls apart. An inverted index could be used to enumerate the
> > > values that exist in the range.
> > >
> > > Another trivial solution would be to select all records matching the
> > > smaller condition, and then post-filter the other condition.
> > >
> > > There might be some more trickier query planning decisions you could
> > > also experiment with (I'd have to give it lots more thought). In short,
> > > I'd recommend against trying to solve the problem via saving state.
> > > Architecturally, this is just not something that Accumulo Iterators are
> > > designed to support at this time.
> > >
> > > - Josh
> > >
> > > Massimilian Mattetti wrote:
> > >  > Hi all,
> > >  >
> > >  > I am working with a Document-Partitioned Index table whose index
> > >  > sections are accessed using ranges over the indexed properties (e.g.
> > >  > property A ∈ [500,000 - 600,000], property B ∈ [0.1 - 0.4], etc.).
> The
> > >  > iterator that handles this table works by: 1st - calculating (doing
> > >  > intersection and union on different properties) all the result from
> the
> > >  > index section of a single bin; 2nd - using the ids retrieved from
> the
> > >  > index, it goes over the data section of the specific bin.
> > >  > This iterator has proved to have significant performance penalty
> > >  > whenever the amount of data retrieved from the index is orders of
> > >  > magnitude bigger than the table_scan_max_memory i.e. the iterator is
> > >  > teardown tens of times for each bin. Since there is no explicit way
> to
> > >  > save the state of an iterator, is there any other mechanism/approach
> > >  > that I could use/follow in order to avoid to re-calculate the index
> > >  > result set after each teardown?
> > >  > Thanks.
> > >  >
> > >  >
> > >  > Regards,
> > >  > Max
> > >  >
> > > .
> > >
> > >
> > >
> > >
>
> --
Christopher


Re: data miss when use rowiterator

2017-02-09 Thread Christopher
Does it matter if your scanner is a BatchScanner or a Scanner?
I wonder if this is due to the way BatchScanner could split rows up.

On Thu, Feb 9, 2017 at 9:50 PM Lu Q  wrote:

>
> I use accumulo 1.8.0,and I develop a ORM framework for conversion the scan
> result to a object.
>
> Before,I use Rowiterator because it faster than direct to use scan
>
> RowIterator rows = new RowIterator(scan);
> rows.forEachRemaining(rowIterator -> {
> while (rowIterator.hasNext()) {
> Map.Entry entry = rowIterator.next();
> ...
> }
> }
>
> it works ok until I query 1000+ once .I found that when the range size
> bigger then 1000,some data miss.
> I think maybe I conversion it error ,so I change it to a map struct ,the
> row_id as the map key ,and other as the map value ,the problem still exists.
>
> Then I not use RowIterator,it works ok.
> for (Map.Entry entry : scan) {
> ...
> }
>
>
> Is the bug or my program error ?
> Thanks.
>
-- 
Christopher


Re: data miss when use rowiterator

2017-02-09 Thread Christopher
I suspected that was the case. BatchScanner does not guarantee ordering of
entries, which is needed for the behavior you're expecting with
RowIterator. This means that the RowIterator could see the same row
multiple times with different subsets of the row's columns. This is
probably affecting your count.

On Thu, Feb 9, 2017 at 10:29 PM Lu Q  wrote:

> I use BatchScanner
>
> 在 2017年2月10日,11:24,Christopher  写道:
>
> Does it matter if your scanner is a BatchScanner or a Scanner?
> I wonder if this is due to the way BatchScanner could split rows up.
>
> On Thu, Feb 9, 2017 at 9:50 PM Lu Q  wrote:
>
>
> I use accumulo 1.8.0,and I develop a ORM framework for conversion the scan
> result to a object.
>
> Before,I use Rowiterator because it faster than direct to use scan
>
> RowIterator rows = new RowIterator(scan);
> rows.forEachRemaining(rowIterator -> {
> while (rowIterator.hasNext()) {
> Map.Entry entry = rowIterator.next();
> ...
> }
> }
>
> it works ok until I query 1000+ once .I found that when the range size
> bigger then 1000,some data miss.
> I think maybe I conversion it error ,so I change it to a map struct ,the
> row_id as the map key ,and other as the map value ,the problem still exists.
>
> Then I not use RowIterator,it works ok.
> for (Map.Entry entry : scan) {
> ...
> }
>
>
> Is the bug or my program error ?
> Thanks.
>
> --
> Christopher
>
>
> --
Christopher


Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

2017-02-20 Thread Christopher
Removing them is probably a bad idea. The root table entries correspond to
split points in the metadata table. There is no need for the tables which
existed when the metadata table split to still exist for this to continue
to act as a valid split point.

Would need to see the exception stack trace, or at least an error message,
to troubleshoot the shell scanning error you saw.

On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR 
wrote:

> UNOFFICIAL
>
> In case it is ok to remove these from the root table, how can I scan the
> root table for rows with a rowid starting with !0;1vm?
>
> Running "scan -b !0;1vm" throws an exception and exits the shell.
>
>
> -Original Message-
> From: Dickson, Matt MR [mailto:matt.dick...@defence.gov.au]
> Sent: Tuesday, 21 February 2017 09:30
> To: 'user@accumulo.apache.org'
> Subject: RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>
> UNOFFICIAL
>
>
> Does that mean I should have entries for 1vm in the metadata table
> corresponding to the root table?
>
> We are running 1.6.5
>
>
> -Original Message-
> From: Josh Elser [mailto:josh.el...@gmail.com]
> Sent: Tuesday, 21 February 2017 09:22
> To: user@accumulo.apache.org
> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>
> The root table should only reference the tablets in the metadata table.
> It's a hierarchy: like metadata is for the user tables, root is for the
> metadata table.
>
> What version are ya running, Matt?
>
> Dickson, Matt MR wrote:
> > *UNOFFICIAL*
> >
> > I have a situation where all tablet servers are progressively being
> > declared dead. From the logs the tservers report errors like:
> > 2017-02- DEBUG: Scan failed thrift error
> > org.apache.thrift.trasport.TTransportException null
> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997,2342423df12341d)
> > 1vm was a table id that was deleted several months ago so it appears
> > there is some invalid reference somewhere.
> > Scanning the metadata table "scan -b 1vm" returns no rows returned for
> 1vm.
> > A scan of the accumulo.root table returns approximately 15 rows that
> > start with; !0:1vm;/::2016103 /blah/ // How are the root
> > table entries used and would it be safe to remove these entries since
> > they reference a deleted table?
> > Thanks in advance,
> > Matt
> > //
>
-- 
Christopher


Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

2017-02-21 Thread Christopher
It should be safe to merge on the metadata table. That was one of the goals
of moving the root tablet into its own table. I'm pretty sure we have a
build test to ensure it works.

On Tue, Feb 21, 2017, 18:22 Dickson, Matt MR 
wrote:

> *UNOFFICIAL*
> Firstly, thankyou for your advice its been very helpful.
>
> Increasing the tablet server memory has allowed the metadata table to come
> online.  From using the rfile-info and looking at the splits for the
> metadata table it appears that all the metadata table entries are in one
> tablet.  All tablet servers then query the one node hosting that tablet.
>
> I suspect the cause of this was a poorly designed table that at one point
> the Accumulo gui reported 1.02T tablets for.  We've subsequently deleted
> that table but it might be that there were so many entries in the metadata
> table that all splits on it were due to this massive table that had the
> table id 1vm.
>
> To rectify this, is it safe to run a merge on the metadata table to force
> it to redistribute?
>
> --
> *From:* Michael Wall [mailto:mjw...@gmail.com]
> *Sent:* Wednesday, 22 February 2017 02:44
>
> *To:* user@accumulo.apache.org
> *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> Matt,
>
> If I am reading this correctly, you have a tablet that is being loading
> onto a tserver.  That tserver dies, so the tablet is then assigned to
> another tablet.  While the tablet is being loading, that tserver dies and
> so on.  Is that correct?
>
> Can you identify the tablet that is bouncing around?  If so, try using
> rfile-info -d to inspect the rfiles associated with that tablet.  Also look
> at the rfiles that compose that tablet to see if anything sticks out.
>
> Any logs that would help explain why the tablet server is dying?  Can you
> increase the memory of the tserver?
>
> Mike
>
> On Tue, Feb 21, 2017 at 10:35 AM Josh Elser  wrote:
>
> ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
> communicating with ZooKeeper, will retry
> SessionExpiredException: KeeperErrorCode = Session expired for
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>
> There can be a number of causes for this, but here are the most likely
> ones.
>
> * JVM gc pauses
> * ZooKeeper max client connections
> * Operating System/Hardware-level pauses
>
> The former should be noticeable by the Accumulo log. There is a daemon
> running which watches for pauses that happen and then reports them. If
> this is happening, you might have to give the process some more Java
> heap, tweak your CMS/G1 parameters, etc.
>
> For maxClientConnections, see
>
> https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim.html
>
> For the latter, swappiness is the most likely candidate (assuming this
> is hopping across different physical nodes), as are "transparent huge
> pages". If it is limited to a single host, things like bad NICs, hard
> drives, and other hardware issues might be a source of slowness.
>
> On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
>  wrote:
> > UNOFFICIAL
> >
> > It looks like an issue with one of the metadata table tablets. On startup
> > the server that hosts a particular metadata tablet gets scanned by all
> other
> > tablet servers in the cluster.  This then crashes that tablet server
> with an
> > error in the tserver log;
> >
> > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
> > communicating with ZooKeeper, will retry
> > SessionExpiredException: KeeperErrorCode = Session expired for
> >
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
> >
> > That metadata table tablet is then transferred to another host which then
> > fails also, and so on.
> >
> > While the server is hosting this metadata tablet, we see the following
> log
> > statement from all tserver.logs in the cluster:
> >
> >  [impl.ThriftScanner] DEBUG: Scan failed, thrift error
> > org.apache.thrift.transport.TTransportException  null
> > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997,2342423df12341d)
> > Hope that helps complete the picture.
> >
> >
> > 
> > From: Christopher [mailto:ctubb...@apache.org]
> > Sent: Tuesday, 21 February 2017 13:17
> >
> > To: user@accumulo.apache.org
> > Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >
> > Removing them is probably a bad idea. The root table entries correspond
> to
> > split points in the metadata table. There i

Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

2017-02-22 Thread Christopher
On Wed, Feb 22, 2017 at 8:18 PM Dickson, Matt MR <
matt.dick...@defence.gov.au> wrote:

> UNOFFICIAL
>
> I ran the compaction with no luck.
>
> I've had a close look at the split points on the metadata table and
> confirmed that due to the initial large table we now have 90% of the
> metadata for existing tables hosted on one tablet which creates a hotspot.
> I've now manually added better split points to the metadata table that has
> created tablets with only 4-5M entries rather than 12M+.
>
> The split points I created isolate the metadata for large tables to
> separate tablets but ideally I'd like to split these further which raises 3
> questions.
>
> 1. If I have table 1xo, is there a smart way to determine the mid point of
> the data in the metadata table eg 1xo; to allow me to create a split
> based on that?
>
> 2. I tried to merge tablets on the metadata table where the size was
> smaller than 1M but was met with a warning stating merge on the metadata
> table was not allowed. Due to the deletion of the large table we have
> several tablets with zero entries and they will never be populate.
>
>
Hmm. That seems to ring a bell. It was a goal of moving the root tablet
into its own table, that users would be able to merge the metadata table.
However, we may still have an unnecessary constraint on that in the
interface, which is no longer needed. If merging on the metadata table
doesn't work, please file a JIRA at
https://issues.apache.org/browse/ACCUMULO with any error messages you saw,
so we can track it as a bug.


> 3. How Accumulo should deal with the deletion of a massive table? Should
> the metadata table redistribute the tablets to avoid hotspotting on a
> single tserver which appears to be whats happening?
>
> Thanks for all the help so far.
>
> -Original Message-
> From: Josh Elser [mailto:josh.el...@gmail.com]
> Sent: Thursday, 23 February 2017 10:00
> To: user@accumulo.apache.org
> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>
> There's likely a delete "tombstone" in another file referenced by that
> tablet which is masking those entries. If you compact the tablet, you
> should see them all disappear.
>
> Yes, you should be able to split/merge the metatdata table just like any
> other table. Beware, the implications of this are system wide instead of
> localized to a single user table :)
>
> Dickson, Matt MR wrote:
> > *UNOFFICIAL*
> >
> > When I inspect the rfiles associated with the metadata table using the
> > rfile-info there are a lot of entries for the old deleted table, 1vm.
> > Querying the metadata table returns nothing for the deleted table.
> > When a table is deleted should the rfiles have any records referencing
> > the old table?
> > Also, am I able to manually create new split point on the metadata
> > table to force it to break up the large tablet?
> > --
> > --
> > *From:* Christopher [mailto:ctubb...@apache.org]
> > *Sent:* Wednesday, 22 February 2017 15:46
> > *To:* user@accumulo.apache.org
> > *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >
> > It should be safe to merge on the metadata table. That was one of the
> > goals of moving the root tablet into its own table. I'm pretty sure we
> > have a build test to ensure it works.
> >
> > On Tue, Feb 21, 2017, 18:22 Dickson, Matt MR
> > mailto:matt.dick...@defence.gov.au>>
> wrote:
> >
> > __
> >
> > *UNOFFICIAL*
> >
> > Firstly, thankyou for your advice its been very helpful.
> > Increasing the tablet server memory has allowed the metadata table
> > to come online. From using the rfile-info and looking at the splits
> > for the metadata table it appears that all the metadata table
> > entries are in one tablet. All tablet servers then query the one
> > node hosting that tablet.
> > I suspect the cause of this was a poorly designed table that at one
> > point the Accumulo gui reported 1.02T tablets for. We've
> > subsequently deleted that table but it might be that there were so
> > many entries in the metadata table that all splits on it were due to
> > this massive table that had the table id 1vm.
> > To rectify this, is it safe to run a merge on the metadata table to
> > force it to redistribute?
> >
> >
>  
> > *From:* Michael Wall [mailto:mjw...@gmail.com
> > <mailto:mjw...@gmail.com>]
> > 

Re: Master takes awhile to start after Accumulo start-all.sh run

2017-05-25 Thread Christopher
Hello,

How long is "awhile"? A few seconds, tens of seconds, minutes?

It could take some extra time in a virtualized environment, but it's hard
to know exactly whether what you're seeing is normal or not, without
knowing how many tablets your system has, how fast your networking is, what
specs your nodes have, how many tablet servers you are running, and any
number of other factors.

If you're starting HDFS and ZK immediately prior to starting Accumulo,
Accumulo may be waiting on those to finish starting first. You may see some
indication of this in the Accumulo logs.

The other thing is... it can take about 30 seconds for the ZooKeeper lock
from a previously killed Master to disappear, if you're killing and
restarting it.

The shell may be able to start faster with the --disable-tab-completion
option.


On Thu, May 25, 2017 at 10:04 PM o haya  wrote:

> Hi,
>
> I followed the procedure on this page to stand up my test Accumulo 1.8.1
> instance:
>
>
> https://www.digitalocean.com/community/tutorials/how-to-install-the-big-data-friendly-apache-accumulo-nosql-database-on-ubuntu-14-04
>
> Everything seems to be working correctly.  To startup, I:
>
> - Run /apps/hadoop/sbin/start-dfs.sh
> - Run /apps/zookeeper/bin/zkServer.sh start
> - Run /apps/accumulo/bin/start-all.sh
>
> The accumulo startup seems to run all right and I can get to the Accumulo
> website, but the Master is not running.  If I wait awhile and check the
> website again, then everything is running, including Master.
>
> Is it normal for the Master to take awhile to start running?  Is there
> something I can do to get it to start faster?
>
> It seems like I cannot start the Accumulo shell until the Master is
> running...
>
> Thanks,
> Jim
>


Re: ~delhdfs entries in metadata table [SEC=UNOFFICIAL]

2017-05-30 Thread Christopher
The tablet will split if it gets too big. You can manually add a split
point if you want it to split sooner.

These entries should go away on their own if the accumulo-gc service is
running. If it has died, you should check the logs to find out why, and
then restart it when you can. If the accumulo-gc service is running but the
files aren't going away, you should check the logs to determine why.

On Wed, May 31, 2017 at 12:23 AM Dickson, Matt MR <
matt.dick...@defence.gov.au> wrote:

> *UNOFFICIAL*
> Hi,
>
> I have in excess of 500K entries in the metadata table that look like:
>
> ~delhdfs: //root-**/accumulo/tables/**/...rf : []
>
> Is this number of records normal?
>
> I'm concerned that these are all hosted on a single tablet so wanted to
> either split the tablet or know if it is safe to delete these?
>
> Thanks in advance.
> Matt
>


Re: maximize usage of cluster resources during ingestion

2017-07-05 Thread Christopher
Huge GC pauses can be mitigated by ensuring you're using the Accumulo
native maps library.

On Wed, Jul 5, 2017 at 11:05 AM Cyrille Savelief 
wrote:

> Hi Massimilian*,*
>
> Using a MultiTableBatchWriter we are able to ingest about 600K entries/s
> on a single node (30Gb of memory, 8 vCPU) running Hadoop, Zookeeper,
> Accumulo and our ingest process. For us, "valleys" came from huge GC pauses.
>
> Best,
>
> Cyrille
>
> Le mer. 5 juil. 2017 à 14:37, Massimilian Mattetti 
> a écrit :
>
>> Hi all,
>>
>> I have an Accumulo 1.8.1 cluster made by 12 bare metal servers. Each
>> server has 256GB of Ram and 2 x 10 cores CPU. 2 machines are used as
>> masters (running HDFS NameNodes, Accumulo Master and Monitor). The other 10
>> machines has 12 Disks of 1 TB (11 used by HDFS DataNode process) and are
>> running Accumulo TServer processes. All the machines are connected via a
>> 10Gb network and 3 of them are running ZooKeeper. I have run some heavy
>> ingestion test on this cluster but I have never been able to reach more
>> than *20% *CPU usage on each Tablet Server. I am running an ingestion
>> process (using batch writers) on each data node. The table is pre-split in
>> order to have 4 tablets per tablet server. Monitoring the network I have
>> seen that data is received/sent from each node with a peak rate of about
>> 120MB/s / 100MB/s while the aggregated disk write throughput on each tablet
>> servers is around 120MB/s.
>>
>> The table configuration I am playing with are:
>> "table.file.replication": "2",
>> "table.compaction.minor.logs.threshold": "10",
>> "table.durability": "flush",
>> "table.file.max": "30",
>> "table.compaction.major.ratio": "9",
>> "table.split.threshold": "1G"
>>
>> while the tablet server configuration is:
>> "tserver.wal.blocksize": "2G",
>> "tserver.walog.max.size": "8G",
>> "tserver.memory.maps.max": "32G",
>> "tserver.compaction.minor.concurrent.max": "50",
>> "tserver.compaction.major.concurrent.max": "8",
>> "tserver.total.mutation.queue.max": "50M",
>> "tserver.wal.replication": "2",
>> "tserver.compaction.major.thread.files.open.max": "15"
>>
>> the tablet server heap has been set to 32GB
>>
>> From Monitor UI
>>
>>
>> As you can see I have a lot of valleys in which the ingestion rate
>> reaches 0.
>> What would be a good procedure to identify the bottleneck which causes
>> the 0 ingestion rate periods?
>> Thanks.
>>
>> Best Regards,
>> Max
>>
>>


Re: Kerberos ticket renewal

2017-07-10 Thread Christopher
It certainly sounds like the same issue. I'd recommend upgrading to the
latest 1.7.3 (currently the latest 1.7 version) to include all the bugs
we've found and fixed in that release line.

On Mon, Jul 10, 2017 at 5:50 AM James Srinivasan 
wrote:

> I'm using Accumulo 1.7.0 and finding that after some period of time
> (>8 hours, <3 days - happened over the weekend) my ingest fails with
> errors regarding "Failed to find any Kerberos tgt". My guess is that
> the ticket from the keytab has expired, and needs to be renewed - from
> memory, I had seen a Kerberos tgt renewer thread running in my client,
> so assumed it happened automagically. Is that the case? Perhaps I am
> hitting this bug? https://issues.apache.org/jira/browse/ACCUMULO-4069
>
> Thanks,
>
> James
>


Re: Another VisibilityEvaluator question

2017-08-14 Thread Christopher
Not in the current implementation. As I understand it, though, you are
writing an alternate VisibilityEvaluator. The IteratorEnvironment that is
passed in contains a reference to the current table's configuration. It
doesn't have the table name or id, but it does have its configuration, so
if you were to insert a configuration property into that particular table,
you could read it in the VisibilityFilter, and modify that to pass it to
the VisibilityEvaluator.

Alternatively, you could update the IteratorEnvironment interface to
include a table ID getter (or name, but ID is more reliable). This would
avoid requiring you to put anything in the table's configuration, but may
require you to modify a more code. (Might be a good idea to add this
upstream; I created an issue:
https://issues.apache.org/jira/browse/ACCUMULO-4695)

Either strategy would involve getting some information from the environment
from within the VisibilityFilter, and passing that along to the
VisibilityEvaluator.


On Mon, Aug 14, 2017 at 12:40 AM o haya  wrote:

> Hi,
>
> I am wondering if there is a way for code inside the VisibilityEvaluator
> to get the name of the "current" table (the table that is being processed)?
>
> Thanks,
> Jim
>


Re: Another VisibilityEvaluator question

2017-08-15 Thread Christopher
It isn't currently being passed. You'd have to modify the VisibilityFilter
to pass it. As I said, both strategies involve modifying the
VisibilityFilter to pass something from it to the VisibilityEvaluator.

On Tue, Aug 15, 2017 at 8:10 AM o haya  wrote:

> Hi Christopher,
>
> I may consider your second suggestion, but really prefer to "minimize" the
> parts that I am working with (if you know what I mean :)).
>
> But, about the comment in your first paragraph, I don't see where a
> reference to the IteratorEnvironment is being passed into the
> VisibilityEvaluator?
>
> How can I access that from within the VE?
>
> Thanks again,
> Jim
>
> ----
> On Mon, 8/14/17, Christopher  wrote:
>
>  Subject: Re: Another VisibilityEvaluator question
>  To: user@accumulo.apache.org, "o haya" 
>  Date: Monday, August 14, 2017, 7:43 PM
>
>  Not in the
>  current implementation. As I understand it, though, you are
>  writing an alternate VisibilityEvaluator. The
>  IteratorEnvironment that is passed in contains a reference
>  to the current table's configuration. It doesn't
>  have the table name or id, but it does have its
>  configuration, so if you were to insert a configuration
>  property into that particular table, you could read it in
>  the VisibilityFilter, and modify that to pass it to the
>  VisibilityEvaluator.
>
>  Alternatively, you could update the
>  IteratorEnvironment interface to include a table ID getter
>  (or name, but ID is more reliable). This would avoid
>  requiring you to put anything in the table's
>  configuration, but may require you to modify a more code.
>  (Might be a good idea to add this upstream; I created an
>  issue: https://issues.apache.org/jira/browse/ACCUMULO-4695)
>
>  Either strategy would involve
>  getting some information from the environment from within
>  the VisibilityFilter, and passing that along to the
>  VisibilityEvaluator.
>
>
>  On Mon, Aug
>  14, 2017 at 12:40 AM o haya 
>  wrote:
>  Hi,
>
>
>
>  I am wondering if there is a way for code inside the
>  VisibilityEvaluator to get the name of the
>  "current" table (the table that is being
>  processed)?
>
>
>
>  Thanks,
>
>  Jim
>
>
>


[NOTICE] Accumulo git repositories have moved

2017-08-29 Thread Christopher
Hello Accumulo developers and users,

Accumulo has moved its source code repositories from git.apache.org /
git-wip-us.apache.org to gitbox.apache.org. If you were using the mirrors
at GitHub.com, then nothing has changed for you. Otherwise, you should
update your git remote to point to the new location in all of your git
clones, mirrors, and any Jenkins jobs or tools which check out the source
code.

Example:
  `git remote -v` shows "origin" remote with a URL of:
https://git-wip-us.apache.org/repos/asf/accumulo OR git://
git.apache.org/accumulo.git
  You should execute: `git remote set-url origin
https://gitbox.apache.org/repos/asf/accumulo` to update your local clone.

Example:
  `git remote -v` shows "mirror" remote with a URL of:
https://github.com/apache/accumulo.git OR g...@github.com:apache/accumulo.git

  Congratulations. You do not have to change a thing. The GitHub repo
locations have not changed.

Here's a full list of our current repositories for cloning (.git suffix is
optional; clone works either way):

https://gitbox.apache.org/repos/asf/accumulo.git
https://gitbox.apache.org/repos/asf/accumulo-bsp.git
https://gitbox.apache.org/repos/asf/accumulo-examples.git
https://gitbox.apache.org/repos/asf/accumulo-instamo-archetype.git
https://gitbox.apache.org/repos/asf/accumulo-pig.git
https://gitbox.apache.org/repos/asf/accumulo-testing.git
https://gitbox.apache.org/repos/asf/accumulo-website.git
https://gitbox.apache.org/repos/asf/accumulo-wikisearch.git

And, if you prefer, here's our GitHub mirrors:
https://github.com/apache?q=accumulo

Links to these new locations have already been updated on our website at
https://accumulo.apache.org
Email the d...@accumulo.apache.org if you have any questions.


Re: IPv6-only hosts for MAC

2017-08-29 Thread Christopher
I would love to have Accumulo work well with IPv6, but unfortunately, I
haven't been able to try it, and don't have the right test environment to
do so. There's no reason we shouldn't support it, though.

On Tue, Aug 29, 2017 at 5:35 PM Adam J. Shook  wrote:

> Howdy folks,
>
> Anyone have any experience running Accumulo on IPv6-only hosts?
> Specifically the MiniAccumloCluster?
>
> There is an open issue in the Presto-Accumulo connector (see [1] and [2])
> saying the MAC doesn't work in an IPv6-only environment, and the PR comment
> thread has some suggestions to change the JVM arguments within the server
> and client code to prefer IPv6 addresses.
>
> From a brief look at the Accumulo source code, this might require changes
> to make MAC's JVM arguments configurable, changes to the client code, or a
> different approach to testing the Presto/Accumulo connector all together.
>
> Any pointers in the right direction would be appreciated.  Looking to get
> a heading before I dig myself into a hole on this one.
>
> [1] Issue: https://github.com/prestodb/presto/issues/8789
> [2] PR and comment thread: https://github.com/prestodb/presto/pull/8869
>
> Thanks,
> --Adam
>


Re: Help

2017-09-29 Thread Christopher
Accumulo is written primarily in Java, which is platform-independent. So,
it may be possible to run Accumulo in Windows. However, I do not know
anybody actually doing this, and there will certainly be roadblocks and
pitfalls along the way.

For one, the scripts provided in Accumulo's distribution will almost
certainly not work on Windows. So, you may have to create your own way to
launch the Java applications and create any necessary environment variables
for class path, configuration file locations, logging, etc.

Another possible pitfall: although Java is platform independent, in theory,
it is possible to write code in Java which makes assumptions that are only
valid on Linux or other Unix-like environments, and it's probable that
we've done that in a few places. If this is the case, then that's something
we could fix, given a bug report which brings it to our attention. So, if
you do end up trying this, please let the community know how it went.
Although I wouldn't recommend it for production Accumulo instances, there
may be a use case for it (development?) that we could support if there were
contributors regularly using Windows who could provide bug reports and
testing feedback.

On Fri, Sep 29, 2017 at 10:03 PM Yue Zhao  wrote:

> Hello,
>
>
>
> Can I use Accumulo on Windows 7? Or it only works with Linux?
>
>
>
> Thanks,
>
>
>
> Yue Zhao
>
> Team Leader, Software Development
>
> Physical Optics Corporation 
>
> 1845 W. 205 th
> St., Torrance, CA 90501
>
> Phone: (310)320-3088 ext. 256 <(310)%20320-3088>
>
>
> This e-mail message is for the sole use of the intended recipient(s) and
> may contain confidential and privileged information. Any unauthorized
> review, use, disclosure or distribution is prohibited. If you are not the
> intended recipient, please contact the sender by reply e-mail and destroy
> all copies of the original message. EXPORT CONTROL NOTICE: This e-mail may
> contain technical data whose export, transfer, and /or disclosure may be
> controlled by the US international Traffic in Arms Regulation (ITAR) 22 CFR
> part 120-130 or the Export Administration Regulations (Commerce.)
>


Re: Backup and Recovery

2017-10-03 Thread Christopher
Hi Mike. This is a great question. Accumulo has several options for backup.

Accumulo is backed by HDFS for persisting its data on disk. It may be
possible to use S3 directly at this layer. I'm not sure what the current
state is for doing something like this, but a brief Googling for "HDFS on
S3" shows a few historical projects which may still be active and mature.

Accumulo also has a replication feature to automatically mirror live ingest
to a pluggable external receiver, which could be a backup service you've
written to store data in S3. Recovery would depend on how you store the
data in S3. You could also implement an ingest system which stores data to
a backup as well as to Accumulo, to handle both live and bulk ingest.

Accumulo also has an "exporttable" feature, which exports the metadata for
a table, along with a list of files in HDFS for you to back up to S3 (or
another file system). Recovery involves using the "importtable" feature
which recreates the metadata, and bulk importing the files after you've
moved them from your backup location back onto HDFS.

This is just a rough outline of 3 possible solutions. I don't know which
(if any) would match your requirements best. There may be many other
solutions as well.

On Tue, Oct 3, 2017 at 4:10 PM  wrote:

> Please forgive the newbie question. What options are there for backup and
> recovery of accumulo data?
>
>
>
> Ideally I would like something that would replicate to S3 in realtime.
>
>


Re: Backup and Recovery

2017-10-03 Thread Christopher
Oh, sorry, no. That's not the case. I did not mean to mislead. You also
need to back up the metadata from ZooKeeper for a complete backup. We have
a utility for that, which I believe is mentioned in the documentation. If
not, that's a documentation bug and we should add it. (Sorry, unable to
check at the moment, but please file a bug if you can't find it.)

On Tue, Oct 3, 2017 at 4:47 PM  wrote:

> So if I backup the HDFS I have a backup of accumulo? There isn’t any other
> data that I’d need to grab?
>
>
>
> *From:* Christopher [mailto:ctubb...@apache.org]
> *Sent:* Tuesday, October 3, 2017 1:41 PM
> *To:* user@accumulo.apache.org
> *Subject:* Re: Backup and Recovery
>
>
>
> Hi Mike. This is a great question. Accumulo has several options for backup.
>
> Accumulo is backed by HDFS for persisting its data on disk. It may be
> possible to use S3 directly at this layer. I'm not sure what the current
> state is for doing something like this, but a brief Googling for "HDFS on
> S3" shows a few historical projects which may still be active and mature.
>
> Accumulo also has a replication feature to automatically mirror live
> ingest to a pluggable external receiver, which could be a backup service
> you've written to store data in S3. Recovery would depend on how you store
> the data in S3. You could also implement an ingest system which stores data
> to a backup as well as to Accumulo, to handle both live and bulk ingest.
>
> Accumulo also has an "exporttable" feature, which exports the metadata for
> a table, along with a list of files in HDFS for you to back up to S3 (or
> another file system). Recovery involves using the "importtable" feature
> which recreates the metadata, and bulk importing the files after you've
> moved them from your backup location back onto HDFS.
>
> This is just a rough outline of 3 possible solutions. I don't know which
> (if any) would match your requirements best. There may be many other
> solutions as well.
>
> On Tue, Oct 3, 2017 at 4:10 PM  wrote:
>
> Please forgive the newbie question. What options are there for backup and
> recovery of accumulo data?
>
>
>
> Ideally I would like something that would replicate to S3 in realtime.
>
>
>
>


Re: Accumulo as a Column Storage

2017-10-19 Thread Christopher
There's no expected scaling issue with having each column qualifier in its
own unique column family, regardless of how large the number of these
becomes. I've ingested random data like this before for testing, and it
works fine.

However, there may be an issue trying to create a very large number of
locality groups. Locality groups are named, and you must explicitly
configure them to store particular column families. That configuration is
typically stored in ZooKeeper, and the configuration storage (in ZooKeeper,
and/or in your conf/accumulo-site.xml file) does not scale as well as the
data storage (HDFS) does. Where, and how, it will break, is probably
system-dependent and not directly known (at least, not known by me). I
would expect dozens, and possibly hundreds, of locality groups to work
okay, but thousands seems like it's too many (but I haven't tried).

On Thu, Oct 19, 2017 at 6:47 PM Mohammad Kargar  wrote:

> That makes sense. So this means that there's no limit or concerns on
> having, potentially,  large number of column families (holing only one
> column qualifier), right?
>
> On Thu, Oct 19, 2017 at 3:06 PM, Josh Elser  wrote:
>
>> Yup, that's the intended use case. You have the flexibility to determine
>> what column families make sense to group together. Your only "cost" in
>> changing your mind is the speed at which you can re-compact your data.
>>
>> There is one concern which comes to mind. Though making many locality
>> groups does increase the speed at which you can read from specific columns,
>> it decreases the speed at which you can read from _all_ columns. So, you
>> can do this trick to make Accumulo act more like a columnar database, but
>> beware that you're going to have an impact if you still have a use-case
>> where you read more than just one or two columns at a time.
>>
>> Does that make sense?
>>
>>
>> On 10/19/17 5:50 PM, Mohammad Kargar wrote:
>>
>>> AFAIK in Accumulo we can use "locality groups" to group sets of columns
>>> together on disk which would make it more like  a column-oriented database.
>>> Considering that "locality groups" are per column family, I was wondering
>>> what if we treat column families like column qualifiers (creating one
>>> column family per each qualifier) and assigning each to a different
>>> locality group. This way all the data in a given column will be next to
>>> each other on disk which makes it easier for analytical applications to
>>> query the data.
>>>
>>> Any thoughts?
>>>
>>> Thanks,
>>> Mohammad
>>>
>>>
>


Re: Connecting java client to Accumulo VM

2017-11-09 Thread Christopher
So, Accumulo TServers publish their hostname from their config into the
Accumulo metadata. So, yes, this hostname must be reachable by clients as
well, because that's how clients will identify it in the Accumulo metadata.

I'm not sure what the deal was with the JAVA_HOME variable. It's probably
one of the launch scripts which ssh'd to the hostname in the config to
start the tserver. When it's set to localhost, it probably doesn't ssh, but
just starts it directly. My guess is that .profile is not read when ssh'ing
because it's not creating a logon session and instead just executing the
given command. You could try putting it in .bashrc instead (assuming your
default shell is bash).

On Thu, Nov 9, 2017 at 11:42 AM Geoffry Roberts 
wrote:

> All,
>
> Muchas gracias for the help.
>
> When I had masters/slaves populated with "localhost", 9997 was listening
> but not remotely.  When I had m/s populated with the IP address,  9997 was
> not listening and the log said JAVA_HOME was not set and that I should set
> it in accumulo-env.sh.  I did that and 9997 began listening remotely and
> was accessible.  Let us then, declare victory.
>
> Two points, however, I should like to pass along:
>
>
>1. Why the problem with JAVA_HOME?  I have JAVA_HOME set in .profile.
>When m/s is "localhost" no problem; but when "hostname" (from /etc/hosts),
>big problem.
>2. Before I could get my remote client to connect, I had to make a
>mirror entry in the host's /etc/hosts file.  i.e. Both host and guest
>machines needed the same entry.  This is because the client tries to work
>with the name from the guest's /etc/hosts and not the IP.
>
> Anyhow, I am off to the races with Accumulo.
>
> On Wed, Nov 8, 2017 at 5:49 PM, Edward Gleeck  wrote:
>
>> Yep should work. would suggest checking the logs at this point to see
>>  what’s causing the failure.  If it’s not starting up there would be
>> exceptions thrown by the service.
>>
>> On Wed, Nov 8, 2017 at 5:36 PM Geoffry Roberts 
>> wrote:
>>
>>> I tried the IP address (a 192 number) but the same result--no 9997.
>>> Using said IP I can access from either the host or from within the guest.
>>>
>>> So far nothing works in master/slaves except localhost.
>>>
>>> I gather this is supposed to work correct?
>>>
>>> On Wed, Nov 8, 2017 at 5:16 PM, Edward Gleeck  wrote:
>>>
 You wouldn't want the 0.0.0.0 on your /etc/hosts as this wouldn't be
 valid. I don't recall exactly which configuration file under
 $ACCUMULO_CONF_DIR you would want this in as Josh pointed out, but if you
 were to go the /etc/host route, you want to put the IP address of that
 interface VM host. for example /etc/hosts:

 192.168.56.101 localhost localhost.localdomain

 HTH



 On Wed, Nov 8, 2017 at 4:13 PM, Geoffry Roberts >>> > wrote:

> I gave your suggestion a try.  I made an entry in /etc/hosts that
> resolves to 0.0.0.0 then set that name in master and slaves.  (I am 
> running
> single node for now.). The upshot is port 9997 does not appear as 
> listening
> at all.  If I change back to localhost, then it appears again.   My guess
> is the tablet server only starts when it's port is localhost.
>
> Am I using Accumulo correctly?  Is it not designed to be accessed
> remotely?
>
>
>
> On Wed, Nov 8, 2017 at 2:20 PM, Josh Elser 
> wrote:
>
>> Accumulo chooses the network interface to bind given the resolution
>> of the hostname that you provide in the "hosts" files in 
>> ACCUMULO_CONF_DIR.
>>
>> If you have "localhost" (the default) still in the files (e.g.
>> masters, slaves), this presumably resolves to 127.0.0.1 which will result
>> in Accumulo not accepting connections from your VM's network adapter.
>>
>> A quick hack would be to put "0.0.0.0" in those files instead of
>> "localhost". I think the Accumulo scripts only have the ability to 
>> override
>> the bound interface for the Monitor, not all processes, to be 0.0.0.0. 
>> You
>> could also use a hostname you define in /etc/hosts that binds to the 
>> proper
>> interface instead (which would be a bit more like reality).
>>
>> On 11/8/17 10:43 AM, Geoffry Roberts wrote:
>>
>>> All,
>>>
>>> I have used Accumulo before, but a few versions ago (1.5.1), maybe
>>> something has changed.  Also, I've never before run it in a VM.
>>>
>>> I am running Accumulo from withn a VM and attempting to connect from
>>> without.  I am getting a complaint regarding port 9997, which, within 
>>> the
>>> VM, is listening on 127.0.0.1:9997 .
>>> Apparently, I need to get it onto 0.0.0.0:9997 .
>>> Am I correct?
>>>
>>> Hadoop 2.6.2
>>> Zookeeper 3.4.10
>>> Accumulo 1.8.1
>>> Thrift 0.10.0
>>> Ubuntu 16.04 as a VBox guest
>>> OSX 10.12.06 as the host
>>

Re: Retrieve all keys in a single table

2017-12-12 Thread Christopher
Hi,

If you know all possible authorizations in advance, then you can grant
those to a particular user. You may be able to write and use a custom
Authorizations security provider, which ensures a user always has every
authorization encountered in a visibility string while visibility labels
are being parsed for filtering.

A major compaction iterator is a better choice than the scanning API.
Iterators applied at the major compaction scope have access to all the
underlying data, unfiltered by visibilities, and it would be capable of
generating indexes of keys in parallel... but you'd have to rely on some
external mechanism for aggregating those indexes from all the compactions
across the table.

However, if all you want to do is see if entries have changed over time,
there's a much simpler way to do that. You can rely on the
timestamp/versioning field of Accumulo, to allow multiple versions over
time. You may be able to write a Combiner that aggregates the different
versions into a single version which tracks how many times it has been
changed.

Another option would be to use something like Apache Fluo, in your ingest,
so that you can incrementally update the counts for how often an entry is
modified, during ingest. That's the kind of thing Apache Fluo was designed
for.

On Mon, Dec 11, 2017 at 8:14 PM Edward Armes  wrote:

> Hi there,
>
> I was wondering if it was possible in accumulo to retrieve every key in
> the table regardless of the viability and classifiers, via the Java API. If
> not would it be possible via an iterator? . The idea here would be to build
> an index of the keys in accumulo to see when a record is changed over time
> in a given accumulo table.
>
> Thanks
>


Re: BloomFilter error: stream is closed

2017-12-21 Thread Christopher
I'm not an expert on bloom filters, but I asked a colleague and they think
that what may be happening is that the file is opened for read (to support
a scan, probably), but then the file is closed before the background bloom
filter thread can load the bloom filters to optimize future queries of that
file.

This could happen because for a number of reasons. Keep in mind that this
is not an ERROR or WARN message, but a DEBUG one, so, it may be safe to
ignore, or if it happens frequently, it may indicate that there's room for
further system tuning to optimize your use of bloom filters.

Some things you can try are:

* Modify `tserver.scan.files.open.max` to increase it so that files don't
get evicted and closed as quickly.
* Modify `tserver.files.open.idle` to increase the amount of idle time
after the most recently read file before closing it (in case the background
bloom filter threads need more time to load bloom filters, and so it can
still be open the next time it is read).
* Modify `tserver.bloom.load.concurrent.max` to increase the number of
background threads for loading bloom filters (in case they aren't getting
loaded fast enough to be used). Or, set it to 0 to force it to load in the
foreground instead of the background.
* Modify other `table.bloom.*` parameters to make bloom filters smaller so
they load faster or are utilized more optimally for your work load and
access patterns.

Other possibilities might involve changing how big your RFiles are, or the
compaction ratio, or other settings to try to reduce the number of files
open concurrently on the tablet servers.

On Thu, Dec 21, 2017 at 10:34 AM vLex Systems  wrote:

> Hi
>
> We've activated the bloomfilter on an accumulo table to see if it
> helped with the CPU usage and we're seeing this messages in our
> tserver debug log:
>
> 2017-12-20 12:08:28,800 [impl.CachableBlockFile] DEBUG: Error full
> blockRead for file
> hdfs://10.0.32.143:9000/accumulo/tables/6/t-013/F0008k42.rf for
> block acu_bloom
> java.io.IOException: Stream is closed!
> at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1404)
> at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
> at
> org.apache.accumulo.core.file.rfile.bcfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:98)
> at
> org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
> at
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
> at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> at java.io.DataInputStream.readFully(DataInputStream.java:195)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
> at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.cacheBlock(CachableBlockFile.java:335)
> at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBlock(CachableBlockFile.java:318)
> at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:368)
> at
> org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:137)
> at
> org.apache.accumulo.core.file.rfile.RFile$Reader.getMetaStore(RFile.java:974)
> at
> org.apache.accumulo.core.file.BloomFilterLayer$BloomFilterLoader$1.run(BloomFilterLayer.java:211)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at java.lang.Thread.run(Thread.java:745)
> 2017-12-20 12:08:28,801 [file.BloomFilterLayer] DEBUG: Can't open
> BloomFilter, file closed : Stream is closed!
>
>
> Does anyone know what these mean or what is causing them?
>
> Thank you.
>


Re: Large number of used ports from tserver

2018-01-24 Thread Christopher
I haven't seen that, but I'm curious what OS, Hadoop, ZooKeeper, and
Accumulo version you're running. I'm assuming you verified that it was the
TabletServer process holding these TCP sockets open using `netstat -p` and
cross-referencing the PID with `jps -ml` (or similar)? Are you able to
confirm based on the port number that these were Thrift connections or
could they be ZooKeeper or Hadoop connections? Do you have any special
non-default Accumulo RPC configuration (SSL or SASL)?

On Wed, Jan 24, 2018 at 3:46 PM Adam J. Shook  wrote:

> Hello all,
>
> Has anyone come across an issue with a TabletServer occupying a large
> number of ports in a CLOSED_WAIT state?  'Normal' number of used ports on a
> 12-node cluster are around 12,000 to 20,000 ports.  In one instance, there
> were over 68k and it was affecting other applications from getting a free
> port and they would fail to start (which is how we found this in the first
> place).
>
> Thank you,
> --Adam
>


Re: Large number of used ports from tserver

2018-01-25 Thread Christopher
Interesting. It's possible we're mishandling an IOException from DFSClient
or something... but it's also possible there's a bug in DFSClient
somewhere. I found a few similar issues from the past... some might still
be not fully resolved:

https://issues.apache.org/jira/browse/HDFS-1836
https://issues.apache.org/jira/browse/HDFS-2028
https://issues.apache.org/jira/browse/HDFS-6973
https://issues.apache.org/jira/browse/HBASE-9393

The HBASE issue is interesting, because it indicates a new HDFS feature in
2.6.4 to clear readahead buffers/sockets (
https://issues.apache.org/jira/browse/HDFS-7694). That might be a feature
we're not yet utilizing, but it would only work on a newer version of HDFS.

I would probably also try to grab some jstacks of the tserver, to try to
figure out what HDFS client code paths are being taken to see where the
leak might be occurring. Also, if you have any debug logs for the tserver,
that might help. There might be some DEBUG or WARN items that indicate
retries or other failures failures that are occurring, but perhaps handled
improperly.

It's probably less likely, but it could also be a Java or Linux issue. I
wouldn't even know where to begin debugging at that level, though, other
than to check for OS updates.  What JVM are you running?

It's possible it's not a leak... and these are just getting cleaned up too
slowly. That might be something that can be tuned with sysctl.

On Thu, Jan 25, 2018 at 11:27 AM Adam J. Shook  wrote:

> We're running Ubuntu 14.04, HDFS 2.6.0, ZooKeeper 3.4.6, and Accumulo
> 1.8.1.  I'm using `lsof -i` and grepping for the tserver PID to list all
> the connections.  Just now there are ~25k connections for this one tserver,
> of which 99.9% of them are all writing to various DataNodes on port 50010.
> It's split about 50/50 for connections that are CLOSED_WAIT and ones that
> are ESTABLISHED.  No special RPC configuration.
>
> On Wed, Jan 24, 2018 at 7:53 PM, Josh Elser  wrote:
>
>> +1 to looking at the remote end of the socket and see where they're
>> going/coming to/from. I've seen a few HDFS JIRA issues filed about sockets
>> left in CLOSED_WAIT.
>>
>> Lucky you, this is a fun Linux rabbit hole to go down :)
>>
>> (
>> https://blog.cloudflare.com/this-is-strictly-a-violation-of-the-tcp-specification/
>> covers some of the technical details)
>>
>> On 1/24/18 6:37 PM, Christopher wrote:
>>
>>> I haven't seen that, but I'm curious what OS, Hadoop, ZooKeeper, and
>>> Accumulo version you're running. I'm assuming you verified that it was the
>>> TabletServer process holding these TCP sockets open using `netstat -p` and
>>> cross-referencing the PID with `jps -ml` (or similar)? Are you able to
>>> confirm based on the port number that these were Thrift connections or
>>> could they be ZooKeeper or Hadoop connections? Do you have any special
>>> non-default Accumulo RPC configuration (SSL or SASL)?
>>>
>>> On Wed, Jan 24, 2018 at 3:46 PM Adam J. Shook >> <mailto:adamjsh...@gmail.com>> wrote:
>>>
>>> Hello all,
>>>
>>> Has anyone come across an issue with a TabletServer occupying a
>>> large number of ports in a CLOSED_WAIT state?  'Normal' number of
>>> used ports on a 12-node cluster are around 12,000 to 20,000 ports.
>>>In one instance, there were over 68k and it was affecting other
>>> applications from getting a free port and they would fail to start
>>> (which is how we found this in the first place).
>>>
>>> Thank you,
>>> --Adam
>>>
>>>
>


Re: Question on how Accumulo binds to Hadoop

2018-02-01 Thread Christopher
Normally, you'd set up Accumulo to use the HDFS volume in your
accumulo-site.xml file for your servers by setting the instance.volumes
field (in your case to the value of 'hdfs://haz0-m:8020/accumulo' or
similar).

The shell typically connects to either ZooKeeper using client configuration
files or command-line options as its entry point. Run it with the '--help'
or '-?' options to see the available options.

If it has read permission for your accumulo-site.xml file and the Accumulo
conf directory where this file is located is on its class path, the shell
may fall back to using your hdfs-site.xml or your accumulo-site.xml to try
to figure out things using HDFS... but that's mostly a
backwards-compatible/legacy mode. It's better if you explicitly specify on
the command line the ZK entry point.

On Thu, Feb 1, 2018 at 10:59 AM Geoffry Roberts 
wrote:

> Thanks Adam, that worked.  Accumulo starts but when I try the shell I get:
>
> ERROR: unable obtain instance id at file:/accumulo/instance_id
>
> $ hadoop fs -ls /
>
>
> Shows the id file and the Hadoop configuration directory is on the
> Accumulo class path according to accumulo-site.xml.
>
> Is the shell looking in the local file system or in hdfs?  I never had
> this problem until I started up with Google.
>
> Thanks
>
> On Wed, Jan 31, 2018 at 5:06 PM, Adam J. Shook 
> wrote:
>
>> Yes, it does use RPC to talk to HDFS.  You will need to update the value
>> of instance.volumes in accumulo-site.xml to reference this address,
>> haz0-m:8020, instead of the default localhost:9000.
>>
>> --Adam
>>
>> On Wed, Jan 31, 2018 at 4:45 PM, Geoffry Roberts 
>> wrote:
>>
>>> I have a situation where Accumulo cannot find Hadoop.
>>>
>>> Hadoop is running and I can access hdfs from the cli.
>>> Zookeeper also says it is ok and I can log in using the client.
>>> Accumulo init is failing with a connection refused for localhost:9000.
>>>
>>> netstat shows nothing listening on 9000.
>>>
>>> Now the plot thickens...
>>>
>>> The Hadoop I am running is Google's Dataproc and the Hadoop installation
>>> is not my own.  I have already found a number of differences.
>>>
>>> Here's my question:  Does Accumulo use RPC to talk to Hadoop? I ask
>>> because of things like this:
>>>
>>> From hfs-site.xml
>>>
>>>   
>>>
>>> dfs.namenode.rpc-address
>>>
>>> haz0-m:8020
>>>
>>> 
>>>
>>>   RPC address that handles all clients requests. If empty then we'll
>>> get
>>>
>>>   thevalue from fs.default.name.The value of this property will take
>>> the
>>>
>>>   form of hdfs://nn-host1:rpc-port.
>>>
>>> 
>>>
>>>   
>>>
>>> Or does it use something else?
>>>
>>> Thanks
>>> --
>>> There are ways and there are ways,
>>>
>>> Geoffry Roberts
>>>
>>
>>
>
>
> --
> There are ways and there are ways,
>
> Geoffry Roberts
>


Re: Question on how Accumulo binds to Hadoop

2018-02-01 Thread Christopher
On Thu, Feb 1, 2018 at 2:00 PM Geoffry Roberts 
wrote:

> >> It's better if you explicitly specify on the command line the ZK entry
> point.
>
> Can you give an example?
>
>
bin/accumulo shell -u root -zh zoohost1:2181,zoohost2:2181,zoohost3:2181
-zi myInstance

You can also put a client configuration file containing the following in
~/.accumulo/client.conf:

instance.zookeeper.host=zoohost1:2181,zoohost2:2181,zoohost3:2181
instance.name=myInstance



> On Thu, Feb 1, 2018 at 12:54 PM, Christopher  wrote:
>
>> Normally, you'd set up Accumulo to use the HDFS volume in your
>> accumulo-site.xml file for your servers by setting the instance.volumes
>> field (in your case to the value of 'hdfs://haz0-m:8020/accumulo' or
>> similar).
>>
>> The shell typically connects to either ZooKeeper using client
>> configuration files or command-line options as its entry point. Run it with
>> the '--help' or '-?' options to see the available options.
>>
>> If it has read permission for your accumulo-site.xml file and the
>> Accumulo conf directory where this file is located is on its class path,
>> the shell may fall back to using your hdfs-site.xml or your
>> accumulo-site.xml to try to figure out things using HDFS... but that's
>> mostly a backwards-compatible/legacy mode. It's better if you explicitly
>> specify on the command line the ZK entry point.
>>
>> On Thu, Feb 1, 2018 at 10:59 AM Geoffry Roberts 
>> wrote:
>>
>>> Thanks Adam, that worked.  Accumulo starts but when I try the shell I
>>> get:
>>>
>>> ERROR: unable obtain instance id at file:/accumulo/instance_id
>>>
>>> $ hadoop fs -ls /
>>>
>>>
>>> Shows the id file and the Hadoop configuration directory is on the
>>> Accumulo class path according to accumulo-site.xml.
>>>
>>> Is the shell looking in the local file system or in hdfs?  I never had
>>> this problem until I started up with Google.
>>>
>>> Thanks
>>>
>>> On Wed, Jan 31, 2018 at 5:06 PM, Adam J. Shook 
>>> wrote:
>>>
>>>> Yes, it does use RPC to talk to HDFS.  You will need to update the
>>>> value of instance.volumes in accumulo-site.xml to reference this address,
>>>> haz0-m:8020, instead of the default localhost:9000.
>>>>
>>>> --Adam
>>>>
>>>> On Wed, Jan 31, 2018 at 4:45 PM, Geoffry Roberts <
>>>> threadedb...@gmail.com> wrote:
>>>>
>>>>> I have a situation where Accumulo cannot find Hadoop.
>>>>>
>>>>> Hadoop is running and I can access hdfs from the cli.
>>>>> Zookeeper also says it is ok and I can log in using the client.
>>>>> Accumulo init is failing with a connection refused for localhost:9000.
>>>>>
>>>>> netstat shows nothing listening on 9000.
>>>>>
>>>>> Now the plot thickens...
>>>>>
>>>>> The Hadoop I am running is Google's Dataproc and the Hadoop
>>>>> installation is not my own.  I have already found a number of differences.
>>>>>
>>>>> Here's my question:  Does Accumulo use RPC to talk to Hadoop? I ask
>>>>> because of things like this:
>>>>>
>>>>> From hfs-site.xml
>>>>>
>>>>>   
>>>>>
>>>>> dfs.namenode.rpc-address
>>>>>
>>>>> haz0-m:8020
>>>>>
>>>>> 
>>>>>
>>>>>   RPC address that handles all clients requests. If empty then
>>>>> we'll get
>>>>>
>>>>>   thevalue from fs.default.name.The value of this property will
>>>>> take the
>>>>>
>>>>>   form of hdfs://nn-host1:rpc-port.
>>>>>
>>>>> 
>>>>>
>>>>>   
>>>>>
>>>>> Or does it use something else?
>>>>>
>>>>> Thanks
>>>>> --
>>>>> There are ways and there are ways,
>>>>>
>>>>> Geoffry Roberts
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> There are ways and there are ways,
>>>
>>> Geoffry Roberts
>>>
>>
>
>
> --
> There are ways and there are ways,
>
> Geoffry Roberts
>


Re: Monitor keeps binding to localhost

2018-02-08 Thread Christopher
Glad you got it working. Also, watch out for
https://issues.apache.org/jira/browse/ACCUMULO-4776, which may bite you.

On Tue, Feb 6, 2018 at 9:44 AM Geoffry Roberts 
wrote:

> Today I restart the whole linux instance, lo and behold the monitor now
> binds to 0.0.0.0.  Go figure.
>
> On Tue, Feb 6, 2018 at 8:54 AM, Geoffry Roberts 
> wrote:
>
>> I tried uncommenting:
>>
>> export ACCUMULO_MONITOR_BIND_ALL=“true”
>>
>> I bounced Accumulo but nstat still shows port 9995 as being bound to
>> localhost.
>>
>> Do I need to do anything else?
>> Thanks
>> --
>> There are ways and there are ways,
>>
>> Geoffry Roberts
>>
>
>
>
> --
> There are ways and there are ways,
>
> Geoffry Roberts
>


[ANNOUNCE] Apache Accumulo 1.7.4

2018-03-28 Thread Christopher
The Apache Accumulo project is pleased to announce the release of Apache
Accumulo 1.7.4! This release contains many bug fixes, performance
improvements, build quality improvements, and more. This is a maintenance
(patch) release. Users of any previous 1.7.x release are strongly
encouraged to update as soon as possible to benefit from the improvements
with very little concern in change of underlying functionality.

Apache Accumulo® is a sorted, distributed key/value store that provides
robust, scalable data storage and retrieval. With Apache Accumulo, users
can store and manage large data sets across a cluster. Accumulo uses Apache
Hadoop's HDFS to store its data and Apache ZooKeeper for consensus.

This version is now available in Maven Central, and at:
https://accumulo.apache.org/downloads/

The full release notes can be viewed at:
https://accumulo.apache.org/release/accumulo-1.7.4/

NOTICE: As development shifts to maintaining the newer versions of Accumulo
(1.9 and beyond), this will likely be the last maintenance release of the
1.7 series, so users should begin making plans to upgrade to 1.8 or later,
if they haven't already. (See the release notes for the version you are
upgrading to for guidance; 1.8.1's can be found at
https://accumulo.apache.org/release/accumulo-1.8.1/#upgrading)

--
The Apache Accumulo Team


[ANNOUNCE] Apache Accumulo 1.9.0

2018-04-24 Thread Christopher
The Apache Accumulo project is pleased to announce the release
of Apache Accumulo 1.9.0! This release contains many bug fixes,
performance improvements, security enhancements, build quality
improvements, and more.

This release is effectively a patch release for 1.8.x, and replaces the
1.8.x series. It was bumped to a minor release version, under
Semantic Versioning rules to accommodate a new API addition to in
order to deprecate the use of some third party library types leaking
into our public API.

Users of any previous 1.8.x release are strongly encouraged to
update as soon as possible to benefit from the improvements with
very little concern in change of underlying functionality. Users of
versions 1.7 and earlier are encouraged to develop a timely upgrade
plan to transition to 1.9.0, as their maintenance schedules allow.

***

Apache Accumulo® is a sorted, distributed key/value store that
provides robust, scalable data storage and retrieval. With
Apache Accumulo, users can store and manage large data sets
across a cluster. Accumulo uses Apache Hadoop's HDFS to store
its data and Apache ZooKeeper for consensus.

This version is now available in Maven Central, and at:
https://accumulo.apache.org/downloads/

The full release notes can be viewed at:
https://accumulo.apache.org/release/accumulo-1.9.0/

--
The Apache Accumulo Team


Re: Question on missing RFiles

2018-05-11 Thread Christopher
This is strange. I've only ever seen this when HDFS has reported problems,
such as missing blocks, or another obvious failure. What is your durability
settings (were WALs turned on)?

On Fri, May 11, 2018 at 12:45 PM Adam J. Shook  wrote:

> Hello all,
>
> On one of our clusters, there are a good number of missing RFiles from
> HDFS, however HDFS is not/has not reported any missing blocks.  We were
> experiencing issues with HDFS; some flapping DataNode processes that needed
> more heap.
>
> I don't anticipate I can do much besides create a bunch of empty RFiles
> (open to suggestions).  My question is, Is it possible that Accumulo could
> have written the metadata for these RFiles but failed to write it to HDFS?
> In which case it would have been re-tried later and the data was persisted
> to a different RFile?  Or is it an 'RFile is in Accumulo metadata if and
> only if it is in HDFS' situation?
>
> Accumulo 1.8.1 on HDFS 2.6.0.
>
> Thank you,
> --Adam
>


Re: Question on missing RFiles

2018-05-11 Thread Christopher
Oh, it occurs to me that this may be related to the WAL bugs that Keith
fixed for 1.9.1... which could affect the metadata table recovery after a
failure.

On Fri, May 11, 2018 at 6:11 PM Michael Wall  wrote:

> Adam,
>
> Do you have GC logs?  Can you see if those missing RFiles were removed by
> the GC process?  That could indicate you somehow got old metadata info
> replayed.  Also, the rfiles increment so compare the current rfile names in
> the srv.dir directory vs what is in the metadata table.  Are the existing
> files after files in the metadata.  Finally, pick a few of the missing
> files and grep all your master and tserver logs to see if you can learn
> anything.  This sounds ungood.
>
> Mike
>
> On Fri, May 11, 2018 at 6:06 PM Christopher  wrote:
>
>> This is strange. I've only ever seen this when HDFS has reported
>> problems, such as missing blocks, or another obvious failure. What is your
>> durability settings (were WALs turned on)?
>>
>> On Fri, May 11, 2018 at 12:45 PM Adam J. Shook 
>> wrote:
>>
>>> Hello all,
>>>
>>> On one of our clusters, there are a good number of missing RFiles from
>>> HDFS, however HDFS is not/has not reported any missing blocks.  We were
>>> experiencing issues with HDFS; some flapping DataNode processes that needed
>>> more heap.
>>>
>>> I don't anticipate I can do much besides create a bunch of empty RFiles
>>> (open to suggestions).  My question is, Is it possible that Accumulo could
>>> have written the metadata for these RFiles but failed to write it to HDFS?
>>> In which case it would have been re-tried later and the data was persisted
>>> to a different RFile?  Or is it an 'RFile is in Accumulo metadata if and
>>> only if it is in HDFS' situation?
>>>
>>> Accumulo 1.8.1 on HDFS 2.6.0.
>>>
>>> Thank you,
>>> --Adam
>>>
>>


[ANNOUNCE] Apache Accumulo 1.9.1 (Critical Bug Fixes)

2018-05-14 Thread Christopher
The Apache Accumulo project is pleased to announce the release
of Apache Accumulo 1.9.1! This release contains fixes for **critical**
bugs, which could result in *data loss* during recovery from a previous
failure. (See the release notes linked below for details.)

Versions 1.8.0, 1.8.1, and 1.9.0 are affected, and users of those
versions are encouraged to upgrade to this version immediately to
avoid data loss. Users of earlier versions who are planning to
upgrade to one of the affected versions are encouraged to upgrade
directly to this version instead.

***

Apache Accumulo® is a sorted, distributed key/value store that
provides robust, scalable data storage and retrieval. With
Apache Accumulo, users can store and manage large data sets
across a cluster. Accumulo uses Apache Hadoop's HDFS to store
its data and Apache ZooKeeper for consensus.

This version is now available in Maven Central, and at:
https://accumulo.apache.org/downloads/

The full release notes can be viewed at:
https://accumulo.apache.org/release/accumulo-1.9.1/

--
The Apache Accumulo Team


Re: Corrupt WAL

2018-06-11 Thread Christopher
What version are you using?

On Mon, Jun 11, 2018 at 5:27 PM Adam J. Shook  wrote:

> Hey all,
>
> The root tablet on one of our dev systems isn't loading due to an illegal
> state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd be
> the best way to mitigate this issue?  This was likely caused due to both of
> our NameNodes failing.
>
> Thank you,
> --Adam
>


Re: Custom authorisation

2018-06-11 Thread Christopher
Yes, that's certainly one option. You could develop a Query Service Layer
which wraps Accumulo's API, implements its own authorization policy, and
then uses a singular set of credentials to authenticate to Accumulo.

Personally, I call this the "Database User" approach, since it is a common
strategy when using traditional relational databases where a set of
database credentials are stored in an application's own configuration
somewhere, and the application implements its own security policies within
the application which are separate from the database credentials.

Another option is to make use of Accumulo's "pluggable" Authentication and
Authorization interfaces and to provide your own implementation on your
class path. See:
https://accumulo.apache.org/1.7/accumulo_user_manual.html#_pluggable_security
https://accumulo.apache.org/1.7/accumulo_user_manual.html#_instance_security_authenticator
https://accumulo.apache.org/1.7/accumulo_user_manual.html#_instance_security_authorizor
https://accumulo.apache.org/1.7/accumulo_user_manual.html#_instance_security_permissionhandler

Note: this is an advanced feature, and it may require substantial
investment to develop and maintain a secure implementation suitable for
your situation.


On Thu, May 24, 2018 at 11:36 AM mhd wrk  wrote:

> Hi,
>
> What are the best practices for Accumulo to implement a custom
> authorisation module where user authorisations assigned dynamically based
> on different attributes like time, location and ...
>
> Is implementing "Query Services Layer
> "
> recommended for power users who access Accumulo for large data analysis via
> clients like Spark?
>
> Thanks,
> Mohammad
>


Re: Corrupt WAL

2018-06-11 Thread Christopher
That's what I was thinking it was related to. Do you know if the particular
WAL file was created from a previous version, from before you upgraded?

On Mon, Jun 11, 2018 at 6:00 PM Adam J. Shook  wrote:

> Sorry would have been good to include that :)  It's the newest 1.9.1.  I
> think it relates to https://github.com/apache/accumulo/pull/458, just not
> sure what the best thing to do here is.
>
> On Mon, Jun 11, 2018 at 5:46 PM, Christopher  wrote:
>
>> What version are you using?
>>
>> On Mon, Jun 11, 2018 at 5:27 PM Adam J. Shook 
>> wrote:
>>
>>> Hey all,
>>>
>>> The root tablet on one of our dev systems isn't loading due to an
>>> illegal state exception -- COMPACTION_FINISH preceding COMPACTION_START.
>>> What'd be the best way to mitigate this issue?  This was likely caused due
>>> to both of our NameNodes failing.
>>>
>>> Thank you,
>>> --Adam
>>>
>>
>


Re: Accumulo on Google Cloud Storage

2018-06-20 Thread Christopher
For what it's worth, this is an Apache project, not a Sqrrl project. Amazon
is free to contribute to Accumulo to improve its support of their platform,
just as anybody is free to do. Amazon may start contributing more as a
result of their acquisition... or they may not. There is no reason to
expect that their acquisition will have any impact whatsoever on the
platforms Accumulo supports, because Accumulo is not, and has not ever
been, a Sqrrl project (although some Sqrrl employees have contributed), and
thus will not become an Amazon project. It has been, and will remain, a
vendor-neutral Apache project. Regardless, we welcome contributions from
anybody which would improve Accumulo's support of any additional platform
alternatives to HDFS, whether it be GCS, S3, or something else.

As for the WAL closing issue on GCS, I recall a previous thread about
that... I think a simple patch might be possible to solve that issue, but
to date, nobody has contributed a fix. If somebody is interested in using
Accumulo on GCS, I'd like to encourage them to submit any bugs they
encounter, and any patches (if they are able) which resolve those bugs. If
they need help submitting a fix, please ask on the dev@ list.


On Wed, Jun 20, 2018 at 8:21 AM Geoffry Roberts 
wrote:

> Maxim,
>
> Interesting that you were able to run A on GCS.  I never thought of
> that--good to know.
>
> Since I am now an AWS guy (at least or the time being), in light of the
> fact that Amazon purchased Sqrrl,  I am interested to see what develops.
>
>
> On Wed, Jun 20, 2018 at 5:15 AM, Maxim Kolchin 
> wrote:
>
>> Hi Geoffry,
>>
>> Thank you for the feedback!
>>
>> Thanks to [1, 2], I was able to run Accumulo cluster on Google VMs and
>> with GCS instead of HDFS. And I used Google Dataproc to run Hadoop jobs on
>> Accumulo. Almost everything was good until I've not faced some connection
>> issues with GCS. Quite often, the connection to GCS breaks on writing or
>> closing WALs.
>>
>> To all,
>>
>> Does Accumulo have a specific write pattern, so that file system may not
>> support it? Are there Accumulo properties which I can play with to adjust
>> the write pattern?
>>
>> [1]: https://github.com/cybermaggedon/accumulo-gs
>> [2]: https://github.com/cybermaggedon/accumulo-docker
>>
>> Thank you!
>> Maxim
>>
>> On Tue, Jun 19, 2018 at 10:31 PM Geoffry Roberts 
>> wrote:
>>
>>> I tried running Accumulo on Google.  I first tried running it on
>>> Google's pre-made Hadoop.  I found the various file paths one must contend
>>> with are different on Google than on a straight download from Apache.  It
>>> seems they moved things around.  To counter this, I installed my own Hadoop
>>> along with Zookeeper and Accumulo on a Google node.  All went well until
>>> one fine day when I could no longer log in.  It seems Google had pushed out
>>> some changes over night that broke my client side Google Cloud
>>> installation.  Google referred the affected to a lengthy,
>>> easy-to-make-a-mistake procedure for resolving the issue.
>>>
>>> I decided life was too short for this kind of thing and switched to
>>> Amazon.
>>>
>>> On Tue, Jun 19, 2018 at 7:34 AM, Maxim Kolchin 
>>> wrote:
>>>
 Hi all,

 Does anyone have experience running Accumulo on top of Google Cloud
 Storage instead of HDFS? In [1] you can see some details if you never heard
 about this feature.

 I see some discussion (see [2], [3]) around this topic, but it looks to
 me that this isn't as popular as, I believe, should be.

 [1]:
 https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
 [2]: https://github.com/apache/accumulo/issues/428
 [3]: https://github.com/GoogleCloudPlatform/bigdata-interop/issues/103

 Best regards,
 Maxim

>>>
>>>
>>>
>>> --
>>> There are ways and there are ways,
>>>
>>> Geoffry Roberts
>>>
>>
>
>
> --
> There are ways and there are ways,
>
> Geoffry Roberts
>


Re: Accumulo on Google Cloud Storage

2018-06-22 Thread Christopher
Unfortunately, that feature wasn't added until 2.0, which hasn't yet been
released, but I'm hoping it will be later this year.

However, I'm not convinced this is a write pattern issue, though. I
commented on
https://github.com/GoogleCloudPlatform/bigdata-interop/issues/103#issuecomment-399608543

On Fri, Jun 22, 2018 at 1:50 PM Stephen Meyles  wrote:

> Knowing that HBase has been run successfully on ADLS, went looking there
> (as they have the same WAL write pattern). This is informative:
>
>
> https://www.cloudera.com/documentation/enterprise/5-12-x/topics/admin_using_adls_storage_with_hbase.html
>
> which suggests a need to split the WALs off on HDFS proper versus ADLS (or
> presumably GCS) barring changes in the underlying semantics of each. AFAICT
> you can't currently configure Accumulo to send WAL logs to a separate
> cluster - is this correct?
>
> S.
>
>
> On Fri, Jun 22, 2018 at 9:07 AM, Stephen Meyles  wrote:
>
>> > Did you try to adjust any Accumulo properties to do bigger writes less
>> frequently or something like that?
>>
>> We're using BatchWriters and sending reasonable larges batches of
>> Mutations. Given the stack traces in both our cases are related to WAL
>> writes it seems like batch size would be the only tweak available here
>> (though, without reading the code carefully it's not even clear to me that
>> is impactful) but if there others have suggestions I'd be happy to try.
>>
>> Given we have this working well and stable in other clusters atop
>> traditional HDFS I'm currently pursuing this further with the MS to
>> understand the variance to ADLS. Depending what emerges from that I may
>> circle back with more details and a bug report and start digging in more
>> deeply to the relevant code in Accumulo.
>>
>> S.
>>
>>
>> On Fri, Jun 22, 2018 at 6:09 AM, Maxim Kolchin 
>> wrote:
>>
>>> > If somebody is interested in using Accumulo on GCS, I'd like to
>>> encourage them to submit any bugs they encounter, and any patches (if they
>>> are able) which resolve those bugs.
>>>
>>> I'd like to contribute a fix, but I don't know where to start. We tried
>>> to get any help from the Google Support about [1] over email, but they just
>>> say that the GCS doesn't support such write pattern. In the end, we can
>>> only guess how to adjust the Accumulo behaviour to minimise broken
>>> connections to the GCS.
>>>
>>> BTW although we observe this exception, the tablet server doesn't fail,
>>> so it means that after some retries it is able to write WALs to GCS.
>>>
>>> @Stephen,
>>>
>>> > as discussions with MS engineers have suggested, similar to the GCS
>>> thread, that small writes at high volume are, at best, suboptimal for ADLS.
>>>
>>> Did you try to adjust any Accumulo properties to do bigger writes less
>>> frequently or something like that?
>>>
>>> [1]: https://github.com/GoogleCloudPlatform/bigdata-interop/issues/103
>>>
>>> Maxim
>>>
>>> On Thu, Jun 21, 2018 at 7:17 AM Stephen Meyles 
>>> wrote:
>>>
>>>> I think we're seeing something similar but in our case we're trying to
>>>> run Accumulo atop ADLS. When we generate sufficient write load we start to
>>>> see stack traces like the following:
>>>>
>>>> [log.DfsLogger] ERROR: Failed to write log entries
>>>> java.io.IOException: attempting to write to a closed stream;
>>>> at
>>>> com.microsoft.azure.datalake.store.ADLFileOutputStream.write(ADLFileOutputStream.java:88)
>>>> at
>>>> com.microsoft.azure.datalake.store.ADLFileOutputStream.write(ADLFileOutputStream.java:77)
>>>> at
>>>> org.apache.hadoop.fs.adl.AdlFsOutputStream.write(AdlFsOutputStream.java:57)
>>>> at
>>>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:48)
>>>> at java.io.DataOutputStream.write(DataOutputStream.java:88)
>>>> at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
>>>> at
>>>> org.apache.accumulo.tserver.logger.LogFileKey.write(LogFileKey.java:87)
>>>> at org.apache.accumulo.tserver.log.DfsLogger.write(DfsLogger.java:537)
>>>>
>>>> We have developed a rudimentary LogCloser implementation that allows us
>>>> to recover from this but overall performance is significantly impacted by
>>>> this.
>>>>
>>>> > A

Re: Accumulo on Google Cloud Storage

2018-06-24 Thread Christopher
Ah, ok. One of the comments on the issue led me to believe that it was the
same issue as the missing custom log closer.

On Sat, Jun 23, 2018, 01:10 Stephen Meyles  wrote:

> > I'm not convinced this is a write pattern issue, though. I commented
> on..
>
> The note there suggests the need for a LogCloser implementation; in my
> (ADLS) case I've written one and have it configured - the exception I'm
> seeing involves failures during writes, not during recovery (though it then
> leads to a need for recovery).
>
> S.
>
> On Fri, Jun 22, 2018 at 4:33 PM, Christopher  wrote:
>
>> Unfortunately, that feature wasn't added until 2.0, which hasn't yet been
>> released, but I'm hoping it will be later this year.
>>
>> However, I'm not convinced this is a write pattern issue, though. I
>> commented on
>> https://github.com/GoogleCloudPlatform/bigdata-interop/issues/103#issuecomment-399608543
>>
>> On Fri, Jun 22, 2018 at 1:50 PM Stephen Meyles  wrote:
>>
>>> Knowing that HBase has been run successfully on ADLS, went looking there
>>> (as they have the same WAL write pattern). This is informative:
>>>
>>>
>>> https://www.cloudera.com/documentation/enterprise/5-12-x/topics/admin_using_adls_storage_with_hbase.html
>>>
>>> which suggests a need to split the WALs off on HDFS proper versus ADLS
>>> (or presumably GCS) barring changes in the underlying semantics of each.
>>> AFAICT you can't currently configure Accumulo to send WAL logs to a
>>> separate cluster - is this correct?
>>>
>>> S.
>>>
>>>
>>> On Fri, Jun 22, 2018 at 9:07 AM, Stephen Meyles 
>>> wrote:
>>>
>>>> > Did you try to adjust any Accumulo properties to do bigger writes
>>>> less frequently or something like that?
>>>>
>>>> We're using BatchWriters and sending reasonable larges batches of
>>>> Mutations. Given the stack traces in both our cases are related to WAL
>>>> writes it seems like batch size would be the only tweak available here
>>>> (though, without reading the code carefully it's not even clear to me that
>>>> is impactful) but if there others have suggestions I'd be happy to try.
>>>>
>>>> Given we have this working well and stable in other clusters atop
>>>> traditional HDFS I'm currently pursuing this further with the MS to
>>>> understand the variance to ADLS. Depending what emerges from that I may
>>>> circle back with more details and a bug report and start digging in more
>>>> deeply to the relevant code in Accumulo.
>>>>
>>>> S.
>>>>
>>>>
>>>> On Fri, Jun 22, 2018 at 6:09 AM, Maxim Kolchin 
>>>> wrote:
>>>>
>>>>> > If somebody is interested in using Accumulo on GCS, I'd like to
>>>>> encourage them to submit any bugs they encounter, and any patches (if they
>>>>> are able) which resolve those bugs.
>>>>>
>>>>> I'd like to contribute a fix, but I don't know where to start. We
>>>>> tried to get any help from the Google Support about [1] over email, but
>>>>> they just say that the GCS doesn't support such write pattern. In the end,
>>>>> we can only guess how to adjust the Accumulo behaviour to minimise broken
>>>>> connections to the GCS.
>>>>>
>>>>> BTW although we observe this exception, the tablet server doesn't
>>>>> fail, so it means that after some retries it is able to write WALs to GCS.
>>>>>
>>>>> @Stephen,
>>>>>
>>>>> > as discussions with MS engineers have suggested, similar to the GCS
>>>>> thread, that small writes at high volume are, at best, suboptimal for 
>>>>> ADLS.
>>>>>
>>>>> Did you try to adjust any Accumulo properties to do bigger writes less
>>>>> frequently or something like that?
>>>>>
>>>>> [1]: https://github.com/GoogleCloudPlatform/bigdata-interop/issues/103
>>>>>
>>>>> Maxim
>>>>>
>>>>> On Thu, Jun 21, 2018 at 7:17 AM Stephen Meyles 
>>>>> wrote:
>>>>>
>>>>>> I think we're seeing something similar but in our case we're trying
>>>>>> to run Accumulo atop ADLS. When we generate sufficient write load we 
>>>>>> start
>>>>&

Re: Connector user switches between threads!

2018-07-03 Thread Christopher
It is known that Hadoop's implementation of Kerberos authentication tokens
is plagued by lack of thread safety (see
https://issues.apache.org/jira/browse/HADOOP-13066 for some discussion) and
UserGroupInformation is notoriously difficult to reason about.

Accumulo does not currently support the kind of multi-threaded behavior
you're using, but with some work, we probably could. Have you any insight
into what kinds of code changes would be required to properly support this
multi-threaded case with separate Kerberos users in Accumulo?

On Tue, Jul 3, 2018 at 7:33 PM mhd wrk  wrote:

> Here's the test case conditions:
>
> -Kerberized cluster
> -Thread one authenticates as user1 (using keytab) and start performing a
> long running task on a specific table.
> -Thread two simply authenticates as user2 (using username  and password ).
>
> My observation is that as soon as thread two logins, thread one runs into
> the exception below.
>
> java.lang.RuntimeException:
> org.apache.accumulo.core.client.AccumuloSecurityException: Error
> BAD_CREDENTIALS for user Principal in credentials object should match
> kerberos principal. Expected 'user2@example' but was 'user1@example' on
> table user1.test_table(ID:3) - Username or Password is Invalid
> at
> org.apache.accumulo.core.client.impl.ScannerIterator.hasNext(ScannerIterator.java:161)
> at java.lang.Iterable.forEach(Iterable.java:74)
> at com.example.test.TestBug$1.run(TestBug.java:53)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.accumulo.core.client.AccumuloSecurityException:
> Error BAD_CREDENTIALS for user Principal in credentials object should match
> kerberos principal. Expected 'user2@example' but was 'user1@example' on
> table user1.test_table(ID:3) - Username or Password is Invalid
> at
> org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:465)
> at
> org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:285)
> at
> org.apache.accumulo.core.client.impl.ScannerIterator$Reader.run(ScannerIterator.java:80)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> ... 1 more
> Caused by: ThriftSecurityException(user:Principal in credentials object
> should match kerberos principal. Expected 'user2@example' but was
> 'user1@example', code:BAD_CREDENTIALS)
> at
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startScan_result$startScan_resultStandardScheme.read(TabletClientService.java:6696)
> at
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startScan_result$startScan_resultStandardScheme.read(TabletClientService.java:6673)
> at
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$startScan_result.read(TabletClientService.java:6596)
> at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
> at
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startScan(TabletClientService.java:232)
> at
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startScan(TabletClientService.java:208)
> at
> org.apache.accumulo.core.client.impl.ThriftScanner.scan(ThriftScanner.java:410)
> ... 6 more
>
>
>
> 
> Java class to reproduce the issue
> 
>
> package com.example.test;
>
> import org.apache.accumulo.core.client.ClientConfiguration;
> import org.apache.accumulo.core.client.Connector;
> import org.apache.accumulo.core.client.ZooKeeperInstance;
> import org.apache.accumulo.core.client.security.tokens.KerberosToken;
> import org.apache.accumulo.core.security.Authorizations;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.security.UserGroupInformation;
>
> import javax.security.auth.callback.Callback;
> import javax.security.auth.callback.CallbackHandler;
> import javax.security.auth.callback.NameCallback;
> import javax.security.auth.callback.PasswordCallback;
> import javax.security.auth.callback.UnsupportedCallbackException;
> import javax.security.auth.login.LoginContext;
> import java.io.File;
> import java.io.IOException;
> import java.security.PrivilegedExceptionAction;
>
> public class TestBug {
>
> public static void main(String[] args) throws Exception {
>
> final String hadoopHome = "/path/to/hadoophome";
> final String hadoopConfigFile = "/path/to/my-site.xml";
>
> final String accumuloTableName = "test_table";
>
> final String user1Name = "user1@example";
> final String user1Keytab = "/etc/security/keytabs/user1.keytab";
>
> final String user2Name = "user2@example";
> final String user2Password = "user2password";
>
> System.out.println("==

Re: Namespace vs table permissions

2018-07-17 Thread Christopher
You are correct in your understanding of namespace permissions.

That check is a sanity check for fast failure of your job if you can't read
the table. I think you might be right that it's not checking if you have
read permission inherited from the namespace. It is possible that the
check's implementation will also check if you have the permission at the
table's namespace level, but I can't verify the implementation at the
moment. If it doesn't, then this sanity check's lack of consideration for
namespaces is a bug.

On Tue, Jul 17, 2018 at 4:28 PM James Srinivasan 
wrote:

> Hi all,
>
> I'm a little confused regarding Namespace and table permissions. My
> assumption is that granting Namespace.READ will allow a user to read
> all tables in a namespace, even those which are created after the
> permission is granted, but before the client tries to access the
> table. My specific issue seems to be that
> InputConfigurator.validatePermisisons
> (
> https://github.com/apache/accumulo/blob/rel/1.9.0/core/src/main/java/org/apache/accumulo/core/client/mapreduce/lib/impl/InputConfigurator.java#L782
> )
> seems to only check the table, and not the namespace permissions. Is
> my assumption correct? Is there a way of granting the permission I
> need?
>
> Many thanks,
>
> James
>


Re: Accumulo init.d script

2018-08-09 Thread Christopher
In general, I've found Accumulo's launch scripts for the tarball
distribution to be unsuitable for automating with system init stuffs, due
to their complexity and reliance on various environment pieces (though
that's not to say others haven't had a good experience using them).

For Fedora's RPM packaging (vs. the binary tarball), I opted to rewrite
Accumulo's launch scripts with a standard "JPackage"-style generated launch
script (a simple script which sources an env script, sets up class path,
locates Java, and launches the main class with the provided args, all in a
very simple way), and I called that script in very simple systemd scripts.

Perhaps rewriting the launch scripts isn't the best idea for your
situation... I don't know... but if you find that Accumulo's default
scripts are too problematic, it may be something to consider. I know one of
the main benefits I saw was that I didn't have to bother with any of the
complicated log capturing output-redirection stuff... I just let the
process log to STDOUT with a simple log4j ConsoleAppender, and let
systemd/journald handle the log management with its system-wide log
management policy. I also didn't have to do any PID file management or
complicated environment setup. If you write custom scripts tailored to your
requirements, you may be able to see similar benefits.

The good news is that we've tried to take steps to ensure we don't rely on
our provided scripts too heavily, and that processes can be run directly
with your own scripts, by simply calling the main method with an
appropriate classpath and args. We've taken some steps in existing releases
to help decouple the code from the scripts, and we've done even more in
that regard for 2.0.

I realize none of this helps you troubleshoot your existing init scripts...
and I'm sorry for that... but I think it's worth noting that Accumulo
services aren't too tightly coupled to the launch scripts, so if they can't
work for you, it's an option to use your own.

As for possible things to look for troubleshooting your existing scripts
(in addition to the other steps people have suggested):

1. Consider disabling SELinux to see if that changes anything. Accumulo
services may use network resources disabled by default in your SELinux
policy.
2. If you're using SysVInit scripts (init.d) instead of systemd, use
existing functions from /etc/init.d/functions in your init script whenever
possible... it can save you lots of headaches.
3. Avoid bash-isms; init.d scripts should be POSIX, and your scripts may
behave differently at bootup than when run manually for this reason.
4. Use shellcheck to check for problems.
5. Check file permissions of all scripts and paths to scripts to ensure
they are executable/readable/traversable by the root user and the Accumulo
user.


On Thu, Aug 9, 2018 at 9:49 AM Keith Turner  wrote:

> I would try adding set -x to the Accumuo script you are running and
> see what that outputs.  Could add that as the second line of a script
> as follows.
>
>  #! /usr/bin/env bash
> set -x
>
> Hopefully that will shed some light on the problem.
>
> On Thu, Aug 9, 2018 at 7:20 AM, Maria Krommyda 
> wrote:
> > Hello Josh,
> >
> > Thank you for your time and suggestions.
> >
> > I am setting the ACCUMULO_HOME variable and I see the logs in the proper
> > directory being updated on every reboot so this should not be the
> problem.
> >
> > I do not get any error/warning is the syslog. Only the confirmation that
> the
> > service was started. If there is any other log that I can check please
> let
> > me know.
> >
> > What is leading me to believe that it is a path/permission/variable
> issue,
> > is that the exact same script runs without any problem from the home
> > directory after the login.
> >
> > Best regards,
> > Maria.
> >
> >
> >
> > Στις 6:06 μ.μ. Τετάρτη, 8 Αυγούστου 2018, ο/η Josh Elser <
> els...@apache.org>
> > έγραψε:
> >
> >
> > Every Accumulo service creates log files in the directory you specified
> > via the ACCUMULO_LOG_DIR environment variable in accumulo-env.sh
> >
> > If you didn't define this, it likely defaults to ACCUMULO_HOME/logs.
> >
> > Have you looked at your syslog or similar to understand what your init.d
> > script's output was? If Hadoop comes up correctly and you know your
> > steps work otherwise, it sounds like it might be a typo in your script.
> >
> > On 8/8/18 3:32 AM, Maria Krommyda wrote:
> >> Hello everyone,
> >>
> >> I have set up a VM, Ubuntu 16.04, where I run Zookeeper (3.4.6), Hadoop
> >> (2.2.0) and Accumulo (1.7.3)
> >> I am trying to set up a script at init.d that would start all three
> >> services in case of an unexpected reboot of the machine.
> >>
> >> I already have a start_me script that I use to start the services which
> >> works great, but making it an init.d script has been proven a challenge.
> >>
> >> I have set all the environmental variables in /etc/default/my_script,
> >> which I include at the beginning of my script
> >> I run all the comman

Re: vfs classpath could not replicate

2018-09-26 Thread Christopher
It looks like the error is when vfs tries to create a temporary file in
your tmpdir. I would check that your java.io.tmpdir points to a directory
that exists and has the appropriate permissions for the user running the
Accumulo process.

On Wed, Sep 26, 2018, 16:08 Rob Verkuylen  wrote:

> Hi,
>
>
> I'm trying to get vfs loading of my jar working in 1.7.2. On a test
> cluster of the same version this works fine and I see the jar being
> replicated in the cache in '/tmp/accumulo-vfs*/fstore-filters-1.3.2.jar',
> but when I do the exact same thing on prod, I get the stacktrace below. Any
> ideas?
>
>
> Setup I used:
>
> config -s
> general.vfs.context.classpath.fstore=hdfs://nameservice1/libs/accumulo/fstore/.*jar
> config -t test.fstore_index -s table.classpath.context=fstore
>
> Stacktrace on tablet server:
> Failed to load class
> java.lang.ClassNotFoundException: IO Error loading class
> org.apache.accumulo.tserver.compaction.DefaultCompactionStrategy
> at
> org.apache.accumulo.start.classloader.vfs.ContextManager.loadClass(ContextManager.java:188)
> at
> org.apache.accumulo.core.conf.Property.createInstance(Property.java:873)
> at
> org.apache.accumulo.core.conf.Property.createTableInstanceFromPropertyName(Property.java:910)
> at
> org.apache.accumulo.tserver.TabletServerResourceManager$TabletResourceManager.needsMajorCompaction(TabletServerResourceManager.java:647)
> at
> org.apache.accumulo.tserver.tablet.Tablet.needsMajorCompaction(Tablet.java:1594)
> at
> org.apache.accumulo.tserver.tablet.Tablet.initiateMajorCompaction(Tablet.java:1574)
> at
> org.apache.accumulo.tserver.TabletServer$MajorCompactor.run(TabletServer.java:1875)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.commons.vfs2.FileSystemException: Could
> not replicate
> "hdfs://nameservice1/libs/accumulo/fstore/fstore-filters-1.3.2.jar".
> at
> org.apache.commons.vfs2.provider.AbstractFileSystem.replicateFile(AbstractFileSystem.java:426)
> at
> org.apache.commons.vfs2.provider.zip.ZipFileSystem.(ZipFileSystem.java:66)
> at
> org.apache.commons.vfs2.provider.jar.JarFileSystem.(JarFileSystem.java:48)
> at
> org.apache.commons.vfs2.provider.jar.JarFileProvider.doCreateFileSystem(JarFileProvider.java:80)
> at
> org.apache.commons.vfs2.provider.AbstractLayeredFileProvider.createFileSystem(AbstractLayeredFileProvider.java:87)
> at
> org.apache.commons.vfs2.impl.DefaultFileSystemManager.createFileSystem(DefaultFileSystemManager.java:1022)
> at
> org.apache.commons.vfs2.impl.DefaultFileSystemManager.createFileSystem(DefaultFileSystemManager.java:1042)
> at
> org.apache.commons.vfs2.impl.VFSClassLoader.addFileObjects(VFSClassLoader.java:156)
> at
> org.apache.commons.vfs2.impl.VFSClassLoader.(VFSClassLoader.java:119)
> at
> org.apache.accumulo.start.classloader.vfs.AccumuloReloadingVFSClassLoader.(AccumuloReloadingVFSClassLoader.java:147)
> at
> org.apache.accumulo.start.classloader.vfs.AccumuloReloadingVFSClassLoader.(AccumuloReloadingVFSClassLoader.java:162)
> at
> org.apache.accumulo.start.classloader.vfs.ContextManager$Context.getClassLoader(ContextManager.java:46)
> at
> org.apache.accumulo.start.classloader.vfs.ContextManager.getClassLoader(ContextManager.java:174)
> at
> org.apache.accumulo.start.classloader.vfs.ContextManager.loadClass(ContextManager.java:186)
> ... 8 more
> Caused by: org.apache.commons.vfs2.FileSystemException:
> Unknown message with code "No such file or directory".
> at
> org.apache.accumulo.start.classloader.vfs.UniqueFileReplicator.replicateFile(UniqueFileReplicator.java:68)
> at
> org.apache.commons.vfs2.provider.AbstractFileSystem.doReplicateFile(AbstractFileSystem.java:473)
> at
> org.apache.commons.vfs2.provider.AbstractFileSystem.replicateFile(AbstractFileSystem.java:422)
> ... 21 more
> Caused by: java.io.IOException: No such file or directory
> at
> java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createTempFile(File.java:2024)
> at
> org.apache.accumulo.start.classloader.vfs.UniqueFileReplicator.replicateFile(UniqueFileReplicator.java:60)
> ... 23 more
>
>
>
>


Hacktoberfest 2018 - DigitalOcean

2018-09-27 Thread Christopher
Anybody interested in organizing one of these?

https://hacktoberfest.digitalocean.com/eventkit


[ANNOUNCE] Apache Accumulo 2.0.0-alpha-1

2018-10-15 Thread Christopher
The Apache Accumulo project is pleased to announce the release of
Apache Accumulo 2.0.0-alpha-1! This *alpha* release is a preview for
2.0.0 and contains numerous feature enhancements and API changes.

While this version is *not* recommended for production use, it is made
available for feedback, testing, and evaluation. Please report any issues
you find to our issue tracker[1] or discuss on our dev email list[2].

***

Apache Accumulo® is a sorted, distributed key/value store that
provides robust, scalable data storage and retrieval. With
Apache Accumulo, users can store and manage large data sets
across a cluster. Accumulo uses Apache Hadoop's HDFS to store
its data and Apache ZooKeeper for consensus.

This version is now available in Maven Central, and at:
https://accumulo.apache.org/downloads/

The release notes for this alpha can be viewed at:
https://accumulo.apache.org/release/accumulo-2.0.0-alpha-1/

[1]: https://github.com/apache/accumulo
[2]: d...@accumulo.apache.org

--
The Apache Accumulo Team


Re: hadoop 3 / accumulo 1.9 accumulo-site.xml classpath updates

2018-12-19 Thread Christopher
A build of the 1.9.3 source (once released) should now be bundled with
the commons-configuration, so that should help a little. If you're
interested in doing a pull request to contribute and testing it, the
other changes could be added to the
assemble/conf/templates/accumulo-site.xml for use with
bin/bootstrap_config.sh

On Wed, Dec 19, 2018 at 2:31 PM Bulldog20630405
 wrote:
>
> configured accumulo 1.9.2 with hadoop 3.1.1; it required the following 
> updates to the "general classpath" of accumulo-site.xml; see below:
>
> (note: if i missed something; please let me know; else, maybe update the 
> config file for next release?
>
> 
> $HADOOP_PREFIX/share/hadoop/client/[^.].*.jar,
> 
> $ACCUMULO_HOME/lib/ext/commons-configuration-.*.jar,
> $HADOOP_PREFIX/share/hadoop/common/lib/commons-[^.].*.jar,
> $HADOOP_PREFIX/share/hadoop/common/lib/htrace-core4-4.1.0-incubating.jar,
>


Re: IS not authorized => code:BAD_AUTHORIZATIONS

2019-03-26 Thread Christopher
In general, this means the user is trying to initiate a scan by
passing in authorizations (security labels) which that user has not
been granted.
This is available in the API in order to support users ability to
limit their own authorizations. The error results from an attempt to
expand their authorizations, rather than limit them.

Compare the scan command being executed with the output (in the shell)
of `userpermissions `

On Tue, Mar 26, 2019 at 1:48 PM Bulldog20630405
 wrote:
>
> we are running a accumulo cluster which was running ok up until a day ago...
>
> we are getting the following error:
>
> note: we are using accumulo 1.8
>
>  is not authorized
> ThriftSecruityException(user:, code: BAD_AUTHORIZATIONS)
> at 
> org.apache.accumulo.tserver.TableServer$ThriftClientHanlder.startMultiScan(TabletServer.java:637)
> 
>
> unfortunately i cannot tell by the exception which table the scan is going 
> against; in general what is this issue?
>
>


  1   2   3   4   5   >