Re: [DISCUSS] Proposed binary packaging changes

2016-07-01 Thread William Slacum
Yeah, I wasn't really suggesting it as a course of action. It was more of a
mental exercise so I could grasp the issue better.

On Fri, Jul 1, 2016 at 12:13 PM, Christopher <ctubb...@apache.org> wrote:

> On Fri, Jul 1, 2016 at 3:07 PM William Slacum <wsla...@gmail.com> wrote:
>
> > Is another action we could take be adding profiles for each version of
> > dependencies to include appropriate dependencies (and dependencies'
> > metadata)?
> >
> >
> That could be a potentially huge number of profiles, and it would add a lot
> of complexity which is certainly going to suffer from lack of maintenance
> over time. I really think this kind of thing (integration) is a distinct
> responsibility better suited to external/downstream tasks than
> internal/upstream.
>
>
> > I guess right now the problem is we throw in a generic "one size fits
> all"
> > distribution, and we're seeing the cracks in it?
> >
> >
> Yes. That's my opinion.
>


Re: [DISCUSS] Proposed binary packaging changes

2016-07-01 Thread William Slacum
Is another action we could take be adding profiles for each version of
dependencies to include appropriate dependencies (and dependencies'
metadata)?

I guess right now the problem is we throw in a generic "one size fits all"
distribution, and we're seeing the cracks in it?

On Fri, Jul 1, 2016 at 11:48 AM, Christopher  wrote:

> On Fri, Jul 1, 2016 at 12:34 PM Josh Elser  wrote:
>
> > This leads me to wonder: what problem are we trying to solve? By
> > avoiding the binary release, we're making our lives easier to release
> > code (the continual L work). The build becomes a bit simpler with only
> > a source-release.
> >
> > If this is *really* about ease-of-use for downstream packagers (which
> > seemed to be your original intent, Christopher), is there a different
> > way we could solve this problem that would meet your needs (again,
> > assuming you're trying to make life easier as a package maintainer for
> > Fedora) that would not involve completely removing the binary tarball?
> >
>
> I'm finding it difficult to express clearly the one or two top problems I'm
> trying to solve. I think this is one of those things that addresses several
> smaller problems, each of which on their own aren't that important, but add
> up. Some of those are:
>
> * Reduce developer workload so that we can more easily bump dependencies
> when needed for features, bugfixes, and security fixes.
>
> * Reduce the technical and licensing debt on our part (current and future),
> because we're taking on unnecessary bundling tasks which are prone to
> faulty assumptions.
>
> * Better communicate downstream responsibilities for integration so
> upstream Accumulo is not harmed by negative perceptions when it's not our
> fault (we made faulty assumptions and the user didn't reconcile them).
>
> * Refocus/narrow our responsibilities to the upstream project, and draw a
> distinction with additional integration responsibilities we might
> voluntarily take on, so that we can provide a better experience for
> integrators and ease/encourage greater adoption.
>
> * In general, encourage making fewer upstream assumptions about downstream
> use cases, so we can better support a wider audience of users.
>
> * Prefer extensible tools for users to customize their integration
> experience, rather than hard-code decisions for them.
>
> FWIW, it was reported to me today that a user ran into an issue where my
> recent update of commons-configuration caused an integration problem
> because our scripts/packaging do not bundle commons-configuration and we
> just assume it will work with the version provided by Hadoop lib directory.
> That's the kind of thing I'd like to avoid... users should understand that
> assumptions in our packaging may not work for them, and we're creating work
> for ourselves while failing to communicate that when we try to bundle
> everything for them.
>
> If we were a self-contained application, we could even go the opposite way,
> and bundle everything. But, we're not. We're picking and choosing what to
> bundle, and our choices might not be right. We should make it easier for
> the users to choose, instead.
>


Re: Apache Accumulo integrated with Presto

2016-06-13 Thread William Slacum
I think the generic hash-join strategy is, for some small set A, we can
send the whole set to partitions of a larger set B and do the join in
parallel. In this case, whichever is the smaller set would be consumed on
some worker, and then distributed out to each worker participating in the
hash join. Outside of Presto, this is often done in an iterator where the
smaller set is an argument to an iterator.

On Mon, Jun 13, 2016 at 4:03 PM, Dylan Hutchison  wrote:

> Thanks for clarifying Adam.
>
> I am interested in learning more about the hash-join strategy, in case
> you're familiar with them.  Suppose we want to join the PartSupp and
> Supplier table on ps_suppkey = s_suppkey.  The s_suppkey is a primary key
> of the Supplier table and it is stored in the Accumulo row.  The ps_suppkey
> is neither a key nor stored in the row of the PartSupp table.  (The
> PartSupp table's row is a UUID.)
>
> Is the hash-join strategy to (1) scan tuples (whole rows) from PartSupp to
> a Presto worker, (2) for a batch of PartSupp tuples fetch the
> matching Supplier tuples, (3) repeat until all tuples are read from
> PartSupp?
>
> Regards, Dylan
>
> On Mon, Jun 13, 2016 at 8:24 AM, Adam J. Shook 
> wrote:
>
> > A few clarifications:
> >
> > - Presto supports hash-based distributed joins as well as broadcast joins
> >
> > - Presto metadata is stored in ZooKeeper, but metadata storage is
> pluggable
> > and could be stored in Accumulo instead
> >
> > - The connector does use tablet locality when scanning Accumulo, but our
> > testing has shown you get better performance by giving Accumulo and
> Presto
> > their own dedicated machines, making locality a moot point.  This will
> > certainly change based on types of queries, data sizes, network quality,
> > etc.
> >
> > - You can insert the results of a query into a Presto table using INSERT
> > INTO foo SELECT ..., as well as create a table from the results of a
> query
> > (CTAS).  Though, for large inserts, it is typically best to bypass the
> > Presto layer and insert directly into the Accumulo tables using the
> > PrestoBatchWriter API
> >
> > Cheers,
> > --Adam
> >
> > On Mon, Jun 13, 2016 at 7:20 AM, Christopher 
> wrote:
> >
> > > Thanks for that summary, Dylan! Very helpful.
> > >
> > > On Mon, Jun 13, 2016, 01:36 Dylan Hutchison <
> dhutc...@cs.washington.edu>
> > > wrote:
> > >
> > > > Thanks for sharing Sean.  Here are some notes I wrote after reading
> the
> > > > article on Presto-Accumulo design.  I have a research interest in the
> > > > relationship between relational (SQL) and non-relational (Accumulo)
> > > > systems, so I couldn't resist reading the post in detail.
> > > >
> > > >- Places the primary key in the Accumulo row.
> > > >- Performs row-at-a-time processing (each tuple is one row in
> > > >Accumulo) using WholeRowIterator behavior.
> > > >- Relational table metadata is stored in the Presto infrastructure
> > (as
> > > >opposed to an Accumulo table).
> > > >- Supports the creation of index tables for any attributes. These
> > > >index tables speed up queries that filter on indexed attributes.
> It
> > > is
> > > >standard secondary indexing, which provides speedups when the
> > > selectivity
> > > >of the query is roughly <10% of the original table.
> > > >- Only database->client querying is supported.  You cannot run
> > "select
> > > >... into result_table".
> > > >- As far as I can see, Presto only has one join strategy:
> *broadcast
> > > >join*.  The right table of every join is scanned into one of the
> > > >Presto worker's memory.  Subsequently the size of the right table
> is
> > > >limited by worker memory.
> > > >- There is one Presto worker for each Accumulo tablet, which
> enables
> > > >good scaling.
> > > >- The Presto bridge classes track internal Accumulo information
> such
> > > >as the assignment of tablets to tablet servers by reading
> Accumulo's
> > > >Metadata table. Presto uses tablet locations to provide better
> > > locality.
> > > >- The Presto bridge comes with several Accumulo server-side
> > iterators
> > > >for filtering and aggregating.
> > > >- The code is quite nice and clean.
> > > >
> > > > This image below gives Presto's architecture.  Accumulo takes the
> role
> > of
> > > > the DB icon in the bottom-right corner.
> > > >
> > > > [image: Inline image 2]
> > > >
> > > > Bloomberg ran 13 out of the 22 TPC-H queries.  There is no
> fundamental
> > > > reason why they cannot run all the queries; they just have not
> > > implemented
> > > > everything required ('exists' clauses, non-equi join, etc.).
> > > >
> > > > The interface looks like this, though they use a compiled java jar to
> > > > insert entries from a csv file (it wraps around a BatchWriter).
> > > >
> > > > [image: Inline image 3]
> > > >
> > > > Here are performance results.  They don't say 

Re: measuring perf

2016-06-10 Thread William Slacum
I think it's reasonable to measure from the start of a for/while loop over
the Scanner. Such as:

```
// .. my initialization code
scanner.setRange(someRange)
Stopwatch timer = Stopwatch.createStarted();
for(Entry e: scanner) {
  // my logic
}
timer.stop();
```
I've personally done this when measuring query performance and usually
gives a good estimate of what's going on, especially if the network has
low, constant latency.


On Fri, Jun 10, 2016 at 3:35 PM, z11373  wrote:

> Good morning!
> I have a service running against different Accumulo instance (in different
> datacenter).
> Both Accumulo should have same configurations, but I was told by consumer
> of
> my service is they experience one is faster than one in another datacenter.
> The service being deployed is running on machine with same spec, and most
> operations are against Accumulo, hence I am interested to capture the perf
> (including network latency from Accumulo server to my service), and compare
> them to verify if the problem is indeed accessing Accumulo instance is
> slower than the other one. Right now I capture time from my service being
> called and results being returned, but that doesn't tell how much time it
> spent on Accumulo.
>
> Unlike in traditional SQL database, I could measure the time it takes to
> run
> a SELECT statement for example, but in Accumulo, nothing being read from
> server, until we iterate (my understanding may be wrong), so for now I am
> thinking perhaps I'd set the start time before setting the ranges, and set
> the stop time when there is no more item from that iterator. Is this
> reasonable, or perhaps there is a better way?
>
> For additional info, my service will read from iterator, for each item, it
> will make another scanner (and set range), and iterate again, and so on. So
> if it ends up with 10 scanners, my current approach will log 10 perf
> captures.
>
>
> Thanks,
> Z
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/measuring-perf-tp17245.html
> Sent from the Developers mailing list archive at Nabble.com.
>


Re: [DISCUSS] Java 8 support (was Fwd: [jira] [Commented] (ACCUMULO-4177) TinyLFU-based BlockCache)

2016-05-03 Thread William Slacum
They'll at least get runtime errors.

On Tue, May 3, 2016 at 4:18 PM, Mike Drob <md...@mdrob.com> wrote:

> If our code ends up using java 8 bytecode in any classes required by a
> consumer, then I think they will get compilation (linking?) errors,
> regardless of java 8 types in our methods signatures.
>
> On Tue, May 3, 2016 at 3:09 PM, Josh Elser <josh.el...@gmail.com> wrote:
>
> > That's a new assertion ("we can't actually use Java 8 features util
> > Accumulo-2"), isn't it? We could use new Java 8 features internally which
> > would require a minimum of Java 8 and not affect the public API. These
> are
> > related, not mutally exclusive, IMO.
> >
> > To Shawn's point: introducing Java 8 types/APIs was exactly the point --
> > we got here from ACCUMULO-4177 which does exactly that.
> >
> >
> > Mike Drob wrote:
> >
> >> I agree with Shawn's implied statement -- why bother dropping Java 7 in
> >> any
> >> Accumulo 1.x if we can't actually make use of Java 8 features.until
> >> Accumulo 2.0
> >>
> >> On Tue, May 3, 2016 at 1:29 PM, Christopher<ctubb...@apache.org>
> wrote:
> >>
> >> Right, these are competing and mutually exclusive goals, so we need to
> >>> decide which is a priority and on what timeline we should transition to
> >>> Java 8 to support those goals.
> >>>
> >>> On Tue, May 3, 2016 at 9:16 AM Shawn Walker<accum...@shawn-walker.net>
> >>> wrote:
> >>>
> >>> I'm not sure that guaranteeing build-ability under Java 7 would address
> >>>>
> >>> the
> >>>
> >>>> issue that raised this discussion:  We (might) want to add a
> dependency
> >>>> which requires Java 8.  Or, following Keith's comment, we might wish
> to
> >>>> introduce Java 8 types (e.g. CompletableFuture) into Accumulo's
> >>>>
> >>> "public"
> >>>
> >>>> API.
> >>>>
> >>>>
> >>>>
> >>>> On Mon, May 2, 2016 at 6:42 PM, Christopher<ctubb...@apache.org>
> >>>> wrote:
> >>>>
> >>>> I don't feel strongly about this, but I was kind of thinking that we'd
> >>>>>
> >>>> bump
> >>>>
> >>>>> to Java 8 dependency (opportunistically) when we were ready to
> develop
> >>>>>
> >>>> a
> >>>
> >>>> 2.0 version. But, I'm not opposed to doing it on the 1.8 branch.
> >>>>>
> >>>>> On Mon, May 2, 2016 at 2:50 PM William Slacum<wsla...@gmail.com>
> >>>>>
> >>>> wrote:
> >>>
> >>>> So my point about versioning WRT to the Java runtime is more about
> >>>>>>
> >>>>> how
> >>>
> >>>> there are incompatibilities within the granularity of Java versions
> >>>>>>
> >>>>> we
> >>>
> >>>> talk
> >>>>>
> >>>>>> about (I'm specifically referencing a Kerberos incompatibility
> within
> >>>>>> versions of Java 7), so I think that just blanket saying "We support
> >>>>>>
> >>>>> Java X
> >>>>>
> >>>>>> or Y" really isn't enough. I personally feel some kind of version
> >>>>>>
> >>>>> bump
> >>>
> >>>> is
> >>>>
> >>>>> nice to say that something has changed, but until the public API
> >>>>>>
> >>>>> starts
> >>>
> >>>> exposing Java 8 features, it's a total cop out to say, "Here's all
> >>>>>>
> >>>>> these
> >>>>
> >>>>> bug fixes and some new features, oh by the way upgrade your
> >>>>>>
> >>>>> infrastructure
> >>>>>
> >>>>>> because we decided to use a new Java version for an optional
> >>>>>>
> >>>>> feature".
> >>>
> >>>> The best parallel I can think of is in Scala, where there's no binary
> >>>>>> compatibility between minor versions (ie, 2.10, 2.11,etc), so
> there's
> >>>>>> generally an extra qualifier on libraries marking the scala
> >>>>>>
&

Re: [DISCUSS] Java 8 support (was Fwd: [jira] [Commented] (ACCUMULO-4177) TinyLFU-based BlockCache)

2016-05-02 Thread William Slacum
So my point about versioning WRT to the Java runtime is more about how
there are incompatibilities within the granularity of Java versions we talk
about (I'm specifically referencing a Kerberos incompatibility within
versions of Java 7), so I think that just blanket saying "We support Java X
or Y" really isn't enough. I personally feel some kind of version bump is
nice to say that something has changed, but until the public API starts
exposing Java 8 features, it's a total cop out to say, "Here's all these
bug fixes and some new features, oh by the way upgrade your infrastructure
because we decided to use a new Java version for an optional feature".

The best parallel I can think of is in Scala, where there's no binary
compatibility between minor versions (ie, 2.10, 2.11,etc), so there's
generally an extra qualifier on libraries marking the scala compability
level. Would we ever want to have accumulo-server-1.7-j[7|8]  styled
artifacts to signal some general JRE compatibility? It's a total mess, but
I haven't seen a better solution.

Another idea is we could potentially have some guarantee for Java 7, such
as making sure we can build a distribution using Java 7, but only
distribute Java 8 artifacts by default?

On Mon, May 2, 2016 at 2:30 PM, Josh Elser  wrote:

> Sean Busbey wrote:
>
>> On Mon, May 2, 2016 at 8:55 AM, Josh Elser  wrote:
>>
>>> >  Thanks for the input, Sean.
>>> >
>>> >  Playing devil's advocate: we didn't have a major version bump when we
>>> >  dropped JDK6 support (in Accumulo-1.7.0). Oracle has EOL'ed java 7
>>> back in
>>> >  April  2015. Was the 6->7 upgrade different than a 7->8 upgrade?
>>> >
>>>
>>
>> On Mon, May 2, 2016 at 10:31 AM, Keith Turner  wrote:
>>
>>> >  On Mon, May 2, 2016 at 1:54 AM, Sean Busbey
>>> wrote:
>>> >
>>>
 >>  If we drop jdk7 support, I would strongly prefer a major version
 bump.
 >>

>>> >
>>> >
>>> >  Whats the rationale for binding a bump to Accumulo 2.0 with a bump in
>>> the
>>> >  JDK version?
>>> >
>>>
>>
>> The decision to drop JDK6 support happened in latemarch  / earlyApril
>> 2014[1], long before any of our discussions or decisions on semver.
>> AFAICT it didn't get discussed again, presumably because by the time
>> we got to 1.7.0 RCs it was too far in the past.
>>
>
> Thanks for the correction, Sean. I hadn't dug around closely enough.
>


Re: Accumulo on s3

2016-04-25 Thread William Slacum
Ephemeral storage & EBS are more friendly. Ephemeral storage is generally
the fastest and most HDFS-friendly.

On Mon, Apr 25, 2016 at 1:13 PM, Dylan Hutchison  wrote:

> Hey Josh,
>
> Are there other platforms on AWS (or another cloud provider) that
> Accumulo/HDFS are friendly to run on?  I thought I remembered you and
> others running the agitation tests on Amazon instances during
> release-testing time.  If there are alternatives, what advantages would S3
> have over the current method?
>
> On Mon, Apr 25, 2016 at 8:09 AM, Josh Elser  wrote:
>
> > I'm not sure on the guarantees of s3 (much less the s3 or s3a Hadoop
> > FileSystem implementations), but, historically, the common issue is
> > lacking/incorrect implementations of sync(). For durability (read-as: not
> > losing your data), Accumulo *must* know that when it calls sync() on a
> > file, the data is persisted.
> >
> > I don't know definitively what S3 guarantees (or asserts to guarantee),
> > but I would be very afraid until I ran some testing (we have one good
> test
> > in Accumulo that can run for days and verify data integrity called
> > continuous ingest).
> >
> > You might have luck reaching out to the Hadoop community to get some
> > understanding from them about what can reasonably be expected with the
> > current S3 FileSystem implementations, and then run your own tests to
> make
> > sure that data is not lost.
> >
> >
> > vdelmeglio wrote:
> >
> >> Hi everyone,
> >>
> >> I recently got this answer on stackoverflow (link:
> >>
> >>
> http://stackoverflow.com/questions/36602719/accumulo-cluster-in-aws-with-s3-not-really-stable/36772874#36772874
> >> ):
> >>
> >>
> >>   Yes, I would expect that running Accumulo with S3 would result in
> >>> problems. Even though S3 has a FileSystem implementation, it does not
> >>> behave like a normal file system. Some examples of the differences are
> >>> that operations we would expect to be atomic are not atomic in S3,
> >>> exceptions may mean different things than we expect, and we assume our
> >>> view of files and their metadata is consistent rather than the eventual
> >>> consistency S3 provides.
> >>>
> >>> It's possible these issues could be mitigated if we made some
> >>> modifications to the Accumulo code, but as far as I know no one has
> tried
> >>> running Accumulo on S3 to figure out the problems and whether those
> could
> >>> be fixed or not.
> >>>
> >>
> >> Since we're currently running an accumulo cluster on aws with s3 for
> >> evaluation purpose, this answer make me wonder, should someone explain
> me
> >> why running accumulo on s3 is not a good idea? in the specific, which
> >> operations are expected to be atomic on accumulo?
> >>
> >> Is there eventually a roadmap for s3 compatibility?
> >>
> >> Thanks!
> >> Valerio
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-tp16737.html
> >> Sent from the Developers mailing list archive at Nabble.com.
> >>
> >
>


Re: Checking what a BatchWriter is stuck on; failure during split

2016-04-19 Thread William Slacum
Good digs, Dylan. I don't think it's too rare to matter. I notice  often
during MR jobs, and there's usually a point where I give up and just start
writing RFiles.

It could possibly be related to what I saw back in the dayoday with:
https://mail-archives.apache.org/mod_mbox/accumulo-user/201406.mbox/%3ccamz+duvmmhegon9ejehr9h_rrpp50l2qz53bbdruvo0pira...@mail.gmail.com%3E

On Tue, Apr 19, 2016 at 6:26 PM, Josh Elser  wrote:

> Nice findings. Sorry I haven't had any cycles to dig into this myself.
>
> I look forward to hearing what you find :)
>
>
> Dylan Hutchison wrote:
>
>> I investigated a bit more and I am pretty sure the problem is that the
>> BatchWriter is not recognizing that the tablet vb<<  split into vb;2436<
>> and
>> vb<;2436.  It keeps trying to update the closed tablet vb<<.  Each update
>> writes 0 mutations and records a failure at the tablet server
>> UpdateSession
>> because vb<<  is closed.
>>
>> I'm not sure why this is happening because the BatchWriter should have
>> invalidated its tablet locator cache upon recognizing a failure.  Then it
>> would recognize that the entries it wants to write fall into the new
>> tablets vb;2436<  and vb<;2436.  I think there is a timing bug for this
>> edge
>> case, when a table split occurs during heavy writes.
>>
>> I will write this up if I can reproduce it.  Maybe it is too rare to
>> matter.
>>
>> Cheers, Dylan
>>
>> On Mon, Apr 18, 2016 at 2:38 PM, Dylan Hutchison<
>> dhutc...@cs.washington.edu
>>
>>> wrote:
>>>
>>
>> Hi devs,
>>>
>>> I'd like to ask your help in figuring out what is happening to a
>>> BatchWriter.  The following gives my reasoning so far.
>>>
>>> In Accumulo 1.7.1, I have a BatchWriter that is stuck in WAITING status
>>> in
>>> its addMutation method.  I saw that it is stuck by jstack'ing the
>>> Accumulo
>>> client.  It's been stuck like this for 16 hours.
>>>
>>> The BatchWriter is supposed to wait when a mutation is added if no
>>> failures have recorded and either (a) the total memory used exceeds the
>>> maximum allowed for the BatchWriter, or (b) the batchwriter is currently
>>> flushed.  So we conclude that one of (a) or (b) have occurred and no
>>> failures were recorded, at the time when addMutation was called.  I think
>>> (a) is likely.
>>>
>>> The BatchWriter is supposed to notify itself when either (1) a flush
>>> finishes, (2) a constraint violation or authorization failure or server
>>> error or unknown error occurs, (3) memory usage decreases, which happens
>>> when entries successfully send to the tablet server.  Since the
>>> BatchWriter
>>> is stuck on WAITING, none of these conditions are occurring.
>>>
>>> The BatchWriter has 3 write threads (the default number).  All three have
>>> status TIMED_WAITING (parked) in jstack.  Their stack traces don't give
>>> useful information.
>>>
>>> Here's what I can tell from the tserver logs.  A new table (and tablet)
>>> was created successfully.  The BatchWriter started writing to this tablet
>>> steadily.  The logs show that the tablet (vb<<) flushed every 5 seconds
>>> or
>>> so and major compacted at a steady periodic rate.
>>>
>>> Everything looks good, until vb<<  grew large enough that it needed
>>> splitting.  This occurred about 42 minutes after the BatchWriter started
>>> writing entries.  The logs show a failure in an UpdateSession that popped
>>> up in the middle of the split operation.  This failure continues to show
>>> for the next 15 hours.
>>>
>>> I copied the portion of the tserver logs that look relevant to the split
>>> below.  I highlighted the line reporting the first failure.  It occurs in
>>> between when the split starts and when it finishes.
>>>
>>> Any idea what could have caused this?  I don't know if the failure is
>>> related to the BatchWriter being stuck in WAITING.  It seems likely.  I
>>> think it is weird that the 3 write threads are all idle; at least one of
>>> them should be doing something if the thread calling addMutation() is
>>> waiting.
>>>
>>> Here is a pastebin of the jstack, though I
>>> think I wrote the useful parts from it.
>>>
>>> 2016-04-17 22:38:06,436 [tablet.Tablet] TABLET_HIST: vb<<  closed
>>> 2016-04-17 22:38:06,439 [tablet.Tablet] DEBUG: Files for low split
>>> vb;2436<
>>>   [hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C8lh.rf,
>>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/C9iz.rf,
>>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/Ca08.rf,
>>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/Ca4t.rf,
>>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/Ca7m.rf,
>>> hdfs:
>>> //localhost:9000/accumulo/tables/vb/default_tablet/Ca8f.rf,
>>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/Ca8n.rf,
>>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/Fa8p.rf,
>>> hdfs://localhost:9000/accumulo/tables/vb/default_tablet/Fa8q.rf,
>>> 

Re: Pros and Cons of moving SKVI to public API

2016-03-24 Thread William Slacum
It should be public API. It's one of the core reasons for choosing Accumulo
over a similar project like HBase or Cassandra. Allegedly, Jeff "Mean Gene"
Dean said we got the concept correct as well :)

Personally I hate the current API from a usability standpoint (ie, the
generic types are useless and already encoded in the name, it needlessly
diverges from the standard java Iterator calling standards), but it's a
strong, identifying feature we have.

On Thu, Mar 24, 2016 at 2:50 PM, Christopher  wrote:

> Accumulators,
>
> What are the pros and cons that you can see for moving the
> SortedKeyValueIterator into the public API?
>
> Right now, I think there's still some need for improvement in the Iterator
> API, and many of the iterators may not be stable enough to really recommend
> people use without some serious caveats (because we may not be able to keep
> their API stable very easily). So, there's a con.
>
> In the pros side, iterators are a core feature of Accumulo, and nearly all
> of Accumulo's distributed processing capabilities are dependent upon them.
> It is reasonable to expect users to take advantage of them, and we've at
> least tried to be cautious about changing the iterators in incompatible
> ways, even if they aren't in the public API.
>
> Recently, this came up when we stripped out all the non-public API javadocs
> from the website. (reported by Dan Blum on the user list on March 4th:
>
> http://mail-archives.apache.org/mod_mbox/accumulo-user/201603.mbox/%3C066a01d17658%24bc9dc1b0%2435d94510%24%40bbn.com%3E
> )
>
> What would it take for us to feel comfortable moving them to the public
> API? Do we need a better interface first, or should we isolate the other
> iterators into another package (some of that has already been done), or
> should we wait for a proper public API package (2.0?) to provide this
> interface in?
>


Re: delete + insert case

2016-03-19 Thread William Slacum
Be aware of the OS's underlying granularity for time as well:

http://docs.oracle.com/javase/6/docs/api/java/lang/System.html#currentTimeMillis%28%29

I almost wonder if it's better to use the RowDeletingIterator in this case.
If the check it does is "if TS < delete marker TS", in theory you could get
away with putting the delete marker inside the same Mutation as the update
and the iterator will mask any data marked with a TS before the delete
marker.

On Thu, Mar 17, 2016 at 11:18 AM, Josh Elser  wrote:

> Server-assigned timestamps aren't noticeably slower than user-assigned
> timestamps, if that's what you're referring to WRT throughput.
>
> As for using currentTimeMillis(), probably fine, but not always.
>
> 1) NTP updates might cause currentTimeMillis() to change in reverse
> 2) You need to make sure the delete and update always come from the same
> host (otherwise two hosts might have different values for
> currentTimeMillis())
>
> Time is hard in distributed systems.
>
>
> z11373 wrote:
>
>> Thanks Josh! For better throughput, I think I'd just assign the timestamp
>> from my code.
>> Using this code, System.currentTimeMillis(); for timestamp should be ok,
>> right?
>>
>>
>> Thanks,
>> Z
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-accumulo.1065345.n5.nabble.com/delete-insert-case-tp16375p16382.html
>> Sent from the Developers mailing list archive at Nabble.com.
>>
>


Re: git-based site and jekyll

2016-03-10 Thread William Slacum
I would like to request at least one frame and one scrolling marquee. Can
we blingee the Accumulo logo?

On Thursday, March 10, 2016, Josh Elser  wrote:

> * Some companies on http://ctubbsii.github.io/accumulo/people.html are
> goofed as are the timezones.
> * Some broken links on http://ctubbsii.github.io/accumulo/source.html.
> Coding practices are also messed up.
> * http://ctubbsii.github.io/accumulo/contrib.html contrib project entries
> are a little wacky.
> * http://ctubbsii.github.io/accumulo/screenshots.html is weird with the
> monitor screenshot (should be beneath the text?)
> * Just noticed that Other and Documentation both have a link to the
> papers/presentations. That might actually be how the site is now, just
> realized it's duplicative.
>
> Thanks again for doing this. It's great!
>
> Christopher wrote:
>
>> Actually, I now have it all working (as far as I can tell) with everything
>> pretty much the same as it looks with CMS today. After people have taken
>> the time to give it a glance, I'll push it to the ASF repo, and then push
>> the generated site to a separate branch. Then we can put in the INFRA
>> ticket to switch from svn to git.
>>
>> On Thu, Mar 10, 2016 at 6:42 PM Christopher  wrote:
>>
>> I'm working on converting our current site contents over to jekyll at
>>> https://github.com/ctubbsii/accumulo/tree/gh-pages
>>> (view at http://ctubbsii.github.io/accumulo)
>>>
>>> Yes, it's terrible right now... it's in progress. :)
>>>
>>> On Tue, Mar 8, 2016 at 4:21 PM Josh Elser  wrote:
>>>
>>> Lazy consensus is fine. If there are no objections, I don't want to hold
 things up. I feel like I've adequately expressed my concerns. Silence
 can and should be treated as acknowledgement for this, IMO.

 Christopher wrote:

> Another reason we probably shouldn't worry about this: anybody can
>
 create a

> DNS name at their leisure which transparently redirects to
> accumulo.apache.org and serves its contents. This is perfectly
>
 legitimate

> for a number of reasons, including corporate proxies/mirrors,
> URL-shortening services, caching services, archiving services,
> vision-impaired accessibility services, foreign-language DNS mappings,
>
 and

> so-on.
>
> I think when it comes to trademarks and our website, our area of
> concern
> should mostly focus on when people misrepresent our trademark in the
>
 course

> of their mirroring/archiving. There's no risk of that for a mirror that
>
 is

> explicitly under our control, but I'm really leaning towards the
>
 javascript

> to detect and display a message about the canonical location just to
> mitigate any possibility for concern.
>
> If you still have concerns, I'd be happy to put it up for a formal vote
> from the PMC, or to get feedback from ASF trademarks folks before we
> proceed.
>
> On Tue, Mar 8, 2016 at 3:22 PM Josh Elser
>  wrote:
>
> Well, I think the difference is that archive.org (and others -- google
>> cached pages come to mind) are devoted/known for that specific
>> purpose.
>> The fact that Github ends up being a "de-facto" location for software
>> projects, I'm just nervous about the expecting good faith from the
>> denizens of the internet. Maybe I'm just worrying too much. If there's
>> sufficient "it'll be ok" opinion coming from the PMC, it's fine by me.
>>
>> Christopher wrote:
>>
>>> I can't imagine there's a trademark issue since it's really just
>>>
>> acting

> as
>>
>>> a mirror. If there were trademark issues, I imagine sites like
>>> http://archive.org would be in big trouble. But, it certainly
>>>
>> couldn't

> hurt
>>
>>> to find out.
>>>
>>> Another option to sabotage the GH-rendered site is to add some
>>>
>> javascript

> which detects the location and displays an informative link back to
>>>
>> the

> canonical location for the site. That should be simple enough to do.
>>>
>>> On Tue, Mar 8, 2016 at 1:36 PM Josh Elser
>>>
>>   wrote:

> It's also probably worth mentioning that this concern only comes

>>> about

> for point #4 (or if we use the branch name gh-pages in point #1).

 Josh Elser wrote:

> The one concern I had was regarding automatic rendering of what
>
 would

> look like "the Apache Accumulo website" on Github (both
>
 apache/accumulo

> github account and other forks).
>
> Christopher had said that no one seemed to object in comdev@ when
>
 he

> talked about this a while back. I wanted to make sure 

Re: Trouble connecting to Kerberized Accumulo/Zookeeper

2016-03-08 Thread William Slacum
This is one of the tests I used to figure out how Kerberos worked with
Accumulo.

https://gist.github.com/wjsl/93c8528e8f27bbeb31bf

You'll see the pattern where I would call `val someUser =
loginUserFromKeytabAndReturnUGI(...)` and then execution connections inside
of a doAs call.


On Tue, Mar 8, 2016 at 7:31 PM, William Slacum <wsla...@gmail.com> wrote:

> I think one thing is that we can at least guarantee you can connect to the
> KDC.
>
> It kind of seems like there's an issue with communication between the
> client and Accumulo.Can you try `new KerberosToken(principal, keytab,
> true)`? I think I ran into this when figuring things out my own. By passing
> in `false`, the connection won't be made as that user. I think you have to
> manually execute the connection in a PrivilegedAction if you don't replace
> the currently logged in user.
>
> On Tue, Mar 8, 2016 at 7:16 PM, Tristen Georgiou <tgeorg...@phemi.com>
> wrote:
>
>> One thing I've noticed is that my client stack trace makes no mention of
>> using an SASL transport (It's the last log dump in this email). Maybe this
>> is the problem; Accumulo wants an SASL connection, but for some reason the
>> client app isn't using an SASL transport even though I'm using a Kerberos
>> Token?
>>
>> *Tablet server log:*
>>
>> 2016-03-08 16:09:07,875 [handler.PhemiAuthenticator] INFO : Authenticating
>> user: phemi...@dev.phemi.com
>> 2016-03-08 16:09:17,884 [handler.PhemiAuthenticator] INFO : Authenticating
>> user: phemi...@dev.phemi.com
>> 2016-03-08 16:09:27,892 [handler.PhemiAuthenticator] INFO : Authenticating
>> user: phemi...@dev.phemi.com
>> 2016-03-08 16:09:37,902 [handler.PhemiAuthenticator] INFO : Authenticating
>> user: phemi...@dev.phemi.com
>> 2016-03-08 16:09:43,142 [server.TThreadPoolServer] ERROR: Error occurred
>> during processing of message.
>> java.lang.RuntimeException:
>> org.apache.thrift.transport.TTransportException
>> at
>>
>> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
>> at
>>
>> org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:51)
>> at
>>
>> org.apache.accumulo.core.rpc.UGIAssumingTransportFactory$1.run(UGIAssumingTransportFactory.java:48)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:360)
>> at
>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637)
>> at
>>
>> org.apache.accumulo.core.rpc.UGIAssumingTransportFactory.getTransport(UGIAssumingTransportFactory.java:48)
>> at
>>
>> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:208)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at
>> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.apache.thrift.transport.TTransportException
>> at
>>
>> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
>> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
>> at
>>
>> org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:178)
>> at
>>
>> org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
>> at
>> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
>> at
>>
>> org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
>> at
>>
>> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
>> ... 11 more
>> 2016-03-08 16:09:47,908 [handler.PhemiAuthenticator] INFO : Authenticating
>> user: phemi...@dev.phemi.com
>> 2016-03-08 16:09:57,915 [handler.PhemiAuthenticator] INFO : Authenticating
>> user: phemi...@dev.phemi.com
>> 2016-03-08 16:10:01,035 [handler.PhemiAuthenticator] INFO : Authenticating
>> user: phemi...@dev.phemi.com
>> 2016-03-08 16:10:01,037 [handler.PhemiAuthenticator] INFO : Authenticating
>> user: phemi...@dev.phemi.com
>>
>> *Master server log:*
>>
>> 2016-03-08 16:11:37,702 [replication.WorkMaker] INFO : Replication table
>> is
>> not yet online
>> 2016-03-08 16:11:43,160 [server.TThreadPoolServer] ERROR: Error occurred
>> during p

Re: Trouble connecting to Kerberized Accumulo/Zookeeper

2016-03-08 Thread William Slacum
sumingTransportFactory.java:48)
> at
>
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:208)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.thrift.transport.TTransportException
> at
>
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> at
>
> org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:178)
> at
>
> org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
> at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
> at
>
> org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
> at
>
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
> ... 11 more
> 2016-03-08 16:12:07,705 [replication.WorkMaker] INFO : Replication table is
> not yet online
>
> *And the debug log for the client application:*
>
> 2016-03-08 16:11:39,241 DEBUG [main-SendThread(dev:2181)]
> zookeeper.ClientCnxn (ClientCnxn.java:readResponse(717)) - Got ping
> response for sessionid: 0x15342d74d18477f after 0ms
> 2016-03-08 16:11:49,250 DEBUG [main] impl.ServerClient
> (ServerClient.java:executeRaw(101)) - ClientService request failed
> dev:9997, retrying ...
> org.apache.thrift.transport.TTransportException:
> java.net.SocketTimeoutException: 12 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.129.0.116:37532
> remote=dev/10.129.0.110:9997]
> at
>
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> at
>
> org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
> at
>
> org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
> at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
> at
>
> org.apache.accumulo.core.client.impl.ThriftTransportPool$CachedTTransport.readAll(ThriftTransportPool.java:270)
> at
>
> org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:601)
> at
>
> org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:470)
> at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
> at
>
> org.apache.accumulo.core.client.impl.thrift.ClientService$Client.recv_authenticate(ClientService.java:500)
> at
>
> org.apache.accumulo.core.client.impl.thrift.ClientService$Client.authenticate(ClientService.java:486)
> at
>
> org.apache.accumulo.core.client.impl.ConnectorImpl$1.execute(ConnectorImpl.java:70)
> at
>
> org.apache.accumulo.core.client.impl.ConnectorImpl$1.execute(ConnectorImpl.java:67)
> at
>
> org.apache.accumulo.core.client.impl.ServerClient.executeRaw(ServerClient.java:98)
> at
>
> org.apache.accumulo.core.client.impl.ServerClient.execute(ServerClient.java:61)
> at
>
> org.apache.accumulo.core.client.impl.ConnectorImpl.(ConnectorImpl.java:67)
> at
>
> org.apache.accumulo.core.client.ZooKeeperInstance.getConnector(ZooKeeperInstance.java:248)
> at
>
> com.phemi.testing.AccumuloKerberosConnection.main(AccumuloKerberosConnection.java:18)
> Caused by: java.net.SocketTimeoutException: 12 millis timeout while
> waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.129.0.116:37532
> remote=dev/10.129.0.110:9997]
> at
>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at java.io.FilterInputStream.read(FilterInputStream.java:133)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> at
>
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
> ... 17 more
>
>
> On Tue, Mar 8, 2016 at 4:03 PM William Slacum <wsla...@gmail.com> wrote:
>
> > Any logs on the Accumulo and/or KDC side?
> >
> > On Tue, Mar 8, 2016 at 5:05 PM, Tristen Georgiou <tgeorg...@phemi.com>
> > wrote:
> >

Re: Trouble connecting to Kerberized Accumulo/Zookeeper

2016-03-08 Thread William Slacum
Any logs on the Accumulo and/or KDC side?

On Tue, Mar 8, 2016 at 5:05 PM, Tristen Georgiou 
wrote:

> Here is a simple Java program to attempt to get a connection to Accumulo
> and list the local users:
>
> package com.phemi.testing;
>
> import org.apache.accumulo.core.client.Connector;
> import org.apache.accumulo.core.client.Instance;
> import org.apache.accumulo.core.client.ZooKeeperInstance;
> import org.apache.accumulo.core.client.security.tokens.KerberosToken;
>
> import java.io.File;
>
> public class AccumuloKerberosConnection {
> public static void main(String[] args) throws Exception {
> Instance inst = new ZooKeeperInstance("agile_accumulo", "dev");
> KerberosToken token = new KerberosToken("
> accumulo-ph...@dev.phemi.com", new
> File("/etc/security/keytabs/accumulo.headless.keytab"), false);
> Connector conn = inst.getConnector(token.getPrincipal(), token);
> System.out.println(conn.securityOperations().listLocalUsers());
> }
> }
>
> It always hangs at the getConnector function.
>
> NOTE: the user and keytab were created by Ambari as the default Accumulo
> user.
>
> At first I noticed that there was an error saying that it was not
> attempting to connect using SASL (unspecified error) but after some digging
> I found that I could get around this using jaas.conf file and specifying it
> as a Java argument on the command line:
>
> -Djava.security.auth.login.config=/tmp/jaas.conf
>
> Where the file contains:
>
> Client {
> com.sun.security.auth.module.Krb5LoginModule required
> useKeyTab=true
> keyTab="/etc/security/keytabs/accumulo.headless.keytab"
> principal="accumulo-phemi"
> useTicketCache=false
> debug=true;
> };
>
> Now I'm at a point where it says it's using GSSAPI as SASL mechanism
> (good!) but it still hangs at the "Connector conn =
> inst.getConnector(token.getPrincipal(), token);" line.
>
> Any pointers on what I'm doing wrong?
>
> Tristen
>
> PS: Here is the debug output:
> 2016-03-08 14:02:03,600 WARN  [main] client.ClientConfiguration
> (ClientConfiguration.java:loadFromSearchPath(227)) - Found no client.conf
> in default paths. Using default client configuration values.
> 2016-03-08 14:02:03,668 INFO  [main] zookeeper.ZooKeeper
> (Environment.java:logEnv(100)) - Client
> environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
> 2016-03-08 14:02:03,670 INFO  [main] zookeeper.ZooKeeper
> (Environment.java:logEnv(100)) - Client environment:host.name
> =tgeorgiou-ubuntu-dev
> 2016-03-08 14:02:03,670 INFO  [main] zookeeper.ZooKeeper
> (Environment.java:logEnv(100)) - Client environment:java.version=1.7.0_95
> 2016-03-08 14:02:03,670 INFO  [main] zookeeper.ZooKeeper
> (Environment.java:logEnv(100)) - Client environment:java.vendor=Oracle
> Corporation
> 2016-03-08 14:02:03,670 INFO  [main] zookeeper.ZooKeeper
> (Environment.java:logEnv(100)) - Client
> environment:java.home=/usr/lib/jvm/java-7-openjdk-amd64/jre
> 2016-03-08 14:02:03,670 INFO  [main] zookeeper.ZooKeeper
> (Environment.java:logEnv(100)) - Client
>
> 

Re: [ATTN] Cleaning up extra refs in git

2016-03-04 Thread William Slacum
Any stats on what the repo size is after removing the refs and doing
something like `git gc`?

On Fri, Mar 4, 2016 at 4:25 PM, Christopher  wrote:

> I was able to deleted 135 duplicate refs of the kind I described. Only one
> resulted in a new branch being created (ACCUMULO-722). We probably don't
> need that at all, but it might be useful to turn into patches to attach to
> the "Won't Fix" ticket, rather than preserve them as an inactive branch.
>
> Also note that the ACCUMULO-722 branch is not rooted on any other branches
> in our git repo. It was essentially just a sandbox in svn where Eric had
> been working.
>
> On Wed, Mar 2, 2016 at 6:14 PM Christopher  wrote:
>
> > (tl;dr version: I'm going to clean up refs/remotes/** in git, which
> > contains duplicate history and messes with 'git clone --mirror'; these
> are
> > refs which are neither branches nor tags and leftover from git-svn)
> >
> > So, when we switched from svn to git, there were a lot of leftover refs
> > left in the git repository that are from old branches/history which has
> > already been merged into the branches/tags that we've since created. I
> > think these were leftover from weird git-svn behavior. These can, and
> > should, be cleaned up.
> >
> > You can see all of them when you do a:
> > git ls-remote origin
> >
> > In that output, our current branches are the refs/heads/*, and our tags
> > are the refs/tags/*
> > The extras which need to be cleaned up are the refs/remotes/* (including
> > refs/remotes/tags/*)
> >
> > As you can see, these are duplicates of branches which have been merged
> in
> > already, or temporary tags which didn't make it to a release (release
> > candidates) but whose relevant history is already in our normal git
> > history, or they are branches which were abandoned on purpose
> > (ACCUMULO-722).
> >
> > Usually these extra refs don't present a problem, because we don't
> > normally see them when we clone (they aren't branches which are normally
> > fetched). However, there are a few cases where this is a problem. In
> > particular, they show up when you do "git clone --mirror", and if you
> push
> > this mirror to another git repository, like a GitLab mirror (git push
> > --mirror), they show up as extra branches which don't appear to exist in
> > the original (a very confusing situation for a "mirror").
> >
> > The interesting thing about these, is that even when they have the same
> > history as the git branches/tags we maintain now, the SHA1s don't match
> up.
> > This seems to imply they were leftover from a previous invocation of
> > git-svn.
> >
> > So, what I'd like to do is go through each of these extra refs one by
> one,
> > and figure out if we already have this history in our branches/tags. If
> we
> > do, then I'd delete these extras. If we don't (as in the case of
> > ACCUMULO-722), I'd just convert that to a normal git branch
> (refs/heads/*)
> > until we decide what to do with it at some future point in time (for
> > example, perhaps do a 'git format-patch' on it and attach the files to
> the
> > "Won't Fix" ticket so we can delete the dead branch? not sure, but that
> can
> > be deferred).
> >
>


Re: Monitor Tablet Id mapping

2016-02-23 Thread William Slacum
It's md5sum'd then base64'd. I think if you'd have to build a mapping of
tablet id <-> md5sum to translate them.

On Tue, Feb 23, 2016 at 3:59 PM,  wrote:

> Anyone know the utility to map the tablet id on the monitor to its actual
> value? It looks like its Base64 encoded on the monitor, but I don't think
> that's actually the case.
>
>
>
>


Re: [DRAFT] [ANNOUNCE] Apache Accumulo 1.6.5

2016-02-17 Thread William Slacum
Thanks, Christopher!

On Wed, Feb 17, 2016 at 4:06 PM, Christopher  wrote:

> Staging build looks stuck for extpaths. Might need to wait for it to time
> out, but all the important stuff is published.
>
> On Wed, Feb 17, 2016, 18:55 Josh Elser  wrote:
>
> > LGTM, made two slight tweaks to the release notes (basic
> grammar/english).
> >
> > Christopher wrote:
> > > The Apache Accumulo project is pleased to announce its 1.6.5 release.
> > >
> > > Version 1.6.5 is the most recent bug-fix release in its 1.6.x release
> > line.
> > > This version includes several bug fixes since 1.6.4. Existing users of
> > the
> > > 1.6.x release line are encouraged to upgrade immediately with
> confidence.
> > >
> > > The Apache Accumulo sorted, distributed key/value store is a robust,
> > > scalable, high performance data storage system that features cell-based
> > > access control and customizable server-side processing. It is based on
> > > Google's BigTable design and is built on top of Apache Hadoop, Apache
> > > ZooKeeper, and Apache Thrift.
> > >
> > > This release is available at http://accumulo.apache.org/downloads/ and
> > > release notes at http://accumulo.apache.org/release_notes/1.6.5.html.
> > >
> > > - The Apache Accumulo Team
> > >
> >
>


Re: On 1.7.1 rc1 (was Re: [VOTE] Accumulo 1.6.5-rc2)

2016-02-17 Thread William Slacum
"William" is my grandfather. Please refer to me as "Sir William".

On Wed, Feb 17, 2016 at 8:44 AM, Josh Elser  wrote:

>
>
> Christopher wrote:
>
>> I'll wrap up this releasetomorrow  and get started on 1.7.1 soon.
>>
>
> FYI, I want to ping William about ACCUMULO-4140. His wording is a little
> scary. I'd like to find some time today to look into it.
>


Re: [DISCUSS] Trivial changes and git

2016-01-06 Thread William Slacum
I've worked on teams who had perpetually open tickets like "Solve warnings"
or "Fix typos". Those issues could be referenced in commits just say they
were involved with some changes.

>From your  list, I think #1 is fine, and that #3 is preferable to #2. I
don't feel strongly to be honest, as the trivialness of the changes don't
pose a large QA risk either way. Do we realistically see the scenario of,
"I need to hunt down the commit where this spelling mistake was corrected
and it's hard to do because it's lumped in with a different patch set"
posing some great roadblock to us?

Not to nitpick too much, but I think the significance of git in this
discussion is minimal-- the same problem could be had w/ any VCS.

On Wed, Jan 6, 2016 at 3:28 PM, Christopher  wrote:

> Accumulo Devs,
>
> We typically create a JIRA for every change, and then explicitly reference
> that JIRA in the git commit log. Sometimes, this seems like a lot of work
> (or, at the very least, a big distraction) for *really* trivial changes[1].
>
> My question(s):
>
> What are the pros and cons of being strict about this for trivial issues?
> What value does creating a JIRA actually add for such things? Is the
> creation of a JIRA issue worth the distraction and time in ALL cases, or
> should developer discretion apply? How strict to we want to be about JIRA
> references?
>
> * * *
>
> For additional consideration, I've noticed that trivial fixes tend to get
> addressed in the following ways:
>
> 1. "Drive-by" - rolled into another, unrelated, commit (will get
> reviewed/reverted/merged along with a non-trivial issue, simply due to its
> vicinity in space or time)
> 2. "One-JIRA-to-rule-them-all" - a JIRA without much of a description,
> created "just so we have a ticket to reference" for several (perhaps
> unrelated) trivial fixes
> 3. "One-JIRA-each" - each trivial issue gets its own JIRA issue, its own
> commit, and its own description (many of each are nearly identical)
>
> In each case, it seems like it would have been sufficient to simply
> describe the trivial change in a separate git commit which is included in
> the next push.
>
> * * *
>
> [1]: By "*really* trivial changes", I mean small typos,
> spelling/grammar/punctuation/capitalization issues in docs, formatting,
> String literal alignment/wrapping issues, perhaps even missing @Overrides
> annotations, extra semicolons, unneeded warnings suppressions, etc.
> Essentially, things that are typically one-off changes that don't change
> the behavior or substance of the code or documentation, or that are
> self-contained, easily-understood, can be reasonably expected to be
> non-controversial, and which couldn't be further elaborated upon with a
> description in JIRA. Such changes would not include trivial bug fixes or
> feature enhancements, and are more likely to be described as style or typo
> fixes.
>


Re: delete rows test result

2015-11-16 Thread William Slacum
"Reading" all of the rows first implies you're bringing back the entire
result to a client, which provides you serial access to the data.

I think you should re-run test #3 that measures the time it takes to call
deleteRows only. I'm emphasizing this because I've worked on projects that
could quickly define a range to be deleted without reading any data, and
using deleteRows decreased our latency significantly

On Mon, Nov 16, 2015 at 11:19 AM, z11373  wrote:

> I didn't do that, but I am sure can extrapolate that from Test 1.
>
> Test 1 is doing:
> foreach k/v in scanner's iterator
> create a new mutation with that row
> call putDelete
>
> Test 3 is doing
> foreach k/v in scanner's iterator
> assign the row of first entry to 'first' var
> assign the row to a 'last' var
> After the loop is done, pass 'first' and 'last' vars to deleteRows.
>
> So, if I'd extrapolate the time without reading all rows, then we can
> subtract result from Test 1 from result from Test 3, i.e. for Table 1 is
> 196,597 - 5,702 = 190,895 (this is still way too long)
>
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569p15571.html
> Sent from the Developers mailing list archive at Nabble.com.
>


Re: delete rows test result

2015-11-16 Thread William Slacum
What happens when you subtract the time to read all of your rows?
deleteRows is designed so you don't have to read any data-- you can compute
a range to delete. For instance, in time series table, it's trivial to give
a start and end date as your rows and call deleteRows.

On Mon, Nov 16, 2015 at 10:35 AM, z11373  wrote:

> Last week on separate thread I was suggested to use
> tableOperations.deleteRows for deleting rows that matched with specific
> ranges. So I was curious to try it out to see if it's better than my
> current
> implementation which is iterating all rows, and call putDelete for each.
> While researching, I also found Accumulo already provides BatchDeleter,
> which also does the same thing.
> I tried all of three, and below is my test results against three different
> tables (numbers are in milliseconds):
>
> Test 1 (using iterator and call putDelete for each):
> Table 1: 5,702
> Table 2: 6,912
> Table 3: 4,694
>
> Test 2 (using BatchDeleter class):
> Table 1: 8,089
> Table 2: 10,405
> Table 3: 7,818
>
> Test 3 (using tableOperations.deleteRows, note that I first iterate all
> rows, just to get the last row id, which then being passed as argument to
> the function):
> Table 1: 196,597
> Table 2: 226,496
> Table 3: 8,442
>
>
> I ran the tests few times, and pretty much got the consistent results
> above.
> I didn't look at the code what deleteRows really doing, but looking at my
> test results, I can say it sucks!
> Note that for that test, I did scan and iterate just to get the last row
> id,
> but even I subtract the time for doing that, it's still way too slow.
> Therefore, I'd recommend anyone to avoid using deleteRows for this
> scenario.
> YMMV, but I'd stick with my original approach, which is doing the same like
> Test 1 above.
>
>
> Thanks,
> Z
>
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569.html
> Sent from the Developers mailing list archive at Nabble.com.
>


Re: total table rows

2015-11-12 Thread William Slacum
There is a performance difference. You have an upper bound of returning all
data to the client be scanned, even with a FirstEntryInRowIterator. Imagine
a table layout where each Key/Value pair represents a single row or
document. Using a counting iterator will return a count (most likely a
64-bit long) for each tablet, that the client can then add together.

There is a deleteRows feature (TableOperations#deleteRows) which may be
what you want. It avoids having to bring data back to the client.

On Thu, Nov 12, 2015 at 9:23 AM, z11373  wrote:

> Thanks all for the reply.
>
> @Josh: Is my understanding correct that iterating the rows to get the count
> on client side and server side doesn't have significant performance diff?
>
> Besides counting iterator, I'd like to see if we can add feature for
> deleting in bulk? Right now, I have to go thru each of them, and then call
> putDelete from client. I wish there is a magic way to tell server to delete
> all rows for this specific range.
>
>
> Thanks,
> Z
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/total-table-rows-tp15484p15535.html
> Sent from the Developers mailing list archive at Nabble.com.
>


Re: total table rows

2015-11-09 Thread William Slacum
Pranked... you can't use a CountingIterator, because it can't be init'd.
Can we get rid of that limitation?

On Mon, Nov 9, 2015 at 10:43 AM, William Slacum <wsla...@gmail.com> wrote:

> An interator stack of FirstEntryInRowIterator + CountingIterator will
> return the count of rows in each tablet, which can then be combined on the
> client side.
>
> On Mon, Nov 9, 2015 at 10:25 AM, Josh Elser <josh.el...@gmail.com> wrote:
>
>> Yeah, there's no explicit tracking of all rows in Accumulo, you're stuck
>> with enumerating them (or explicitly tracking them yourself at ingest time).
>>
>> The easiest approach you can take is probably using the
>> FirstEntryInRowIterator and counting each row on the client-side.
>>
>> You could do another summation in a second iterator but this is a little
>> tricky to get correct. I tried to touch on this a little in a blog post[1].
>> If this is a one-off question you want to answer, doing the summation on
>> the client side is likely not to take excessively longer than a server-side
>> summation.
>>
>> [1]
>> https://blogs.apache.org/accumulo/entry/thinking_about_reads_over_accumulo
>>
>>
>> z11373 wrote:
>>
>>> I want to get total rows of a table (likely has more than 100M rows), I
>>> think
>>> to get that information, Accumulo would have to iterate all rows :-( This
>>> may not be typical Accumulo scenario.
>>>
>>> Is there a more efficient way to get total number of rows in a table?
>>> When Accumulo iterating those items, does it mean it will pull the data
>>> to
>>> the client? If yes, is there a way to ask it to return just the number,
>>> since that's the only data I care.
>>>
>>> Thanks,
>>> Z
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-accumulo.1065345.n5.nabble.com/total-table-rows-tp15484.html
>>> Sent from the Developers mailing list archive at Nabble.com.
>>>
>>
>


Re: total table rows

2015-11-09 Thread William Slacum
An interator stack of FirstEntryInRowIterator + CountingIterator will
return the count of rows in each tablet, which can then be combined on the
client side.

On Mon, Nov 9, 2015 at 10:25 AM, Josh Elser  wrote:

> Yeah, there's no explicit tracking of all rows in Accumulo, you're stuck
> with enumerating them (or explicitly tracking them yourself at ingest time).
>
> The easiest approach you can take is probably using the
> FirstEntryInRowIterator and counting each row on the client-side.
>
> You could do another summation in a second iterator but this is a little
> tricky to get correct. I tried to touch on this a little in a blog post[1].
> If this is a one-off question you want to answer, doing the summation on
> the client side is likely not to take excessively longer than a server-side
> summation.
>
> [1]
> https://blogs.apache.org/accumulo/entry/thinking_about_reads_over_accumulo
>
>
> z11373 wrote:
>
>> I want to get total rows of a table (likely has more than 100M rows), I
>> think
>> to get that information, Accumulo would have to iterate all rows :-( This
>> may not be typical Accumulo scenario.
>>
>> Is there a more efficient way to get total number of rows in a table?
>> When Accumulo iterating those items, does it mean it will pull the data to
>> the client? If yes, is there a way to ask it to return just the number,
>> since that's the only data I care.
>>
>> Thanks,
>> Z
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-accumulo.1065345.n5.nabble.com/total-table-rows-tp15484.html
>> Sent from the Developers mailing list archive at Nabble.com.
>>
>


Re: [DISCUSS] What to do about encryption at rest?

2015-11-05 Thread William Slacum
> > >
> > > > > > > +1 I think this is the right step. My hunch is that some of the
> > > > common
> > > > > > > data access patterns that we have in Accumulo (over HBase) is
> > that
> > > > the
> > > > > > > per-colfam encryption isn't quick as common a design pattern as
> > it
> > > is
> > > > > > > for HBase (please tell me I'm wrong if anyone disagrees -- this
> > is
> > > > > > > mostly a gut reaction). I think our users would likely benefit
> > more
> > > > > from
> > > > > > > a per-namespace/table encryption control like you suggest.
> > > > > > >
> > > > > > > Implementing RFile encryption at HDFS level (e.g. tie a
> specific
> > > > > > > zone/key for a table) is probably straightforward. Changing the
> > > > > > > TServer's WAL use would likely be trickier to get right (a
> > tserver
> > > > > would
> > > > > > > have multiple WALs, one for each unique zone/key from Tablet it
> > > > happens
> > > > > > > to host). Maybe worrying about that is getting ahead of things
> --
> > > > just
> > > > > > > thought about it and figured I'd mention it :)
> > > > > > >
> > > > > > > William Slacum wrote:
> > > > > > > > Yup, #2. I also don't know if it's worth the effort for that
> > > > specific
> > > > > > > > feature. It might be easier to add something like
> per-namespace
> > > > > and/or
> > > > > > > > per-table encryption, then define common access patterns for
> > > > > > applications
> > > > > > > > that want to use multiple keys for encryption.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Nov 4, 2015 at 8:10 PM, Adam Fuchs<afu...@apache.org
> >
> > > > > wrote:
> > > > > > > >
> > > > > > > >> Bill,
> > > > > > > >>
> > > > > > > >> Do you envision one of the following as the driver behind
> > > > > > finer-grained
> > > > > > > >> encryption?:
> > > > > > > >>
> > > > > > > >> 1. We would only encrypt certain columns in order to get
> > better
> > > > > > > >> performance;
> > > > > > > >>
> > > > > > > >> 2. We would use different keys on different columns in order
> > to
> > > > > revoke
> > > > > > > >> access to a column via the key store;
> > > > > > > >>
> > > > > > > >> 3. We would only give a tablet server access to a subset of
> > > > columns
> > > > > at
> > > > > > > any
> > > > > > > >> given time in order to protect something, and figure out
> what
> > to
> > > > do
> > > > > > for
> > > > > > > >> compactions, etc.;
> > > > > > > >>
> > > > > > > >> 4. Something entirely different...
> > > > > > > >>
> > > > > > > >> Seems like thing #2 might have merit, but I'm not sure it's
> > > worth
> > > > > the
> > > > > > > >> effort.
> > > > > > > >>
> > > > > > > >> Adam
> > > > > > > >> On Nov 4, 2015 7:38 PM, "William Slacum"<wsla...@gmail.com>
> > > > wrote:
> > > > > > > >>
> > > > > > > >>> @Adam, column family level encryption can be useful for
> > > > > multi-tenant
> > > > > > > >>> environments, and I think it maps pretty well to the
> document
> > > > > > > >>> partitioning/sharding/wikisearch style tables. Things are
> > > > trickier
> > > > > in
> > > > > > > >>> Accumulo than in HBase since there isn't a 1:1 mapping
> > between
> > > > > column
> > > > > > > >>> families and files. The built in RFile encryption scheme

[DISCUSS] What to do about encryption at rest?

2015-11-04 Thread William Slacum
@Adam, column family level encryption can be useful for multi-tenant
environments, and I think it maps pretty well to the document
partitioning/sharding/wikisearch style tables. Things are trickier in
Accumulo than in HBase since there isn't a 1:1 mapping between column
families and files. The built in RFile encryption scheme seems better
suited to this.

@Christopher & Keith, it's something we can evaluate. Is there a good test
harness for just writing an RFile, opening a reader to it, and just poking
around? I was looking at the constructors and they didn't seem
straightforward enough for me to comprehend them within a few seconds.



On Tue, Nov 3, 2015 at 9:56 PM, Keith Turner <ke...@deenlo.com
<javascript:_e(%7B%7D,'cvml','ke...@deenlo.com');>> wrote:

> On Mon, Nov 2, 2015 at 1:37 PM, Keith Turner <ke...@deenlo.com
> <javascript:_e(%7B%7D,'cvml','ke...@deenlo.com');>> wrote:
>
> >
> >
> > On Mon, Nov 2, 2015 at 12:27 PM, William Slacum <wsla...@gmail.com
> <javascript:_e(%7B%7D,'cvml','wsla...@gmail.com');>> wrote:
> >
> >> Is "the code being 'at rest'" you making a funny about active
> development?
> >> Making sure I haven't lost my ability to get jokes :)
> >>
> >> I see two reasons why the code would be inactive: the feature is good
> >> enough as is or it's not interesting enough to attract attention.
> >> Considering it's not public API, there are no discussions to bring into
> >> the
> >> public API, and there's no effort to document how to use it, my
> intuition
> >> tells me that there isn't enough interest in it from a project
> >> perspective.
> >>
> >> From a user perspective, I've been getting asked about it when I work
> with
> >> Accumulo users. My recommendation, exclusively, is to use HDFS
> encryption
> >> because I can go to Hadoop's website and find documentation on it. When
> I
> >> go to find documentation on Accumulo's offerings, any usability
> >> information
> >> comes from vendor SlideShares. Most mentions of the feature on official
> >> Apache Accumulo channels echo Christopher's sentiments on the feature
> >> being
> >> experimental and not being officially recommended for use.
> >>
> >> I wouldn't want to rip out the feature first and then figure things out
> >> later. Sean already alluded to it, but a roadmap should contain
> something
> >> (tool or documentation) to help users migrate if we go down that route.
> >>
> >> What I'm trying to figure out is, when the question of "How do I do
> >> encryption at rest in Accumulo?" comes up, what is our community's
> answer?
> >>
> >> If we went down the route of using HDFS encryption zones, can we offer
> the
> >> same features? At the very least, we'd be offering the same
> database-level
> >>
> >
> > Where does the decryption happen with DFS, is it in the DFS client?  If
> > so, using HDFS level encryption seems to offer the same functionality???
> >
> > Has anyone written a tool that takes an
> > Accumulo-encrypted-HDFS-unencrypted-RFile and rewrites it is as an
> > Accumulo-unencrypted-HDFS-encrypted-RFile?  Wondering if there are any
> > unexpected gotchas w/ this.
> >
>
> I was discussing my questions w/ Christopher today and he mentioned an
> experiment that I thought was interesting.   What is the random seek
> performance of Accumulo-encrypted-HDFS-unencrypted-RFile vs
> Accumulo-unencrypted-HDFS-encrypted-RFile?
>
>
> >
> >
> >
> >> encryption scheme. I don't know the details of "more advanced key
> stores",
> >> but it seems like we could potentially take any custom implementation
> and
> >> map it to a KeyProvider [1]. I could also envision table level
> encryption
> >> being implementable via zones, but probably not down to the column
> family
> >> level.
> >>
> >> [1]
> >>
> >>
> https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/crypto/key/KeyProvider.html
> >>
> >>
> >> On Sun, Nov 1, 2015 at 10:19 AM, Adam Fuchs <afu...@apache.org
> <javascript:_e(%7B%7D,'cvml','afu...@apache.org');>> wrote:
> >>
> >> > Responses inline.
> >> >
> >> > Adam
> >> >
> >> > On Nov 1, 2015 9:58 AM, "Christopher" <ctubb...@apache.org
> <javascript:_e(%7B%7D,'cvml','ctubb...@apache.org');>> wrote:
> >> > >
> >> > > 1. I'm not sure I'd call an incomplete solution 'great

[DISCUSS] What to do about encryption at rest?

2015-10-30 Thread William Slacum
So I've been looking into options for providing encryption at rest, and it
seems like what Accumulo has is abandonware from a project perspective.
There is no official documentation on how to perform encryption at rest,
and the best information from its status comes from year (or greater) old
ticket comments about how the feature is still experimental. Recently there
was a talk that described using HDFS encryption zones as an alternative.

>From my perspective, this is what I see as the current situation:

1- Encryption at rest in Accumulo isn't actively being worked on
2- Encryption at rest in Accumulo isn't part of the public API or marketed
capabilities
3- Documentation for what does exist is scattered throughout Jira comments
or presentations
4- A viable alternative exists that appears to have feature parity in HDFS
encryption
5- HBase has finer grained encryption capabilities that extend beyond what
HDFS provides

Moving forward, what's the consensus for supporting this feature?
Personally, I see two options:

1- Start going down a path to bring the feature into the forefront and
start providing feature parity with HBase

or

2- Remove the feature and place emphasis on upstream encryption offerings

Any input is welcomed & appreciated!


Re: HBase and Accumulo

2015-08-19 Thread William Slacum
If you drew a Venn diagram of HBase features compared to Accumulo features,
it's pretty much going to be a single circle.

If you want performance anecdotes, the most succinct summary I've seen is
that Accumulo can handle heavier write loads whereas HBase will handle
heavier read loads. From these two points you can venture into the many
tablets/regions discussions, the emphasis on what column families are
between the two, and the use of server side processing capabilities, etc.


On Wed, Aug 19, 2015 at 1:30 PM, Josh Elser josh.el...@gmail.com wrote:

 Like I've said many times now, it's relative to your actual problem. If
 you don't have that much data (or intend to grow into that much data), it's
 not an issue. Obviously, this is the case for you.

 However, it is an architectural difference between the two projects with
 known limitations for a single metadata region. It's a difference as what
 was asked for by Jerry.


 Ted Malaska wrote:

 I've been doing HBase for a long time and never had an issue with region
 count limits and I have clusters with 10s of billions of records.  Many
 there would be issues around a couple Trillion records, but never got that
 high yet.

 Ted Malaska

 On Wed, Aug 19, 2015 at 2:24 PM, Josh Elserjosh.el...@gmail.com  wrote:

 Oh, one other thing that I should mention (was prompted off-list).

 (definition time since cross-list now: HBase regions == Accumulo tablets)

 Accumulo will handle many more regions than HBase does now due to a
 splittable metadata table. While I was told this was a very long and
 arduous journey to implement correctly (WRT splitting, merges and bulk
 loading), users with too many regions problems are extremely few and
 far
 between for Accumulo.

 I was very happy to see effort/design being put into this in HBase. And,
 just to be fair in criticism/praises, HBase does appear to me to do
 assignments of regions much faster than Accumulo does on a small cluster
 (~5-10 nodes). Accumulo may take a few seconds to notice and reassign
 tablets. I have yet to notice this with HBase (which also could be due to
 lack of personal testing).


 Jerry He wrote:

 Hi, folks

 We have people that are evaluating HBase vs Accumulo.
 Security is an important factor.

 But I think after the Cell security was added in HBase, there is no more
 real gap compared to Accumulo.

 I know we have both HBase and Accumulo experts on this list.
 Could someone shred more light?
 I am looking for real gap comparing HBase to Accumulo if there is any so
 that I can be prepared to address them. This is not limited to the
 security
 area.

 There are differences in some features and implementations. But they
 don't
 see like real 'gaps'.

 Any comments and feedbacks are welcome.

 Thanks,

 Jerry






Re: [PROPOSAL] 1.7/2.0 branches and git workflow change

2014-10-07 Thread William Slacum
#1 would be nice. I would discourage the cherry-pick-back-from-master model
because it completely disregards git's history model and makes auditing
changes nearly impossible because for N patches, the same change exists N
times under different IDs. If we wanted that, we'd be back to subversion
without mergeinfo.

#2 and #3 is possible now with our branching strategy. Is there some
deficiency you notice with it?

While we're a big project, I think we might be able to benefit from a
review-then-commit process. It could allow us to review any patch to
master, and if we determine it is relevant in historical branches, we
commit it to the historical branch and then merge forward before publishing
to our public history.



On Tue, Oct 7, 2014 at 1:12 AM, Sean Busbey bus...@cloudera.com wrote:

 What if we start with what we want and work from there, instead of starting
 from our current model.

 I would really like:

 1) A *single* place where new contributors can base patches

 2) Stable planned release lines where a release manager can determine what
 does or does not get included

 3) a git history that makes it easy for me to tell what jiras impact a
 given release tag

 One way to achieve these goals is to adopt a commit-to-master and
 cherry-pick approach.

 * Master would be the default landing zone for new commits (unless they
 only apply to an older branch).
 * Master would have a version that represents unstable future work (so
 right now presumably 3.0 if Christopher wants to start solidifying 2.0)
 * We'd have a branch for each current dev branches
 * When a fix applies to an older branch a committer (and usually not a
 contributor) would cherry pick it from master
 * When the release manager for a new version was ready to start stabilizing
 things they'd make a new branch
 * Said release manager would determine what feature changes in master get
 pulled back to the new major release

 The big disadvantage with this approach is that in the event that there is
 a bad commit `git bisect` will only find it on a single development branch.
 On the plus side, the lack of merge commits means that it's easier to
 revert.

 On Mon, Oct 6, 2014 at 10:41 PM, Christopher ctubb...@apache.org wrote:

  True. Everything I'm thinking of would work with no master, but that
 might
  be confusing, and might break some tooling without extra effort (which
  branch is default when cloning?). We also kind of assume that the master
  branch is forward-moving only, but other branches are disposable and can
 be
  rebase'd, deleted, re-created, etc.
 
  Alternatively, if people understood that a 2.0 branch is a future
  branch when 1.7 (master) is the current, that'd work, too... I just
 worry
  that people will merge it poorly.
 
  I suppose the best option, then, is probably to keep the status quo, and
  use a branch name like ACCUMULO- which represents the overall work
  for a particular future release plan, instead of a name which looks like
 a
  maintenance branch.
 
 
  --
  Christopher L Tubbs II
  http://gravatar.com/ctubbsii
 
  On Mon, Oct 6, 2014 at 10:59 PM, William Slacum 
  wilhelm.von.cl...@accumulo.net wrote:
 
   It seems to me you can get everything you want by merely getting rid of
   master or making master just be the 1.7 branch. I'm not really
 concerned
   about the name, because it's easy enough to figure out. Master
  duplicating
   a tag doesn't really seem useful to me, save for here's the highest
   version we have released, which is of limited utility when a user can
  just
   check the tags. I don't see the point in having master be something for
  the
   sake of having master.
  
  
  
   On Mon, Oct 6, 2014 at 9:19 PM, Josh Elser josh.el...@gmail.com
 wrote:
  
Christopher wrote:
   
What purpose does the master branch serve if it's just the same as
 the
last
  major release tag?


   
I think Josh had some specific opinions on this, but the general
 idea
   from
what I understood was that master is supposed to be stable...
representative of the latest, most modern release, because it's
 what a
   new
contributor would expect to fork to create a patch. That's hard to
 do
  if
the goalpost is moving a lot, and it makes feature merges more
complicated,
since contributors have to rebase or merge themselves in order to
   create a
patch that merges cleanly. Having a stable master makes it very easy
  to
contribute to the most recent release.
   
   
No, I don't really care for a stable-only master (I think I diverge
  from
the git-flow model in that regard). I like master to just be a
commits-go-here area more than anything.
   
  
 



 --
 Sean



Re: [PROPOSAL] 1.7/2.0 branches and git workflow change

2014-10-07 Thread William Slacum
Do we have a way to measure the efficacy of patches that exist in multiple
branches? By convention, each commit in an early branch will appear in
any later branch, so an existence check isn't sufficient, but it'd be cool
to see how much, on average, change a patch has to go through when being
merged forward.

But, on principal, I don't like the idea of divergent histories. Reverting
merges can be annoying, but we can also revert specific patches if need be.
Losing merge history is a big loss, and divergent history would mean we've
pushed auditing on change sets onto the developers-- we'd really be moving
backwards in terms of version control capabilities.

I think Christopher's real issue (re: #2) is that it's ambiguous what
bleeding-edge/trunk development should look like, because we don't have a
defined goal. I proposed getting rid of master, or treating the 1.7 branch
master, because we really don't know what 2.0 will look like yet. Divergent
histories doesn't solve that.

As for tracking which issues are in a release, you do remove noise if you
have a fix that only goes in a historical branch. That's about it, because
it's still a function of good commit messages (which we're pretty awful at,
if you subscribe to kernel-style commit message convention) to even infer
which Jira issues are in some graph of history.

Sean, you keep mentioning a release manager opting out-- how would that
process go, in your mind? Would a release manger revert commits, or rewrite
history to remove/delete commits? Could release managers for 2.0 and 1.7
decide differently on whether or not they want to include a fix from 1.6?

On Tue, Oct 7, 2014 at 10:17 AM, Keith Turner ke...@deenlo.com wrote:

 On Tue, Oct 7, 2014 at 6:24 AM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

  #1 would be nice. I would discourage the cherry-pick-back-from-master
 model
  because it completely disregards git's history model and makes auditing
  changes nearly impossible because for N patches, the same change exists N
  times under different IDs. If we wanted that, we'd be back to subversion
  without mergeinfo.
 
  #2 and #3 is possible now with our branching strategy. Is there some
  deficiency you notice with it?
 
  While we're a big project, I think we might be able to benefit from a
  review-then-commit process. It could allow us to review any patch to
  master, and if we determine it is relevant in historical branches, we
  commit it to the historical branch and then merge forward before
 publishing
  to our public history.
 

 We decided to try RTC on Fluo.  I love it.  We worked out a process using
 git and GH infrastructure that minimizes friction/overhead.  We are still
 refining the process and change it whenever something seems inefficient or
 isn't working well.  We are small team so we can be very agile in this
 regard.  We did not try define the process ahead of time and set in stone,
 rather we decided to experiment.  We started off with the simplest process
 possible and refined it as needed.

 The benefits I see are that I am more aware of other peoples work and
 together we are producing better quality code than anyone of us could
 alone.

 I may be wrong about this, but I feel with CTR there is no quid pro quo
 with reviews.  No one has to review to get their code commited :)


 
 
 
  On Tue, Oct 7, 2014 at 1:12 AM, Sean Busbey bus...@cloudera.com wrote:
 
   What if we start with what we want and work from there, instead of
  starting
   from our current model.
  
   I would really like:
  
   1) A *single* place where new contributors can base patches
  
   2) Stable planned release lines where a release manager can determine
  what
   does or does not get included
  
   3) a git history that makes it easy for me to tell what jiras impact a
   given release tag
  
   One way to achieve these goals is to adopt a commit-to-master and
   cherry-pick approach.
  
   * Master would be the default landing zone for new commits (unless they
   only apply to an older branch).
   * Master would have a version that represents unstable future work
 (so
   right now presumably 3.0 if Christopher wants to start solidifying 2.0)
   * We'd have a branch for each current dev branches
   * When a fix applies to an older branch a committer (and usually not a
   contributor) would cherry pick it from master
   * When the release manager for a new version was ready to start
  stabilizing
   things they'd make a new branch
   * Said release manager would determine what feature changes in master
 get
   pulled back to the new major release
  
   The big disadvantage with this approach is that in the event that there
  is
   a bad commit `git bisect` will only find it on a single development
  branch.
   On the plus side, the lack of merge commits means that it's easier to
   revert.
  
   On Mon, Oct 6, 2014 at 10:41 PM, Christopher ctubb...@apache.org
  wrote:
  
True. Everything I'm thinking of would work with no master

Re: [VOTE] Apache Accumulo 1.6.1 RC1

2014-09-26 Thread William Slacum
Ah, that's a thought to think about. The conclusion I came was made
specifically because the vote had ended, so idk if it would've helped. Of
course, actually participating on my end would've been the best course of
action.

On Fri, Sep 26, 2014 at 8:12 AM, Christopher ctubb...@apache.org wrote:

 No, not after the vote closes. I was trying to say that the concerns you
 expressed might have had greatest impact if they were expressed with a -1
 while the vote was open.


 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii

 On Fri, Sep 26, 2014 at 12:40 AM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

  Can you do that after the vote closed? Corey did some good stuff in
  documenting our release process, so I'm confident these releases can be
  iterated on faster now, which would mitigate this situation.
 
  On Thu, Sep 25, 2014 at 9:31 PM, Christopher ctubb...@apache.org
 wrote:
 
   Sorry, reply was to Bill. I know GMail doesn't thread well, so
  top-posting
   is problematic.
  
  
   --
   Christopher L Tubbs II
   http://gravatar.com/ctubbsii
  
   On Thu, Sep 25, 2014 at 9:28 PM, Corey Nolet cjno...@gmail.com
 wrote:
  
Christopher, are you referring to Keith's last comment or Bill
  Slacum's?
   
On Thu, Sep 25, 2014 at 9:13 PM, Christopher ctubb...@apache.org
   wrote:
   
 That seems like a reason to vote -1 (and perhaps to encourage
 others
  to
do
 so also). I'm not sure this can be helped so long as people have
different
 criteria for their vote, though. If we can fix those issues, I'm
  ready
   to
 vote on a 1.6.2 :)


 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii

 On Thu, Sep 25, 2014 at 2:42 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

  I'm a little concerned we had two +1's that mention failures. The
  one
 time
  when we're supposed to have a clean run through, we have 50% of
 the
  participators noticing failure. It doesn't instill much
 confidence
  in
me.
 
  On Thu, Sep 25, 2014 at 2:18 PM, Josh Elser 
 josh.el...@gmail.com
 wrote:
 
   Please make a ticket for it and supply the MAC directories for
  the
test
   and the failsafe output.
  
   It doesn't fail for me. It's possible that there is some edge
  case
that
   you and Bill are hitting that I'm not.
  
  
   Corey Nolet wrote:
  
   I'm seeing the behavior under Max OS X and Fedora 19 and they
  have
 been
   consistently failing for me. I'm thinking ACCUMULO-3073. Since
others
  are
   able to get it to pass, I did not think it should fail the
 vote
solely
  on
   that but I do think it needs attention, quickly.
  
   On Thu, Sep 25, 2014 at 10:43 AM, Bill Havanki
  bhava...@clouderagovt.com
   wrote:
  
I haven't had an opportunity to try it again since my +1, but
   prior
 to
   that
   it has been consistently failing.
  
   - I tried extending the timeout on the test, but it would
 still
time
  out.
   - I see the behavior on Mac OS X and under CentOS. (I wonder
 if
it's
 a
   JVM
   thing?)
  
   On Wed, Sep 24, 2014 at 9:06 PM, Corey Nolet
 cjno...@gmail.com
  
  wrote:
  
Vote passes with 4 +1's and no -1's.
  
   Bill, were you able to get the IT to run yet? I'm still
 having
  timeouts
  
   on
  
   my end as well.
  
  
   On Wed, Sep 24, 2014 at 1:41 PM, Josh Elser
   josh.el...@gmail.com
  
   wrote:
  
   The crux of it is that both of the errors in the CRC where
   single
 bit
   variants.
  
   y instead of 9 and p instead of 0
  
   Both of these cases are a '1' in the most significant bit
 of
   the
 byte
   instead of a '0'. We recognized these because y and p are
   outside
 of
  
   the
  
   hex range. Fixing both of these fixes the CRC error
 (manually
  
   verified).
  
   That's all we know right now. I'm currently running
  memtest86. I
do
  not
   have ECC ram, so it *is* theoretically possible that was
 the
cause.
  
   After
  
   running memtest for a day or so (or until I need my desktop
 functional
   again), I'll go back and see if I can reproduce this again.
  
  
   Mike Drob wrote:
  
Any chance the IRC chats can make it only the ML for
   posterity?
  
   Mike
  
   On Wed, Sep 24, 2014 at 12:04 PM, Keith Turner
   ke...@deenlo.com

  
   wrote:
  
 On Wed, Sep 24, 2014 at 12:44 PM, Russ Weeks
  
   rwe...@newbrightidea.com
  
   wrote:
  
 Interesting that y (0x79) and 9 (0x39) are one bit
   away
  from
  
   each
  
   other. I blame cosmic rays!
  
 It is interesting, and thats

Re: [VOTE] Apache Accumulo 1.6.1 RC1

2014-09-25 Thread William Slacum
I'm a little concerned we had two +1's that mention failures. The one time
when we're supposed to have a clean run through, we have 50% of the
participators noticing failure. It doesn't instill much confidence in me.

On Thu, Sep 25, 2014 at 2:18 PM, Josh Elser josh.el...@gmail.com wrote:

 Please make a ticket for it and supply the MAC directories for the test
 and the failsafe output.

 It doesn't fail for me. It's possible that there is some edge case that
 you and Bill are hitting that I'm not.


 Corey Nolet wrote:

 I'm seeing the behavior under Max OS X and Fedora 19 and they have been
 consistently failing for me. I'm thinking ACCUMULO-3073. Since others are
 able to get it to pass, I did not think it should fail the vote solely on
 that but I do think it needs attention, quickly.

 On Thu, Sep 25, 2014 at 10:43 AM, Bill Havankibhava...@clouderagovt.com
 wrote:

  I haven't had an opportunity to try it again since my +1, but prior to
 that
 it has been consistently failing.

 - I tried extending the timeout on the test, but it would still time out.
 - I see the behavior on Mac OS X and under CentOS. (I wonder if it's a
 JVM
 thing?)

 On Wed, Sep 24, 2014 at 9:06 PM, Corey Noletcjno...@gmail.com  wrote:

  Vote passes with 4 +1's and no -1's.

 Bill, were you able to get the IT to run yet? I'm still having timeouts

 on

 my end as well.


 On Wed, Sep 24, 2014 at 1:41 PM, Josh Elserjosh.el...@gmail.com

 wrote:

 The crux of it is that both of the errors in the CRC where single bit
 variants.

 y instead of 9 and p instead of 0

 Both of these cases are a '1' in the most significant bit of the byte
 instead of a '0'. We recognized these because y and p are outside of

 the

 hex range. Fixing both of these fixes the CRC error (manually

 verified).

 That's all we know right now. I'm currently running memtest86. I do not
 have ECC ram, so it *is* theoretically possible that was the cause.

 After

 running memtest for a day or so (or until I need my desktop functional
 again), I'll go back and see if I can reproduce this again.


 Mike Drob wrote:

  Any chance the IRC chats can make it only the ML for posterity?

 Mike

 On Wed, Sep 24, 2014 at 12:04 PM, Keith Turnerke...@deenlo.com

 wrote:

   On Wed, Sep 24, 2014 at 12:44 PM, Russ Weeks

 rwe...@newbrightidea.com

 wrote:

   Interesting that y (0x79) and 9 (0x39) are one bit away from

 each

 other. I blame cosmic rays!

   It is interesting, and thats only half of the story.  Its been

 interesting
 chatting w/ Josh about this on irc and hearing about his findings.


   On Wed, Sep 24, 2014 at 9:05 AM, Josh Elserjosh.el...@gmail.com
 wrote:

  The offending keys are:

 389a85668b6ebf8e 2ff6:4a78 [] 1411499115242

 3a10885b-d481-4d00-be00-0477e231ey65:8576b169:
 0cd98965c9ccc1d0:ba15529e

   The careful eye will notice that the UUID in the first
 component

 of

 the
 value has a different suffix than the next corrupt key/value (ends

 with

 ey65 instead of e965). Fixing this in the Value and re-running

 the

 CRC

  makes it pass.


and

  7e56b58a0c7df128 5fa0:6249 [] 1411499311578

 3a10885b-d481-4d00-be00-0477e231e965:p000872d60eb:
 499fa72752d82a7c:5c5f19e8





 --
 // Bill Havanki
 // Solutions Architect, Cloudera Govt Solutions
 // 443.686.9283





Re: [VOTE] Apache Accumulo 1.5.2 RC1

2014-09-18 Thread William Slacum
+1
- verified source dist hash
- built from tag
- ran koverse integration tests against 1.5.2

On Thu, Sep 18, 2014 at 5:36 PM, Josh Elser josh.el...@gmail.com wrote:

 Reminder that this closes in a few hours. We're currently at about 25% of
 PMC participating, would be much nicer to more activity...


 On 9/15/14, 12:24 PM, Josh Elser wrote:

 Devs,

 Please consider the following candidate for Apache Accumulo 1.5.2

 Tag: 1.5.2rc1
 SHA1: 039a2c28bdd474805f34ee33f138b009edda6c4c
 Staging Repository:
 https://repository.apache.org/content/repositories/
 orgapacheaccumulo-1014/

 Source tarball:
 http://repository.apache.org/content/repositories/
 orgapacheaccumulo-1014/org/apache/accumulo/accumulo/1.5.
 2/accumulo-1.5.2-src.tar.gz

 Binary tarball:
 http://repository.apache.org/content/repositories/
 orgapacheaccumulo-1014/org/apache/accumulo/accumulo/1.5.
 2/accumulo-1.5.2-bin.tar.gz

 (Append .sha1, .md5 or .asc to download the signature/hash for a
 given artifact.)

 Signing keys available at: https://www.apache.org/dist/accumulo/KEYS

 Over 1.5.1, we have 109 issues resolved
 https://git-wip-us.apache.org/repos/asf?p=accumulo.git;a=
 blob;f=CHANGES;h=c2892d6e9b1c6c9b96b2a58fc901a76363ece8b0;hb=
 039a2c28bdd474805f34ee33f138b009edda6c4c


 Testing: all unit and functional tests are passing and ingested 1B
 entries using CI w/ agitation over rc0.

 Vote will be open until Friday, August 19th 12:00AM UTC (8/18 8:00PM ET,
 8/18 5:00PM PT)

 - Josh




Re: AccumuloInputFormat getters

2014-07-16 Thread William Slacum
It's dubious to say it's internals, when it gets hamjammed into a map of
strings to other strings that's going to be passed around to many processes.

Maybe we can make our own serializable pojo that implements some interface
for consumers to use. That would at least let us hide internals and have a
single entry point into the Hadoop configuration.


On Wed, Jul 16, 2014 at 9:59 PM, Christopher ctubb...@apache.org wrote:

 Well, you can subclass to introspect. And, if you feel the API can be
 improved by offering stronger getter/setter support with the stability
 guarantees that we care about for public API, go ahead. (It probably
 wouldn't change much anyway, since we now treat protected as public API,
 too, I think). I won't object to the improvements... just explaining why
 it's like that. My concern if you were to do this would be whether this
 would actually add too much bloat or not to consumers of the API who don't
 need to subclass, and the lack of 1-to-1 in many cases... but if you can
 address those things sufficiently, I wouldn't object.


 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii


 On Wed, Jul 16, 2014 at 9:55 PM, Josh Elser josh.el...@gmail.com wrote:

  Ultimately, I feel like there's a big problem when I, an experienced
  Accumulo developer, am getting frustrated with the API.
 
  As it stands right now, I have no way to introspect the contents of a
  Configuration to ensure that the state is as I expect it to be. I'm stuck
  dumping the entire configuration, and grep'ing it to see if the values I
  expect are in there with *some* key. If so, I then have to try to unravel
  what exactly is the appropriate key that the value should be paired with.
 
  I can understand the complexity in the storage of relevant data within
 the
  Configuration, but this seems unnecessarily complicated to me.
 
 
  On 7/16/14, 9:48 PM, Josh Elser wrote:
 
  The value of the name of the table that the AccumuloInputFormat is going
  to read is subject to change? Isn't the point of a getter that it can
  unwrap the specifics of the serialization within the configuration and
  present the high-level constructs (username, AuthenticationToken, table
  name, IteratorSettings, etc) that users expect?
 
  On 7/16/14, 9:46 PM, Christopher wrote:
 
  Because those things represent internals of the configuration that are
  subject to change, and we don't want end users becoming dependent on
  them.
  They are protected, because they may be needed for subclassing, where
 the
  subclass assumes some greater risk than an end user of the API.
 
 
  --
  Christopher L Tubbs II
  http://gravatar.com/ctubbsii
 
 
  On Wed, Jul 16, 2014 at 9:43 PM, Josh Elser josh.el...@gmail.com
  wrote:
 
   Why are all of the getters on the AccumuloInputFormat protected
 (really,
  InputFormatBase) instead of public?
 
  This has repeatedly infuriated me as it makes it impossible for me to
  verify that the Configuration actually has the data in it as needed.
 
  It seems intentional so I figured I would ask before making a ticket
 and
  changing it.
 
  - Josh
 
 
 



Re: Is the Column Family especially useful for iterators?

2014-07-10 Thread William Slacum
Like most things in big table, it really depends on your use case. The
column family can potentially control the location of a given key/value
pair on disk.

I wouldn't say it's necessarily more useful than any other part of the key
tuple. We have some built-ins that make searching for or suppressing column
families easier. They matter more when in a locality group, as there are
potential performance gains there.


On Thu, Jul 10, 2014 at 10:01 PM, David Medinets david.medin...@gmail.com
wrote:

 In a recent conversation, someone mentioned that the Column Family is
 especially useful in custom iterators. Unfortunately I was not able to
 follow up to get details. Can anyone share a use case showing how
 information in a Column Family is better inside an iterator than
 information in another part of the Accumulo key?



Re: Accumulo Scala

2014-06-30 Thread William Slacum
That's a really cool DSL, Kevin.

Any plans for adding in some iterator support? I see in the FAQ an iterator
is mentioned, but it'd be cool to be able to push the foreach declarations
out to the tservers, if possible.


On Sun, Jun 29, 2014 at 4:26 PM, Kevin Faro ke...@tetraconcepts.com wrote:

 If anybody is working with Accumulo using Scala, I have started to put
 together a thin DSL that might be useful.  It is more of a proof of concept
 than anything at this point, but as we use it internally at Tetra I am sure
 it will become more stable.  If anybody is interested in using it or
 contributing please feel free.  Also, if you have any recommendations or
 feature requests let us know.

 Here is the link: https://github.com/tetra-concepts-llc/accumulo-scala

 --Kevin

 --
 Kevin Faro
 Tetra Concepts



Re: Running Accumulo on the IBM JVM

2014-06-23 Thread William Slacum
Work on the oldest branch possible and merge forward, please.


On Mon, Jun 23, 2014 at 6:00 AM, Hayden Marchant hay...@il.ibm.com wrote:

 Josh (and all who commented),

 Thanks for the comments. I'll take them into account, and will create the
 JIRAs.

 I was not intending on removing the CMS options, but rather only including
 them in the JVM in which they are relevant, and including the equivalent
 in different JVMs (i.e. IBM ) - all through the bootstrap_config.sh.

 Here's my newbie question: Should I be making this patch based on 1.6.1,
 or should I always be working against the 'master' branch, and then
 backport the fix(es) to any desired older version?

 Regards,



 Hayden



 From:   Josh Elser josh.el...@gmail.com
 To: dev@accumulo.apache.org,
 Date:   19/06/2014 06:43 PM
 Subject:Re: Running Accumulo on the IBM JVM



 snip/

   b.
 
 
 

 org.apache.accumulo.core.security.crypto.BlockedIOStreamTest.testGiantWrite.
   This test assumes a max heap of about 1GB. This fails on IBM
 JRE,
  since the default max heap is not specified, and on IBM JRE this
 depends
  on the OS (see
 
 
 

 http://www-01.ibm.com/support/knowledgecenter/SSYKE2_6.0.0/com.ibm.java.doc.diagnostics.60/diag/appendixes/defaults.html?lang=en

  ).
   Proposal: add -Xmx1g to the surefire maven plugin reference
 in
  parent maven pom.
 
 
  This might be https://issues.apache.org/jira/browse/ACCUMULO-2774

 Yup! I actually bumped this up to 1G already after I started seeing
 failures (again) from the ACCUMULO-2774 patch which set a 768M heap.
 Pull the upstream changes and feel free to submit something to address
 any problem you still have.

 
 
 c. Both org.apache.accumulo.core.security.crypto.CrypoTest
 
  org.apache.accumulo.core.file.rfile.RFileTest have lots of failures
 due
  to
  calls to SEcureRandom with Random Number Generator Provider hard-coded
 as
  Sun. The IBM JRE has it's own built in RNG Provider called IBMJCE. 2
  issues - hard-coded calls to SecureRandom.getInstance(algo,SUN)
 and
  also default value in Property class is SUN.
   Proposal: Add mechanism to override default Property through
  System property through new annotator in Property class. Only usage
 will
  be by Property.CRYPTO_SECURE_RNG_PROVIDER
 
 
 
  I'm not sure about adding new annotators to Property. However, the
  CryptoTest should be getting the value from the conf instead of
 hard-coding
  it. Then you can specify the correct value in accumulo-site.xml
 
  I think another part of the issue is in
  CryptoModuleFactory::fillParamsObjectFromStringMap because it looks like
  that ignores the default setting.
 

  2. Environment/Configuration
   a. The generated configuration files contain references to GC
  params that are specific to Sun JVM. In accumulo-env.sh, the
  ACCUMULO_TSERVER_OPTS contains -XX:NewSize and -XX:MaxNewSize , and
 also
  in ACCUMULO_GENERAL_OPTS,
  -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 are
 used.
   b. in bin/accumulo, get ClassNotFoundException due to
  specification of JAXP Doc Builder:
 
 
 

 -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
  .
   The Sun implementation of Document Builder Factory does not
  exists
  in IBM JDK, so a ClassNotFoundException is thrown on running accumulo
  script
 
   c. MiniAccumuloCluster - in the MiniAccumuloClusterImpl,
  Sun-speciifc GC params are passed as params to the java process
 (similar
  to section a. )
 
   Single proposal for solving all three above issues:
   Enhance bootstrap_config.sh with request to select Java
 vendor.
  Selecting this will set correct values for GC params (they differ
 between
  IBM and Sun), inclusion/ommision of JAXP setting. The
  MiniAccumuloClusterImpl can read the same env variable that was set in
  code for the GC Params, and use in the exec command.
 
 
  I don't know enough about the IBM JDK to comment on this part
  intelligently. Go ahead and generate a patch, and we can use that as a
  starting point for discussion.

 I'm a little hesitant to remove the CMS configuration (as it really
 helps). My first thought about how to address this is you can submit
 some example Accumulo configurations that work with IBM JDK or you can
 add something to the configuration template/script (conf/examples and
 conf/templates with bin/bootstrap_config.sh, respectively). I think
 you're on the right path.

 

So far, my work has been focused on getting unit tests working for
 all
  Java vendors in a clean manner. I have not yet run intensive testing
 of
  real clusters following these changes, and would be happy to get
 pointers
  to what else might need treatment.
 
 
 
  Unit tests is a good first pass. Integration tests (mvn verify) is
 probably
  the minimum that you want on your continuous integration once you have
  things set up.
 
  Accumulo 

Re: [DISCUSS] Should we support upgrading 1.4 - 1.6 w/o going through 1.5?

2014-06-16 Thread William Slacum
How much of this is a standalone utility? I think a magic button approach
would be good for this case.


On Mon, Jun 16, 2014 at 5:24 PM, Sean Busbey bus...@cloudera.com wrote:

 In an effort to get more users off of our now unsupported 1.4 release,
 should we support upgrading directly to 1.6 without going through a 1.5
 upgrade?

 More directly for those on user@: would you be more likely to upgrade off
 of 1.4 if you could do so directly to 1.6?

 We have this working locally at Cloudera as a part of our CDH integration
 (we shipped 1.4 and we're planning to ship 1.6 next).

 We can get into implementation details on a jira if there's positive
 consensus, but the changes weren't very complicated. They're mostly

 * forward porting and consolidating some upgrade code
 * additions to the README for instructions

 Personally, I can see the both sides of the argument. On the plus side,
 anything to get more users off of 1.4 is a good thing. On the negative
 side, it means we have the 1.4 related upgrade code sitting in a supported
 code branch longer.

 Thoughts?

 --
 Sean



Re: Accumulo shell remote debugger settings.

2014-06-15 Thread William Slacum
Putting the flag in the process/module OPTs is fine. It's what I normally
do when I want to debug. Are you suggesting we have remote debugging
enabled by default?


On Sun, Jun 15, 2014 at 9:11 AM, Vicky Kak vicky@gmail.com wrote:

 While trying to get the remote debugger running with accumulo I figured
 that for the accumulo shell command we need to introduce the following
 changes

 1) test -z $ACCUMULO_SHELL_OPTSexport ACCUMULO_SHELL_OPTS=-Xmx128m
 -Xms64m -Xrunjdwp:server=y,transport=dt_socket,address=4002,suspend=n

 in accumulo-env.sh

 2)
 include the additional case in accumulo.sh

 shell)  export ACCUMULO_OPTS=${ACCUMULO_GENERAL_OPTS}
 ${ACCUMULO_SHELL_OPTS} ;;

 We can't define the debugger port in the $ACCUMULO_OTHER_OPTS in the
 accumulo-env.sh as that would be bind when start-all.sh is called.

 Before I raise a JIRA for this a provide a patch I would like to hear the
 option from others how they enable the remote debugging for the shell
 command.

 Thanks,
 Vicky



Re: Email list search links

2014-06-13 Thread William Slacum
we outta make our own search capability using Accumulo :)


On Fri, Jun 13, 2014 at 1:44 PM, Billie Rinaldi billie.rina...@gmail.com
wrote:

 It might be okay, as long as you note that isn't the official mail
 archive.  I think some projects use Nabble.  I've had decent luck just
 doing a google search of the archive, e.g. site:
 mail-archives.apache.org/mod_mbox/accumulo-dev 1.6.0 release


 On Fri, Jun 13, 2014 at 10:27 AM, Bill Havanki bhava...@clouderagovt.com
 wrote:

  Hey everybody,
 
  I'd like to add search links to our mailing list page [1]. The ASF
 mailing
  list archives don't offer search, and the ASF's search capability [2] is
  only for ASF members (maybe - I can't even log in).
 
  Does anyone mind if I link to The Mail Archive? It is external to Apache,
  which might matter. You can check out a couple of their list pages [3][4]
  if you're curious.
 
  Thanks,
  Bill
 
  [1] http://accumulo.apache.org/mailing_list.html
  [2] https://mail-search.apache.org/
  [3] http://www.mail-archive.com/dev@accumulo.apache.org/
  [4] http://www.mail-archive.com/user@accumulo.apache.org
 
  --
  // Bill Havanki
  // Solutions Architect, Cloudera Govt Solutions
  // 443.686.9283
 



Re: Using ZooCache in unit tests

2014-06-11 Thread William Slacum
What about mocking that call?


On Wed, Jun 11, 2014 at 8:09 PM, Mike Drob mad...@cloudera.com wrote:

 When writing unit tests, I indirectly call code that invokes
 {{Tables.getZooCache(Instance)}} which sets up a connection to a zookeeper.
 However, there is not a server running, so this end up looping forever
 until my test times out.

 It looks like my options are...
 1) Make it an IT that uses MAC, so that there is a ZK to connect to
 (probably overkill)
 2) Stand up my own ZK (error prone? duplicative?)
 3) Stand up a Curator Testing Server (introduces new dependencies).
 4) Not write tests.

 I'm leaning toward number 3, but was wondering if anybody had thoughts on
 this, since I feel like that will have a lasting impact on the code.



Re: Proposal for splitting ACCUMULO-1242 into subtasks.

2014-05-13 Thread William Slacum
Sounds good, Ed.

Just out of curiosity, are you planning on doing this with the goal of
being able to swap out log4j for logback? In personal projects, I like
slf4j solely for the message formatting feature.


On Mon, May 12, 2014 at 10:45 PM, Sean Busbey bus...@cloudera.com wrote:

 +1 LGTM

 Overall approach looks good, we can deal with details in review.

 --
 Sean
 On May 12, 2014 8:49 PM, Mike Drob md...@mdrob.com wrote:

  +1.
 
  You've spent more time thinking about this than the rest of us combined,
  probably, so if you think this is the best approach I recommend just
 going
  for it. If we discover other issues as they crop up, then we can deal
 with
  them at that point.
 
  Mike
 
 
  On Mon, May 12, 2014 at 9:15 PM, Ed Coleman d...@etcoleman.com wrote:
 
   I am willing to take another run at the Consistent Logging ticket,
   ACCUMULO-1242, but I'd like to achieve a consensus on an approach
 before
   starting.
  
   The tl;dr version - I would like to split ACCUMULO-1242 into subtasks.
   Target version would be 1.7.0 (or whatever it gets called, would not
 mind
   doing it for 1.6.1 too, to ease merges of bug fixes - esp. for the
 easy
   conversions.
  
   Now the novel-length version (and sorry for the length)
  
   I think that the ACCUMULO-1242 should be split into a number of
 subtasks
  -
   at least three or maybe four. This way individual subtasks can be
  committed
   independently to allow a thorough review of the more complex changes.
 The
   breakdown that I am thinking of would go from easy, mostly
 non-functional
   changes and progressively become more complex and could require
  rethinking
   the way certain things are done for the hardest ones.  The breakdown
   would
   also narrow the number of files effected as the subtasks progressed
 from
   easy to hard.  The easy changes would impact most files, while the
 most
   complex changes would impact relatively few.
  
   To be clear, with this approach some files may be changed multiple
 times
  by
   different sub-tasks - in case that influences anyone's opinion to this
   approach.
  
   The breakdown that I am suggesting as a starting point for discussion
 is:
  
   Subtask-1:
  
   a) Replace package statements and Logger.getLogger to
   LoggerFactory.getLogger
  
   b) Use parameterized messages ( {} ) instead of concatenation and
 remove
   any
   if level enabled tests (.isDebugEnabled(), .isInfoEnabled().)- this may
   provide a very slight performance gain.
  
   c) Add messages to all exceptions - required by slf4j and generally an
   accepted practice.
  
   d) Eliminate printStackTrace with log messages of an appropriate level
   (ACCUMULO-2628 covers this and could be done at the same time.)
  
   This is the low hanging fruit and should eliminate log4j dependencies
 in
   most classes - maybe 80% to 90% or more. [Because (c) and (d) will
  slightly
   change the log output, maybe they are more appropriate for subtask-2?]
   [Question: any issue with changing log statement wording in (b) if it
   improves clarity? - which would also slightly change log output which
  would
   break anyone that is doing log scraping.]
  
   Subtask-2:
  
   a) Remove FATAL level and replace with MARKER interface supported by
   logback
   and log4j-2 [a future effort could be to extend MARKER usage to allow
  finer
   grained log filtering, but probably not as part of this effort.]
  
   b) Remove dynamic manipulation of log levels in testing by using
   test-specific parameter files (if desired)
  
   Subtask-3:
  
   a) Rework TRACING and log forwarding so they do not have a log4j
  dependency
  
   Subtask-4:
  
   a) Rework shell debug command facility that dynamically changes the log
   level.
  
   With the current code base it may be impractical to completely remove
   direct
   log4j dependencies, but we should be able to isolate it to a few
  instances
   in the server-side code and completely remove it from the client-side
  code.
  
   Another thing to note is that many of the limitations of slf4j are
  present
   in log4j-2 -neither allow dynamic log level changes programmatically or
   through DOM manipulation but instead watch the property file and react
  when
   it is modified. So, even if you really don't care about slf4j, similar
   changes will be required to upgrade log4j-2.
  
   Once there is a consensus I (or Christopher ?) could make the sub-tasks
  and
   I'll get started.
  
  
  
  
 



Re: SQL layer over Accumulo?

2014-05-10 Thread William Slacum
So there may be a bit of confusion with storing index and data in the same
row. By row I just mean the logical Accumulo unit, not a row as in
thing in my relational table. Synonyms for row in this scheme are
shard and document partition.

You can store multiple documents and indices for those documents in
different column families within the same row. You then have separate
readers for the indices and document data (sources in Iterator terms).
Point and range queries are still possible in this fashion, and are made
even easier if there's another level that maps terms to
rows/shards/partition. The wikisearch example is an (admittedly rough)
implementation of this.

I think looking at how buddy regions work may help clarify things, since
I imagine it works similarly. If the coprocessor is just reading from a
region I, that that contains index data for only region D, then that
maps pretty well to an iterator scanning index data from a column family
I and fetching documents from a column family D.



On Thu, May 8, 2014 at 1:09 AM, James Taylor jamestay...@apache.org wrote:

 Sorry for the delay in getting back to you - things got a bit crazy with
 our graduation and HBaseCon happening at the same time.

 @Josh  Bill - r.e. Co-locating indices within the same row simplifies this
 a bit.
 The secondary indexes need to be in row key order by the indexed columns,
 so co-locating them in the data table wouldn't allow the lookup and range
 scan abilities we'd need. The advantage of the index is that you don't need
 to look at all the data, but can do a point lookup or range scan based on
 the usage of the indexed columns in a query.

 @Josh - r.e. Assuming I understand properly, you don't need to be cognizant
 of the splits. You just specify the Ranges (where each Range is a start key
 and end key) and the Accumulo client API does the rest.

 Typically the Ranges are merge sorted on the client, so this might require
 an extension to the Accumulo client.

 r.e. Next steps.

 We'd definitely need an expert on the Accumulo side to proceed. I'm happy
 to help on the Phoenix side - I'll post a note on our dev list too to see
 if there are other folks interested as well. Given the similarities between
 Accumulo and HBase and the abstraction Phoenix already has in place, I
 don't think the effort would be large to get something up and running.
 Maybe a phased approach, would make sense: first with query support and
 next with secondary index support?

 Not sure where this stacks up in terms of priority for you all. At
 Salesforce, we saw a specific need for this with HBase, the big data
 store on top of which we'd choose to standardize. We realized early on
 that we'd never get the adoption we wanted without providing a different,
 more familiar programming model: namely SQL. Since we were targeting
 supporting interactive web-based applications, anything map/reduce based
 wasn't a fit which led us to create Phoenix. Perhaps there are members in
 your community in the same boat?

 Thanks,
 James



 On Fri, May 2, 2014 at 1:44 PM, Josh Elser josh.el...@gmail.com wrote:

  On 5/1/14, 2:24 AM, James Taylor wrote:
 
  Thanks for the explanations, Josh. This sounds very doable. Few more
  comments inline below.
 
  James
 
 
  On Wed, Apr 30, 2014 at 8:37 AM, Josh Elser josh.el...@gmail.com
 wrote:
 
 
 
  On 4/30/14, 3:33 AM, James Taylor wrote:
 
   On Tue, Apr 29, 2014 at 11:57 AM, Josh Elser josh.el...@gmail.com
  wrote:
 
@Josh - it's less baked in than you'd think on the client where the
  query
 
 
   parsing, compilation, optimization, and orchestration occurs. The
  client/server interaction is hidden behind the
 ConnectionQueryServices
  interface, the scanning behind ResultIterator (in
  particular ScanningResultIterator), the DML behind MutationState,
 and
  KeyValue interaction behind KeyValueBuilder. Yes, though, it would
  require
  some more abstraction, but probably not too bad, though. On the
  server-side, the entry points would all be different and that's
 where
  I'd
  need your insights for what's possible.
 
 
   Definitely. I'm a little concerned about what's expected to be
  provided
  by
  the database (HBase, Accumulo) as I believe HBase is a little more
  flexible in allowing writes internally where Accumulo has thus far
 said
  you're gonna have a bad time.
 
 
 
  Tell me more about what you mean by allowing writes internally.
 
 
  Haha, sorry, that was a sufficiently ominous statement with
 insufficient
  context.
 
  For discussion sake, let's just say HBase coprocessors and Accumulo
  iterators are equivalent, purely in the scope of running server-side
  code
  (in the RegionServer/TabletServer). However, there is a notable
  difference
  in the pipeline where each of those are implemented.
 
  Coprocessors have built-in hooks that let you get updates on
  PUT/GET/DELETE/etc as well as pre and post each of those operations. In
  other words, they provide hooks at a high database level.
 

Re: [VOTE] end of life plan for 1.4 branch

2014-05-06 Thread William Slacum
+1 for EOL'ing 1.4. -0 for any follow on actions. I don't see any
particular value in doing anything beyond just not contributing to the 1.4
branch any more.


On Tue, May 6, 2014 at 2:45 PM, Sean Busbey bus...@cloudera.com wrote:

 On Tue, May 6, 2014 at 12:26 PM, John Vines vi...@apache.org wrote:

  +0
 
  I want to EOL 1.4.x but I am having difficulties following this
 discussion.
  If someone could provide a tldr; I will probably change my vote.
 
 

 tl;dr (sorry still long):

 Consensus is to update the 1.4.6-eol tag to be something like
 1.4-something-to-indicate-development-stopped (but not this verbosely
 ridiculous, natch)

 Christopher and Keith are still at -1 because there will be a window of
 time in which the tag 1.4.6-eol exists in the repository

 Drew is negative on the vote (though he has not voted -1) because he would
 like users of the 1.4 line to have an easier time with jira and downloads.

 --
 Sean



Re: verifying name suitability

2014-05-05 Thread William Slacum
Jerry O'Connell and his merry band from 1995 would like to have a word with
you.


On Mon, May 5, 2014 at 7:23 PM, Billie Rinaldi billie.rina...@gmail.comwrote:

 Oops, sorry Accumulo devs, I'm having trouble with my mailing list
 autocomplete.  I'll try to be more careful.


 On Mon, May 5, 2014 at 3:55 PM, Billie Rinaldi billie.rina...@gmail.com
 wrote:

  I created PODLINGNAMESEARCH-47 to track our investigation into whether
  Apache Slider is a suitable name.  I marked unfinished items as TODO, in
  case anyone is interested in helping out with the research.  There are
  instructions and examples here:
  http://www.apache.org/foundation/marks/naming.html
 



Re: Remove Row Data

2014-05-02 Thread William Slacum
I interpreted this as I want to delete an entire row based on specific
column family and qualifier value.


On Fri, May 2, 2014 at 12:31 PM, Christopher ctubb...@apache.org wrote:

 I think there's a terminology mismatch in your question. It sounds
 like you're trying to remove single entries (Entry = Key/Value pair),
 not entire rows. Or, perhaps worded another way, you're trying to
 remove specific column families or columns from seom rows. Is that
 correct?

 To delete an entry, you need to use Mutation.putDelete() to insert a
 delete entry for a particular key you wish to remove. Typically you
 either know the key you wish to delete already, and can just insert
 the corresponding delete entry, or you have to scan to identify
 matching entries to delete, and issue deletes for each one that
 matches your delete criteria.

 The BatchDeleter helps you do the latter
 [Connector.createBatchDeleter()]. The BatchDeleter is like a scanner
 and a writer combined. You specify the scan criteria (which columns,
 ranges, iterators, etc.) to find the entries you wish to delete, and
 then you call its delete() method to scan and delete the matching
 entries. You can ensure your scan criteria is correct by issuing the
 same parameters to a BatchScanner that you would to the BatchDeleter,
 and ensuring the returned results are only those entries you wish to
 delete, before executing the BatchDeleter.

 If you only have a few entries to delete, you can delete them using
 the shell, with either the delete or deletemany command. The
 latter takes iterator and column options, just like a scanner. See the
 shell's internal help help command for more details.



 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii


 On Fri, May 2, 2014 at 9:55 AM, Marko Escriba escribama...@gmail.com
 wrote:
  Hi,
 
  Is it possible to remove number of rows from a table based from its
 Column
  Qualifier or Family?
  I have noticed that I can only remove on row, the latest inserted row
 (based
  on timestamp). In case you want to ask why I need to remove rows, is for
 the
  reason that I need to revert the failure transaction made by the webapp.
 Any
  advice on this please.
 
  Thanks.
 
 
 
  --
  View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/Remove-Row-Data-tp9573.html
  Sent from the Developers mailing list archive at Nabble.com.



Re: SQL layer over Accumulo?

2014-05-01 Thread William Slacum
The wikisearch example provides something similar to a local index. Rather
than stuff things into two tablets, a single row in accumulo contains both
the index and data stored in separate column families. Iterator trees are
used to execute queries and retrieve data with that row.


On Thu, May 1, 2014 at 2:24 AM, James Taylor jamestay...@apache.org wrote:

 Thanks for the explanations, Josh. This sounds very doable. Few more
 comments inline below.

 James


 On Wed, Apr 30, 2014 at 8:37 AM, Josh Elser josh.el...@gmail.com wrote:

 
 
  On 4/30/14, 3:33 AM, James Taylor wrote:
 
  On Tue, Apr 29, 2014 at 11:57 AM, Josh Elser josh.el...@gmail.com
  wrote:
 
   @Josh - it's less baked in than you'd think on the client where the
 query
 
  parsing, compilation, optimization, and orchestration occurs. The
  client/server interaction is hidden behind the ConnectionQueryServices
  interface, the scanning behind ResultIterator (in
  particular ScanningResultIterator), the DML behind MutationState, and
  KeyValue interaction behind KeyValueBuilder. Yes, though, it would
  require
  some more abstraction, but probably not too bad, though. On the
  server-side, the entry points would all be different and that's where
  I'd
  need your insights for what's possible.
 
 
  Definitely. I'm a little concerned about what's expected to be provided
  by
  the database (HBase, Accumulo) as I believe HBase is a little more
  flexible in allowing writes internally where Accumulo has thus far said
  you're gonna have a bad time.
 
 
 
  Tell me more about what you mean by allowing writes internally.
 
 
  Haha, sorry, that was a sufficiently ominous statement with insufficient
  context.
 
  For discussion sake, let's just say HBase coprocessors and Accumulo
  iterators are equivalent, purely in the scope of running server-side
 code
  (in the RegionServer/TabletServer). However, there is a notable
 difference
  in the pipeline where each of those are implemented.
 
  Coprocessors have built-in hooks that let you get updates on
  PUT/GET/DELETE/etc as well as pre and post each of those operations. In
  other words, they provide hooks at a high database level.
 
  Iterators tend to be much closer to the data itself, only dealing with
  streams of data (other iterators stacked on one another). Iterators
  implement versioning, visibilities, and can even implement complex
  searches. The downside of this approach is that iterators lack any means
 to
  safely write data _outside of the sorted Key-Value pairs in the tablet
  currently being processed_. It's possible to make in tablet updates, but
  sorted order within a large tablet might make this difficult as well.
 
  This is why I was thinking percolator would be a better solution, as it's
  meant for handling updates like this server-side. However, I imagine it
  would be possible, in the short-term, to make some separate process
 between
  Phoenix and Accumulo which handles writes.


 Another fallback might be to do global index maintenance on the client.
 It'd just be more expensive, especially if you want to handle out-of-order
 updates (which are particularly tricky, as you have to get multiple
 versions of the rows to work out all the different scenarios here).

 A second fallback might be to support only local indexing. Does Accumulo
 have the concept of a custom load balancer that would allow you to
 co-locate two regions from different tables? The local-index features has
 kind of driven some feature requests on that front for HBase - mainly
 callbacks when a region is split or re-located. The rows of the local index
 are prefixed with the region start key to keep them together and identify
 them.

 
 
 
 
 
@Eric - I agree about having txn support (probably through snapshot
 
  isolation) by controlling the timestamp, and then layering indexing on
  top
  of that. That's where we're headed. But I wouldn't let that stop the
  effort
  - it would just be layered on top of what's already there. FWIW,
 there's
  another interesting indexing model that has been termed local
  indexing(
  https://github.com/Huawei-Hadoop/hindex) which is being worked on
 right
  now
  (should be available in either our 4.1 or 4.2 release). In this model,
  the
  table data and index data are co-located on the same region server
  through
  a kind of buddy region mechanism. The advantage is that you take no
  hit
  at write time, as you're writing both the index and table data
 together.
  Not sure how/if this would transfer over to the Accumulo world.
 
 
  Interesting. Given that Accumulo doesn't have a fixed column family
  schema, this might make index generation even easier (maybe cleaner
 is
  the proper word). You could easily co-locate the indices with the data,
  given them a proper name.
 
 
  With HBase, you can do something similar (though, you're right, you'd
 need
  to create the column family upfront or take the hit of creating it
  dynamically - that's a nice feature that 

Re: [VOTE] Accumulo 1.6.0-RC4

2014-04-28 Thread William Slacum
Do you think doing this on a Friday was a good idea? I know that point came
up earlier, and it was possibly due to already discovered issues that would
fail the release, but I think the lack of traffic on here is significant.


On Fri, Apr 25, 2014 at 8:37 PM, Christopher ctubb...@apache.org wrote:

 Correction on the vote end date. It's:
 Tue, 2014 April 29 01:00 UTC ... or
 Mon, 2014 April 28 21:00 EDT (9pm)

 The initial email had the wrong date (28 instead of 29).

 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii


 On Fri, Apr 25, 2014 at 8:35 PM, Christopher ctubb...@apache.org wrote:
  Accumulo Developers,
 
  Please consider the following candidate for Accumulo 1.6.0.
 
  Git Commit: 95ddea99e120102ce3316efbbe4948b574e59bc3
  Branch: 1.6.0-RC4
 
  Staging repo:
 https://repository.apache.org/content/repositories/orgapacheaccumulo-1010
  Source:
 https://repository.apache.org/content/repositories/orgapacheaccumulo-1010/org/apache/accumulo/accumulo/1.6.0/accumulo-1.6.0-src.tar.gz
  Binary:
 https://repository.apache.org/content/repositories/orgapacheaccumulo-1010/org/apache/accumulo/accumulo/1.6.0/accumulo-1.6.0-bin.tar.gz
  (Append .sha1, .md5 or .asc to download the signature/hash for a
  given artifact.)
 
  All artifacts were built and staged with:
  mvn release:prepare  mvn release:perform
 
  Signing keys available at: https://www.apache.org/dist/accumulo/KEYS
 
  Release notes (in progress):
 http://accumulo.apache.org/release_notes/1.6.0
 
  Changes since RC3 (`git log 5678e51..origin/1.6.0-RC4`):
 
  https://issues.apache.org/jira/browse/ACCUMULO-1219
  https://issues.apache.org/jira/browse/ACCUMULO-2523
  https://issues.apache.org/jira/browse/ACCUMULO-2569
  https://issues.apache.org/jira/browse/ACCUMULO-2654
  https://issues.apache.org/jira/browse/ACCUMULO-2707
  https://issues.apache.org/jira/browse/ACCUMULO-2713
  https://issues.apache.org/jira/browse/ACCUMULO-2714
  https://issues.apache.org/jira/browse/ACCUMULO-2715
  https://issues.apache.org/jira/browse/ACCUMULO-2716
  https://issues.apache.org/jira/browse/ACCUMULO-2717
  https://issues.apache.org/jira/browse/ACCUMULO-2718
  https://issues.apache.org/jira/browse/ACCUMULO-2720
  https://issues.apache.org/jira/browse/ACCUMULO-2723
  https://issues.apache.org/jira/browse/ACCUMULO-2726
  https://issues.apache.org/jira/browse/ACCUMULO-2728
  https://issues.apache.org/jira/browse/ACCUMULO-2729
  https://issues.apache.org/jira/browse/ACCUMULO-2733
  https://issues.apache.org/jira/browse/ACCUMULO-2734
 
  This vote will remain open for 72 hours (3 days), until Tue, 2014
  April 28 01:00 UTC.
  (That's 9pm EDT on Monday.)
 
  [ ] +1 - I have verified and accept...
  [ ] +0 - I have reservations, but not strong enough to vote against...
  [ ] -1 - Because..., I do not accept...
  ... these artifacts as the 1.6.0 release of Apache Accumulo.
 
  Thanks.
 
  P.S. Hint: download the whole staging repo with
  wget -erobots=off -r -l inf -np -nH
 
 https://repository.apache.org/content/repositories/orgapacheaccumulo-1010/
  # note the trailing slash is needed
 
  --
  Christopher L Tubbs II
  http://gravatar.com/ctubbsii



Re: [VOTE] Accumulo 1.6.0-RC4

2014-04-28 Thread William Slacum
I was concerned about the lack of activity. I don't have a personal need
for an extension, but I do recall a discussion about Friday RC's
potentially being problematic in the past, which is why I brought it up.


On Mon, Apr 28, 2014 at 11:21 AM, Christopher ctubb...@apache.org wrote:

 I don't know what everyone's schedules are. If the point of a vote was
 to begin performing testing, I'd say yes, or if this were RC1, I'd say
 yes (or extended it to 4 days so it's not a surprise). However, since
 we're already in the RC mindset, having had 3 prior ones already, an
 RC4 was already expected to be forthcoming. Since I don't think any of
 the issues since RC3 invalidate the previous testing, and because this
 is RC4, having gone through several previous candidates, I think a 3
 day vote starting on Friday is fine. That gives many people an
 opportunity to examine the release candidate's changes since the last
 one, whether they do so on a weekend or whether they do so on Monday.

 I'm not concerned about the lack of initial activity... that's usually
 the pattern for votes.

 Do you think you need extra time to evaluate the release candidate? Do
 we need to discuss an extension?

 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii


 On Mon, Apr 28, 2014 at 8:24 AM, William Slacum
 wilhelm.von.cl...@accumulo.net wrote:
  Do you think doing this on a Friday was a good idea? I know that point
 came
  up earlier, and it was possibly due to already discovered issues that
 would
  fail the release, but I think the lack of traffic on here is significant.
 
 
  On Fri, Apr 25, 2014 at 8:37 PM, Christopher ctubb...@apache.org
 wrote:
 
  Correction on the vote end date. It's:
  Tue, 2014 April 29 01:00 UTC ... or
  Mon, 2014 April 28 21:00 EDT (9pm)
 
  The initial email had the wrong date (28 instead of 29).
 
  --
  Christopher L Tubbs II
  http://gravatar.com/ctubbsii
 
 
  On Fri, Apr 25, 2014 at 8:35 PM, Christopher ctubb...@apache.org
 wrote:
   Accumulo Developers,
  
   Please consider the following candidate for Accumulo 1.6.0.
  
   Git Commit: 95ddea99e120102ce3316efbbe4948b574e59bc3
   Branch: 1.6.0-RC4
  
   Staging repo:
 
 https://repository.apache.org/content/repositories/orgapacheaccumulo-1010
   Source:
 
 https://repository.apache.org/content/repositories/orgapacheaccumulo-1010/org/apache/accumulo/accumulo/1.6.0/accumulo-1.6.0-src.tar.gz
   Binary:
 
 https://repository.apache.org/content/repositories/orgapacheaccumulo-1010/org/apache/accumulo/accumulo/1.6.0/accumulo-1.6.0-bin.tar.gz
   (Append .sha1, .md5 or .asc to download the signature/hash for a
   given artifact.)
  
   All artifacts were built and staged with:
   mvn release:prepare  mvn release:perform
  
   Signing keys available at: https://www.apache.org/dist/accumulo/KEYS
  
   Release notes (in progress):
  http://accumulo.apache.org/release_notes/1.6.0
  
   Changes since RC3 (`git log 5678e51..origin/1.6.0-RC4`):
  
   https://issues.apache.org/jira/browse/ACCUMULO-1219
   https://issues.apache.org/jira/browse/ACCUMULO-2523
   https://issues.apache.org/jira/browse/ACCUMULO-2569
   https://issues.apache.org/jira/browse/ACCUMULO-2654
   https://issues.apache.org/jira/browse/ACCUMULO-2707
   https://issues.apache.org/jira/browse/ACCUMULO-2713
   https://issues.apache.org/jira/browse/ACCUMULO-2714
   https://issues.apache.org/jira/browse/ACCUMULO-2715
   https://issues.apache.org/jira/browse/ACCUMULO-2716
   https://issues.apache.org/jira/browse/ACCUMULO-2717
   https://issues.apache.org/jira/browse/ACCUMULO-2718
   https://issues.apache.org/jira/browse/ACCUMULO-2720
   https://issues.apache.org/jira/browse/ACCUMULO-2723
   https://issues.apache.org/jira/browse/ACCUMULO-2726
   https://issues.apache.org/jira/browse/ACCUMULO-2728
   https://issues.apache.org/jira/browse/ACCUMULO-2729
   https://issues.apache.org/jira/browse/ACCUMULO-2733
   https://issues.apache.org/jira/browse/ACCUMULO-2734
  
   This vote will remain open for 72 hours (3 days), until Tue, 2014
   April 28 01:00 UTC.
   (That's 9pm EDT on Monday.)
  
   [ ] +1 - I have verified and accept...
   [ ] +0 - I have reservations, but not strong enough to vote against...
   [ ] -1 - Because..., I do not accept...
   ... these artifacts as the 1.6.0 release of Apache Accumulo.
  
   Thanks.
  
   P.S. Hint: download the whole staging repo with
   wget -erobots=off -r -l inf -np -nH
  
 
 https://repository.apache.org/content/repositories/orgapacheaccumulo-1010/
   # note the trailing slash is needed
  
   --
   Christopher L Tubbs II
   http://gravatar.com/ctubbsii
 



Re: increasing balancing problems to WARN

2014-04-18 Thread William Slacum
We could consider the use of markers to throw in more metadata about the
relevance of a particular log message.


On Fri, Apr 18, 2014 at 10:46 PM, Sean Busbey bus...@cloudera.com wrote:

 I also try to limit what goes at higher warning levels.  One of my goals
 over hte next few months is to improve our current logging. It sounds like
 this is a good time to make sure we're on the same page.

 We're going to have to train users on something (esp since our currently
 logging is very noisy). The short version I like is Info and more severe
 are for operators; info and less severe are for developers.

 Here's what I usually use as a guideline (constrained to slf4j levels):


 = ERROR

 Something is wrong and an operator needs to do something, preferably very
 soon. In other words, if I was on call I'd expect to get paged.

 = WARN

 Something is amiss, but not of immediate concern. An operator who is on
 call but not busy at the moment might want to investigate some kind of
 underlying issue, but the system will continue to function within some
 reasonable bound.

 = INFO

 Summary information about normal operations that is safe to ignore. GC
 information, throughput stats, that kind of thing.

 = DEBUG

 Low level information that is not normally useful, but will help determine
 the cause of a system malfunction. Usually something a developer or tier 3
 supporter would want when something was going wrong (e.g. stack traces).

 = TRACE

 Detailed low level information at a volume that probably can't be gathered
 in production.


 Eric, do those all sound reasonable? I want to make sure we have a common
 basis before I get into the specifics of this case.

 -Sean

 On Fri, Apr 18, 2014 at 8:21 PM, Eric Newton eric.new...@gmail.com
 wrote:

  -1
 
  I would hesitate to put *any* message at WARN. It is normal for balancing
  to take a little while, especially for some of my users who have their
 own
  balancing algorithm.
 
  Users feel the need to fix the problem; after all, it's there in big
 scary
  yellow on the monitor page.   I don't like training users to ignore scary
  yellow.  Is it a problem, or not?
 
  Alternatively, put the balance info into the master status, and display
 it.
   Like GC collection time... hey, I've been migrating these tablets for a
  long time... turn yellow/red.
 
  -Eric
 
 
 
 
  On Fri, Apr 18, 2014 at 4:03 PM, Sean Busbey bus...@cloudera.com
 wrote:
 
   At the moment all of our logs about problems balancing are at DEBUG.
  
   Given the impact to a cluster when this happens (skewing load onto few
   servers, in some case severely), I'd like to raise it to WARN so that
 it
   surfaces for operators in the Monitor and in the non-debug log.
  
   Thought I'd do a quick lazy consensus check before filing a jira and
  taking
   care of it.
  
   --
   Sean
  
 



 --
 Sean



Re: 1.6.0 RCs release manager?

2014-04-07 Thread William Slacum
I was under the impression that John Heard It Through The Grape Vines was
the release manager.


On Mon, Apr 7, 2014 at 7:15 PM, Christopher ctubb...@apache.org wrote:

 Who is the volunteer for creating 1.6.0 RCs?

 I'm willing to build them and start the vote, but I had thought that
 somebody else (maybe John Vines?) had volunteered to be the release
 manager for 1.6.0, though I can't find the thread at the moment (if
 there was one).

 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii



Re: [DISCUSS] MiniAccumuloCluster goals and approach

2014-03-28 Thread William Slacum
I think this is better reserved for a version later than 1.6.0. It's an
11th hour change in addition to being a large overhaul of the interfaces to
support functionality we never intended for 1.6.0.


On Fri, Mar 28, 2014 at 4:04 PM, Josh Elser josh.el...@gmail.com wrote:

 Forgot to also add, that I would add the experimental annotation to
 alleviate confusion.

 The already mocked minimum set of methods on the interface which I posted
 to github Is a first pass. If we miss something that is in fact common, we
 can add it later, anything else is likely destined for the implementation.

 On Friday, March 28, 2014, Keith Turner ke...@deenlo.com wrote:

  On Fri, Mar 28, 2014 at 3:14 PM, Josh Elser josh.el...@gmail.com
 javascript:;
  wrote:
 
   Not even the addition of a new interface, Christopher? I'd very much
 like
   to have an interface that we can get in 1.6.0 at a minimum. I wouldn't
  even
   push for any deprecation of what's currently in place.
  
 
  W/o deprecation it seems very confusing.   The intent is that users
 should
  use the new one, but the old one is not deprecated.  If someone is
  completely new to this, how will they know which option to use?
 
  Once you get down in the weeds of working on this, do you think you might
  end wanting something very different?
 
 
 
   On Mar 28, 2014 10:02 AM, Christopher ctubb...@apache.org wrote:
  
I don't think any of this should be done for 1.6.0, but I like the
idea of creating a separate cluster interface for testing. I think it
should be integrated into the accumulo-maven-plugin, also. I think
 the
idea should be hammered out, and tested as a separate thing, to
experiment with the options, and provided as a complete feature for
the next major release. If it would change packaging dependencies, it
shouldn't even be done for 1.6.x bugfix releases.
   
--
Christopher L Tubbs II
http://gravatar.com/ctubbsii
   
   
On Fri, Mar 28, 2014 at 12:24 PM, Josh Elser josh.el...@gmail.com
   wrote:
 Oh, I like that idea, Bill  Sean.

 Package: org.apache.accumulo.cluster
 Public API: org.apache.accumulo.cluster.AccumuloCluster
 MAC: org.apache.accumulo.cluster.mini.MiniAccumuloCluster
 (implements
 AccumuloCluster, allows for backwards compat)
 Yarn: org.apache.accumulo.cluster.yarn
 Docker: ...
 Mesos: ...

 etc etc etc.

 One question in my mind, do we keep the maven module
'accumulo-minicluster'?
 I would imagine that if we struck the 'mini' portion from 1.6 that
   would
 create some confusion. Would it be worth the indirection to rename
 accumulo-minicluster to accumulo-cluster and then create a new
 accumulo-minicluster module that depends on accumulo-minicluster
 (but
 contains no code itself) to preserve the 1.4 and 1.5 poms to
  generally
work
 with a version bump? I'm not sure if Maven would be happy with that
  or
   do
 what I think it should.


 On 3/28/14, 6:26 AM, Bill Havanki wrote:

 I've been watching the conversation on the side, but I wanted to
   mention
 that it seems the focus isn't so much on mini clusters anymore.
   You're
 thinking of programmatic cluster management, whether one node or
  many.
The
 idea of a basic cluster management interface, with MAC as an
 implementation, is promising. A package name of just cluster
 could
work.

 Carry on :)

 Bill H


 On Fri, Mar 28, 2014 at 12:39 AM, Sean Busbey
 busbey+li...@cloudera.comwrote:

 If you decide to go the mapred/mapreduce way, you could go with
 the
 package
 name mini.

 alternatively, we can do a multi-stage change out

 1)  1.6.x:  introduce TestAccumuloCluster interface, @deprecate
 MiniAccumuloCluster class and make it implement
 TestAccumuloCluster

 2) 1.6 + major: change MiniAccumuloCluster to an interface that
   extends
 TestAccumuloCluster, @deprecate TestAccumuloCluster

 3) 1.6 + 2 major: remove TestAccumuloCluster

 Or just go with TestAccumuloCluster as the interface, have
 MiniAccumuloCluster as the local pseudo distributed
 implementation,
   and
 then call your new one something like YarnAccumuloCluster.

 In that case we could use the deprecation cycle to move the MAC
  class
out
 of the public api.


 On Thu, Mar 27, 2014 at 6:48 PM, Josh Elser 



Re: [VOTE] Accumulo 1.4.5 RC-1

2014-03-27 Thread William Slacum
I was under the impression that a functioning Wikisearch was a requirement
for 1.4.5, as it would be consistent with all previous 1.4.x releases.


On Wed, Mar 26, 2014 at 11:35 PM, Josh Elser josh.el...@gmail.com wrote:

 Thanks, Sean.


 On 3/26/14, 8:24 PM, Sean Busbey wrote:

 Filed ACCUMULO-2564 as BLOCKER against the 1.4.5 release[1].


 [1]: https://issues.apache.org/jira/browse/ACCUMULO-2564


 On Wed, Mar 26, 2014 at 10:06 PM, Josh Elser josh.el...@gmail.com
 wrote:

  I haven't seen any issues in 1.5 with Hadoop 1 and 2 compatibility, so
 I'm
 not sure where your feelings of risk are coming from.

 Also, I've already said that this is very important to me. I do not view
 this as a marginal benefit. I view this as half-porting Hadoop2 support
 to
 1.4.


 On 3/26/14, 6:53 PM, Sean Busbey wrote:

  The reflection stuff to obviate having an special build went into 1.5
 in a
 hurry between RCs. Frankly, I think including it would introduce
 unnecessary risk to the stability of the 1.4 line for marginal benefit.






Re: [DISCUSS] MiniAccumuloCluster goals and approach

2014-03-26 Thread William Slacum
[NOTE: I started this email when this thread was new, and it kinda of blew
up on me while writing it and being distracted. Apologies in advance if
things were already covered or it's not relevant any more.]

Is this a design quality discussion or a a functionality discussion?

The changes from 1.5-1.6 seem like a poor design decision, but they do aid
in functionality.

From 1.5:
  public MiniAccumuloCluster(File dir, String rootPassword) throws
IOException
  public MiniAccumuloCluster(MiniAccumuloConfig config) throws IOException
  public void start() throws IOException, InterruptedException
  public String getInstanceName()
  public String getZooKeepers()
  public void stop() throws IOException, InterruptedException

From 1.6:
  public MiniAccumuloCluster(File dir, String rootPassword) throws
IOException
  public MiniAccumuloCluster(MiniAccumuloConfig config) throws IOException
  public void start() throws IOException, InterruptedException
  public SetPairServerType,Integer getDebugPorts()
  public String getInstanceName()
  public String getZooKeepers()
  public void stop() throws IOException, InterruptedException
  public MiniAccumuloConfig getConfig()
  public Connector getConnector(String user, String passwd) throws
AccumuloException, AccumuloSecurityException
  public ClientConfiguration getClientConfig()

From a client perspective, I see a difference of #getDebugPorts,
#getConfig, #getConnector, #getClientConfig. The other methods are on the
Impl. There's nothing wrong with using aggregation in this case, since the
code would be the same regardless.

I don't quite understand what it means to extend generically. At this
point, the MiniAccumuloCluster's interface of the MiniAccumuloClusterImpl's
interface. The naming could, and should, be better, but I don't quite get
where we're losing functionality.



On Wed, Mar 26, 2014 at 12:06 PM, Keith Turner ke...@deenlo.com wrote:

 There were many change made to MAC so Accumulo could test itself.  For
 example a method was added to return the internal threads that flush logs.
 Flushing the logs may be a useful feature to add.  However it could be
 offered in a way that does not expose these internal threads.   When
 working on  ACCUMULO-2151 I had no desire to reimplement things like this,
 I just wanted to hide it.  It was hidden from users so we do not have to
 support it and can change it at will when testing 1.7.0.

 As Sean said MAC was a class in 1.4.4, 1.5.0, and 1.5.1.  So making it an
 interface would break things for any users using it.  Any reorganizing of
 the implementation of MAC could easily be done after 1.6.0.  From a users
 perspective the MAC API has changed very little, even though the
 implementation has dramatically changed.




 On Wed, Mar 26, 2014 at 3:10 AM, Sean Busbey busbey+li...@cloudera.com
 wrote:

  ACCUMULO-2143 has developed a conversation about MiniAccumuloCluster's
  intended use and the way we currently implement the difference between
 MAC
  for external use and MAC for internal Accumulo testing[1].
 
  In particular, Josh had a few major concerns
 
  -
 
  It doesn't make sense to me why MiniAccumuloCluster is a concrete class
  which, ultimately still tied to a MiniAccumuloClusterImpl.
  MiniAccumuloCluster *requires* a MiniAccumuloClusterImpl or something
 that
  extends it. This is what's really chafing me about the separation of
  accumulo user and accumulo developer methods - you *always* have them
  both. Not to mention, this hierarchy is really obnoxious to create a new
  implementation of AccumuloMiniCluster(Impl) because I have to carry all
 of
  the cruft of the original implementation with me.
 
  Bringing this back around to this ticket, while I still don't agree with
  the reasoning that exposing the FileSystem or ZooKeeper object that
  MiniAccumuloClusterImpl is getting us anything other than the ability to
  say we didn't change this [arbitrary] API. For users who might not
 care
  what the underlying FileSystem or ZooKeeper connection, it's merely an
  extra two items in their editor's code-completion. For users who would
  care to use this information, we now make them jump through extra hoops
 to
  get it. That just doesn't make any sense to me for something we haven't
  even released.
 
  To be honest, I really want to re-open
  ACCUMULO-2151https://issues.apache.org/jira/browse/ACCUMULO-2151,
  make MiniAccumuloCluster an interface, MiniAccumuloClusterImpl an
  implementation of said interface, and create some factory class to make
  instances, ala Connector.tableOperations, Connector.securityOperations,
  etc. Right now there's a class we call an API that cannot be
 generically
  extended for the sake of saying we have an API.
 
  
 
  I wanted to avoid having a drawn out discussion on a jira, where folks my
  not notice it. Especially with things being late in 1.6.0 development and
  the potential this has to impact the API.
 
  Personally, I don't have much of a dog in the fight. 

Re: [DISCUSS] MiniAccumuloCluster goals and approach

2014-03-26 Thread William Slacum
Correction from my previous email:

At this point, the MiniAccumuloCluster's interface of the
MiniAccumuloClusterImpl's interface.

should read

At this point, the MiniAccumuloCluster's interface is a subset of the
MiniAccumuloClusterImpl's interface.


On Wed, Mar 26, 2014 at 1:10 PM, William Slacum 
wilhelm.von.cl...@accumulo.net wrote:

 [NOTE: I started this email when this thread was new, and it kinda of blew
 up on me while writing it and being distracted. Apologies in advance if
 things were already covered or it's not relevant any more.]

 Is this a design quality discussion or a a functionality discussion?

 The changes from 1.5-1.6 seem like a poor design decision, but they do
 aid in functionality.

 From 1.5:
   public MiniAccumuloCluster(File dir, String rootPassword) throws
 IOException
   public MiniAccumuloCluster(MiniAccumuloConfig config) throws IOException
   public void start() throws IOException, InterruptedException
   public String getInstanceName()
   public String getZooKeepers()
   public void stop() throws IOException, InterruptedException

 From 1.6:
   public MiniAccumuloCluster(File dir, String rootPassword) throws
 IOException
   public MiniAccumuloCluster(MiniAccumuloConfig config) throws IOException
   public void start() throws IOException, InterruptedException
   public SetPairServerType,Integer getDebugPorts()
   public String getInstanceName()
   public String getZooKeepers()
   public void stop() throws IOException, InterruptedException
   public MiniAccumuloConfig getConfig()
   public Connector getConnector(String user, String passwd) throws
 AccumuloException, AccumuloSecurityException
   public ClientConfiguration getClientConfig()

 From a client perspective, I see a difference of #getDebugPorts,
 #getConfig, #getConnector, #getClientConfig. The other methods are on the
 Impl. There's nothing wrong with using aggregation in this case, since the
 code would be the same regardless.

 I don't quite understand what it means to extend generically. At this
 point, the MiniAccumuloCluster's interface of the MiniAccumuloClusterImpl's
 interface. The naming could, and should, be better, but I don't quite get
 where we're losing functionality.



 On Wed, Mar 26, 2014 at 12:06 PM, Keith Turner ke...@deenlo.com wrote:

 There were many change made to MAC so Accumulo could test itself.  For
 example a method was added to return the internal threads that flush logs.
 Flushing the logs may be a useful feature to add.  However it could be
 offered in a way that does not expose these internal threads.   When
 working on  ACCUMULO-2151 I had no desire to reimplement things like this,
 I just wanted to hide it.  It was hidden from users so we do not have to
 support it and can change it at will when testing 1.7.0.

 As Sean said MAC was a class in 1.4.4, 1.5.0, and 1.5.1.  So making it an
 interface would break things for any users using it.  Any reorganizing of
 the implementation of MAC could easily be done after 1.6.0.  From a users
 perspective the MAC API has changed very little, even though the
 implementation has dramatically changed.




 On Wed, Mar 26, 2014 at 3:10 AM, Sean Busbey busbey+li...@cloudera.com
 wrote:

  ACCUMULO-2143 has developed a conversation about MiniAccumuloCluster's
  intended use and the way we currently implement the difference between
 MAC
  for external use and MAC for internal Accumulo testing[1].
 
  In particular, Josh had a few major concerns
 
  -
 
  It doesn't make sense to me why MiniAccumuloCluster is a concrete class
  which, ultimately still tied to a MiniAccumuloClusterImpl.
  MiniAccumuloCluster *requires* a MiniAccumuloClusterImpl or something
 that
  extends it. This is what's really chafing me about the separation of
  accumulo user and accumulo developer methods - you *always* have
 them
  both. Not to mention, this hierarchy is really obnoxious to create a new
  implementation of AccumuloMiniCluster(Impl) because I have to carry all
 of
  the cruft of the original implementation with me.
 
  Bringing this back around to this ticket, while I still don't agree with
  the reasoning that exposing the FileSystem or ZooKeeper object that
  MiniAccumuloClusterImpl is getting us anything other than the ability to
  say we didn't change this [arbitrary] API. For users who might not
 care
  what the underlying FileSystem or ZooKeeper connection, it's merely an
  extra two items in their editor's code-completion. For users who would
  care to use this information, we now make them jump through extra hoops
 to
  get it. That just doesn't make any sense to me for something we haven't
  even released.
 
  To be honest, I really want to re-open
  ACCUMULO-2151https://issues.apache.org/jira/browse/ACCUMULO-2151,
  make MiniAccumuloCluster an interface, MiniAccumuloClusterImpl an
  implementation of said interface, and create some factory class to make
  instances, ala Connector.tableOperations, Connector.securityOperations,
  etc. Right now

Re: [DISCUSS] clarification of release guide

2014-03-21 Thread William Slacum
I agree with Chief Keith. Clarity in the docs would be good.


On Fri, Mar 21, 2014 at 1:03 PM, Keith Turner ke...@deenlo.com wrote:

 I think the intention is 1 24h w/ agitation AND 1 24h w/o agitation




 On Fri, Mar 21, 2014 at 12:54 PM, Sean Busbey busbey+li...@cloudera.com
 wrote:

  Hi!
 
  Our release guide[1] has some lines that I'd like to clarify.
 
 
 1. Two 24-hour periods of the randomwalk LongClean test with and
 without
 agitation need to be run successfully.
 2. Two 24-hour periods of continuous ingest with and without agitation
 need to be validated successfully.
 3. Two 72-hour periods of continuous ingest with and without
 agitation.
 No validation is necessary but the cluster should be checked to ensure
  it
 is still functional.
 
  Is the intention on each of these lines to have 4 total test periods?
 
  That is does number 1, for example, mean I need
 
  Period 1: 24 hr LongClean with agitation
  Period 2: 24 hr LongClean with agitation
  Period 3: 24 hr LongClean without agitation
  Period 4: 24 hr LongClean without agitation
 
  Or is the intention for number 1 to mean:
 
  Period 1: 24 hr LongClean with agitation
  Period 2: 24 hr LongClean without agitation
 
  Presuming we quickly have consensus, does anyone feel I'd need a [VOTE]
  thread to rewrite the section to match, or would the [DISCUSS] thread be
  sufficient?
 
 
  -Sean
 
  [1]: http://accumulo.apache.org/governance/releasing.html#cluster-based
 



Re: Accumulo site Bootstrapped

2014-03-05 Thread William Slacum
I'm a fan of bootstrap and those pages are looking sexy. Not a big fan how
the 1.4 / 1.5 links show up in the navigation bar on the left though.


On Wed, Mar 5, 2014 at 5:40 PM, Bill Havanki bhava...@clouderagovt.comwrote:

 Some folks in the IRC room were discussing how nice the Spark [1] and Hue
 [2] sites look compared to ours. While babysitting integration tests, I
 decided to prototype a rework of our site using Twitter Bootstrap [3], the
 front-end framework that both of those other sites use.

 Here are the pages that I converted.

 * http://people.apache.org/~bhavanki/accumulo-bootstrapped/
 *

 http://people.apache.org/~bhavanki/accumulo-bootstrapped/notable_features.html
 * http://people.apache.org/~bhavanki/accumulo-bootstrapped/source.html

 You can navigate between those pages using the left nav menu, but try
 anywhere else and you'll jump out to the production site.

 The pages use Bootstrap's own theme, with only very slight modifications to
 be close to our own theme. (I actually disabled around 90% of
 accumulo.css.) I kept the page organization like production, although we
 have many other whizbang options with Bootstrap. Some bits I left messy,
 like the nav items for the user manuals, but you should get the idea
 anyway.

 Beyond just how it looks, Bootstrap gives you many other capabilities,
 especially responsive display for mobile and tablets, so there's benefit to
 a switch beyond just pretty looking boxes.

 [1] spark.apache.org
 [2] gethue.com
 [3] getbootstrap.com

 --
 // Bill Havanki
 // Solutions Architect, Cloudera Govt Solutions
 // 443.686.9283



Re: [DISCUSS] Accumulo Bylaws

2014-02-18 Thread William Slacum
Mike, add the --all parameter to the log statement to go across the
entire repo:

git log --all --pretty=format:%an --since=6 months ago | sort | uniq -c

This is slightly more portable for those of us on OSX w/ BSD date.


On Tue, Feb 18, 2014 at 4:56 PM, Mike Drob mad...@cloudera.com wrote:

 I would like to think that the ASF would prevent us from doing something
 incredibly stupid, because we have to refer removal votes to them anyway.
 What problem are you trying to address, Dave? Both unanimous votes to
 remove, and lazy consensus vote to re-instate can be ground to a halt by a
 single voice of reason.


 On Tue, Feb 18, 2014 at 4:53 PM, John Vines vi...@apache.org wrote:

  Because there may, someday (ideally never), be someone who needs to
 removed
  who should not be granted access back.
 
 
  On Tue, Feb 18, 2014 at 4:46 PM, dlmar...@comcast.net wrote:
 
   We are not removing them as a committer, we are just revoking their
  commit
   access to the code repo due to inactivity. I agree with consensus for
   removing them as a committer in general, but not for revoking commit
  access
   due to inactivity. I would imagine that all they have to do to regain
  their
   access is send an email to the list saying, I tried to commit a code
   change
   but could not login.
  
   -Original Message-
   From: John Vines [mailto:vi...@apache.org]
   Sent: Tuesday, February 18, 2014 4:41 PM
   To: Accumulo Dev List
   Subject: Re: [DISCUSS] Accumulo Bylaws
  
   Because it should be hard to remove someone but easy to bring them
 back.
  
  
   On Tue, Feb 18, 2014 at 4:36 PM, dlmar...@comcast.net wrote:
  
 I do think it's in our interest to keep the committership and PMC
membership mostly active. For example, having many inactive
 committers
brings a higher risk of a compromised committer account causing
  trouble.
   
+1
   
Do we know which committers have not committed a change in 6 months?
   
I see that  Commit access can be revoked by a unanimous vote of all
the active PMC members, but re-instatement is by lazy concensus. Why
are they different?
   
   
-Original Message-
From: Bill Havanki [mailto:bhava...@clouderagovt.com]
Sent: Tuesday, February 18, 2014 11:39 AM
To: dev@accumulo.apache.org
Subject: Re: [DISCUSS] Accumulo Bylaws
   
My comments and minor edits are in the doc, I'll bring up bigger
issues on this list.
   
Re emeritus status for committers: I'd like it not to constitute an
automatic kicking you off the island action. For example, I
 wouldn't
want to close off commit access on day 181. It can be a time when we
automatically check on the level of involvement an emeritus / emerita
wishes to keep. I'm fine with softening the bylaw verbiage in that
regard.
   
I do think it's in our interest to keep the committership and PMC
membership mostly active. For example, having many inactive
 committers
brings a higher risk of a compromised committer account causing
trouble.
Also, it'd be hard collecting a 2/3 majority of PMC members when many
are not paying any attention.
   
   
On Tue, Feb 18, 2014 at 11:35 AM, Joey Echeverria
joey...@clouderagovt.comwrote:
   
 Emeritus is not an official ASF designation. As far as the ASF is
 concerned, you're either a Committer, a PMC member, or both, or not
 at
all.

 The reason other projects use the emeritus designation is to avoid
 overstating active involvement. An emeritus member does not lose
 any privileges as far as ASF is concerned. If you want to remove
 privileges, I believe that the PMC has to vote to that effect.

 -Joey


 On Tue, Feb 18, 2014 at 11:06 AM, Sean Busbey
 busbey+li...@cloudera.com
 wrote:

  If people have substantive questions (as opposed to requests for
  edits / clarification), I'd rather they be here on the list.
 
  My main issue is the automatic transition to emeritus status for
 committers
  / PMCs at 6 months. That's a significant change. Do we know what
  the current impact of that would be?
 
 
  On Tue, Feb 18, 2014 at 9:04 AM, Bill Havanki
  bhava...@clouderagovt.com
  wrote:
 
   I have some minor edits and some questions about it, which I'll
   add as comments in the doc. I also agree that a weather
   allowance is a good
  idea.
  
  
   On Tue, Feb 18, 2014 at 9:49 AM, Mike Drob 
 mad...@cloudera.com
 wrote:
  
Thanks for putting it in a Google Doc, Arshak!
   
What issues do y'all see with this document in it's current
   state?
Personally, I think it looks fine and would be willing to
start a
 vote
  on
it, but I get the impression that east coast weather has
prevented
 some
folk from looking at it, so maybe another couple of days is
  fine.
   
   

Re: New committers!

2014-01-10 Thread William Slacum
Congrats!


On Fri, Jan 10, 2014 at 3:23 PM, Bill Havanki bhava...@clouderagovt.comwrote:

 Eric: :P

 ;)


 On Fri, Jan 10, 2014 at 2:37 PM, Eric Newton eric.new...@gmail.com
 wrote:

  Yay! No more patching their many contributions!  :-)
 
 
  On Fri, Jan 10, 2014 at 2:27 PM, Arshak Navruzyan arsh...@gmail.com
  wrote:
 
   Congrats Sean and Bill.  Great to see the community grow!
  
  
   On Fri, Jan 10, 2014 at 11:13 AM, Mike Drob mad...@cloudera.com
 wrote:
  
Congratulations, guys! Glad to have you on board!
   
   
On Fri, Jan 10, 2014 at 10:36 AM, Billie Rinaldi bil...@apache.org
wrote:
   
 I am please to announce that Bill Havanki and Sean Busbey have been
   voted
 to become new committers for Apache Accumulo.

 Welcome, Sean and Bill, and thanks for your ongoing contributions!
Feel
 free to say a few words about your development interests.

 Billie

   
  
 



 --
 | - - -
 | Bill Havanki
 | Solutions Architect, Cloudera Government Solutions
 | - - -



Re: [DISCUSS] API changes to provide resource cleanup

2014-01-02 Thread William Slacum
Voting for the hammer/hacksawjimdugging. I like the concept of being to
track resources and clean them up, but the back end code isn't designed to
deal with an instance in the way we're trying to model it.


On Thu, Jan 2, 2014 at 2:46 PM, Josh Elser josh.el...@gmail.com wrote:

 Bill Slacum and I had talked about unexpected breakages in API for clients
 and internal by modifying ZooKeeperInstance (I think I might have mentioned
 it already on one of the tickets).

 Considering some of the other work that Mike has started on in regards to
 making an easier-to-use client API, Bill and I mused over an
 InstanceFactory notion which could wrap different Instance implementations
 for the various deployment requirements. We could leave the current ZKI
 (close to?) how it works now, lift the non thread-safe pieces to a common
 parent, and create some sort of ThreadsafeZKI.

 Obviously this is very hand-wavy, but I'm definitely leery to changing the
 default impl for something so prevalent as ZKI. Thinking about the problem
 with a clean slate seems best to me.


 On 1/2/14, 1:36 PM, Eric Newton wrote:

 All of our current code treats the Instance like a simple record:

 * immutable, and therefore
 * thread-safe
 * provides several fields that describe an instance

 When I tried to add calls to close() in our own code, I found that our
 disregard for the lifetime of an instance was implicit, and probably is in
 all our user's code, too.

 I think if we want to do something like #1, we'll have to do so through a
 new API, and not by changing Instance, and then deprecate Instance.  The
 mental model is just completely different.

 -Eric


 On Thu, Jan 2, 2014 at 12:47 PM, Sean Busbey busbey+li...@cloudera.com
 wrote:

  Hey Folks!

 We need to come to some conclusions on what we're going to do for
 resource
 clean up. I'll attempt to summarize the situation and various options.
 If I
 missed something from our myriad of tickets and mailing list threads,
 please bring it up.

 Brief Background:

 The existing client APIs presume that a large amount of global state will
 persist for the duration of a JVM instance. This is at odds with
 lifecycle
 management in application containers, where a JVM is very long lived and
 user provided applications are stood up and torn down. We have reports of
 this causing OOM on JBoss[1] and leaked threads on Tomcat[2].

 We have two possible solutions, both of which Jared Winick has kindly
 verified solve the problem, as seen on JBoss.

 
 = Proposed solution #1: Closeable Instance

 The first approach adds a .close method to Instance so that users can say
 when they are done with a given instance. Internally, reference counting
 determines when we tear down global resources.

 Advantages:
* States via code where a client should do lifecycle management.
* Allows shutting down just some of the resources used.
* Is already in the code base.

 Disadvantages:
* Since lifecycle is getting added post-hoc, we are more likely to
 have
 maintenance issues as we find other side effects we hadn't considered,
 like
 the multithreaded issue that already came up[3].
* Changes Instance from representing static configuration to shared
 state
* Doesn't work with the fluent style some of our APIs encourage.
* closed semantics probably aren't consistently enforced (e.g. users
 trying to use a BatchWriter that came from a now-closed instance should
 fail)

 To finish, we'd need to
* Verify multithreaded handling is done without too much of a
 performance
 impact[3]
* Finish making our internal use consistent with the lifecycle we're
 telling others to use[4]
* Possibly add tests to verify consistent enforcement of closing on
 objects derived from Instance

 = Proposed solution #2: Global cleanup utility, aka The Hammer

 As a band-aid to allow for unload resources without making changes to
 the
 API we instead provide a utility method that cleans up all global
 resources.

 Advantages:
* Doesn't change API or meaning for Instance
* Can be used on older Accumulo deployments w/o patch/rebuild cycle

 Disadvantages:
* Only allows all-or-nothing cleanup
* Doesn't address our underlying lack of lifecycle
* Requires reverts

 To finish, we'd need to
* revert commits from old solution (I haven't checked how many
 commits,
 but it's 6 tickets :/ )
* port code from PoC to main codebase (asf grants, etc) [6]
* add some kind of test (functional/IT?)

 -

 We need to decide what we're going to provide as a placeholder for
 releases
 already frozen on API (i.e. 1.4, 1.5, 1.6*) as well as longer term.

 Personally, my position is that we should use the simplest change to
 handle
 the published versions (solution #2).

 Obviously there are outstanding issues with how we deal with global state
 and shared resources in the current client APIs. I'd like to see that
 addressed as a part of a more coherent client lifecycle rather than
 

Re: Resource leak warnings

2013-12-30 Thread William Slacum
At best the javadoc is incomplete and at worst incorrect. If it were just
representing configuration information, it would be a structure containing
only fixed data like the zookeeper list and timeout. Instead, it creates
resources and has a direct handle to those resources via its own ZooCache
property and it contains convenience methods to create other resources like
connectors. A javadoc comment is enough to warrant ignoring resource
management.

Storing state statically one thing, not cleaning up after ourselves is
another. We don't need a whole new API to do that because we've already
done that with the addition of `close()`. Keeping a list of
ZooKeeperInstances to close is already provides the same functionality as
just shutting down everything with the utility, as well as the ability to
free a subset of the resources.

That being said, has anyone started on the utility so we can at least have
a comparison/bake off? I assume this is going to block 1.6.0/1.5.1.



On Fri, Dec 27, 2013 at 6:52 PM, Christopher ctubb...@apache.org wrote:

 The javadoc for Instance says: This class represents the information
 a client needs to know to connect to an instance of accumulo.

 There's no mention of connection resources or shared state, or any
 indication that it is used for anything other than a one-time method
 to get a connection... it seems to be defined as configuration
 information. The fact that we're talking about it representing
 connection resources (which aren't even stored in ZooKeeperInstance
 itself, but happens to use some of the shared state we're talking
 about for its own implementation), is a bit confusing in the context
 of the declared semantics from the javadoc.

 The fact is, we store state statically, as global resources, in the
 JVM, and (I think) changing the definition of Instance to represent
 this statically stored state, is very confusing. I think a static
 utility makes a lot more sense to clean up static shared state hidden
 deep in the implementation... until we can invent (in a new API) an
 actual ConnectionResources object to represent connection resources,
 with a well-defined lifetime (not for the duration of the JVM's
 lifetime, as it currently is defined in released versions) where the
 cleanup of these resources makes sense.

 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii


 On Fri, Dec 27, 2013 at 2:23 PM, William Slacum
 wilhelm.von.cl...@accumulo.net wrote:
  We need to actually define the usage pattern and lifetime of a
  ZooKeeperInstance. Looking at the code, it's really masking a singleton
  usage pattern. The resources backing a given set of zookeepers+timeout
 pair
  all share a ZooCache, and we hand-rolled reference counting for
  ZooKeeperInstances themselves. That indicates that a ZooKeeperInstance is
  basically a global variable, and we have to be careful about the
 resources
  it allocates, directly or indirectly, because their lifetimes are opaque
  from the perspective of the client.
 
  I'm a fan of the close method, because it puts, in code, how an instance
  tidies up after itself. We didn't have any cleanup before because the
  ZooCache for a given zookeeper+timeout lived on until the process died.
  Since the side effects of our API aren't documented or made clear to the
  client, it's on us to handle and manage them. Making it optional for a
 user
  is a benefit, because maybe they don't care and someone else (gc, another
  management thread) will call close() on the instance, or maybe they want
 to
  force a close at class unloading.
 
  The utility seems to be brute forcing shutdown- is it possible to get
  something finer grained for specific instances? Shutting down every thing
  will handle the clean up at unload time issue, but not necessarily
  anything involving closing down a subset of ZooSessions.
 
 
 
  On Thu, Dec 26, 2013 at 2:48 PM, Sean Busbey bus...@clouderagovt.com
 wrote:
 
  On Dec 26, 2013 12:27 PM, Mike Drob md...@cloudera.com wrote:
  
   I'm willing to stipulate that this solves the thread leak from web
   containers - I haven't verified it, but I am ever hopeful. Does this
   solution imply that we should nix the close() methods just added in
 the
   snapshot branches?
  
  
 
  If we can verify that it solves the leaks for web containers, I would
 say
  yes.
 
  We can do proper life cycle for persistent state when we provide an
 updated
  client API.
 



Re: Resource leak warnings

2013-12-27 Thread William Slacum
We need to actually define the usage pattern and lifetime of a
ZooKeeperInstance. Looking at the code, it's really masking a singleton
usage pattern. The resources backing a given set of zookeepers+timeout pair
all share a ZooCache, and we hand-rolled reference counting for
ZooKeeperInstances themselves. That indicates that a ZooKeeperInstance is
basically a global variable, and we have to be careful about the resources
it allocates, directly or indirectly, because their lifetimes are opaque
from the perspective of the client.

I'm a fan of the close method, because it puts, in code, how an instance
tidies up after itself. We didn't have any cleanup before because the
ZooCache for a given zookeeper+timeout lived on until the process died.
Since the side effects of our API aren't documented or made clear to the
client, it's on us to handle and manage them. Making it optional for a user
is a benefit, because maybe they don't care and someone else (gc, another
management thread) will call close() on the instance, or maybe they want to
force a close at class unloading.

The utility seems to be brute forcing shutdown- is it possible to get
something finer grained for specific instances? Shutting down every thing
will handle the clean up at unload time issue, but not necessarily
anything involving closing down a subset of ZooSessions.



On Thu, Dec 26, 2013 at 2:48 PM, Sean Busbey bus...@clouderagovt.comwrote:

 On Dec 26, 2013 12:27 PM, Mike Drob md...@cloudera.com wrote:
 
  I'm willing to stipulate that this solves the thread leak from web
  containers - I haven't verified it, but I am ever hopeful. Does this
  solution imply that we should nix the close() methods just added in the
  snapshot branches?
 
 

 If we can verify that it solves the leaks for web containers, I would say
 yes.

 We can do proper life cycle for persistent state when we provide an updated
 client API.



Re: Resource leak warnings

2013-12-23 Thread William Slacum
We're pretty clear on commit-then-review and lazy consensus, so I don't
really have an issue with regards to the commits.

That said, I still think ignoring the warnings is the best course of
action. I compiled with warnings on from the command line and don't see a
resource leak warning with Java 6. We voted not to use Java 7, so this
shouldn't be an issue until we make that move.

This is what I did to check if those warnings were present when building
from the command line. If this isn't sufficient, please let me know.

1) `git revert 335f693a4045d2c2501e2ed6ece0493734093143`
2) Added the following to the configuration block for the
maven-compiler-plugin:

  compilerArgument-Xlint:all/compilerArgument
  showWarningstrue/showWarnings
  showDeprecationtrue/showDeprecation
3) `mvn clean compile | grep -i leak`



On Sun, Dec 22, 2013 at 10:28 PM, Christopher ctubb...@apache.org wrote:

 On Sun, Dec 22, 2013 at 2:23 PM, Bill Havanki bhava...@clouderagovt.com
 wrote:
 [snip]
  Although there was no intention of circumventing consensus, looking at
 the
  email exchange, consensus was clearly not reached.

 It is my understanding that typically, in CtR, consensus is needed to
 resolve issues after they are committed, where there is
 conflict/objections. Perhaps it was my misunderstanding of the
 responses, but it was my understanding that while there was no
 consensus on the final solution, there was no objection that would
 have prevented the interim action taken.

  The short time span did
  not give others the chance to work on eliminating the warnings, as they
  offered, or to instead come around to just dropping Closeable.

 True... the timespan was short. My goal, as stated in the original
 email, was to commit first (just like I might commit any improvement
 to the current state of the code), and I intended the email to just be
 an explanation of the reasoning, as it related to the prior commits,
 and a prompt for discussion of further action. The fact that I
 submitted the email chronologically first was a bit arbitrary. I
 accept blame for the confusion of that, and any inciting wording the
 email may have caused... I probably could have prepped things a bit
 better... I have many personal lessons learned from this. :)

  Personally,
  I am ambivalent about it. In any event, -1923 now exists to
 comprehensively
  tackle the issue, and I eagerly welcome input and help on it.
 
  Removing Closeable did not undo all the work done, but it did undo some
 of
  it. It's OK to call it that. Sometimes undoing is fine. That part of the
  commit for -2010 is a minimal change. We all agree Closeable should be
  there eventually, which is more important. We'll get it back.

 undo or improve upon is probably a semantic difference... but
 yeah, my intent was to make it trivial to re-introduce if we decided
 it was best to keep it.

 However, I'm not sure we all agree that Closeable should be there
 eventually. I cannot speak for Keith Turner (hopefully, he'll chime in
 at some point), but he and I have discussed this a bit, and I get the
 distinct impression that he thinks it should not be there.

  I never saw any compiler warnings because I don't use Eclipse. I can
  appreciate wanting to kill annoying warnings, but it would have been
 better
  to tell Eclipse to STFU about them, until we could come around to
 resolving
  them. If and when we do introduce some pertinent bylaws, the
 peculiarities
  of an IDE should not drive them. Tools are there to help us, not tell us
  what to do.

 It's my understanding that these aren't Eclipse warnings, these are
 default JDK1.6 compiler warnings. I could be wrong here... they may
 need javac -Xlint:all, or some other flag, to show up. In any case,
 whether it is Eclipse, or FindBugs, or some other tool reporting
 potential problems, I'm not concerned about them for aesthetics... I'm
 concerned because they hint at potential areas of improvements or
 bugs, that we should inspect with due diligence, and when they become
 numerous, it's hard to actually tell the difference between a non-bug
 warning that we've ignored and an actual bug warning that we've not
 examined yet.

 In any case, the point is moot here, because even if it didn't produce
 a warning, the current implementation does not warrant giving
 incorrect information to the API consumer that it can/should be
 closed, in accordance with Closeable's semantics (as in the case of
 the currently broken MapReduce configuration code... See comment on
 ACCUMULO-1923, which affects our code, and any subclasses of the
 Input/OutputFormat). I would even go so far as to say that this
 warning actually reflects an API bug: Instance does not actually
 conform to Closeable's semantics... because it doesn't free resources
 held by Instance... it frees static resources held elsewhere, and that
 becomes obvious when we actually try to close it in accordance with
 the semantics of Closeable, so it shouldn't be 

Re: Resource leak warnings

2013-12-13 Thread William Slacum
Voting for #1.


On Fri, Dec 13, 2013 at 3:44 PM, Christopher ctubb...@apache.org wrote:

 What should we do about all these additional resource leak warnings
 added as a result of ACCUMULO-1984? (ACCUMULO-2010)

 As I see it, there's a few options:

 0. Revert the previous patch for ACCUMULO-1984
 1. Ignore them
 2. Suppress them
 3. Fix them
 4. Remove Closeable from the interface, but leave the close method

 I don't like the idea of reverting the patch.

 1 is not really an option for me, because they're creating noise
 that's getting in the way of me debugging stuff I'm working on.

 Given that by making the interface Closeable, we're in effect
 recommending that users close it, we should probably follow our own
 recommendation, so 2 is probably not a good idea, and 3 is
 probably better. I don't have time to go back and do 3, though.

 4 might be a good option, at least for 1.4.5-SNAPSHOT and
 1.5.1-SNAPSHOT, so we don't convey the idea (which represents a change
 in API semantics) that you *should* close the Instance. Rather, it
 conveys the idea that it's optional, which I think is more consistent
 with those previous versions, and is suitable for the vast majority of
 use cases.

 All of this is completely overshadowing the real issue, though, which
 is that the close method doesn't actually prevent the resources from
 being opened again. It's a superficial fix, that doesn't really
 enforce it. Our API looks like it's stateless, with factory methods...
 but it's not actually stateless. We can close the instance, but the
 resources that were left open aren't isolated to the instance... they
 are used inside the Connector and below. Closing the instance may free
 up resources, but it doesn't stop new ones from being opened again
 inside the connector and below. The problem is that the Instance
 object does not fully represent the resources used inside client code,
 so closing it is semantically unintuitive, incorrect, and functionally
 broken if not used in a very specific way.

 For the time being, I'm going to pursue option 4, so I can proceed
 with working on things I need to work on, without all the noise.

 Loosely related comments, but probably separate points for discussion:

 A. It'd be nice to require that contributions do not introduce
 compiler warnings (or malformed javadocs) before applying them.
 B. The option to revert is much harder to consider seriously when
 we're simultaneously developing 3 releases, because of the merge
 nightmare: you not only have to revert the patch, but also revert the
 merges, which is not a quick action, because it could result in
 conflicts. Reverting is much more daunting in this scenario. Merge
 windows might help, by providing scheduled times for merging work to a
 common branch, which means that reverts can be considered in a more
 timely manner, because we'll know that new code only shows up during a
 predictable window.

 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii



Re: [accumulo-wikisearch] git workflow for accumulo wikisearch contrib

2013-12-06 Thread William Slacum
Regarding git workflows,
http://cdn.memegenerator.net/instances/500x/43613593.jpg


On Fri, Dec 6, 2013 at 6:28 PM, Josh Elser josh.el...@gmail.com wrote:

 I think Bill (ujustgotbi...@apache.org) is the component lead. He'd
 probably be a good start.

 But in all honesty, I don't know if anyone has really touched it in many
 months. You may be fine just making something that works.


 On 12/6/13, 6:10 PM, Sean Busbey wrote:

 Hi!

 I'm getting started on making sure we have a wikisearch example that can
 run on Accumulo 1.4.x versions[1].

 I was wondering who the likely maintainers for the wikisearch example are
 and if they intend to follow the same git workflow that the main project
 uses[2]?


 [1]: https://issues.apache.org/jira/browse/ACCUMULO-1977
 [2]: http://accumulo.apache.org/git.html




Re: Hadoop 2.0 Support for Accumulo 1.4 Branch

2013-11-12 Thread William Slacum
A user of 1.4.a should be able to move to 1.4.b without any major
infrastructure changes, such as swapping out HDFS or installing extra
add-ons.

I don't find much merit in debating local WAL vs HDFS WAL cost/benefit
since the only quantifiable evidence we have supported the move.

I should note, Sean, that if you see merit in the work, you don't need
community approval for forking and sharing. However, I do not think it is
in the community's best interest to continue to upgrade 1.4.



On Tue, Nov 12, 2013 at 2:12 PM, Josh Elser josh.el...@gmail.com wrote:


 Based on recent feedback on ACCUMULO-1792 and ACCUMULO-1795, I want to
 resurrect this thread to make sure everyone's concerns are addressed.

 For context, here's a link to the start of the last thread:

 http://bit.ly/1aPqKuH

  From ACCUMULO-1792, ctubbsii:

  I'd be reluctant to support any Hadoop 2.x support in the 1.4 release

 line that breaks compatibility with 0.20. I don't think breaking 0.20

 and then possibly fixing it again as a second step is acceptable (because

 that subsequent work may not ever be done, and I don't think

 we should break the compatibility contract that we've established with

 1.4.0).

 Chris, I believe keeping all of the work in a branch under the umbrella
 jira of ACCUMULO-1790 will ensure that we don't end up with a 1.4 release
 that doesn't have proper support for 0.20.203.

 Is there something beyond making sure the branch passes a full set of
 release tests on 0.20.203 that you'd like to see? In the event that the
 branch only ever contains the work for adding Hadoop 2, it's a simple
 matter to abandon without rolling into the 1.4 development line.

  From ACCUMULO-1795, bills (and +1ed by elserj and ctubbsii):

  I'm very uncomfortable with risking breaking continuity in such an old

 release, and I don't think managing two lines of 1.4 releases is

 worth the effort. Though we have no official EOL policy, 1.3 was

 practically dead in the water once 1.4 was around, and I hope we start

 encouraging more adoption of 1.5 (and soon 1.6) versus continually

 propping up 1.4.

 I'd love to get people to move off of 1.4. However, I think adding Hadoop
 2
 support to 1.4 encourages this more than leaving it out.


 I'm not sure I agree that adding Hadoop2 support to 1.4 encourages people
 to upgrade Accumulo. My gut reaction would be that it allows people to
 completely ignore Accumulo updates (ignoring moving to 1.4.5 which would
 allow them to do hadoop2 with your proposed changes)


  Accumulo 1.5.x places a higher burden on HDFS than 1.4 did, and I'm not
 surprised people find relying on 0.20 for the 1.5 WAL intimidating.
 Upgrading both HDFS and Accumulo across major versions at once is asking
 them to take on a bunch of risk. By adding in Hadoop 2 support to 1.4 we
 allow them to break the risk up into steps: they can upgrade HDFS versions
 first, get comfortable, then upgrade Accumulo to 1.5.


 Personally, maintaining 0.20 compatibility is not a big concern on my
 radar. If you're still running an 0.20 release, I'd *really* hope that you
 have an upgrade path to 1.2.x (if not 2.2.x) scheduled.

 I think claiming that 1.5 has a higher burden on 1.4 is a bit of a
 fallacy. There were many problems and pains regarding WALs in =1.4 that
 are very difficult to work with in a large environment (try finding WALs in
 server failure cases). I think the increased I/O on HDFS is a much smaller
 cost than the completely different I/O path that the old loggers have.

 I also think upgrading Accumulo is much less scary than upgrading HDFS,
 but that's just me.

 To me, it seems like the argument may be coming down to whether or not we
 break 0.20 hadoop compatibility on a bug-fix release and how concerned we
 are about letting users lag behind the upstream development.


  I think the existing tickets under the umbrella of ACCUMULO-1790 should
 ensure that we end up with a single 1.4 line that can work with either the
 existing 0.20.203.0 claimed in releases or against 2.2.0.

 Bill (or Josh or Chris), is there stronger language you'd like to see
 around docs / packaging (area #3 in the original plan and currently
 ACCUMULO-1796)? Maybe expressly only doing a binary convenience package
 for
 0.20.203.0? Are you looking for something beyond a full release suite to
 ensure 1.4 is still maintaining compatibility on Hadoop 0.20.203?


 Again, my biggest concern here is not following our own guidelines of
 breaking changes across minor releases, but I'd hope 0.20 users have an
 upgrade path outlined for themselves.



Re: Hadoop 2.0 Support for Accumulo 1.4 Branch

2013-11-12 Thread William Slacum
The language of ACCUMULO-1795 indicated that an acceptable state was
something that wasn't binary compatible. That's my #1 thing to avoid.

 Maybe expressly only doing a binary convenience package for
 0.20.203.0?

If we need an extra package, doesn't that mean a user can't just upgrade
Accumulo?

As a side note, 0.20.203.0 is 1.4,

On Tue, Nov 12, 2013 at 3:28 PM, Sean Busbey busbey...@clouderagovt.comwrote:

 On Tue, Nov 12, 2013 at 1:28 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net wrote:

  A user of 1.4.a should be able to move to 1.4.b without any major
  infrastructure changes, such as swapping out HDFS or installing extra
  add-ons.
 
 

 Right, exactly. Hopefully no part of the original plan contradicts this. Is
 there something that appears to?


 --
 Sean



Re: Accumulo Community Meeting Notes from Strata NYC

2013-11-09 Thread William Slacum
Thanks, Drew!


On Thu, Nov 7, 2013 at 10:22 PM, Drew Farris d...@apache.org wrote:

 On October 29, a number of people got together prior to the Accumulo Meetup
 to present the work they've done with Accumulo and discuss a number of
 other topics.

 In the interests of tracking off-list discussions I've finally managed to
 get around to getting the notes from these sessions into markdown:


 https://github.com/drewfarris/strata2013nyc/blob/master/AccumuloCommunityMeetingsNotes.md

 If anyone has anything to add or revise, feel free to fork the repo and
 issue a pull request.

 Drew



Re: [VOTE] add mvn dependency:analyze to release process

2013-11-08 Thread William Slacum
+1


On Fri, Nov 8, 2013 at 1:45 PM, Josh Elser josh.el...@gmail.com wrote:

 +1


 On 11/8/13, 1:35 PM, Billie Rinaldi wrote:

 I would like to add a dependency clean up step (which can be assisted by
 running mvn dependency:analyze) to our release process for major and minor
 releases, to make sure direct dependencies are declared and any stale
 dependencies are removed.

 This vote will end in 72 hours.

 [ ] +1 Add dependency clean up step
 [ ] +0
 [ ] -1 Do not add dependency clean up because ...




Re: [DISCUSS] Hadoop 2 and Accumulo 1.6.0

2013-10-23 Thread William Slacum
There wasn't any discussions in those tickets as to what Hadoop 2 provides
Accumulo. If we're going to still support 1, then any new features only
possible with 2 have to become optional until we ditch support for 1. Is
there anything people have in mind, feature wise, that Hadoop 2 would help
with?


On Wed, Oct 23, 2013 at 7:05 PM, Josh Elser josh.el...@gmail.com wrote:

 To ensure that we get broader community interaction than only on a Jira
 issue [1], I want to get community feedback about the version of Hadoop
 which the default, deployed Accumulo artifacts will be compiled against.

 Currently, Accumulo builds against a Hadoop-1 series release
 (1.5.1-SNAPSHOT and 1.6.0-SNAPSHOT build against 1.2.1, and 1.5.0 builds
 against 1.0.4). Last week, the Apache Hadoop community voted to release
 2.2.0 as GA (general availability) -- in other words, the Apache Hadoop
 community is calling Hadoop-2.2.0 stable.

 As has been discussed across various issues on Jira, this means a few
 different things for Accumulo. Most importantly, this serves as a
 recommendation by us that users should be trying to use Hadoop-2.2.0 with
 Accumulo 1.6.0. This does *not* mean that we do not support Hadoop1 ([2]
 1.2.1 specifically). Hadoop-1 support would still be guaranteed by us for
 1.6.0.

 - Josh

 [1] 
 https://issues.apache.org/**jira/browse/ACCUMULO-1419https://issues.apache.org/jira/browse/ACCUMULO-1419
 [2] 
 https://issues.apache.org/**jira/browse/ACCUMULO-1643https://issues.apache.org/jira/browse/ACCUMULO-1643



Re: [VOTE] 1.6.0 Feature freeze.

2013-09-28 Thread William Slacum
Plus One


On Fri, Sep 27, 2013 at 5:02 PM, Mike Drob md...@mdrob.com wrote:

 +1


 On Fri, Sep 27, 2013 at 4:02 PM, Brian Loss bfl...@praxiseng.com wrote:

  +1
 
  On Sep 27, 2013, at 1:39 PM, John Vines vi...@apache.org
   wrote:
 
   Please vote on a feature freeze date of Nov 1 23:59 PDT for the master
   branch.  Shortly after this time we will branch 1.6.0-SNAPSHOT from
  master
   and increment the version in master.  Feature Freeze means only bug
  fixes
   and documentation updates happen after the date, which implies major
 code
   additions and changes are already in place with appropriate tests.
  
   If a commiter thinks a new feature in 1.6.0-SNAPSHOT is not ready for
   release, they should bring it up on the dev list.  If agreement can not
  be
   reached on the dev list with 72 hours, then the commiter can call for a
   vote on reverting the feature from 1.6.0-SNAPSHOT.  The vote must pass
  with
   majority approval[1].  If the vote passes, any commiter can revert the
   feature from 1.6.0-SNAPSHOT.
  
   This vote will remain open for 72 hours and must have consensus
  approval[2]
   to pass.
 
 



Re: How do I use scan

2013-08-19 Thread William Slacum
You could use an indexing strategy such as a term index or a sharded index.
I know there's an example for the sharded index packaged with Accumulo.


On Mon, Aug 19, 2013 at 4:28 PM, Richard DeVita rdev...@us.ibm.com wrote:

 I have Accumulo version 1.4.3

 I wrote a java program to create an accumulo table from a csv file of call
 data records

 columns in csv file are:  callingPhone, calledPhone startTime crd-id
 The crd id is unique. there are multiple records for callingPhone and
 calledPhone


 Created a table with :

 row ID  = crd-id
 family = attribute
 qualifiers are  callingPhone,  calledPhone, startTime
 value is the  phone number or start time

 each line in the cdv file has three records in the accumulo table

 I have a java programs that reads the table.   how do I retrieve the
 records
 for a specific Phone number?
 that is all records where callingPhone = 123 456 7890
 and get all three parts ?

 Thank You






 --
 View this message in context:
 http://apache-accumulo.1065345.n5.nabble.com/How-do-I-use-scan-tp5152.html
 Sent from the Developers mailing list archive at Nabble.com.



Re: github mirror

2013-08-01 Thread William Slacum
IIRC I don't believe a process is actually in place to accept pull requests
off the chub. I'm open to being corrected by someone with better info,
however.


On Thu, Aug 1, 2013 at 5:17 PM, Michael Berman mber...@sqrrl.com wrote:

 Oh, actually I was looking at the wrong branch.  It's only a month old, not
 a year old...but my question remains.


 On Thu, Aug 1, 2013 at 5:15 PM, Michael Berman mber...@sqrrl.com wrote:

  Does anyone know what the process is for getting the github mirror
 pointed
  to the real git repo? Right now it's pointing at a year-old revision off
  SVN.  I'd like to fork a branch for my SSL work so I can submit pull
  requests back, but it doesn't seem like there's a point if it's so far
  diverged from master.
 



Re: Java 6 EOLed

2013-06-20 Thread William Slacum
I think in the discussion previoiusly, someone (John Vines?) mentioned
RedHat was picking up the slack.


On Thu, Jun 20, 2013 at 4:43 PM, Michael Allen mich...@sqrrl.com wrote:

 Here's another data point in the move to Java 7 debate: Oracle apparently
 just EOLed Java 6.  Read the Slashdot article here:

 http://developers.slashdot.org/story/13/06/20/1819245/java-6-eold-by-oracle

 While I realize many many many users of Accumulo will continue to use Java
 6, they now will do so at their increasing peril.

 - Mike



Re: Is C++ code still part of 1.5 release?

2013-05-17 Thread William Slacum
I think of the native maps as an add on and they should probably be treated
as such. I think we should consider building a different package and
installing them separately. Personally, for development and testing, I
don't use them.

Since we're building RPMs and debian packages, the steps to install an add
on is roughly 20 keystrokes.


On Fri, May 17, 2013 at 2:22 PM, Josh Elser josh.el...@gmail.com wrote:

 I believe I already voiced my opinion on this, but let me restate it since
 the conversation is happening again.

 Bundling the native library built with a common library is easiest and I
 believe makes the most sense. My opinion is that source files should be
 included in a source release and that a bin release doesn't include source
 files. Since we're specifically making this distinction by making these
 releases, it doesn't make sense to me why we would decide oh, well in this
 one case, the bin dist will actually have _some_ src files too.

 Is it not intuitive that if people need to rebuild something, that they
 download a src dist (and bin) to start? :shrug:


 On 5/17/13 2:04 PM, Adam Fuchs wrote:

 Chris,

 I like the idea of including the most widely used library, but empirical
 evidence tells me that roughly half of the users of Accumulo will still
 need to compile/recompile to get native map support. There is no reason
 not
 to make that as easy as possible by including the cpp code in the
 -bin.tar.gz -- at least I haven't heard a reason not to do that yet.

 Adam



 On Fri, May 17, 2013 at 11:53 AM, Christopher ctubb...@apache.org
 wrote:

  Adam, I didn't make any changes on this, because there were only a few
 opinions, and it didn't seem like there was a consensus. I can make
 this change, though, if a consensus is established. It's very small,
 and easy to do.

 Billie, any of those options would work. I'm not sure we need to
 recommend a particular one over the other, as long as users know how
 to get there.

 An option that Keith and I were discussing is possibly packaging
 against glibc-2.5 by default, which should reduce the impact on people
 using RHEL/CentOS 5, but should still work for RHEL/CentOS 6 or
 anything newer (though they may have to install compat-glibc-2.5). I'm
 not sure the appropriate modifications to make to get this to work,
 though.

 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii


 On Fri, May 17, 2013 at 10:49 AM, Billie Rinaldi
 billie.rina...@gmail.com wrote:

 On Fri, May 17, 2013 at 7:26 AM, Adam Fuchs afu...@apache.org wrote:

  Folks,

 Sorry to be late to the party, but did we come to a consensus on this?
 Seems like we still have opinions both ways as to whether the cpp code
 should be packaged with the binary distribution. I would argue that cpp
 code is a special case, since the build is so platform dependent. It's
 generally hard to distribute the right .so files to cover all
 platforms,
 and we have run into many cases in practice where the native maps don't
 work out of the box. While downloading the source and untarring it over

 the

 same directory is not too much extra work,



 I'm neutral on whether the source files should be included in the binary
 artifacts.  However, I wanted to point out that it sounds like untarring
 the source over binaries is not the recommended procedure.  So what is

 the

 recommended procedure?  Untar the source, navigate to the c++ directory,
 build, and drop the resulting .so file into an existing binary
 installation?  Or just build your own binary tarball from source?

 Billie


 it seems like the only argument

 not to package the native source code with the binary distribution is a
 dogmatic one. Are there any practical reasons why it would be bad to
 add
 the cpp file to the bin distribution?


  Adam




 On Mon, May 13, 2013 at 10:48 PM, Eric Newton eric.new...@gmail.com
 wrote:

  Rumor has it that one of the core developers is irrationally hostile

 to

 perl.

 And octal.

 And xml.

 He's just old and cranky.

 -Eric


 On Mon, May 13, 2013 at 5:29 PM, David Medinets 

 david.medin...@gmail.com

 wrote:


  How come perl is getting no love?


 On Mon, May 13, 2013 at 10:40 AM, Josh Elser josh.el...@gmail.com

 wrote:


  On 5/12/13 11:45 PM, Christopher wrote:

  1) we don't need to include java bindings for the proxy; compiled
 versions are already in the proxy jar,
 2) not all packagers will even have installed thrift with the

 ability

  to produce ruby and python bindings,
 3) these may or may not be helpful to any particular end user

 (though

  it's probably safe to assume ruby and python will be the most

 common),

  4) we're not including the proxy.thrift file, which is perhaps

 the

  most important file for the proxy, and including it should be
 sufficient.


   1)That works. I should've caught that when I was in the proxy

 last

 and

 I

 didn't.Thanks for that.
 2) Do you mean packagers as in people who might make an official

 release?

 I would think these are the 

Re: Is C++ code still part of 1.5 release?

2013-05-17 Thread William Slacum
 to solidify the decision that Chris is already leaning
  towards,
let
 me
 try to clarify my position:
 1. The only reason not to add the native library source code in
 the
 -bin.tar.gz distribution is that src != bin. There is no
 measurable
 negative effect of putting the cpp files and Makefile into the
 -bin.tar.gz.
 2. At least one person wants the native library source code in
 the
 -bin.tar.gz to make their life easier.

 This is a very simple decision. It really doesn't matter how easy
  it
   is
 to
 include prebuilt native code in some other way or build the code
  and
copy
 it in using some other method. Those are all tangential
 arguments.

 Adam




 On Fri, May 17, 2013 at 2:49 PM, William Slacum 
 wilhelm.von.cl...@accumulo.net** wrote:

  I think of the native maps as an add on and they should probably
  be

 treated

 as such. I think we should consider building a different package
  and
 installing them separately. Personally, for development and
   testing, I
 don't use them.

 Since we're building RPMs and debian packages, the steps to
  install
   an

 add

 on is roughly 20 keystrokes.


 On Fri, May 17, 2013 at 2:22 PM, Josh Elser 
 josh.el...@gmail.com
  

 wrote:


  I believe I already voiced my opinion on this, but let me
 restate
   it

 since

 the conversation is happening again.

 Bundling the native library built with a common library is
   easiest

 and

 I

 believe makes the most sense. My opinion is that source files
   should
be
 included in a source release and that a bin release doesn't
  include

 source

 files. Since we're specifically making this distinction by
 making
these
 releases, it doesn't make sense to me why we would decide oh,
  well
in

 this

 one case, the bin dist will actually have _some_ src files
 too.

 Is it not intuitive that if people need to rebuild something,
  that
they
 download a src dist (and bin) to start? :shrug:



   
  
 



Re: peformance

2013-05-03 Thread William Slacum
Does sqrrl provide an example framework to play around with?


On Fri, May 3, 2013 at 2:20 PM, Adam Fuchs afu...@apache.org wrote:

 Hey Drew,

 This could be a very broad question, so I'll give a partial answer and
 encourage you to come back for more details.

 Impala is a mechanism that sits on top of HBase or HDFS that is design to
 filter and process large quantities of data. People generally like Impala
 because it supports a subset of SQL and because it is optimized to reduce
 the latency that might be incurred by starting up a job in a bulk
 synchronous processing framework. Instead, it uses a series of daemon
 processes and a custom API to reduce overhead.

 With Accumulo, our approach to low-latency queries is generally to use a
 table structure that incorporates some type of index. With appropriate
 indexing techniques, Accumulo can achieve sub-second query latencies even
 over multi-petabyte sized corpuses. Some of these table designs are
 described in the manual:
 http://accumulo.apache.org/1.4/user_manual/Table_Design.html

 Regarding the SQL piece, Accumulo does not natively support an SQL
 interface. For that you would need to wrap it in a processing framework,
 like Hive (https://issues.apache.org/jira/browse/ACCUMULO-143). To make a
 shameless plug, Sqrrl (www.sqrrl.com) also offers that functionality.

 Cheers,
 Adam



 On Fri, May 3, 2013 at 12:39 PM, Drew Pierce drewpie...@live.com wrote:

  does anyone have any anecdotal results (nothing formal) for queries to
  speak to the likes of impala and near low-latency.
  Sent from my Android
 
  Sorry if brief
 
 



Re: JIRA Patch Conventions

2013-04-24 Thread William Slacum
Leave the tickets on there. I'm not trying to romance you Mike, I want more
history and less mystery.


On Wed, Apr 24, 2013 at 11:22 AM, Corey Nolet cno...@texeltek.com wrote:

 #2 as well.


 On Wed, Apr 24, 2013 at 11:08 AM, John Vines vi...@apache.org wrote:

  I too am in favor of the patch history being available.
 
 
  On Wed, Apr 24, 2013 at 11:07 AM, Billie Rinaldi
  billie.rina...@gmail.comwrote:
 
   I like #2 as well. Here's a quote from the incubator list confirming
 that
   we don't need ICLAs for patches.
  
Under the terms of the AL, any contribution made back to the ASF on
ASF infrastructure, such as via a mailing list, JIRA, or Bugzilla, is
licensed to the foundation. The JIRA checkbox existed to give people
an easy way to _avoid_ contributing something. There is no need to
 ask
casual patchers for ICLAs.
   On Apr 24, 2013 10:05 AM, Josh Elser josh.el...@gmail.com wrote:
  
   
On 4/24/13 9:32 AM, Keith Turner wrote:
   
On Tue, Apr 23, 2013 at 11:51 PM, Mike Drob md...@mdrob.com
 wrote:
   
 Accumulo Devs,
   
Are there any conventions that we'd like to follow for attaching
   updated
patches to issues? There are two lines of thought applicable here:
   
1) Remove the old one and attach the new patch. This has the
  advantage
   of
being immediately obvious to future google searchers what the patch
   was,
especially in case of back porting issues.
2) Leave all patches attached to the ticket, and use a one-up
   identifier
for each subsequent patch. This preserves context from comments,
 and
might
be useful in other ways.
   
   
 I've seen both approaches used on Accumulo tickets, and don't have
 a
strong
preference outside of a desire for consistency. I think I'd lean
   towards
option #2, if only because that means I get one fewer email
   notification.
   
 I agree I would like consistency.   I lean towards 2 also, but I
 do
   not
have a good reason, its just my preference.  We should probably put
together a page outlining how to submit a patch.  I have seen other
projects do this.
   
Ditto.
   
   
 As an aside, what is the IP status of submitted patches? I think I
remember
hearing that they immediately become part of the Apache Foundation,
  so
removing them might be a bad idea from that perspective.
   
 Does someone who is submitting patches need to submit an ICLA?
   
I believe they just need to be capable of assigning the copyright to
  the
ASF (as in, an employer does not hold rights to the patch). I believe
  the
ICLA is more for the case of a committer being able to use SVN (and
 not
having the jira checkbox).
   
   
   
 Mike
   
   
   
  
 



 --
 Corey Nolet
 Senior Software Engineer
 TexelTek, inc.
 [Office] 301.880.7123
 [Cell] 410-903-2110



Re: [VOTE] release 1.4.3?

2013-03-10 Thread William Slacum
+1 for a 1.4.3

On Sun, Mar 10, 2013 at 6:21 PM, Brian Loss bfl...@praxiseng.com wrote:

 +1

 On Mar 9, 2013, at 8:14 PM, Josh Elser josh.el...@gmail.com wrote:

  Ditto. In favor. I can help with the release process, as well.
 
  On 03/08/2013 02:50 PM, John Vines wrote:
  Looking over the tickets, I'm strongly in favor of a 1.4.3 release.
 
 
  On Fri, Mar 8, 2013 at 2:42 PM, Eric Newton eric.new...@gmail.com
 wrote:
 
  Here you go:
 
 
 
 https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=truejqlQuery=project+%3D+ACCUMULO+AND+fixVersion+%3D+%221.4.3%22+ORDER+BY+priority+DESC
 
 
  On Fri, Mar 8, 2013 at 12:48 PM, John Vines vi...@apache.org wrote:
 
  What's the best way to do a view of the potential changelog?
 
 
  On Fri, Mar 8, 2013 at 11:29 AM, Eric Newton eric.new...@gmail.com
  wrote:
 
  Putting out a release and testing it requires significant effort.  It
  will
  delay the 1.5.0 release unless we can get some additional resources
 to
  perform the work in parallel.
 
  I'm for releasing 1.4.3 because I'm supporting customers using 1.4.2
  with
  patches, which is not ideal.
 
  If you vote +1, consider moving the process ahead by volunteering
 your
  time
  to build and test the release candidates.
 
  -Eric
 
 
  On Fri, Mar 8, 2013 at 11:15 AM, Christopher ctubb...@apache.org
  wrote:
  It might be useful to just roll out a release candidate (because
 that
  could really be done at any time), and vote on that, rather than
 vote
  twice, once on the idea of releasing, then again on the release
  candidate.
 
  --
  Christopher L Tubbs II
  http://gravatar.com/ctubbsii
 
 
  On Fri, Mar 8, 2013 at 11:13 AM, Christopher ctubb...@apache.org
  wrote:
  +1 for putting together a 1.4.3 release.
 
  --
  Christopher L Tubbs II
  http://gravatar.com/ctubbsii
 
 
  On Fri, Mar 8, 2013 at 10:56 AM, Eric Newton 
 eric.new...@gmail.com
  wrote:
  There has been some offline discussion of releasing 1.4.3.  This
  discussion
  needs to be held with the full team.
 
  Please discuss and vote.
 
  -Eric
 
 




Re: LICENSE and NOTICE

2013-02-13 Thread William Slacum
We were so close to Good news, everyone!

On Wed, Feb 13, 2013 at 6:02 PM, Keith Turner ke...@deenlo.com wrote:

 Thats awesome.  I remember when were initially constructing these
 files we were trying to figure this out.  We looked at what other
 Apache projects did and could not find a clear pattern, different
 projects did different things.  It was so confusing.

 On Wed, Feb 13, 2013 at 5:47 PM, Billie Rinaldi bil...@apache.org wrote:
  Exciting news!  There is now a detailed guide on what needs to go in the
  LICENSE and NOTICE files: http://www.apache.org/dev/licensing-howto.html
 .
 
  Billie



Re: Add Damon Brown to contributors list

2013-02-12 Thread William Slacum
Thanks, Damon!

On Tue, Feb 12, 2013 at 1:18 PM, Keith Turner ke...@deenlo.com wrote:

 Damon,

 Thanks for your recent patches.  I am going to add you to the
 contributors list on the web page.   If you would like an org and
 timezone listed also, just shoot me an email

 Keith



Re: ACCUMULO-958 - Pluggable encryption in walogs

2013-01-30 Thread William Slacum
Bottom line, the patch has no value added to general users. The idea behind
pushing back a release date to stuff in unoperational code is very bad
practice. It sets a precedent for not considering alternative approaches
while simultaneously having no justification for choosing the approach we
did. If a specific customer/group/person wants a feature, and that feature
does not exist yet, the code is freely available to be modified,
distributed and open to public review. Adam, I strongly disagree that
forking the code is bad, considering the progress that other projects make
specifically because they have experimental forks (HBase).

On Wed, Jan 30, 2013 at 10:40 AM, Adam Fuchs afu...@apache.org wrote:

 Let me attempt to make another argument for why the 958 patch should be
 included in 1.5.0. What this patch represents is not an encryption solution
 for WAL, but an experimental extension point that will be used for building
 an encryption solution as a pluggable module. We need to judge its merit
 based on whether it is a successful experimental extension point or not.
 There are three main reasons for including the patch in 1.5.0:
 1. Test the performance impact of the null cipher solution (default
 configuration) in all the performance tests we will be running for the
 1.5.0 release. If it causes problems there then we can roll it back.
 2. Enable the use of this extension after 1.5 is released. External
 experiments have dependencies on this extension point. Without the
 extension point we will have to test with unreleased versions of Accumulo,
 which would be less than ideal.
 3. It is not harmful and somebody wants it. The reason for wanting this
 code in is well documented, so you need a very strong reason to throw it
 out. Otherwise you will encourage forking of the project (which would be
 bad).

 Adam




 On Wed, Jan 30, 2013 at 10:09 AM, Eric Newton eric.new...@gmail.com
 wrote:

  Some comments about the comments in ACCUMULO-958:
 
  Josh writes:
 
  We still have the ability to review this even after the feature freeze
  happens, it's just frustrating from my point of view in generating the
 best
  1.5.0 candidate possible (we tend to go through x.y.0 releases pretty
 darn
  quick).
 
  John writes:
 
  Yes, but we get stuck on x.y.* for a year or so, so it does become a
 race
  to get all the features you want to see in the next year.
 
  As Accumulo matures, we will need to start thinking a little more
 flexibly
  about what goes into minor releases.  We have implemented new (small)
  features in minor releases before.
 
  I would have no problem including ACCUMULO-958 into 1.5.1 after a test
  phase, and after some basic experience with the feature.  However I'm
 very
  uncomfortable including this in 1.5.0 because there is not a single test,
  and no real implementation behind the factory that anyone would use In
 Real
  Life.  Is this an appropriate API?  I have no idea.  Comments in the code
  about the stability of the interface basically admit that the author
 isn't
  completely comfortable with it, either.
 
  Let's not rush it, and when it is done right, I'm all for putting it into
  the next release.  For now, I would hold back incorporating these changes
  until they are more fully implemented. After we branch 1.5, commit this
 to
  trunk, and back-port it to the 1.5 branch when experience and tests show
 it
  is ready to be released.
 
  -Eric
 
 
 
  On Wed, Jan 30, 2013 at 9:13 AM, Josh Elser josh.el...@gmail.com
 wrote:
 
   All,
  
   It's been a few days and I haven't seen much chatter at all on
   ACCUMULO-958 [1] since the patch was applied. There are a couple of
   concerns I have that I definitely want to see addressed before a 1.5.0
   release.
  
   - It worries me that the provided patch is fail-open (when we can't
 load
   the configured encryption strategies/modules, we don't decrypt
 anything.
  I
   think for a security-minded database, we should probably be defaulting
 to
   fail-close; but, that brings up an issue, what happens when we can't
   encrypt a WAL? Do minor compactions fail gracefully? What does Accumulo
  do?
  
   - John said he had been reviewing the patch before he applied it; it
   bothers me that there was a version of this patch that had been
 reviewed
   privately for some amount of time when we had already pushed back the
   feature freeze date by a week waiting for features that weren't done.
  
   - The author noted himself with the deprecation of the CryptoModule
   interface that we anticipate changing [this] in non-backwards
 compatible
   ways as we explore requirements for encryption in Accumulo This
  tells
   me that implementation of WAL encryption overall hasn't been properly
   thought out.
  
   Given all of this, it gives me great pause to knowingly include this
  patch
   into a 1.5.0 release. I see no signs that this has been truly thought
  out,
   there is no default provided encryption strategy for 1.5.0 with this
  

Re: Accumulo 1.6 and beyond feature summit

2013-01-29 Thread William Slacum
I gave this a bit of thought too, and I think the easiest thing is to break
the interface and wrap all instances of non-closable iterators in a
closable one. That way we can delegate close down to the sources like
deepCopy does. I think Josh created a ticket for this; if not I will so we
don't derail this.

Also, in regards to the doodle thing, are we trying to set up like a cam
show or something? Personally I don't see the issue with us just listing
stuff here and having a discussion about it.

On Tue, Jan 29, 2013 at 12:15 PM, Keith Turner ke...@deenlo.com wrote:

 On Mon, Jan 28, 2013 at 7:12 PM, William Slacum
 wilhelm.von.cl...@accumulo.net wrote:
  I'd like to see:
 
  - Data triggers on insertion
  - REST interface for looking up ranges of keys
  - A DSL or some other interpreted language for crafting iterators
- there's the clojure iterator, but something like python (via jython)
 or
  javascript (via rhino) would be more adoptable
  - Adding a clean up hook to iterators

 I was thinking about this.   If we added a close() method to the SKVI
 interface then it would break existing iterators.  Another option
 would be to support closing iterators that implement Closeable.  So if
 in iterators is an intstanceof Closeable then the framework could
 close it when its finished with the iterator.   I wish there had been
 a 1.5 ticket for this, I think it would have been fairly simple to
 implement.

  - Allowing iterators to launch connections to other services (caching,
  other tservers) to retrieve or write data
  - Merging of the batch scanner and scanner implementations
- a batch scanner with 1 thread have the same behavior as a scanner
- scanners have a close() method on them
  - Adding some builder interface for creating and introspecting iterator
  stacks
  - Clients being able to scan to specific keys using the scan command



Re: Accumulo 1.6 and beyond feature summit

2013-01-28 Thread William Slacum
I'd like to see:

- Data triggers on insertion
- REST interface for looking up ranges of keys
- A DSL or some other interpreted language for crafting iterators
  - there's the clojure iterator, but something like python (via jython) or
javascript (via rhino) would be more adoptable
- Adding a clean up hook to iterators
- Allowing iterators to launch connections to other services (caching,
other tservers) to retrieve or write data
- Merging of the batch scanner and scanner implementations
  - a batch scanner with 1 thread have the same behavior as a scanner
  - scanners have a close() method on them
- Adding some builder interface for creating and introspecting iterator
stacks
- Clients being able to scan to specific keys using the scan command


Re: Accumulo 1.6 and beyond feature summit

2013-01-28 Thread William Slacum
Currently it's not recommended to launch a batch scanner from an iterator
and retrieve new information, due to the possibility of a dead lock. Other
services may alleviate that concern, but due to lifecycle management issues
(related to the add a clean up method to iterators), it's not fool proof
to clean up connections from it.

On Mon, Jan 28, 2013 at 7:21 PM, Dave Marion dlmar...@comcast.net wrote:

 - Allowing iterators to launch connections to other services (caching,
 other tservers) to retrieve or write data

   What does allow mean in this context? I don't think its disallowed (I
 know
 of an iterator that does this).

 -Original Message-
 From: William Slacum [mailto:wilhelm.von.cl...@accumulo.net]
 Sent: Monday, January 28, 2013 7:13 PM
 To: dev@accumulo.apache.org
 Subject: Re: Accumulo 1.6 and beyond feature summit

 I'd like to see:

 - Data triggers on insertion
 - REST interface for looking up ranges of keys
 - A DSL or some other interpreted language for crafting iterators
   - there's the clojure iterator, but something like python (via jython) or
 javascript (via rhino) would be more adoptable
 - Adding a clean up hook to iterators
 - Allowing iterators to launch connections to other services (caching,
 other
 tservers) to retrieve or write data
 - Merging of the batch scanner and scanner implementations
   - a batch scanner with 1 thread have the same behavior as a scanner
   - scanners have a close() method on them
 - Adding some builder interface for creating and introspecting iterator
 stacks
 - Clients being able to scan to specific keys using the scan command




Re: Contributing Organizations

2013-01-03 Thread William Slacum
I support it and a PMC vote.

On Wed, Jan 2, 2013 at 6:42 PM, Dave Marion dlmar...@comcast.net wrote:

 I see 3 proponents and 0 opponents of this idea. Can we put it to a vote?

 Dave

 -Original Message-
 From: Dave Marion [mailto:dlmar...@comcast.net]
 Sent: Wednesday, December 19, 2012 6:30 PM
 To: dev@accumulo.apache.org
 Subject: Re: Contributing Organizations

 +1

 Dave Marion


 Sent from my Motorola ATRIX™ 4G on ATT

 -Original message-
 From: Christopher Tubbs ctubb...@gmail.com
 To: dev@accumulo.apache.org
 Sent: Wed, Dec 19, 2012 22:41:07 GMT+00:00
 Subject: Contributing Organizations

 All-

 Many other projects list the organizations where their developers /
 contributors are from.
 See, for example:

 http://zookeeper.apache.org/credits.html
 http://hadoop.apache.org/who.html
 http://gora.apache.org/credits.html

 We can, and probably should, do the same, if this is agreeable to a
 sufficient number of us. (If we do this, it should probably be understood
 that it is a voluntary extra column on our page, and that it is the
 responsibility of committers to add themselves.)

 For reference, our current credits page looks like this:
 http://accumulo.apache.org/people.html

 --
 Christopher L Tubbs II
 http://gravatar.com/ctubbsii





Re: ingest performance oscillations and Xceivers

2013-01-03 Thread William Slacum
Have you also been tracking compactions? Did you have a query load?


On Wed, Jan 2, 2013 at 7:25 PM, Kepner, Jeremy - 0553 - MITLL 
kep...@ll.mit.edu wrote:

 Hmmm, that's interesting, because in the past I didn't see this behavior.
  It might be worth having someone look into because it seems to have a 2x
 impact on sustained ingest.

 Regards.  -Jeremy

 On Jan 2, 2013, at 2:23 PM, Keith Turner wrote:

  On Wed, Jan 2, 2013 at 2:11 PM, Jeremy Kepner kep...@ll.mit.edu wrote:
  So what mechanism causes the number of Xceivers to increase?
 
  Its been a while since I looked at the data node source code.   When I
  last look at it an Xceiver was just a thread created to handle a
  datanode request.   The thread went away after the request was
  processed.   So major and minor compactions running would cause more
  Xceivers to be created to read and write data.
 
  Newer datanode code may use a thread pool instead of creating a
  thread/xceiver for each request.   I am not sure.
 
  I am carefully controlling the number of ingestors and the data isn't
 varying too much.
  I would expect the number of Xceivers to remain consant.
 
  Regards.  -Jeremy
 
  On Tue, Jan 01, 2013 at 09:45:20PM -0500, Eric Newton wrote:
  Hey Jeremy,
 
  Can you compare the ingest rate to the number of tablets, too?
 
  I've found, that if I have 20-80 tablets per server (on similar
 hardware) I
  get the best performance.
 
  # of Xceivers == number of writers when ingest is the primary target.
 
  Also, is this 1.4 or trunk?
 
  -Eric
 
 
 
  On Tue, Jan 1, 2013 at 9:19 PM, Kepner, Jeremy - 1010 - MITLL 
  kep...@ll.mit.edu wrote:
 
  Accumulo Colleagues,
   I am trying to optimize my ingest into a single node Accumulo
 instance
  running on a 32 core node with 96 GB of RAM.  I am seeing the follow
 ingest
  variations as a I change the number of ingest processes (see
 attached):
 
  -
  Ingestors, Ingest rate
  -
  1, 60K inserts/sec (stable)
  2, 120K inserts/sec (stable)
  3, 60K to 180K inserts/sec
  4, 90K to 220K inserts/sec
  8, 80K to 280K inserts/sec
  12, 80K to 280K inserts/sec
  -
 
  The only thing I can see that correlates with the ingest rate is the
  number of Xceivers.  When the ingest rate is high the number of
 Xceivers is
  usually low.  Likewise, when the ingest rate drops, the number of
 Xceivers
  usually increases significantly.
 
  Question: What role to Xceivers play in ingest?
 
  Request: It would be great to add a plot showing the number of
 Xceivers
  over time to the diagnostics.
 
  Regards.  -Jeremy
 
 




Re: ingest performance oscillations and Xceivers

2013-01-02 Thread William Slacum
How many disks do you have? That can be bottle-necking throughput as the
number of Xceivers is related to the number of resources (threads, sockets:
http://blog.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/) used at once
to perform operations.

On Tue, Jan 1, 2013 at 6:45 PM, Eric Newton eric.new...@gmail.com wrote:

 Hey Jeremy,

 Can you compare the ingest rate to the number of tablets, too?

 I've found, that if I have 20-80 tablets per server (on similar hardware) I
 get the best performance.

 # of Xceivers == number of writers when ingest is the primary target.

 Also, is this 1.4 or trunk?

 On Tue, Jan 1, 2013 at 9:19 PM, Kepner, Jeremy - 1010 - MITLL 
 kep...@ll.mit.edu wrote:

  Accumulo Colleagues,
I am trying to optimize my ingest into a single node Accumulo instance
  running on a 32 core node with 96 GB of RAM.  I am seeing the follow
 ingest
  variations as a I change the number of ingest processes (see attached):
 
  -
  Ingestors, Ingest rate
  -
  1, 60K inserts/sec (stable)
  2, 120K inserts/sec (stable)
  3, 60K to 180K inserts/sec
  4, 90K to 220K inserts/sec
  8, 80K to 280K inserts/sec
  12, 80K to 280K inserts/sec
  -
 
  The only thing I can see that correlates with the ingest rate is the
  number of Xceivers.  When the ingest rate is high the number of Xceivers
 is
  usually low.  Likewise, when the ingest rate drops, the number of
 Xceivers
  usually increases significantly.
 
  Question: What role to Xceivers play in ingest?
 
  Request: It would be great to add a plot showing the number of Xceivers
  over time to the diagnostics.
 
  Regards.  -Jeremy
 
 



Re: problems running accumuo

2012-12-28 Thread William Slacum
Did you run `accumulo init`? Do you have a `/accumulo` directory in HDFS?

On Fri, Dec 28, 2012 at 9:54 AM, Tim Piety timpi...@gmail.com wrote:

 I have installed CDH3 and ZooKeeper on a CENTOS 6.3. VM (4G memory).
 Hadoop and ZooKeeper appear to run fine. I installed accumulo-1.4.2 and
 believe I configured it correctly. I used the accumulo-env.sh in the
 conf/examples/1GB/native-standalone as my template.



Re: SplitLarge Utility

2012-11-13 Thread William Slacum
If it's used by RFile during a system invoked task, then I'd say leave it.
If you want to make a shell friendly interface for invoking it, I'm all for
it.

On Tue, Nov 13, 2012 at 5:59 AM, David Medinets david.medin...@gmail.comwrote:

 It is out of place, to me, because the Accumulo Shell should be the
 primary mechanism for Accumulo system administration. Why have a
 utility that can't be invoked from the Shell? Any objections to moving
 it?

 On Tue, Nov 13, 2012 at 5:59 AM, Eric Newton eric.new...@gmail.com
 wrote:
  Yes, I've had users accidentally ingest key/values that were so large
 that
  they could not be compacted (the tablet server would run out of memory
 and
  crash).  This utility allowed me to remove the large key/values and
  preserve them for analysis.  Why is it that you want to move it?  Could
 you
  be a little more specific about why it seems out of place?
 
 
  On Mon, Nov 12, 2012 at 10:19 PM, David Medinets
  david.medin...@gmail.comwrote:
 
  Is this utility something useful in a production shop? If so, should
  it be integrated into the shell? Maybe it should be moved to the
  contribs directory? I seems out of place in the current
  org.apache.accumulo.core.file.rfile package.
 



Re: Key.getColumnFamilyAsBytes - comments about suggested new method?

2012-11-13 Thread William Slacum
For efficiency reasons, I'd leave the methods that take a Text object
as-is. This avoids a third copy of the data when a user actually wants it
in Text form.

On Tue, Nov 13, 2012 at 12:25 PM, David Medinets
david.medin...@gmail.comwrote:

 In Key.java, I see this:

   public Text getColumnFamily(Text cf) {
 cf.set(colFamily, 0, colFamily.length);
 return cf;
   }

   public Text getColumnFamily() {
 return getColumnFamily(new Text());
   }

 in TabletServerBatchDeleter, I see this:

 Mutation m = new Mutation(k.getRow());
 m.putDelete(k.getColumnFamily(), k.getColumnQualifier(), new
 ColumnVisibility(k.getColumnVisibility()), k.getTimestamp());

 The change I recently committed would allow using byte arrays as
 arguments to putDelete. It seems adding a method to Key like the
 following would eliminate creating the Text object:

   public byte[] getColumnFamilyAsBytes() {
 byte[] buffer = new byte[colFamily.length];
 System.arraycopy(colFamily, 0, buffer, 0, colFamily.length);
 return buffer;
   }

 I don't want to head down a twisty windy path removing Text objects
 but does it make sense to reduce reliance on them?



Re: IteratorSetting and priorities

2012-10-31 Thread William Slacum
It's because you're building a stack of iterators and the order you set on
the scanner is the order of sources created and passed to init() for each
iterator you create in the stack when the scan is executing on a TServer.
Albeit deprecated, the filtering API in 1.3 does allow you to set multiple
filters at the same priority, though it is broken in certain cases.

The semantics of set up and call order are such that I want my KVs coming
out of iterator A at priority N to be handled by iterator B at priority N +
1. If you want function composition of your predicates, then increasing
priorities/positions in the stack is the correct approach. I think most, if
not all, of what you want can be accomplished via client side helpers.

For example of where a tree model is used (and you'll be able to see that
tree's can't actually be defined on the client side), check out the
IntersectingIterator.

On Wed, Oct 31, 2012 at 7:52 AM, Patrone, Dennis S. 
dennis.patr...@jhuapl.edu wrote:

  The issue with giving multiple iterators the same priority is that the
 API specifies that during the call to init(), one source is given the
 iterator.

 I fail to see how this is an issue.  I don't really want a tree of
 iterators (I'm not sure how you'd combine the multiple results moving back
 up the tree).  I still want a straight line of iterators, I just don't want
 to have to worry about ordering within a set of them at the same priority
 level.

 So right now if I add I1 @ priority 1, I2 @ priority 2, and I3 @ priority
 3, then basically (as I understand it, at least) the output of I1 is fed
 into I2.  Then the output of I2 is fed into I3.

 What I want is the API to allow me to add I2 and I3 at priority 2.  Then
 the system has two choices to process my request:

 I1 - I2 - I3

 ...OR...

 I1 - I3 - I2

 Based on my priority values, I don't care which processing chain is
 followed; either is correct.

 What I'm NOT asking for is this:

   I2
 /
 I1
 \
   I3

 Am I missing something?

 Billie- I also looked at ACCUMULO-759.  I need some time later to read
 through it and follow the discussion but then I will try to add something
 coherent.

 Thanks,
 Dennis




Re: IteratorSetting and priorities

2012-10-30 Thread William Slacum
The issue with giving multiple iterators the same priority is that the API
specifies that during the call to init(), one source is given the iterator.
Now, that iterator can make multiple copies of that source via deepCopy()
to make a tree of iterators, but by default its given one source.

In the absence of a more convenient API for tracking priorities, you could
create a QueueIteratorSetting and push the filters on you want on there,
and iteratively apply each IteratorSetting to the Scanner after you're done.

Personally, I have kicked the around the idea of client helpers that keep
track of priorities and provide queue or stack like interfaces to setting
up iterators. This doesn't solve the disparity between being able to create
trees of iterators on the server side versus only being able create a stack
on the client side.

On Tue, Oct 30, 2012 at 11:02 AM, Patrone, Dennis S. 
dennis.patr...@jhuapl.edu wrote:

 Hi all,

 Is there a reason that ScannerOptions only allows a single iterator per
 priority value?  It seems that multiple iterators added at the same
 priority could just be executed in an arbitrary order by the system.

 I have a ScannerBase that gets passed around through several classes.
  These classes add different filters (for different reasons) to the scanner
 based on the particular request being processed and user configuration.
  Requiring only one filter per priority imposes a dependency among the
 different classes managing the filters.  They have to coordinate to make
 sure no one reuses the same priority.

 I'd rather be able to set priorities based on the (expected) selectivity
 of the filter only within the class adding a subset of the filters, and let
 the cross-'domain' filtering priorities be managed automatically by
 Accumulo.

 Even worse, the ScannerBase API does not provide access to the
 already-added IteratorSettings or even the min/max iterator priority, so I
 have no way AFAICT to ensure via the API that my iterator priority is not
 in conflict with an existing priority.  I have to manage the priority value
 through an unenforceable convention... and wait for a RuntimeException(!)
 to tell me when the convention is violated.

 I think minimally an accessor method needs to be added so I can ensure my
 priority isn't going to clash and cause an IllegalArgumentException.

 Ideally, I'd like to see filters added at the same priority allowed and
 just executed in some arbitrary order (or some well-defined order within
 the priority, e.g., in order they were added?).

 I'd be willing to contribute some updates for this, but before I started I
 wanted to see if this is reasonable, if anyone else thinks it is a good
 idea, or if there are real valid reasons only one iterator per priority is
 allowed.

 Thanks,
 Dennis


 Dennis Patrone
 The Johns Hopkins University / Applied Physics Laboratory
 240-228-2285 / Washington
 443-778-2285 / Baltimore
 443-220-7190 / Cell
 dennis.patr...@jhuapl.edumailto:dennis.patr...@jhuapl.edu




Re: Setting Charset in getBytes() call.

2012-10-29 Thread William Slacum
Isn't it easier to just set the JVM property `file.encoding`?

On Sun, Oct 28, 2012 at 3:18 PM, Ed Kohlwey ekohl...@gmail.com wrote:

 If you use a private static field in each class for the charset, it will
 basically be a singleton because charsets are cached in char set.forname.
 IMHO this is a somewhat cleaner approach than having lots of static imports
 to utility classes with lots of constants in them.
 On Oct 28, 2012 5:50 PM, David Medinets david.medin...@gmail.com
 wrote:

 
 
 https://issues.apache.org/jira/browse/ACCUMULO-241?focusedCommentId=13449680page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13449680
 
  In this comment, John mentioned that all getBytes() method calls
  should be changed to use UTF8. There are about 1,800 getBytes() calls
  and not all of them involve String objects. I am working on ways to
  identify a subset of these calls to change.
 
  I have created https://issues.apache.org/jira/browse/ACCUMULO-836 to
  track this issue.
 
  Should we create one static Charset object?
 
Class AccumuloDefaultCharset {
  public static Charset UTF8 = Charset.forName(UTF8);
}
 
  Should we use a static constant?
 
public static String UTF8 = UTF8;
 
  I have found one instance of getBytes() in InputFormatBase:
 
protected static byte[] getPassword(Configuration conf) {
  return Base64.decodeBase64(conf.get(PASSWORD, ).getBytes());
}
 
  Are there any reasons why I can't start specifying the charset? Is
  UTF8 the right Charset to use? I am not an expert in non-English
  charsets, so guidance would be welcome.
 



Re: Unapproved License Message From assemble/build.sh)

2012-10-22 Thread William Slacum
Billie-- any way around the issue with different versions of rat
considering the odp files binary? I'm noticing they're getting marked for
me on OSX 10.7.5, and seem to be the difference in the file counts.

On Mon, Oct 22, 2012 at 4:55 PM, Michael Flester fles...@gmail.com wrote:

 I've checked the rat source code[1] and followed the trail to the codehaus
 DirectoryScanner[2]
 and don't find any code that reads or obeys an exclude list other than some
 defaults
 that can be turned on or off. These default excludes seem to be hard coded
 everywhere,
 e.g. a set of excludes for maven, a set for eclipse, nothing project
 specific.


 [1]

 https://svn.apache.org/repos/asf/creadur/rat/trunk/apache-rat-plugin/src/main/java/org/apache/rat/mp/AbstractRatMojo.javaline
 302

 [2]

 http://plexus.codehaus.org/plexus-utils/apidocs/org/codehaus/plexus/util/AbstractScanner.html#DEFAULTEXCLUDES

 On Mon, Oct 22, 2012 at 9:35 AM, Billie Rinaldi billie.rina...@gmail.com
 wrote:

  I've noticed that different systems (perhaps different versions of the
 rat
  plugin) can show different numbers of files as missing licenses. For
  example some know that odp files are binary and some do not. So we need a
  better method than counting them. I didn't know that the exclusion lists
  weren't working. We should look into that.
 
  Billie
 
 
 
  On Oct 21, 2012, at 8:54 PM, David Medinets david.medin...@gmail.com
  wrote:
 
   Thanks. I checked the rat.txt file and saw 56 files listed with
  question marked. Then I commented out the exclusion section of the rat
  configuration in pom.xml. There was no change in rat.txt. It seems
  like the exclusion list is ignored. There is a dearth of information
  about this topic on Google. I'd like to help resolve this issue but I
  don't know how.
 
  If someone verifies this issue (56 files unapproved by rat), then I'll
  create a JIRA ticket.
 
  On Sun, Oct 21, 2012 at 12:46 AM, Christopher Tubbs ctubb...@gmail.com
 
  wrote:
 
  If you look in the target directory, after running mvn rat:check,
  you can see a rat.txt file, which has the details. You can also run
  mvn rat:rat (the goal executed for the site build when the rat
  plugin is included as a site report) if you want to see it in a
  prettier format (target/site/rat-report.html).
 
  --
  Christopher L Tubbs II
  http://gravatar.com/ctubbsii
 
 
  On Sat, Oct 20, 2012 at 11:50 PM, David Medinets
  david.medin...@gmail.com wrote:
 
  Apache Rat is reporting unapproved licenses. I don't know anything
  about Rat. Is there someway for Rat to report which files are missing
  licenses?
 
  [INFO] Building accumulo
  [INFO]task-segment: [rat:check]
  [INFO]
 --**--**
  
  [INFO] [rat:check {execution: default-cli}]
  [INFO]
 --**--**
  
  [ERROR] BUILD FAILURE
  [INFO]
 --**--**
  
  [INFO] Too many unapproved licenses: 291
  [INFO]
 --**--**
  
  [INFO] For more information, run Maven with the -e switch
  [INFO]
 --**--**
  
  [INFO] Total time: 3 seconds
  [INFO] Finished at: Sat Oct 20 23:44:28 EDT 2012
  [INFO] Final Memory: 21M/52M
  [INFO]
 --**--**
  
 expected 53 files missing licenses, but saw 56
 
 



Re: Running Examples Within Eclipse (Missing Class)

2012-10-19 Thread William Slacum
You need to add the zookeeper jar to the run/debug profile for the class
you're executing.

On Thu, Oct 18, 2012 at 10:01 PM, David Medinets
david.medin...@gmail.comwrote:

 I imported the Accumulo project into the Spring Tool Suite (which is
 Eclipse-based) as a maven project. Even seemed fine but I ran into an
 issue when I tried to run the RowOperations example inside Eclipse.
 Should this work? Below is the exception:

 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/zookeeper/KeeperException
 at
 org.apache.accumulo.core.client.ZooKeeperInstance.init(ZooKeeperInstance.java:99)
 at
 org.apache.accumulo.core.client.ZooKeeperInstance.init(ZooKeeperInstance.java:81)
 at
 org.apache.accumulo.examples.simple.client.RowOperations.main(RowOperations.java:59)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.zookeeper.KeeperException
 at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
 ... 3 more



Re: JIRA Etiquette / Hackathon Projects

2012-10-08 Thread William Slacum
At some point we had default assignees, so I believe they should be fair
game. If there aren't any patches and it's been open for a while, I think
that's an even stronger case to work on it (speaking of which, I think I
have a ticket or two I need to finish up!).

On Mon, Oct 8, 2012 at 11:13 AM, Drew Farris d...@apache.org wrote:

 Hi All,

 A question about JIRA etiquette in the context of the Accumulo project:

 If an issue is assigned, but has a state of 'open' (as opposed to 'in
 progress'), is it considered rude for someone else to begin work on
 the issue? Are those tickets only marked as 'unassigned' fair game?

 I'm reviewing tickets to develop a list of issues people could work at
 the upcoming Accumulo Hackathon. I want to be sure that I do not
 suggest anything that someone is either in the middle of working or is
 substantially invested in.

 Are there particular tickets that any of you think would be good
 candidates to work during the Hackathon?

 Thanks,

 Drew



Re: new committers!

2012-08-06 Thread William Slacum
Thanks guys!

I hope to contribute to as many areas as possible, but I'm really
interested helping make Accumulo an easy tool to set up, throw some data
at, and pull out data in some meaningful way. To start, I may be giving the
Wikipedia example some TLC :)

On Mon, Aug 6, 2012 at 1:13 PM, David Medinets david.medin...@gmail.comwrote:

 +2!

 On Mon, Aug 6, 2012 at 1:11 PM, Jim Klucar klu...@gmail.com wrote:
  Congrats Josh and Bill!
 
  On Mon, Aug 6, 2012 at 1:08 PM, Billie J Rinaldi
  billie.j.rina...@ugov.gov wrote:
  I am pleased to announce that Josh Elser and Bill Slacum have been
 voted to become new committers for Apache Accumulo.
 
  Welcome, Josh and Bill!  Feel free to say a few words about your
 development interests.
 
  Billie



[jira] [Commented] (ACCUMULO-702) build on ubuntu hangs without required dependencies

2012-07-25 Thread William Slacum (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13422844#comment-13422844
 ] 

William Slacum commented on ACCUMULO-702:
-

I'd prefer it if things that make external calls were turned on by user flag or 
possibly a build profile. The Thrift code, in the past, was already generated 
and I don't really see a need to build the documentation when all I want to do 
is get a running environment set up. 

 build on ubuntu hangs without required dependencies
 ---

 Key: ACCUMULO-702
 URL: https://issues.apache.org/jira/browse/ACCUMULO-702
 Project: Accumulo
  Issue Type: Bug
  Components: docs
Affects Versions: 1.5.0
 Environment: Ubuntu 12.04 LTS
Reporter: Dave Marion
Assignee: David Medinets
Priority: Minor
 Fix For: 1.5.0

 Attachments: build.sh.patch


 build hangs when correct packages are not installed.
 [INFO] --- exec-maven-plugin:1.2.1:exec (user-manual) @ accumulo ---
 This is pdfTeX, Version 3.1415926-1.40.10 (TeX Live 2009/Debian)
 entering extended mode
 (./accumulo_user_manual.tex
 LaTeX2e 2009/09/24
 Babel v3.8l and hyphenation patterns for english, usenglishmax, dumylang, 
 noh
 yphenation, loaded.
 (/usr/share/texmf-texlive/tex/latex/base/report.cls
 Document Class: report 2007/10/19 v1.4h Standard LaTeX document class
 (/usr/share/texmf-texlive/tex/latex/base/size11.clo))
 (/usr/share/texmf-texlive/tex/latex/base/alltt.sty)
 ! LaTeX Error: File `multirow.sty' not found.
 Type X to quit or RETURN to proceed,
 or enter new name. (Default extension: sty)
 Enter file name: 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (ACCUMULO-703) Add PrintInfo shortcut to bin/accumulo

2012-07-25 Thread William Slacum (JIRA)

[ 
https://issues.apache.org/jira/browse/ACCUMULO-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13422851#comment-13422851
 ] 

William Slacum commented on ACCUMULO-703:
-

Keith, how/where would I go about adding it to the user manual? I did a grep 
for PrintInfo to see where I could replace documentation, but didn't find 
anything for the user manual. 

 Add PrintInfo shortcut to bin/accumulo
 --

 Key: ACCUMULO-703
 URL: https://issues.apache.org/jira/browse/ACCUMULO-703
 Project: Accumulo
  Issue Type: Improvement
Affects Versions: 1.5.0-SNAPSHOT
 Environment: OSX
Reporter: William Slacum
Assignee: William Slacum
Priority: Trivial
 Attachments: rfile-info.2.patch, rfile-info.patch

   Original Estimate: 1m
  Remaining Estimate: 1m

 Accumulo has a utility, org.apache.accumulo.core.file.rfile.PrintInfo, that 
 will summarize a RFile and even print out all of the keys for you. It'd be 
 nice to run it via {{$ACCUMULO_HOME/bin/accumulo rfile-info}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   >