Re: Tracking cardinality in Accumulo

2014-05-16 Thread Marc Parisi
On Fri, May 16, 2014 at 6:04 PM, Corey Nolet wrote: > What's the expected size of your unique key set? Thousands? Millions? > Billions? > > You could probably use a table structure similar to > https://github.com/calrissian/accumulo-recipes/tree/master/store/metrics-storebut > just have it emit

Re: In v1.5.0 CLI, how can you scan for empty CF with a CQ value?

2014-05-16 Thread David Medinets
I have only 1.5.0. Perhaps I need to expend the effort to upgrade. Time being precious I've been procrastinating. On Fri, May 16, 2014 at 11:59 AM, Josh Elser wrote: > On 5/16/14, 10:38 AM, David Medinets wrote: > >> I tried both of the following ways: >> >> scan -c :name >> > > This worked for

Re: Tracking cardinality in Accumulo

2014-05-16 Thread Marc Parisi
woops, sorry for the empty response, but I'm new to E-mail. The bitset within HLL supports union and intersection. You should be able to estimate cardinality without re-reading the data. In effect, you can segment your estimation and minimize error < about 2%. Union is straightforward, whereas int

Re: Tracking cardinality in Accumulo

2014-05-16 Thread Corey Nolet
What's the expected size of your unique key set? Thousands? Millions? Billions? You could probably use a table structure similar to https://github.com/calrissian/accumulo-recipes/tree/master/store/metrics-storebut just have it emit 1's instead of summing them. I'm thinking maybe your mappings cou

Re: MR Data Locality with AccumuloInputFormat?

2014-05-16 Thread Russ Weeks
Thanks, Josh. I'll take a look through the Hadoop web UI. -Russ On Fri, May 16, 2014 at 1:37 PM, Josh Elser wrote: > Hi Russ, > > I believe that the AccumuloInputFormat will use the splits on the table > you're reading to generate the MR InputSplits. The InputFormat should be > trying to run th

Re: MR Data Locality with AccumuloInputFormat?

2014-05-16 Thread Corey Nolet
Has the table been compacted since loading the data? Hi Russ, I believe that the AccumuloInputFormat will use the splits on the table you're reading to generate the MR InputSplits. The InputFormat should be trying to run the Mappers on the same machine as the tserver serving the data is located.

Document-Partitioned Indexing - Optimizing Mutation Size

2014-05-16 Thread Slater, David M.
Hi, quick question, I’m attempting to optimize the ingest rates for a document-partitioned table. I am currently presplitting the tables and have even spread of data across tablet servers. However, I was wondering if changing the size of mutations would have a major impact on the ingest rates.

Re: Tracking cardinality in Accumulo

2014-05-16 Thread David Medinets
Yes, the data has not yet been ingested. I can control the table structure; hopefully by integrating (or extending) the D4M schema. I'm leaning towards using https://github.com/addthis/stream-lib as part of the ingest process. Upon start up, existing tables would be analyzed to find cardinality. T

Re: MR Data Locality with AccumuloInputFormat?

2014-05-16 Thread Josh Elser
Hi Russ, I believe that the AccumuloInputFormat will use the splits on the table you're reading to generate the MR InputSplits. The InputFormat should be trying to run the Mappers on the same machine as the tserver serving the data is located. If you're only getting a few mappers, adding mor

Re: Tracking cardinality in Accumulo

2014-05-16 Thread Corey Nolet
Can we assume this data has not yet been ingested? Do you have control over the way in which you structure your table? On Fri, May 16, 2014 at 1:54 PM, David Medinets wrote: > If I have the following simple set of data: > > NAME John > NAME Jake > NAME John > NAME Mary > > I want to end up with

Re: In v1.5.0 CLI, how can you scan for empty CF with a CQ value?

2014-05-16 Thread Josh Elser
On 5/16/14, 10:38 AM, David Medinets wrote: I tried both of the following ways: scan -c :name This worked for me with 1.6.0. Does it fail with 1.5.1? scan -c "":name Neither worked. Is there a way?

Re: Pagination in Accumulo (also D4M Data Explorer!)

2014-05-16 Thread David Medinets
Josh, this morning I woke up and remembered that I wrote http://affy.blogspot.com/2012/11/how-can-i-use-reverse-sort-on-generic.html about 18 months ago. I can easily add a reverse index in order to extend the D4M schema. I'm glad to see that reverse scanning is possible in HBase. On Thu, May 15

MR Data Locality with AccumuloInputFormat?

2014-05-16 Thread Russ Weeks
Hi, folks, When I execute an MR job with AccumuloInputFormat, are there any guarantees about which mappers process which rows? I'm trying to minimize crosstalk in my cluster but either I haven't split my table properly or I'm expecting too much, because I'm only seeing 1 or 2 nodes running MR task

Re: Tracking cardinality in Accumulo

2014-05-16 Thread William Slacum
Yes. It will be less useful if you can't scan only the newest data, as you'll be recombining the same pieces of data on subsequent runs. On Fri, May 16, 2014 at 1:54 PM, David Medinets wrote: > If I have the following simple set of data: > > NAME John > NAME Jake > NAME John > NAME Mary > > I wa

Re: Pagination in Accumulo (also D4M Data Explorer!)

2014-05-16 Thread Josh Elser
Reverse scanning isn't necessarily infeasible: https://issues.apache.org/jira/browse/HBASE-4811 This might be something cool that could be implemented to make this sort of thing easiser. The pagination isolation you mention in Approach B is interesting. I'm curious as to how clone'ing tables

Tracking cardinality in Accumulo

2014-05-16 Thread David Medinets
If I have the following simple set of data: NAME John NAME Jake NAME John NAME Mary I want to end up with the following: NAME 3 I'm thinking that perhaps a HyperLogLog approach should work. See http://en.wikipedia.org/wiki/HyperLogLog for more information. Has anyone done this before in Accumu

In v1.5.0 CLI, how can you scan for empty CF with a CQ value?

2014-05-16 Thread David Medinets
I tried both of the following ways: scan -c :name scan -c "":name Neither worked. Is there a way?

Re: Trace log

2014-05-16 Thread Josh Elser
Just to be clear, if the Warning message eventually goes away (should be within seconds, maybe minutes), then it's probably just asynch delay. If a substantial time later you're still getting the warning, that's probably a sign that the tracing is used wrong (opened and not closed as Eric said