Counting Bloom Filters

2014-04-06 Thread Arshak Navruzyan
Can the bloom filter functionality in Accumulo (1.5.x) be adapted to become a counting bloom filter? I would like to use a counting bloom filter (which uses an array of counting bins rather than a single bit for each array position) to get the number of matches that are encountered for each corres

Re: Accumulo and OSGi

2014-04-06 Thread Corey Nolet
Geoffrey, My quick answer is that I needed to adjust my container (Karaf in my case) to export the JAAS packages because they come in the JRE. Then I needed to make the hadoop bundle import them. Also before I forget, Hadoop packages its default xml configurations (core-site.xml, core-default.xml

Re: Accumulo and OSGi

2014-04-06 Thread Geoffry Roberts
All, To what extent does the Accumulo Client rely on the Hadoop Client? I apologize if the question is a bit obtuse. But I got into dependency weeds trying to get the Hadoop Client to work in OSGI. (See below Hadoop Client woes) I am now wondering if I OSGified Accumulo's client would I encou

Re: RowID format tradeoffs

2014-04-06 Thread Christopher
You could try sharding: If your RowID is ingest date (to achieve ability to scan over recently ingested data, as you describe), you could use RowID of "ShardID_IngestDate" instead, where: ShardID = hash(row) % numShards This will result in numShards number of rows for each IngestDate, and is cho

Re: RowID format tradeoffs

2014-04-06 Thread Ariel Valentin
Russ, I experienced the same problem. In the end what we decided to do was to take another property and use it as a prefix and then presplit the tables E.g. apples\0454316778 We still have situations where nodes run hot during peak usage but we are able to live with it Thanks, Ariel --- Sent fr

RowID format tradeoffs

2014-04-06 Thread Russ Weeks
Hi, I'm looking for advice re. the best way to structure my row IDs. Monotonically increasing IDs have the very appealing property that I can quickly scan all recently-ingested unprocessed rows, particularly because I maintain a "checkpoint" of the most-recently processed row. Of course, the prob