I did use the inverted index but I went into trouble because I used a batch
scan and it returns unsorted data. Also, I need to do some computation
after.  Here is my prob definition:

The data is of the form:
studentID course|courseID [ ]  count
.
.
.
.
.
studentID np2| [ ]  count

So a student is registered in multiple courses. The query has the following
parameters:
Input: List of course Ids
Output: Computation on records that contain course from the I/p
Algo:
Step1: Select rows that contain a course matching courses in the list
Step2: Count the number of such courses for each student
Step3: Do some computation

Approach1(Naive):
1. Designed a RowFilter that checks all the rowIds in the DB to check if
the course is in the course List
2. Designed an iterator to count the number of such courses within each
student
3. Designed an iterator to do the computation

Problem: Complexity = O(n) where n= number of records in the DB which is
BAD.

Approach2(Better Lookup):
1. Created an inverted Index with:
courseID student|studentID [ ] count
.
.
.
.
2. Looked up students for courses in the list
3. Accessed records with studentIDs, courseID generated from step1 using
Range Object in batch scan
4. Designed an iterator to count courseIds within a student record
5. Designed an iterator to do the computation

Problem: Batch scan does not return records in a sorted manner hence step 4
does not give me the required results :\

I am not sure how to proceed now.



Best regards,
Yamini Joshi

On Thu, Oct 20, 2016 at 6:04 PM, Dylan Hutchison <dhutc...@cs.washington.edu
> wrote:

> Hi Yamini,
>
> If you have a finite, known list of column families, you can use locality
> groups
> <https://accumulo.apache.org/1.8/accumulo_user_manual#_locality_groups> to
> store them in separate files in Hadoop.   Scans that only reference the
> column families within a locality group need not open data in other
> locality groups' files.
>
> Apart from locality groups, setting "fetch column families and/or
> qualifiers" on the scanner sets up a standard Filter iterator on the scan.
> If you need to obtain these columns from every row, then the whole table is
> scanned and filtered server-side.  (Seeking will occur during the scan if
> the selected columns are far apart in the table.)  I guess that is too
> inefficient for your use case.  For reference, these iterators are here
> for families
> <https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnFamilySkippingIterator.java>
> and here for qualifiers
> <https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/iterators/system/ColumnQualifierFilter.java>
> .
>
> If locality groups are not an option and you must filter on families and
> columns, then you may want to consider maintaining an index table, in which
> the columns are stored as rows, or otherwise moving the columns into the
> rows.
>
> Regards, Dylan
>
> On Thu, Oct 20, 2016 at 3:45 PM, Yamini Joshi <yamini.1...@gmail.com>
> wrote:
>
>> Hello all
>>
>> Is it possible to configure an iterator that works as a filter? As per
>> Accumulo docs:
>> As such, the `Filter` class functions well for filtering small amounts of
>> data, but is
>> inefficient for filtering large amounts of data. The decision to use a
>> `Filter` strongly
>> depends on the use case and distribution of data being filtered.
>>
>> I have a huge corpus to be filtered with a small amount of data selected.
>> I want to select column families from a list of col families. I have a
>> rough idea of using 'seek' to bypass cfs that don't exist in the list. I
>> was hoping I could exploit the 'seek'ing in iterator and go to the range in
>> the list of cf and check if it exists. I am not sure if this will work or
>> if it is a good approach. Any feedback is much appreciated.
>>
>> Best regards,
>> Yamini Joshi
>>
>
>

Reply via email to