Re: Iterator as a Filter

dlmarion Fri, 21 Oct 2016 04:37:37 -0700

So if I understand this correctly, for this use case, you could do the 
following:


courseId studentId <list of courseIds> 

For either of your queries (1 and 2 below) you could use a BatchScanner with 
the set of Ranges being the course ids from input C. In your client you would 
add the resulting columnFamily (studentId) and columnQualifier (list of 
courses) to a map of studentId -> list of courses. For #1, you just need the 
size of the list of courses for each student. For #2, you can do the 
intersection for each student. 

Now, this does not work if you want to be able to update the student 
information in an online fashion. This should work though if you are able to 
simply reload the information when it is updated. 

----- Original Message -----

From: "Yamini Joshi" <[email protected]> 
To: [email protected] 
Sent: Thursday, October 20, 2016 9:53:34 PM 
Subject: Re: Iterator as a Filter 

I have an input C which is the list of courses a student x is enrolled in. 
I am trying to do some computation which requires 2 things: 
For a student enrolled in atleast one of the courses in C 
1. Total number of classes a student is enrolled in (Y) 
2. Number of courses the student is enrolled in which belong the list 
cardinality(Y intersection C) 


Best regards, 
Yamini Joshi 

On Thu, Oct 20, 2016 at 7:16 PM, Dave < [email protected] > wrote: 




I'm a little confused to the use case here. Are you trying to find courses that 
students are taking where the students are in a particular class? The table 
design is going to depend on the set of questions that you want to answer. 

On Oct 20, 2016 7:19 PM, Yamini Joshi < [email protected] > wrote: 

<blockquote>

I did use the inverted index but I went into trouble because I used a batch 
scan and it returns unsorted data. Also, I need to do some computation after. 
Here is my prob definition: 

The data is of the form: 
studentID course|courseID [ ] count 
. 
. 
. 
. 
. 
studentID np2| [ ] count 

So a student is registered in multiple courses. The query has the following 
parameters: 
Input: List of course Ids 
Output: Computation on records that contain course from the I/p 
Algo: 
Step1: Select rows that contain a course matching courses in the list 
Step2: Count the number of such courses for each student 
Step3: Do some computation 

Approach1(Naive): 
1. Designed a RowFilter that checks all the rowIds in the DB to check if the 
course is in the course List 
2. Designed an iterator to count the number of such courses within each student 
3. Designed an iterator to do the computation 

Problem: Complexity = O(n) where n= number of records in the DB which is BAD. 

Approach2(Better Lookup): 
1. Created an inverted Index with: 
courseID student|studentID [ ] count 
. 
. 
. 
. 
2. Looked up students for courses in the list 
3. Accessed records with studentIDs, courseID generated from step1 using Range 
Object in batch scan 
4. Designed an iterator to count courseIds within a student record 
5. Designed an iterator to do the computation 

Problem: Batch scan does not return records in a sorted manner hence step 4 
does not give me the required results :\ 

I am not sure how to proceed now. 



Best regards, 
Yamini Joshi 

On Thu, Oct 20, 2016 at 6:04 PM, Dylan Hutchison < [email protected] > 
wrote: 

<blockquote>

Hi Yamini, 

If you have a finite, known list of column families, you can use locality 
groups to store them in separate files in Hadoop. Scans that only reference the 
column families within a locality group need not open data in other locality 
groups' files. 

Apart from locality groups, setting "fetch column families and/or qualifiers" 
on the scanner sets up a standard Filter iterator on the scan. If you need to 
obtain these columns from every row, then the whole table is scanned and 
filtered server-side. (Seeking will occur during the scan if the selected 
columns are far apart in the table.) I guess that is too inefficient for your 
use case. For reference, these iterators are here for families and here for 
qualifiers . 

If locality groups are not an option and you must filter on families and 
columns, then you may want to consider maintaining an index table, in which the 
columns are stored as rows, or otherwise moving the columns into the rows. 

Regards, Dylan 

On Thu, Oct 20, 2016 at 3:45 PM, Yamini Joshi < [email protected] > wrote: 

<blockquote>

Hello all 

Is it possible to configure an iterator that works as a filter? As per Accumulo 
docs: 
As such, the `Filter` class functions well for filtering small amounts of data, 
but is inefficient for filtering large amounts of data. The decision to use a 
`Filter` strongly 
depends on the use case and distribution of data being filtered. 

I have a huge corpus to be filtered with a small amount of data selected. I 
want to select column families from a list of col families. I have a rough idea 
of using 'seek' to bypass cfs that don't exist in the list. I was hoping I 
could exploit the 'seek'ing in iterator and go to the range in the list of cf 
and check if it exists. I am not sure if this will work or if it is a good 
approach. Any feedback is much appreciated. 

Best regards, 
Yamini Joshi 





</blockquote>



</blockquote>



</blockquote>

Re: Iterator as a Filter

Reply via email to