[ 
https://issues.apache.org/jira/browse/BLUR-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474744#comment-13474744
 ] 

Aaron McCurry commented on BLUR-18:
-----------------------------------

So I would approach the InputFormat and OutputFormats as separate issues, 
perhaps we should create 2 sub tasks one for each.

How the InputFormat works is really up for debate.  The easiest implementation 
could be a simple lucene document to BlurRecord/Row converter that opens the 
each shard of the table in a separate mapper and just reads through the index 
in a brute force scan.  This approach has a few problems, the first is that if 
the index is being updated by the shard servers then the segment files need to 
be protected/held so that they are not deleted out from underneath the mapper.  
The second is that it is a brute force approach that doesn't really allow for 
blur/lucene queries to be executed against the index without opening the index 
in the mapper for querying.  The problem with opening in the mapper is that 
there isn't typically enough extra memory in the mapper to have an effective 
block cache for any kind of performance.

An alternate solution is to build the InputFormat against the shard server 
thrift api, but I don't think that they can handle iterating over large amount 
of blur records.  My suggestion is to put off the InputFormat until the 
new-api-prototype is in place, or we can start integrating into that branch 
now.  My reasoning for this is that the thrift api and the new server is 
designed to iterate over the entire result set.  I'm getting pretty good 
performance with it right now, but it's not setup to be distributed yet.  We 
can work on that together if you would like.

For the OutputFormat, porting the functionality in the BlurReducer to run in 
the OutputFormat should be fairly straight forward.

Now that think about it more, it might be prudent to go ahead and start working 
on both the Input and Output Formats in the new-api-branch instead of trying to 
get it working the 0.1.x api.  The api and data structure is so much simple in 
the new-api-branch.

What do you think?
                
> Rework the MapReduce Library to implement Input/OutputFromats
> -------------------------------------------------------------
>
>                 Key: BLUR-18
>                 URL: https://issues.apache.org/jira/browse/BLUR-18
>             Project: Apache Blur
>          Issue Type: Improvement
>            Reporter: Aaron McCurry
>
> Currently the only way to implement indexing is to use the BlurReducer.  A 
> better way to implement this would be to support Hadoop input/outputformats 
> in both the new and old api's.  This would allow an easier integration with 
> other Hadoop projects such as Hive and Pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to