[
https://issues.apache.org/jira/browse/BLUR-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474744#comment-13474744
]
Aaron McCurry commented on BLUR-18:
-----------------------------------
So I would approach the InputFormat and OutputFormats as separate issues,
perhaps we should create 2 sub tasks one for each.
How the InputFormat works is really up for debate. The easiest implementation
could be a simple lucene document to BlurRecord/Row converter that opens the
each shard of the table in a separate mapper and just reads through the index
in a brute force scan. This approach has a few problems, the first is that if
the index is being updated by the shard servers then the segment files need to
be protected/held so that they are not deleted out from underneath the mapper.
The second is that it is a brute force approach that doesn't really allow for
blur/lucene queries to be executed against the index without opening the index
in the mapper for querying. The problem with opening in the mapper is that
there isn't typically enough extra memory in the mapper to have an effective
block cache for any kind of performance.
An alternate solution is to build the InputFormat against the shard server
thrift api, but I don't think that they can handle iterating over large amount
of blur records. My suggestion is to put off the InputFormat until the
new-api-prototype is in place, or we can start integrating into that branch
now. My reasoning for this is that the thrift api and the new server is
designed to iterate over the entire result set. I'm getting pretty good
performance with it right now, but it's not setup to be distributed yet. We
can work on that together if you would like.
For the OutputFormat, porting the functionality in the BlurReducer to run in
the OutputFormat should be fairly straight forward.
Now that think about it more, it might be prudent to go ahead and start working
on both the Input and Output Formats in the new-api-branch instead of trying to
get it working the 0.1.x api. The api and data structure is so much simple in
the new-api-branch.
What do you think?
> Rework the MapReduce Library to implement Input/OutputFromats
> -------------------------------------------------------------
>
> Key: BLUR-18
> URL: https://issues.apache.org/jira/browse/BLUR-18
> Project: Apache Blur
> Issue Type: Improvement
> Reporter: Aaron McCurry
>
> Currently the only way to implement indexing is to use the BlurReducer. A
> better way to implement this would be to support Hadoop input/outputformats
> in both the new and old api's. This would allow an easier integration with
> other Hadoop projects such as Hive and Pig.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira