[
https://issues.apache.org/jira/browse/LUCENE-6536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579210#comment-14579210
]
Greg Bowyer commented on LUCENE-6536:
-------------------------------------
bq. Questions:
bq. What will be done to deal with the bugginess of this thing? I see many
reports of user corruption issues. By committing it, we take responsibility for
this and it becomes "our problem". I don't want to see the code committed to
lucene just for this reason.
Fix its bugs ;), joking aside is it the directory or the blockcache that is the
source of most of the corruptions
bq. What will be done about the performance? I am not really sure the entire
technique is viable.
My usecase is a bit odd, I have many small (2*HDFS block) indexes that get run
over map jobs in hadoop. The performance I got last time I did this (with a
dirty hack Directory that copied the files in and out of HDFS :S) was pretty
good.
Its a throughput orientated usage, I think if you tried to use this to back an
online searcher you would have poor performance.
bq. Personally, I think if someone wants to do this, a better integration point
is to make it a java 7 filesystem provider. That is really how such a
filesystem should work anyway.
That is awesome I didnt know such an SPI existed in java. I have found a few
people that are trying to make a provider for hadoop.
I also dont have the greatest love for this path, the more test manipulations I
did the less and less it felt like a simple feature that should be in lucene. I
might try to either strip out the block-cache from this patch, or use a HDFS
filesystem SPI in java7.
> Migrate HDFSDirectory from solr to lucene-hadoop
> ------------------------------------------------
>
> Key: LUCENE-6536
> URL: https://issues.apache.org/jira/browse/LUCENE-6536
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Greg Bowyer
> Labels: hadoop, hdfs, lucene, solr
> Attachments: LUCENE-6536.patch
>
>
> I am currently working on a search engine that is throughput orientated and
> works entirely in apache-spark.
> As part of this, I need a directory implementation that can operate on HDFS
> directly. This got me thinking, can I take the one that was worked on so hard
> for solr hadoop.
> As such I migrated the HDFS and blockcache directories out to a lucene-hadoop
> module.
> Having done this work, I am not sure if it is actually a good change, it
> feels a bit messy, and I dont like how the Metrics class gets extended and
> abused.
> Thoughts anyone
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]