[jira] Commented: (LUCENE-1377) Add HTMLStripReader and WordDelimiterFilter from SOLR

Grant Ingersoll (JIRA) Thu, 10 Dec 2009 06:49:41 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788731#action_12788731
 ]


Grant Ingersoll commented on LUCENE-1377:
-----------------------------------------

(kind of rambling, but...)
We probably should also include Nutch.  I think they have their own analyzers 
too.

While I think it is reasonable to consolidate Analyzers, it is a slippery slope 
(next up Function queries - Solr users love function queries and they get a lot 
of love both from users and devs while Lucene's function queries languish and 
we're never brought over to Solr b/c no one did the work and Solr's moved on as 
well-,  then faceting, then the schema, ZooKeeper integration etc.).  One could 
easily make the case that all of Solr should be a part of Lucene or that all 
the guts of Solr should be pulled into Lucene other than the "scaffolding" part 
at which point Solr simply becomes the Lucene search server.  At the same time, 
you could make the case that Solr could be it's own TLP: Solr often has 
different goals from Lucene, not to mention different release cycles, etc. (for 
instance, we don't ask Compass, DBSight, et. al. to donate back their 
analyzers, right?)  Besides the fact that there are many things that are just 
easier to do in Solr b/c it provides the framework around Lucene that _ALL OF 
US HAVE BUILT_ in one form or another over the years.  That being said, Solr 
and Lucene have a special relationship in that they both fly under the Lucene 
flag and many Solr committers are also Lucene committers, so we should try to 
coordinate a bit more, while maintaining independence.  As I've said before, 
though, it often seems like a one way street for those Lucene committers who 
are not Solr users.  Namely, they grab stuff from Solr and put it into Lucene, 
but then they don't make Solr "whole" again (function queries are "exhibit A").

At the same time, when I see talk of Lucene adding schema like features or 
other stuff that is already in Solr and that I don't think belongs in Lucene, I 
think "why not just use Solr", yet the Lucene community, at times seems intent 
on duplication, too, so the knife cuts both ways, as they say.

At any rate, the reason I'm pro Analyzer consolidation (I could even see it 
being a standalone sub-project) is that Analyzers are useful in their own right 
outside of search all together.  For instance, Mahout currently has a dep. on 
both core and contrib/analyzers for the sole purpose of using the Analyzers 
(well, we also have a utility dependency on Lucene core to create feature 
vectors from a Lucene index, but the core has a dep on Analyzers).  So, 
perhaps, the PMC and broader community should take this up on general@ and we 
should create an Analyzers project and all Lucene ecosystem committers have 
commit writes on it _WITH THE VERY RESTRICTIVE_ approach that changes need to 
be tested across projects.  Of course, this will, as Yonik points out, become a 
bottleneck for each project and it is far from perfect too.  However, we could 
still allow projects to create their own and then they get promoted when deemed 
useful by others.  

On the other hand, I kind of am not thrilled about a whole other subproject 
with it's own lists, JIRA, etc. just for Analyzers.  Maybe we could just have 
separate SVN (so that we can change commit access, builds, etc.) and java-dev 
is still the place for discussion.  Then each project could have it's 
dependency on that code as needed and the Analyzers code could be released 
separately, etc.  So, pretty much a full project, but not the overhead.

> Add HTMLStripReader and WordDelimiterFilter from SOLR
> -----------------------------------------------------
>
>                 Key: LUCENE-1377
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1377
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.3.2
>            Reporter: Jason Rutherglen
>            Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> SOLR has two classes HTMLStripReader and WordDelimiterFilter which are very 
> useful for a wide variety of use cases.  It would be good to place them into 
> core Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1377) Add HTMLStripReader and WordDelimiterFilter from SOLR

Reply via email to