[
https://issues.apache.org/jira/browse/LUCENE-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788731#action_12788731
]
Grant Ingersoll commented on LUCENE-1377:
-----------------------------------------
(kind of rambling, but...)
We probably should also include Nutch. I think they have their own analyzers
too.
While I think it is reasonable to consolidate Analyzers, it is a slippery slope
(next up Function queries - Solr users love function queries and they get a lot
of love both from users and devs while Lucene's function queries languish and
we're never brought over to Solr b/c no one did the work and Solr's moved on as
well-, then faceting, then the schema, ZooKeeper integration etc.). One could
easily make the case that all of Solr should be a part of Lucene or that all
the guts of Solr should be pulled into Lucene other than the "scaffolding" part
at which point Solr simply becomes the Lucene search server. At the same time,
you could make the case that Solr could be it's own TLP: Solr often has
different goals from Lucene, not to mention different release cycles, etc. (for
instance, we don't ask Compass, DBSight, et. al. to donate back their
analyzers, right?) Besides the fact that there are many things that are just
easier to do in Solr b/c it provides the framework around Lucene that _ALL OF
US HAVE BUILT_ in one form or another over the years. That being said, Solr
and Lucene have a special relationship in that they both fly under the Lucene
flag and many Solr committers are also Lucene committers, so we should try to
coordinate a bit more, while maintaining independence. As I've said before,
though, it often seems like a one way street for those Lucene committers who
are not Solr users. Namely, they grab stuff from Solr and put it into Lucene,
but then they don't make Solr "whole" again (function queries are "exhibit A").
At the same time, when I see talk of Lucene adding schema like features or
other stuff that is already in Solr and that I don't think belongs in Lucene, I
think "why not just use Solr", yet the Lucene community, at times seems intent
on duplication, too, so the knife cuts both ways, as they say.
At any rate, the reason I'm pro Analyzer consolidation (I could even see it
being a standalone sub-project) is that Analyzers are useful in their own right
outside of search all together. For instance, Mahout currently has a dep. on
both core and contrib/analyzers for the sole purpose of using the Analyzers
(well, we also have a utility dependency on Lucene core to create feature
vectors from a Lucene index, but the core has a dep on Analyzers). So,
perhaps, the PMC and broader community should take this up on general@ and we
should create an Analyzers project and all Lucene ecosystem committers have
commit writes on it _WITH THE VERY RESTRICTIVE_ approach that changes need to
be tested across projects. Of course, this will, as Yonik points out, become a
bottleneck for each project and it is far from perfect too. However, we could
still allow projects to create their own and then they get promoted when deemed
useful by others.
On the other hand, I kind of am not thrilled about a whole other subproject
with it's own lists, JIRA, etc. just for Analyzers. Maybe we could just have
separate SVN (so that we can change commit access, builds, etc.) and java-dev
is still the place for discussion. Then each project could have it's
dependency on that code as needed and the Analyzers code could be released
separately, etc. So, pretty much a full project, but not the overhead.
> Add HTMLStripReader and WordDelimiterFilter from SOLR
> -----------------------------------------------------
>
> Key: LUCENE-1377
> URL: https://issues.apache.org/jira/browse/LUCENE-1377
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 2.3.2
> Reporter: Jason Rutherglen
> Priority: Minor
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> SOLR has two classes HTMLStripReader and WordDelimiterFilter which are very
> useful for a wide variety of use cases. It would be good to place them into
> core Lucene.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]