On Oct 20, 2008, at 11:35 PM, Otis Gospodnetic wrote:

This is related to something I must have only day dreamed (dreamt?) about, but not actually mentioned on solr-dev. My feeling is we are moving Solr in a direction of a more general web service that can host various NLP and ML components, and no longer only do IR/Lucene. We see that with a few patches that Grant is cooking, I think we'll see that in the Solr+Mahout marriage down the road, and so on.

I somewhat agree, but I hesitate to go so far as saying a "general web service". I see Solr as a pretty nice platform for doing things like NLP/ML (see the AnalysisRequestHandler, TermVectorComponent, ClusteringComponent, LukeReqHandler, FacetingComp., Payloads, etc.), but I mostly view them as enhancing search/navigation. That is, things like clustering/faceting (they are closely related), named entity recognition, search, etc. all act as organizing components for structured and unstructured data. Expressing my vision for Solr (and actually, the Lucene TLP, too, if I put on my PMC hat) it's one that aims to bring coherence to (structured and unstructured) content. This starts with search as a foundation, since the indexing process creates much of the information that empowers the others. I think once you see the flexible indexing stuff added to Lucene Java, we'll see even more opportunity for making Solr even more powerful in these regards.



Is it time to start thinking about Solr sa a server for IR and ML and NLP tasks and see how the tightly coupled Lucene can be made more....pluggable?

Yeah, this is what the Solr 2.0 thread that Yonik started a few weeks ago aims to discuss, along with scalability/fault tolerance. More important, for me anyway, is the decoupling of the configuration. For instance, I see no reason why IndexSchema needs to know anything about an InputStream. As for Lucene, it's really quite good at serving as the backend store/enabler for all these tasks.


At any rate, the question still remains as to how best to handle the QueryComponent :-)




Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
From: Grant Ingersoll <[EMAIL PROTECTED]>
To: [email protected]
Sent: Monday, October 20, 2008 7:56:32 PM
Subject: Must QueryComponent always be on and other Design Questions

I've run into this a couple of times now and I feel like it warrants a
discussion

For both the SpellCheckComponent (SCC) and now for the new
ClusteringComponent (SOLR-769) I think there are cases where the
QueryComponent (QC) is not required.  In the SpellCheckComponent case
it is when building the spelling index.  In the ClusteringComponent,
it is possible to ask for document clusters without running any query
(it also will be possible to get clusters _with_ a query as well, and
it also is distinguished from the handling of search results
clustering, too).  Thus, it seems really weird to have to pass in a
dummy query, yet that is what one has to do in order to avoid getting
an NPE in the QC.

Now, I suppose these pieces could be modeled as something else or it's
possible to split the two functionalities into separate things (1
ReqHandler, 1 SearchComp).  In fact, the said functionality is not
really "search" functionality, or SearchComponent functionality, yet
much of the rest of the functionality in the code in question is
"search" functionality and logically belongs as a SearchComponent. In the case of the SCC build, it's akin to an indexing operation. In the
clustering case, it's a query, albeit a non-traditional one.  In some
sense, this kind of document clustering is like non-query based
faceting which leads to more navigation/browsing instead of searching.

The quick fix is to just put in null checks into the QC or pass in a
dummy query with rows=0, but I'm not sure if there isn't a slightly
bigger picture here that needs adjusting in terms of
SearchComponents.  Namely, must the QC always be on?  And, should we
think a little more about components that don't require a query in
order to function and how they play in the scheme of things?

Thoughts?  Recommendations?

-Grant


Reply via email to