On Oct 20, 2008, at 11:35 PM, Otis Gospodnetic wrote:
This is related to something I must have only day dreamed (dreamt?) about, but not actually mentioned on solr-dev. My feeling is we are moving Solr in a direction of a more general web service that can host various NLP and ML components, and no longer only do IR/Lucene. We see that with a few patches that Grant is cooking, I think we'll see that in the Solr+Mahout marriage down the road, and so on.
I somewhat agree, but I hesitate to go so far as saying a "general web service". I see Solr as a pretty nice platform for doing things like NLP/ML (see the AnalysisRequestHandler, TermVectorComponent, ClusteringComponent, LukeReqHandler, FacetingComp., Payloads, etc.), but I mostly view them as enhancing search/navigation. That is, things like clustering/faceting (they are closely related), named entity recognition, search, etc. all act as organizing components for structured and unstructured data. Expressing my vision for Solr (and actually, the Lucene TLP, too, if I put on my PMC hat) it's one that aims to bring coherence to (structured and unstructured) content. This starts with search as a foundation, since the indexing process creates much of the information that empowers the others. I think once you see the flexible indexing stuff added to Lucene Java, we'll see even more opportunity for making Solr even more powerful in these regards.
Is it time to start thinking about Solr sa a server for IR and ML and NLP tasks and see how the tightly coupled Lucene can be made more....pluggable?
Yeah, this is what the Solr 2.0 thread that Yonik started a few weeks ago aims to discuss, along with scalability/fault tolerance. More important, for me anyway, is the decoupling of the configuration. For instance, I see no reason why IndexSchema needs to know anything about an InputStream. As for Lucene, it's really quite good at serving as the backend store/enabler for all these tasks.
At any rate, the question still remains as to how best to handle the QueryComponent :-)
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ----From: Grant Ingersoll <[EMAIL PROTECTED]> To: [email protected] Sent: Monday, October 20, 2008 7:56:32 PM Subject: Must QueryComponent always be on and other Design QuestionsI've run into this a couple of times now and I feel like it warrants adiscussion For both the SpellCheckComponent (SCC) and now for the new ClusteringComponent (SOLR-769) I think there are cases where the QueryComponent (QC) is not required. In the SpellCheckComponent case it is when building the spelling index. In the ClusteringComponent, it is possible to ask for document clusters without running any query (it also will be possible to get clusters _with_ a query as well, and it also is distinguished from the handling of search results clustering, too). Thus, it seems really weird to have to pass in a dummy query, yet that is what one has to do in order to avoid getting an NPE in the QC.Now, I suppose these pieces could be modeled as something else or it'spossible to split the two functionalities into separate things (1 ReqHandler, 1 SearchComp). In fact, the said functionality is not really "search" functionality, or SearchComponent functionality, yet much of the rest of the functionality in the code in question is"search" functionality and logically belongs as a SearchComponent. In the case of the SCC build, it's akin to an indexing operation. In theclustering case, it's a query, albeit a non-traditional one. In some sense, this kind of document clustering is like non-query basedfaceting which leads to more navigation/browsing instead of searching.The quick fix is to just put in null checks into the QC or pass in a dummy query with rows=0, but I'm not sure if there isn't a slightly bigger picture here that needs adjusting in terms of SearchComponents. Namely, must the QC always be on? And, should we think a little more about components that don't require a query in order to function and how they play in the scheme of things? Thoughts? Recommendations? -Grant
