Re: Must QueryComponent always be on and other Design Questions
For completeness, here's the NPE: SEVERE: java.lang.NullPointerException at org.apache.solr.common.util.StrUtils.splitSmart(StrUtils.java:37) at org.apache.solr.search.OldLuceneQParser.parse(LuceneQParserPlugin.java: 104) at org.apache.solr.search.QParser.getQuery(QParser.java:88) at org .apache .solr.handler.component.QueryComponent.prepare(QueryComponent.java:82) at org .apache .solr .handler.component.SearchHandler.handleRequestBody(SearchHandler.java: 149) at org .apache .solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java: 131) at org .apache .solr .handler .clustering .ClusteringComponentTest.testComponent(ClusteringComponentTest.java:70) Don't worry about the ClusteringComponentTest yet, I haven't posted that code yet. On Oct 20, 2008, at 7:56 PM, Grant Ingersoll wrote: I've run into this a couple of times now and I feel like it warrants a discussion For both the SpellCheckComponent (SCC) and now for the new ClusteringComponent (SOLR-769) I think there are cases where the QueryComponent (QC) is not required. In the SpellCheckComponent case it is when building the spelling index. In the ClusteringComponent, it is possible to ask for document clusters without running any query (it also will be possible to get clusters _with_ a query as well, and it also is distinguished from the handling of search results clustering, too). Thus, it seems really weird to have to pass in a dummy query, yet that is what one has to do in order to avoid getting an NPE in the QC. Now, I suppose these pieces could be modeled as something else or it's possible to split the two functionalities into separate things (1 ReqHandler, 1 SearchComp). In fact, the said functionality is not really "search" functionality, or SearchComponent functionality, yet much of the rest of the functionality in the code in question is "search" functionality and logically belongs as a SearchComponent. In the case of the SCC build, it's akin to an indexing operation. In the clustering case, it's a query, albeit a non-traditional one. In some sense, this kind of document clustering is like non-query based faceting which leads to more navigation/browsing instead of searching. The quick fix is to just put in null checks into the QC or pass in a dummy query with rows=0, but I'm not sure if there isn't a slightly bigger picture here that needs adjusting in terms of SearchComponents. Namely, must the QC always be on? And, should we think a little more about components that don't require a query in order to function and how they play in the scheme of things? Thoughts? Recommendations? -Grant
Re: Must QueryComponent always be on and other Design Questions
This is related to something I must have only day dreamed (dreamt?) about, but not actually mentioned on solr-dev. My feeling is we are moving Solr in a direction of a more general web service that can host various NLP and ML components, and no longer only do IR/Lucene. We see that with a few patches that Grant is cooking, I think we'll see that in the Solr+Mahout marriage down the road, and so on. Is it time to start thinking about Solr sa a server for IR and ML and NLP tasks and see how the tightly coupled Lucene can be made morepluggable? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Grant Ingersoll <[EMAIL PROTECTED]> > To: solr-dev@lucene.apache.org > Sent: Monday, October 20, 2008 7:56:32 PM > Subject: Must QueryComponent always be on and other Design Questions > > I've run into this a couple of times now and I feel like it warrants a > discussion > > For both the SpellCheckComponent (SCC) and now for the new > ClusteringComponent (SOLR-769) I think there are cases where the > QueryComponent (QC) is not required. In the SpellCheckComponent case > it is when building the spelling index. In the ClusteringComponent, > it is possible to ask for document clusters without running any query > (it also will be possible to get clusters _with_ a query as well, and > it also is distinguished from the handling of search results > clustering, too). Thus, it seems really weird to have to pass in a > dummy query, yet that is what one has to do in order to avoid getting > an NPE in the QC. > > Now, I suppose these pieces could be modeled as something else or it's > possible to split the two functionalities into separate things (1 > ReqHandler, 1 SearchComp). In fact, the said functionality is not > really "search" functionality, or SearchComponent functionality, yet > much of the rest of the functionality in the code in question is > "search" functionality and logically belongs as a SearchComponent. In > the case of the SCC build, it's akin to an indexing operation. In the > clustering case, it's a query, albeit a non-traditional one. In some > sense, this kind of document clustering is like non-query based > faceting which leads to more navigation/browsing instead of searching. > > The quick fix is to just put in null checks into the QC or pass in a > dummy query with rows=0, but I'm not sure if there isn't a slightly > bigger picture here that needs adjusting in terms of > SearchComponents. Namely, must the QC always be on? And, should we > think a little more about components that don't require a query in > order to function and how they play in the scheme of things? > > Thoughts? Recommendations? > > -Grant
Re: Must QueryComponent always be on and other Design Questions
On Oct 20, 2008, at 11:35 PM, Otis Gospodnetic wrote: This is related to something I must have only day dreamed (dreamt?) about, but not actually mentioned on solr-dev. My feeling is we are moving Solr in a direction of a more general web service that can host various NLP and ML components, and no longer only do IR/Lucene. We see that with a few patches that Grant is cooking, I think we'll see that in the Solr+Mahout marriage down the road, and so on. I somewhat agree, but I hesitate to go so far as saying a "general web service". I see Solr as a pretty nice platform for doing things like NLP/ML (see the AnalysisRequestHandler, TermVectorComponent, ClusteringComponent, LukeReqHandler, FacetingComp., Payloads, etc.), but I mostly view them as enhancing search/navigation. That is, things like clustering/faceting (they are closely related), named entity recognition, search, etc. all act as organizing components for structured and unstructured data. Expressing my vision for Solr (and actually, the Lucene TLP, too, if I put on my PMC hat) it's one that aims to bring coherence to (structured and unstructured) content. This starts with search as a foundation, since the indexing process creates much of the information that empowers the others. I think once you see the flexible indexing stuff added to Lucene Java, we'll see even more opportunity for making Solr even more powerful in these regards. Is it time to start thinking about Solr sa a server for IR and ML and NLP tasks and see how the tightly coupled Lucene can be made morepluggable? Yeah, this is what the Solr 2.0 thread that Yonik started a few weeks ago aims to discuss, along with scalability/fault tolerance. More important, for me anyway, is the decoupling of the configuration. For instance, I see no reason why IndexSchema needs to know anything about an InputStream. As for Lucene, it's really quite good at serving as the backend store/enabler for all these tasks. At any rate, the question still remains as to how best to handle the QueryComponent :-) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Grant Ingersoll <[EMAIL PROTECTED]> To: solr-dev@lucene.apache.org Sent: Monday, October 20, 2008 7:56:32 PM Subject: Must QueryComponent always be on and other Design Questions I've run into this a couple of times now and I feel like it warrants a discussion For both the SpellCheckComponent (SCC) and now for the new ClusteringComponent (SOLR-769) I think there are cases where the QueryComponent (QC) is not required. In the SpellCheckComponent case it is when building the spelling index. In the ClusteringComponent, it is possible to ask for document clusters without running any query (it also will be possible to get clusters _with_ a query as well, and it also is distinguished from the handling of search results clustering, too). Thus, it seems really weird to have to pass in a dummy query, yet that is what one has to do in order to avoid getting an NPE in the QC. Now, I suppose these pieces could be modeled as something else or it's possible to split the two functionalities into separate things (1 ReqHandler, 1 SearchComp). In fact, the said functionality is not really "search" functionality, or SearchComponent functionality, yet much of the rest of the functionality in the code in question is "search" functionality and logically belongs as a SearchComponent. In the case of the SCC build, it's akin to an indexing operation. In the clustering case, it's a query, albeit a non-traditional one. In some sense, this kind of document clustering is like non-query based faceting which leads to more navigation/browsing instead of searching. The quick fix is to just put in null checks into the QC or pass in a dummy query with rows=0, but I'm not sure if there isn't a slightly bigger picture here that needs adjusting in terms of SearchComponents. Namely, must the QC always be on? And, should we think a little more about components that don't require a query in order to function and how they play in the scheme of things? Thoughts? Recommendations? -Grant
Re: Must QueryComponent always be on and other Design Questions
unrelated to your question, but we should give a better error... https://issues.apache.org/jira/browse/SOLR-435 On Oct 20, 2008, at 8:01 PM, Grant Ingersoll wrote: For completeness, here's the NPE: SEVERE: java.lang.NullPointerException at org.apache.solr.common.util.StrUtils.splitSmart(StrUtils.java:37) at org .apache.solr.search.OldLuceneQParser.parse(LuceneQParserPlugin.java: 104) at org.apache.solr.search.QParser.getQuery(QParser.java:88) at org .apache .solr.handler.component.QueryComponent.prepare(QueryComponent.java:82) at org .apache .solr .handler .component.SearchHandler.handleRequestBody(SearchHandler.java:149) at org .apache .solr .handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org .apache .solr .handler .clustering .ClusteringComponentTest.testComponent(ClusteringComponentTest.java: 70) Don't worry about the ClusteringComponentTest yet, I haven't posted that code yet. On Oct 20, 2008, at 7:56 PM, Grant Ingersoll wrote: I've run into this a couple of times now and I feel like it warrants a discussion For both the SpellCheckComponent (SCC) and now for the new ClusteringComponent (SOLR-769) I think there are cases where the QueryComponent (QC) is not required. In the SpellCheckComponent case it is when building the spelling index. In the ClusteringComponent, it is possible to ask for document clusters without running any query (it also will be possible to get clusters _with_ a query as well, and it also is distinguished from the handling of search results clustering, too). Thus, it seems really weird to have to pass in a dummy query, yet that is what one has to do in order to avoid getting an NPE in the QC. Now, I suppose these pieces could be modeled as something else or it's possible to split the two functionalities into separate things (1 ReqHandler, 1 SearchComp). In fact, the said functionality is not really "search" functionality, or SearchComponent functionality, yet much of the rest of the functionality in the code in question is "search" functionality and logically belongs as a SearchComponent. In the case of the SCC build, it's akin to an indexing operation. In the clustering case, it's a query, albeit a non-traditional one. In some sense, this kind of document clustering is like non-query based faceting which leads to more navigation/browsing instead of searching. The quick fix is to just put in null checks into the QC or pass in a dummy query with rows=0, but I'm not sure if there isn't a slightly bigger picture here that needs adjusting in terms of SearchComponents. Namely, must the QC always be on? And, should we think a little more about components that don't require a query in order to function and how they play in the scheme of things? Thoughts? Recommendations? -Grant
Re: Must QueryComponent always be on and other Design Questions
On Oct 21, 2008, at 8:17 AM, Grant Ingersoll wrote: On Oct 20, 2008, at 11:35 PM, Otis Gospodnetic wrote: This is related to something I must have only day dreamed (dreamt?) about, but not actually mentioned on solr-dev. My feeling is we are moving Solr in a direction of a more general web service that can host various NLP and ML components, and no longer only do IR/Lucene. We see that with a few patches that Grant is cooking, I think we'll see that in the Solr+Mahout marriage down the road, and so on. I somewhat agree, but I hesitate to go so far as saying a "general web service". I won't suggest that solr is (or should be) a general web service, but wt=json/xml/python + RequestHandler makes a pretty nice cross platform interface all on its own. I see Solr as a pretty nice platform for doing things like NLP/ML (see the AnalysisRequestHandler, TermVectorComponent, ClusteringComponent, LukeReqHandler, FacetingComp., Payloads, etc.), but I mostly view them as enhancing search/navigation. That is, things like clustering/faceting (they are closely related), named entity recognition, search, etc. all act as organizing components for structured and unstructured data. Expressing my vision for Solr (and actually, the Lucene TLP, too, if I put on my PMC hat) it's one that aims to bring coherence to (structured and unstructured) content. This starts with search as a foundation, since the indexing process creates much of the information that empowers the others. I think once you see the flexible indexing stuff added to Lucene Java, we'll see even more opportunity for making Solr even more powerful in these regards. agree. Is it time to start thinking about Solr sa a server for IR and ML and NLP tasks and see how the tightly coupled Lucene can be made morepluggable? Yeah, this is what the Solr 2.0 thread that Yonik started a few weeks ago aims to discuss, along with scalability/fault tolerance. More important, for me anyway, is the decoupling of the configuration. For instance, I see no reason why IndexSchema needs to know anything about an InputStream. also agree. The biggest challenge for 2.0 is decoupling configuration As for Lucene, it's really quite good at serving as the backend store/enabler for all these tasks. I have not messed with it yet, but perhaps also HBase... At any rate, the question still remains as to how best to handle the QueryComponent :-) aaah, your question! I see two options: 1. If no other component needs docList or docSet and the query is empty, then skip the QueryComponent 2. add a 'runQuery' param (or somethign like that) and default to true. It can be turned off when not necessary. I like option 1 better. ryan
Re: Must QueryComponent always be on and other Design Questions
+1 I can forsee a lot of components which does not need the QueryComponent. SOLR-706 being one. On Tue, Oct 21, 2008 at 8:39 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote: > > On Oct 21, 2008, at 8:17 AM, Grant Ingersoll wrote: > >> >> On Oct 20, 2008, at 11:35 PM, Otis Gospodnetic wrote: >> >>> This is related to something I must have only day dreamed (dreamt?) >>> about, but not actually mentioned on solr-dev. >>> My feeling is we are moving Solr in a direction of a more general web >>> service that can host various NLP and ML components, and no longer only do >>> IR/Lucene. We see that with a few patches that Grant is cooking, I think >>> we'll see that in the Solr+Mahout marriage down the road, and so on. >> >> I somewhat agree, but I hesitate to go so far as saying a "general web >> service". > > I won't suggest that solr is (or should be) a general web service, but > wt=json/xml/python + RequestHandler makes a pretty nice cross platform > interface all on its own. > > >> I see Solr as a pretty nice platform for doing things like NLP/ML (see the >> AnalysisRequestHandler, TermVectorComponent, ClusteringComponent, >> LukeReqHandler, FacetingComp., Payloads, etc.), but I mostly view them as >> enhancing search/navigation. That is, things like clustering/faceting >> (they are closely related), named entity recognition, search, etc. all act >> as organizing components for structured and unstructured data. Expressing >> my vision for Solr (and actually, the Lucene TLP, too, if I put on my PMC >> hat) it's one that aims to bring coherence to (structured and unstructured) >> content. This starts with search as a foundation, since the indexing >> process creates much of the information that empowers the others. I think >> once you see the flexible indexing stuff added to Lucene Java, we'll see >> even more opportunity for making Solr even more powerful in these regards. >> > > agree. > > >>> >>> >>> Is it time to start thinking about Solr sa a server for IR and ML and NLP >>> tasks and see how the tightly coupled Lucene can be made morepluggable? >> >> Yeah, this is what the Solr 2.0 thread that Yonik started a few weeks ago >> aims to discuss, along with scalability/fault tolerance. More important, >> for me anyway, is the decoupling of the configuration. For instance, I see >> no reason why IndexSchema needs to know anything about an InputStream. > > also agree. The biggest challenge for 2.0 is decoupling configuration > >> As for Lucene, it's really quite good at serving as the backend >> store/enabler for all these tasks. >> > > I have not messed with it yet, but perhaps also HBase... > >> >> At any rate, the question still remains as to how best to handle the >> QueryComponent :-) >> > > aaah, your question! > > I see two options: > 1. If no other component needs docList or docSet and the query is empty, > then skip the QueryComponent > 2. add a 'runQuery' param (or somethign like that) and default to true. It > can be turned off when not necessary. > > I like option 1 better. > > ryan > > > -- --Noble Paul
Re: Must QueryComponent always be on and other Design Questions
FWIW, my last patch on SOLR-769 adds a check to see if QC is enabled, with the default param set to true. Thus, you can send in &query=false and it skips it. On Oct 21, 2008, at 11:21 AM, Noble Paul നോബിള് नोब्ळ् wrote: +1 I can forsee a lot of components which does not need the QueryComponent. SOLR-706 being one. On Tue, Oct 21, 2008 at 8:39 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote: On Oct 21, 2008, at 8:17 AM, Grant Ingersoll wrote: On Oct 20, 2008, at 11:35 PM, Otis Gospodnetic wrote: This is related to something I must have only day dreamed (dreamt?) about, but not actually mentioned on solr-dev. My feeling is we are moving Solr in a direction of a more general web service that can host various NLP and ML components, and no longer only do IR/Lucene. We see that with a few patches that Grant is cooking, I think we'll see that in the Solr+Mahout marriage down the road, and so on. I somewhat agree, but I hesitate to go so far as saying a "general web service". I won't suggest that solr is (or should be) a general web service, but wt=json/xml/python + RequestHandler makes a pretty nice cross platform interface all on its own. I see Solr as a pretty nice platform for doing things like NLP/ML (see the AnalysisRequestHandler, TermVectorComponent, ClusteringComponent, LukeReqHandler, FacetingComp., Payloads, etc.), but I mostly view them as enhancing search/navigation. That is, things like clustering/ faceting (they are closely related), named entity recognition, search, etc. all act as organizing components for structured and unstructured data. Expressing my vision for Solr (and actually, the Lucene TLP, too, if I put on my PMC hat) it's one that aims to bring coherence to (structured and unstructured) content. This starts with search as a foundation, since the indexing process creates much of the information that empowers the others. I think once you see the flexible indexing stuff added to Lucene Java, we'll see even more opportunity for making Solr even more powerful in these regards. agree. Is it time to start thinking about Solr sa a server for IR and ML and NLP tasks and see how the tightly coupled Lucene can be made morepluggable? Yeah, this is what the Solr 2.0 thread that Yonik started a few weeks ago aims to discuss, along with scalability/fault tolerance. More important, for me anyway, is the decoupling of the configuration. For instance, I see no reason why IndexSchema needs to know anything about an InputStream. also agree. The biggest challenge for 2.0 is decoupling configuration As for Lucene, it's really quite good at serving as the backend store/enabler for all these tasks. I have not messed with it yet, but perhaps also HBase... At any rate, the question still remains as to how best to handle the QueryComponent :-) aaah, your question! I see two options: 1. If no other component needs docList or docSet and the query is empty, then skip the QueryComponent 2. add a 'runQuery' param (or somethign like that) and default to true. It can be turned off when not necessary. I like option 1 better. ryan -- --Noble Paul -- Grant Ingersoll Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans. http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Must QueryComponent always be on and other Design Questions
hi Grant, There may be cases where the user may not be interested in the documents but there may be other components which are interested in the search results. In 'tvrh' is an example. How do we take care of that? On Tue, Oct 21, 2008 at 8:59 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > FWIW, my last patch on SOLR-769 adds a check to see if QC is enabled, with > the default param set to true. Thus, you can send in &query=false and it > skips it. > > > On Oct 21, 2008, at 11:21 AM, Noble Paul നോബിള് नोब्ळ् wrote: > >> +1 >> I can forsee a lot of components which does not need the >> QueryComponent. SOLR-706 being one. >> >> >> >> On Tue, Oct 21, 2008 at 8:39 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote: >>> >>> On Oct 21, 2008, at 8:17 AM, Grant Ingersoll wrote: >>> On Oct 20, 2008, at 11:35 PM, Otis Gospodnetic wrote: > This is related to something I must have only day dreamed (dreamt?) > about, but not actually mentioned on solr-dev. > My feeling is we are moving Solr in a direction of a more general web > service that can host various NLP and ML components, and no longer only > do > IR/Lucene. We see that with a few patches that Grant is cooking, I > think > we'll see that in the Solr+Mahout marriage down the road, and so on. I somewhat agree, but I hesitate to go so far as saying a "general web service". >>> >>> I won't suggest that solr is (or should be) a general web service, but >>> wt=json/xml/python + RequestHandler makes a pretty nice cross platform >>> interface all on its own. >>> >>> I see Solr as a pretty nice platform for doing things like NLP/ML (see the AnalysisRequestHandler, TermVectorComponent, ClusteringComponent, LukeReqHandler, FacetingComp., Payloads, etc.), but I mostly view them as enhancing search/navigation. That is, things like clustering/faceting (they are closely related), named entity recognition, search, etc. all act as organizing components for structured and unstructured data. Expressing my vision for Solr (and actually, the Lucene TLP, too, if I put on my PMC hat) it's one that aims to bring coherence to (structured and unstructured) content. This starts with search as a foundation, since the indexing process creates much of the information that empowers the others. I think once you see the flexible indexing stuff added to Lucene Java, we'll see even more opportunity for making Solr even more powerful in these regards. >>> >>> agree. >>> >>> > > > Is it time to start thinking about Solr sa a server for IR and ML and > NLP > tasks and see how the tightly coupled Lucene can be made > morepluggable? Yeah, this is what the Solr 2.0 thread that Yonik started a few weeks ago aims to discuss, along with scalability/fault tolerance. More important, for me anyway, is the decoupling of the configuration. For instance, I see no reason why IndexSchema needs to know anything about an InputStream. >>> >>> also agree. The biggest challenge for 2.0 is decoupling configuration >>> As for Lucene, it's really quite good at serving as the backend store/enabler for all these tasks. >>> >>> I have not messed with it yet, but perhaps also HBase... >>> At any rate, the question still remains as to how best to handle the QueryComponent :-) >>> >>> aaah, your question! >>> >>> I see two options: >>> 1. If no other component needs docList or docSet and the query is empty, >>> then skip the QueryComponent >>> 2. add a 'runQuery' param (or somethign like that) and default to true. >>> It >>> can be turned off when not necessary. >>> >>> I like option 1 better. >>> >>> ryan >>> >>> >>> >> >> >> >> -- >> --Noble Paul > > -- > Grant Ingersoll > Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans. > http://www.lucenebootcamp.com > > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > > -- --Noble Paul
Re: Must QueryComponent always be on and other Design Questions
Don't turn off the query component in those cases. In these cases, the QC identifies what docs are to be used, just as in a user based query. Just think of those other components as clients of the QC output, and I think it makes sense. The application will know whether it needs to deal with results or not. I suppose we could have something that says "run the query and make the results available to other components, but don't bother writing them out". On Oct 21, 2008, at 11:33 AM, Noble Paul നോബിള് नोब्ळ् wrote: hi Grant, There may be cases where the user may not be interested in the documents but there may be other components which are interested in the search results. In 'tvrh' is an example. How do we take care of that? On Tue, Oct 21, 2008 at 8:59 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: FWIW, my last patch on SOLR-769 adds a check to see if QC is enabled, with the default param set to true. Thus, you can send in &query=false and it skips it. On Oct 21, 2008, at 11:21 AM, Noble Paul നോബിള് नोब्ळ् wrote: +1 I can forsee a lot of components which does not need the QueryComponent. SOLR-706 being one. On Tue, Oct 21, 2008 at 8:39 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote: On Oct 21, 2008, at 8:17 AM, Grant Ingersoll wrote: On Oct 20, 2008, at 11:35 PM, Otis Gospodnetic wrote: This is related to something I must have only day dreamed (dreamt?) about, but not actually mentioned on solr-dev. My feeling is we are moving Solr in a direction of a more general web service that can host various NLP and ML components, and no longer only do IR/Lucene. We see that with a few patches that Grant is cooking, I think we'll see that in the Solr+Mahout marriage down the road, and so on. I somewhat agree, but I hesitate to go so far as saying a "general web service". I won't suggest that solr is (or should be) a general web service, but wt=json/xml/python + RequestHandler makes a pretty nice cross platform interface all on its own. I see Solr as a pretty nice platform for doing things like NLP/ ML (see the AnalysisRequestHandler, TermVectorComponent, ClusteringComponent, LukeReqHandler, FacetingComp., Payloads, etc.), but I mostly view them as enhancing search/navigation. That is, things like clustering/ faceting (they are closely related), named entity recognition, search, etc. all act as organizing components for structured and unstructured data. Expressing my vision for Solr (and actually, the Lucene TLP, too, if I put on my PMC hat) it's one that aims to bring coherence to (structured and unstructured) content. This starts with search as a foundation, since the indexing process creates much of the information that empowers the others. I think once you see the flexible indexing stuff added to Lucene Java, we'll see even more opportunity for making Solr even more powerful in these regards. agree. Is it time to start thinking about Solr sa a server for IR and ML and NLP tasks and see how the tightly coupled Lucene can be made morepluggable? Yeah, this is what the Solr 2.0 thread that Yonik started a few weeks ago aims to discuss, along with scalability/fault tolerance. More important, for me anyway, is the decoupling of the configuration. For instance, I see no reason why IndexSchema needs to know anything about an InputStream. also agree. The biggest challenge for 2.0 is decoupling configuration As for Lucene, it's really quite good at serving as the backend store/enabler for all these tasks. I have not messed with it yet, but perhaps also HBase... At any rate, the question still remains as to how best to handle the QueryComponent :-) aaah, your question! I see two options: 1. If no other component needs docList or docSet and the query is empty, then skip the QueryComponent 2. add a 'runQuery' param (or somethign like that) and default to true. It can be turned off when not necessary. I like option 1 better. ryan -- --Noble Paul -- Grant Ingersoll Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans. http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ -- --Noble Paul -- Grant Ingersoll Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans. http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Must QueryComponent always be on and other Design Questions
: For both the SpellCheckComponent (SCC) and now for the new ClusteringComponent : (SOLR-769) I think there are cases where the QueryComponent (QC) is not : required. In the SpellCheckComponent case it is when building the spelling : index. In the ClusteringComponent, it is possible to ask for document : clusters without running any query (it also will be possible to get clusters : _with_ a query as well, and it also is distinguished from the handling of : search results clustering, too). Thus, it seems really weird to have to pass : in a dummy query, yet that is what one has to do in order to avoid getting an : NPE in the QC. In my opinion the "right" way to deal with these use cases is to have seperate request handlers configured for the differnet usecases ... if you want to cluster "stuff" unrelated to a query, register a handler that uses the ClusteringComponent but doesn't use the QueryComponent ... likewise register a handler that knows about the SpellcheckComponent for triggering rebuilds of the spellcheck index independent of doing queryies. we could treat QueryComponent just like the FacetComponent and HighlightComponent and say that it short circuts and does nothing unless "doQuery=true" (which it would be by default) but i'm not convinced that's the best way to approach things -- the typical case is that QueryComponent is the meat of the request handling, it's probably okay that it be put on a bit of a pedestal such that if your handler uses QueryComponent then QueryComonent is "on" -- But it should absolutely be possible to have a handler that doesn't use QueryComponent. On a related topic: if we have any other Components that assume QueryComponent has been executed, they should be changed. Components should have contracts about pre-conditions and post-conditions relating to data in the request, not how the data got there, ala... https://issues.apache.org/jira/browse/SOLR-760 -Hoss