Re: Must QueryComponent always be on and other Design Questions

2008-10-20 Thread Grant Ingersoll

For completeness, here's the NPE:
SEVERE: java.lang.NullPointerException
at org.apache.solr.common.util.StrUtils.splitSmart(StrUtils.java:37)
	at  
org.apache.solr.search.OldLuceneQParser.parse(LuceneQParserPlugin.java: 
104)

at org.apache.solr.search.QParser.getQuery(QParser.java:88)
	at  
org 
.apache 
.solr.handler.component.QueryComponent.prepare(QueryComponent.java:82)
	at  
org 
.apache 
.solr 
.handler.component.SearchHandler.handleRequestBody(SearchHandler.java: 
149)
	at  
org 
.apache 
.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java: 
131)
	at  
org 
.apache 
.solr 
.handler 
.clustering 
.ClusteringComponentTest.testComponent(ClusteringComponentTest.java:70)


Don't worry about the ClusteringComponentTest yet, I haven't posted  
that code yet.


On Oct 20, 2008, at 7:56 PM, Grant Ingersoll wrote:

I've run into this a couple of times now and I feel like it warrants  
a discussion


For both the SpellCheckComponent (SCC) and now for the new  
ClusteringComponent (SOLR-769) I think there are cases where the  
QueryComponent (QC) is not required.  In the SpellCheckComponent  
case it is when building the spelling index.  In the  
ClusteringComponent, it is possible to ask for document clusters  
without running any query (it also will be possible to get clusters  
_with_ a query as well, and it also is distinguished from the  
handling of search results clustering, too).  Thus, it seems really  
weird to have to pass in a dummy query, yet that is what one has to  
do in order to avoid getting an NPE in the QC.


Now, I suppose these pieces could be modeled as something else or  
it's possible to split the two functionalities into separate things  
(1 ReqHandler, 1 SearchComp).  In fact, the said functionality is  
not really "search" functionality, or SearchComponent functionality,  
yet much of the rest of the functionality in the code in question is  
"search" functionality and logically belongs as a SearchComponent.   
In the case of the SCC build, it's akin to an indexing operation.   
In the clustering case, it's a query, albeit a non-traditional one.   
In some sense, this kind of document clustering is like non-query  
based faceting which leads to more navigation/browsing instead of  
searching.


The quick fix is to just put in null checks into the QC or pass in a  
dummy query with rows=0, but I'm not sure if there isn't a slightly  
bigger picture here that needs adjusting in terms of  
SearchComponents.  Namely, must the QC always be on?  And, should we  
think a little more about components that don't require a query in  
order to function and how they play in the scheme of things?


Thoughts?  Recommendations?

-Grant





Re: Must QueryComponent always be on and other Design Questions

2008-10-20 Thread Otis Gospodnetic
This is related to something I must have only day dreamed (dreamt?) about, but 
not actually mentioned on solr-dev.
My feeling is we are moving Solr in a direction of a more general web service 
that can host various NLP and ML components, and no longer only do IR/Lucene.  
We see that with a few patches that Grant is cooking, I think we'll see that in 
the Solr+Mahout marriage down the road, and so on.

Is it time to start thinking about Solr sa a server for IR and ML and NLP tasks 
and see how the tightly coupled Lucene can be made morepluggable?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Grant Ingersoll <[EMAIL PROTECTED]>
> To: solr-dev@lucene.apache.org
> Sent: Monday, October 20, 2008 7:56:32 PM
> Subject: Must QueryComponent always be on and other Design Questions
> 
> I've run into this a couple of times now and I feel like it warrants a  
> discussion
> 
> For both the SpellCheckComponent (SCC) and now for the new  
> ClusteringComponent (SOLR-769) I think there are cases where the  
> QueryComponent (QC) is not required.  In the SpellCheckComponent case  
> it is when building the spelling index.  In the ClusteringComponent,  
> it is possible to ask for document clusters without running any query  
> (it also will be possible to get clusters _with_ a query as well, and  
> it also is distinguished from the handling of search results  
> clustering, too).  Thus, it seems really weird to have to pass in a  
> dummy query, yet that is what one has to do in order to avoid getting  
> an NPE in the QC.
> 
> Now, I suppose these pieces could be modeled as something else or it's  
> possible to split the two functionalities into separate things (1  
> ReqHandler, 1 SearchComp).  In fact, the said functionality is not  
> really "search" functionality, or SearchComponent functionality, yet  
> much of the rest of the functionality in the code in question is  
> "search" functionality and logically belongs as a SearchComponent.  In  
> the case of the SCC build, it's akin to an indexing operation.  In the  
> clustering case, it's a query, albeit a non-traditional one.  In some  
> sense, this kind of document clustering is like non-query based  
> faceting which leads to more navigation/browsing instead of searching.
> 
> The quick fix is to just put in null checks into the QC or pass in a  
> dummy query with rows=0, but I'm not sure if there isn't a slightly  
> bigger picture here that needs adjusting in terms of  
> SearchComponents.  Namely, must the QC always be on?  And, should we  
> think a little more about components that don't require a query in  
> order to function and how they play in the scheme of things?
> 
> Thoughts?  Recommendations?
> 
> -Grant



Re: Must QueryComponent always be on and other Design Questions

2008-10-21 Thread Grant Ingersoll


On Oct 20, 2008, at 11:35 PM, Otis Gospodnetic wrote:

This is related to something I must have only day dreamed (dreamt?)  
about, but not actually mentioned on solr-dev.
My feeling is we are moving Solr in a direction of a more general  
web service that can host various NLP and ML components, and no  
longer only do IR/Lucene.  We see that with a few patches that Grant  
is cooking, I think we'll see that in the Solr+Mahout marriage down  
the road, and so on.


I somewhat agree, but I hesitate to go so far as saying a "general web  
service".  I see Solr as a pretty nice platform for doing things like  
NLP/ML (see the AnalysisRequestHandler, TermVectorComponent,  
ClusteringComponent, LukeReqHandler, FacetingComp., Payloads, etc.),  
but I mostly view them as enhancing search/navigation.   That is,  
things like clustering/faceting (they are closely related), named  
entity recognition, search, etc. all act as organizing components for  
structured and unstructured data.  Expressing my vision for Solr (and  
actually, the Lucene TLP, too, if I put on my PMC hat) it's one that  
aims to bring coherence to (structured and unstructured) content.   
This starts with search as a foundation, since the indexing process  
creates much of the information that empowers the others.  I think  
once you see the flexible indexing stuff added to Lucene Java, we'll  
see even more opportunity for making Solr even more powerful in these  
regards.





Is it time to start thinking about Solr sa a server for IR and ML  
and NLP tasks and see how the tightly coupled Lucene can be made  
morepluggable?


Yeah, this is what the Solr 2.0 thread that Yonik started a few weeks  
ago aims to discuss, along with scalability/fault tolerance.  More  
important, for me anyway, is the decoupling of the configuration.  For  
instance, I see no reason why IndexSchema needs to know anything about  
an InputStream.  As for Lucene, it's really quite good at serving as  
the backend store/enabler for all these tasks.



At any rate, the question still remains as to how best to handle the  
QueryComponent :-)






Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 

From: Grant Ingersoll <[EMAIL PROTECTED]>
To: solr-dev@lucene.apache.org
Sent: Monday, October 20, 2008 7:56:32 PM
Subject: Must QueryComponent always be on and other Design Questions

I've run into this a couple of times now and I feel like it  
warrants a

discussion

For both the SpellCheckComponent (SCC) and now for the new
ClusteringComponent (SOLR-769) I think there are cases where the
QueryComponent (QC) is not required.  In the SpellCheckComponent case
it is when building the spelling index.  In the ClusteringComponent,
it is possible to ask for document clusters without running any query
(it also will be possible to get clusters _with_ a query as well, and
it also is distinguished from the handling of search results
clustering, too).  Thus, it seems really weird to have to pass in a
dummy query, yet that is what one has to do in order to avoid getting
an NPE in the QC.

Now, I suppose these pieces could be modeled as something else or  
it's

possible to split the two functionalities into separate things (1
ReqHandler, 1 SearchComp).  In fact, the said functionality is not
really "search" functionality, or SearchComponent functionality, yet
much of the rest of the functionality in the code in question is
"search" functionality and logically belongs as a SearchComponent.   
In
the case of the SCC build, it's akin to an indexing operation.  In  
the

clustering case, it's a query, albeit a non-traditional one.  In some
sense, this kind of document clustering is like non-query based
faceting which leads to more navigation/browsing instead of  
searching.


The quick fix is to just put in null checks into the QC or pass in a
dummy query with rows=0, but I'm not sure if there isn't a slightly
bigger picture here that needs adjusting in terms of
SearchComponents.  Namely, must the QC always be on?  And, should we
think a little more about components that don't require a query in
order to function and how they play in the scheme of things?

Thoughts?  Recommendations?

-Grant






Re: Must QueryComponent always be on and other Design Questions

2008-10-21 Thread Ryan McKinley

unrelated to your question, but we should give a better error...
https://issues.apache.org/jira/browse/SOLR-435


On Oct 20, 2008, at 8:01 PM, Grant Ingersoll wrote:


For completeness, here's the NPE:
SEVERE: java.lang.NullPointerException
at org.apache.solr.common.util.StrUtils.splitSmart(StrUtils.java:37)
	at  
org 
.apache.solr.search.OldLuceneQParser.parse(LuceneQParserPlugin.java: 
104)

at org.apache.solr.search.QParser.getQuery(QParser.java:88)
	at  
org 
.apache 
.solr.handler.component.QueryComponent.prepare(QueryComponent.java:82)
	at  
org 
.apache 
.solr 
.handler 
.component.SearchHandler.handleRequestBody(SearchHandler.java:149)
	at  
org 
.apache 
.solr 
.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
	at  
org 
.apache 
.solr 
.handler 
.clustering 
.ClusteringComponentTest.testComponent(ClusteringComponentTest.java: 
70)


Don't worry about the ClusteringComponentTest yet, I haven't posted  
that code yet.


On Oct 20, 2008, at 7:56 PM, Grant Ingersoll wrote:

I've run into this a couple of times now and I feel like it  
warrants a discussion


For both the SpellCheckComponent (SCC) and now for the new  
ClusteringComponent (SOLR-769) I think there are cases where the  
QueryComponent (QC) is not required.  In the SpellCheckComponent  
case it is when building the spelling index.  In the  
ClusteringComponent, it is possible to ask for document clusters  
without running any query (it also will be possible to get clusters  
_with_ a query as well, and it also is distinguished from the  
handling of search results clustering, too).  Thus, it seems really  
weird to have to pass in a dummy query, yet that is what one has to  
do in order to avoid getting an NPE in the QC.


Now, I suppose these pieces could be modeled as something else or  
it's possible to split the two functionalities into separate things  
(1 ReqHandler, 1 SearchComp).  In fact, the said functionality is  
not really "search" functionality, or SearchComponent  
functionality, yet much of the rest of the functionality in the  
code in question is "search" functionality and logically belongs as  
a SearchComponent.  In the case of the SCC build, it's akin to an  
indexing operation.  In the clustering case, it's a query, albeit a  
non-traditional one.  In some sense, this kind of document  
clustering is like non-query based faceting which leads to more  
navigation/browsing instead of searching.


The quick fix is to just put in null checks into the QC or pass in  
a dummy query with rows=0, but I'm not sure if there isn't a  
slightly bigger picture here that needs adjusting in terms of  
SearchComponents.  Namely, must the QC always be on?  And, should  
we think a little more about components that don't require a query  
in order to function and how they play in the scheme of things?


Thoughts?  Recommendations?

-Grant







Re: Must QueryComponent always be on and other Design Questions

2008-10-21 Thread Ryan McKinley


On Oct 21, 2008, at 8:17 AM, Grant Ingersoll wrote:



On Oct 20, 2008, at 11:35 PM, Otis Gospodnetic wrote:

This is related to something I must have only day dreamed (dreamt?)  
about, but not actually mentioned on solr-dev.
My feeling is we are moving Solr in a direction of a more general  
web service that can host various NLP and ML components, and no  
longer only do IR/Lucene.  We see that with a few patches that  
Grant is cooking, I think we'll see that in the Solr+Mahout  
marriage down the road, and so on.


I somewhat agree, but I hesitate to go so far as saying a "general  
web service".


I won't suggest that solr is (or should be) a general web service, but  
wt=json/xml/python + RequestHandler makes a pretty nice cross platform  
interface all on its own.



I see Solr as a pretty nice platform for doing things like NLP/ML  
(see the AnalysisRequestHandler, TermVectorComponent,  
ClusteringComponent, LukeReqHandler, FacetingComp., Payloads, etc.),  
but I mostly view them as enhancing search/navigation.   That is,  
things like clustering/faceting (they are closely related), named  
entity recognition, search, etc. all act as organizing components  
for structured and unstructured data.  Expressing my vision for Solr  
(and actually, the Lucene TLP, too, if I put on my PMC hat) it's one  
that aims to bring coherence to (structured and unstructured)  
content.  This starts with search as a foundation, since the  
indexing process creates much of the information that empowers the  
others.  I think once you see the flexible indexing stuff added to  
Lucene Java, we'll see even more opportunity for making Solr even  
more powerful in these regards.




agree.





Is it time to start thinking about Solr sa a server for IR and ML  
and NLP tasks and see how the tightly coupled Lucene can be made  
morepluggable?


Yeah, this is what the Solr 2.0 thread that Yonik started a few  
weeks ago aims to discuss, along with scalability/fault tolerance.   
More important, for me anyway, is the decoupling of the  
configuration.  For instance, I see no reason why IndexSchema needs  
to know anything about an InputStream.


also agree.  The biggest challenge for 2.0 is decoupling configuration

As for Lucene, it's really quite good at serving as the backend  
store/enabler for all these tasks.




I have not messed with it yet, but perhaps also HBase...



At any rate, the question still remains as to how best to handle the  
QueryComponent :-)




aaah, your question!

I see two options:
1.  If no other component needs docList or docSet and the query is  
empty, then skip the QueryComponent
2.  add a 'runQuery' param (or somethign like that) and default to  
true.  It can be turned off when not necessary.


I like option 1 better.

ryan




Re: Must QueryComponent always be on and other Design Questions

2008-10-21 Thread Noble Paul നോബിള്‍ नोब्ळ्
+1
I can forsee a lot of components which does not need the
QueryComponent. SOLR-706 being one.



On Tue, Oct 21, 2008 at 8:39 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote:
>
> On Oct 21, 2008, at 8:17 AM, Grant Ingersoll wrote:
>
>>
>> On Oct 20, 2008, at 11:35 PM, Otis Gospodnetic wrote:
>>
>>> This is related to something I must have only day dreamed (dreamt?)
>>> about, but not actually mentioned on solr-dev.
>>> My feeling is we are moving Solr in a direction of a more general web
>>> service that can host various NLP and ML components, and no longer only do
>>> IR/Lucene.  We see that with a few patches that Grant is cooking, I think
>>> we'll see that in the Solr+Mahout marriage down the road, and so on.
>>
>> I somewhat agree, but I hesitate to go so far as saying a "general web
>> service".
>
> I won't suggest that solr is (or should be) a general web service, but
> wt=json/xml/python + RequestHandler makes a pretty nice cross platform
> interface all on its own.
>
>
>> I see Solr as a pretty nice platform for doing things like NLP/ML (see the
>> AnalysisRequestHandler, TermVectorComponent, ClusteringComponent,
>> LukeReqHandler, FacetingComp., Payloads, etc.), but I mostly view them as
>> enhancing search/navigation.   That is, things like clustering/faceting
>> (they are closely related), named entity recognition, search, etc. all act
>> as organizing components for structured and unstructured data.  Expressing
>> my vision for Solr (and actually, the Lucene TLP, too, if I put on my PMC
>> hat) it's one that aims to bring coherence to (structured and unstructured)
>> content.  This starts with search as a foundation, since the indexing
>> process creates much of the information that empowers the others.  I think
>> once you see the flexible indexing stuff added to Lucene Java, we'll see
>> even more opportunity for making Solr even more powerful in these regards.
>>
>
> agree.
>
>
>>>
>>>
>>> Is it time to start thinking about Solr sa a server for IR and ML and NLP
>>> tasks and see how the tightly coupled Lucene can be made morepluggable?
>>
>> Yeah, this is what the Solr 2.0 thread that Yonik started a few weeks ago
>> aims to discuss, along with scalability/fault tolerance.  More important,
>> for me anyway, is the decoupling of the configuration.  For instance, I see
>> no reason why IndexSchema needs to know anything about an InputStream.
>
> also agree.  The biggest challenge for 2.0 is decoupling configuration
>
>> As for Lucene, it's really quite good at serving as the backend
>> store/enabler for all these tasks.
>>
>
> I have not messed with it yet, but perhaps also HBase...
>
>>
>> At any rate, the question still remains as to how best to handle the
>> QueryComponent :-)
>>
>
> aaah, your question!
>
> I see two options:
> 1.  If no other component needs docList or docSet and the query is empty,
> then skip the QueryComponent
> 2.  add a 'runQuery' param (or somethign like that) and default to true.  It
> can be turned off when not necessary.
>
> I like option 1 better.
>
> ryan
>
>
>



-- 
--Noble Paul


Re: Must QueryComponent always be on and other Design Questions

2008-10-21 Thread Grant Ingersoll
FWIW, my last patch on SOLR-769 adds a check to see if QC is enabled,  
with the default param set to true.  Thus, you can send in  
&query=false and it skips it.



On Oct 21, 2008, at 11:21 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



+1
I can forsee a lot of components which does not need the
QueryComponent. SOLR-706 being one.



On Tue, Oct 21, 2008 at 8:39 PM, Ryan McKinley <[EMAIL PROTECTED]>  
wrote:


On Oct 21, 2008, at 8:17 AM, Grant Ingersoll wrote:



On Oct 20, 2008, at 11:35 PM, Otis Gospodnetic wrote:


This is related to something I must have only day dreamed (dreamt?)
about, but not actually mentioned on solr-dev.
My feeling is we are moving Solr in a direction of a more general  
web
service that can host various NLP and ML components, and no  
longer only do
IR/Lucene.  We see that with a few patches that Grant is cooking,  
I think
we'll see that in the Solr+Mahout marriage down the road, and so  
on.


I somewhat agree, but I hesitate to go so far as saying a "general  
web

service".


I won't suggest that solr is (or should be) a general web service,  
but
wt=json/xml/python + RequestHandler makes a pretty nice cross  
platform

interface all on its own.


I see Solr as a pretty nice platform for doing things like NLP/ML  
(see the

AnalysisRequestHandler, TermVectorComponent, ClusteringComponent,
LukeReqHandler, FacetingComp., Payloads, etc.), but I mostly view  
them as
enhancing search/navigation.   That is, things like clustering/ 
faceting
(they are closely related), named entity recognition, search, etc.  
all act
as organizing components for structured and unstructured data.   
Expressing
my vision for Solr (and actually, the Lucene TLP, too, if I put on  
my PMC
hat) it's one that aims to bring coherence to (structured and  
unstructured)
content.  This starts with search as a foundation, since the  
indexing
process creates much of the information that empowers the others.   
I think
once you see the flexible indexing stuff added to Lucene Java,  
we'll see
even more opportunity for making Solr even more powerful in these  
regards.




agree.





Is it time to start thinking about Solr sa a server for IR and ML  
and NLP
tasks and see how the tightly coupled Lucene can be made  
morepluggable?


Yeah, this is what the Solr 2.0 thread that Yonik started a few  
weeks ago
aims to discuss, along with scalability/fault tolerance.  More  
important,
for me anyway, is the decoupling of the configuration.  For  
instance, I see
no reason why IndexSchema needs to know anything about an  
InputStream.


also agree.  The biggest challenge for 2.0 is decoupling  
configuration



As for Lucene, it's really quite good at serving as the backend
store/enabler for all these tasks.



I have not messed with it yet, but perhaps also HBase...



At any rate, the question still remains as to how best to handle the
QueryComponent :-)



aaah, your question!

I see two options:
1.  If no other component needs docList or docSet and the query is  
empty,

then skip the QueryComponent
2.  add a 'runQuery' param (or somethign like that) and default to  
true.  It

can be turned off when not necessary.

I like option 1 better.

ryan







--
--Noble Paul


--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











Re: Must QueryComponent always be on and other Design Questions

2008-10-21 Thread Noble Paul നോബിള്‍ नोब्ळ्
hi Grant,
There may be cases where the user may not be interested in the
documents but there may be other components which are interested in
the search results. In 'tvrh' is an example. How do we take care of
that?

On Tue, Oct 21, 2008 at 8:59 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> FWIW, my last patch on SOLR-769 adds a check to see if QC is enabled, with
> the default param set to true.  Thus, you can send in &query=false and it
> skips it.
>
>
> On Oct 21, 2008, at 11:21 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
>> +1
>> I can forsee a lot of components which does not need the
>> QueryComponent. SOLR-706 being one.
>>
>>
>>
>> On Tue, Oct 21, 2008 at 8:39 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote:
>>>
>>> On Oct 21, 2008, at 8:17 AM, Grant Ingersoll wrote:
>>>

 On Oct 20, 2008, at 11:35 PM, Otis Gospodnetic wrote:

> This is related to something I must have only day dreamed (dreamt?)
> about, but not actually mentioned on solr-dev.
> My feeling is we are moving Solr in a direction of a more general web
> service that can host various NLP and ML components, and no longer only
> do
> IR/Lucene.  We see that with a few patches that Grant is cooking, I
> think
> we'll see that in the Solr+Mahout marriage down the road, and so on.

 I somewhat agree, but I hesitate to go so far as saying a "general web
 service".
>>>
>>> I won't suggest that solr is (or should be) a general web service, but
>>> wt=json/xml/python + RequestHandler makes a pretty nice cross platform
>>> interface all on its own.
>>>
>>>
 I see Solr as a pretty nice platform for doing things like NLP/ML (see
 the
 AnalysisRequestHandler, TermVectorComponent, ClusteringComponent,
 LukeReqHandler, FacetingComp., Payloads, etc.), but I mostly view them
 as
 enhancing search/navigation.   That is, things like clustering/faceting
 (they are closely related), named entity recognition, search, etc. all
 act
 as organizing components for structured and unstructured data.
  Expressing
 my vision for Solr (and actually, the Lucene TLP, too, if I put on my
 PMC
 hat) it's one that aims to bring coherence to (structured and
 unstructured)
 content.  This starts with search as a foundation, since the indexing
 process creates much of the information that empowers the others.  I
 think
 once you see the flexible indexing stuff added to Lucene Java, we'll see
 even more opportunity for making Solr even more powerful in these
 regards.

>>>
>>> agree.
>>>
>>>
>
>
> Is it time to start thinking about Solr sa a server for IR and ML and
> NLP
> tasks and see how the tightly coupled Lucene can be made
> morepluggable?

 Yeah, this is what the Solr 2.0 thread that Yonik started a few weeks
 ago
 aims to discuss, along with scalability/fault tolerance.  More
 important,
 for me anyway, is the decoupling of the configuration.  For instance, I
 see
 no reason why IndexSchema needs to know anything about an InputStream.
>>>
>>> also agree.  The biggest challenge for 2.0 is decoupling configuration
>>>
 As for Lucene, it's really quite good at serving as the backend
 store/enabler for all these tasks.

>>>
>>> I have not messed with it yet, but perhaps also HBase...
>>>

 At any rate, the question still remains as to how best to handle the
 QueryComponent :-)

>>>
>>> aaah, your question!
>>>
>>> I see two options:
>>> 1.  If no other component needs docList or docSet and the query is empty,
>>> then skip the QueryComponent
>>> 2.  add a 'runQuery' param (or somethign like that) and default to true.
>>>  It
>>> can be turned off when not necessary.
>>>
>>> I like option 1 better.
>>>
>>> ryan
>>>
>>>
>>>
>>
>>
>>
>> --
>> --Noble Paul
>
> --
> Grant Ingersoll
> Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
> http://www.lucenebootcamp.com
>
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>



-- 
--Noble Paul


Re: Must QueryComponent always be on and other Design Questions

2008-10-21 Thread Grant Ingersoll
Don't turn off the query component in those cases.  In these cases,  
the QC identifies what docs are to be used, just as in a user based  
query.  Just think of those other components as clients of the QC  
output, and I think it makes sense.  The application will know whether  
it needs to deal with results or not.  I suppose we could have  
something that says "run the query and make the results available to  
other components, but don't bother writing them out".


On Oct 21, 2008, at 11:33 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



hi Grant,
There may be cases where the user may not be interested in the
documents but there may be other components which are interested in
the search results. In 'tvrh' is an example. How do we take care of
that?

On Tue, Oct 21, 2008 at 8:59 PM, Grant Ingersoll  
<[EMAIL PROTECTED]> wrote:
FWIW, my last patch on SOLR-769 adds a check to see if QC is  
enabled, with
the default param set to true.  Thus, you can send in &query=false  
and it

skips it.


On Oct 21, 2008, at 11:21 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



+1
I can forsee a lot of components which does not need the
QueryComponent. SOLR-706 being one.



On Tue, Oct 21, 2008 at 8:39 PM, Ryan McKinley <[EMAIL PROTECTED]>  
wrote:


On Oct 21, 2008, at 8:17 AM, Grant Ingersoll wrote:



On Oct 20, 2008, at 11:35 PM, Otis Gospodnetic wrote:

This is related to something I must have only day dreamed  
(dreamt?)

about, but not actually mentioned on solr-dev.
My feeling is we are moving Solr in a direction of a more  
general web
service that can host various NLP and ML components, and no  
longer only

do
IR/Lucene.  We see that with a few patches that Grant is  
cooking, I

think
we'll see that in the Solr+Mahout marriage down the road, and  
so on.


I somewhat agree, but I hesitate to go so far as saying a  
"general web

service".


I won't suggest that solr is (or should be) a general web  
service, but
wt=json/xml/python + RequestHandler makes a pretty nice cross  
platform

interface all on its own.


I see Solr as a pretty nice platform for doing things like NLP/ 
ML (see

the
AnalysisRequestHandler, TermVectorComponent, ClusteringComponent,
LukeReqHandler, FacetingComp., Payloads, etc.), but I mostly  
view them

as
enhancing search/navigation.   That is, things like clustering/ 
faceting
(they are closely related), named entity recognition, search,  
etc. all

act
as organizing components for structured and unstructured data.
Expressing
my vision for Solr (and actually, the Lucene TLP, too, if I put  
on my

PMC
hat) it's one that aims to bring coherence to (structured and
unstructured)
content.  This starts with search as a foundation, since the  
indexing
process creates much of the information that empowers the  
others.  I

think
once you see the flexible indexing stuff added to Lucene Java,  
we'll see

even more opportunity for making Solr even more powerful in these
regards.



agree.





Is it time to start thinking about Solr sa a server for IR and  
ML and

NLP
tasks and see how the tightly coupled Lucene can be made
morepluggable?


Yeah, this is what the Solr 2.0 thread that Yonik started a few  
weeks

ago
aims to discuss, along with scalability/fault tolerance.  More
important,
for me anyway, is the decoupling of the configuration.  For  
instance, I

see
no reason why IndexSchema needs to know anything about an  
InputStream.


also agree.  The biggest challenge for 2.0 is decoupling  
configuration



As for Lucene, it's really quite good at serving as the backend
store/enabler for all these tasks.



I have not messed with it yet, but perhaps also HBase...



At any rate, the question still remains as to how best to handle  
the

QueryComponent :-)



aaah, your question!

I see two options:
1.  If no other component needs docList or docSet and the query  
is empty,

then skip the QueryComponent
2.  add a 'runQuery' param (or somethign like that) and default  
to true.

It
can be turned off when not necessary.

I like option 1 better.

ryan







--
--Noble Paul


--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ














--
--Noble Paul


--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











Re: Must QueryComponent always be on and other Design Questions

2008-10-21 Thread Chris Hostetter

: For both the SpellCheckComponent (SCC) and now for the new ClusteringComponent
: (SOLR-769) I think there are cases where the QueryComponent (QC) is not
: required.  In the SpellCheckComponent case it is when building the spelling
: index.  In the ClusteringComponent, it is possible to ask for document
: clusters without running any query (it also will be possible to get clusters
: _with_ a query as well, and it also is distinguished from the handling of
: search results clustering, too).  Thus, it seems really weird to have to pass
: in a dummy query, yet that is what one has to do in order to avoid getting an
: NPE in the QC.

In my opinion the "right" way to deal with these use cases is to have 
seperate request handlers configured for the differnet usecases ... if you 
want to cluster "stuff" unrelated to a query, register a handler that uses 
the ClusteringComponent but doesn't use the QueryComponent ... likewise 
register a handler that knows about the SpellcheckComponent for triggering 
rebuilds of the spellcheck index independent of doing queryies.

we could treat QueryComponent just like the FacetComponent and 
HighlightComponent and say that it short circuts and does nothing unless 
"doQuery=true" (which it would be by default) but i'm not convinced that's 
the best way to approach things -- the typical case is that QueryComponent 
is the meat of the request handling, it's probably okay that it be put on 
a bit of a pedestal such that if your handler uses QueryComponent then 
QueryComonent is "on" -- But it should absolutely be possible to have a 
handler that doesn't use QueryComponent.

On a related topic: if we have any other Components that assume 
QueryComponent has been executed, they should be changed.  Components 
should have contracts about pre-conditions and post-conditions relating to 
data in the request, not how the data got there, ala...

https://issues.apache.org/jira/browse/SOLR-760


-Hoss