date:20071011

implemented StandardReqeustHandler to show top-results per facet-value. Is this the fastest way?

2007-10-11 Thread Britske


Since the title of my original post may not have been so clear, here a
repost. 
//Geert-Jan


Britske wrote:
> 
> First of all, I just wanted to say that I just started working with Solr
> and really like the results I'm getting from Solr (in terms of
> performance, flexibility) as well as the good responses I'm getting from
> this group. Hopefully I will be able to contribute in way way or another
> to this wonderful application in the future!
> 
> The current issue that I'm having is the following ( I tried not to be
> long-winded, but somehow that didn't work out :-)   ):
> 
> I'm extending StandardRequestHandler to no only show the counts per
> facet-value but also the top-N results per facet-value (where N is
> configurable). 
> (See
> http://www.nabble.com/Result-grouping-options-tf4522284.html#a12900630 for
> where I got the idea from). 
> I quickly implemented this by fetching a doclist for each of my
> facet-values and appending these to the result as suggested in the refered
> post, no problems there. 
> 
> However, I realized that for calculating the count for each of the
> facetvalues, the original standardrequesthandler already loops the doclist
> to check for matches. Therefore my implementation actually does double
> work, since it gets doclists for each of the facetvalues again. 
> 
> My question: 
> is there a way to get to the already calculated doclist per facetvalue
> from a subclassed StandardRequestHandler, and so get a nice speedup?  This
> facet-falculation seems to go deep into the core of Solr
> (SimpleFacets.getFacetTermEnumCounts) and seems not very sensible to alter
> for just this requirement. opinions appreciated. 
> 
> Some additional info:
> 
> I have a  requirement to be able to limit the result to explicitly
> specified facet-values. For that I do something like: 
> select?
>  qt=toplist
> &q=name:A OR name:B OR  name:C 
> &sort=sortfield asc 
> &facet=true
> &facet.field=name
> &facet.limit=1
> &rows=2
> 
> This all works okay and results in a faceting/grouping by field: 'name', 
> where for each facetvalue (A, B, C)
> 2 results are shown (ordered by sortfield). 
> 
> The relevant code from the subclassed standardRequestHandler is below. As
> can be seen I alter the query by adding the facetvalue to FQ (which is
> almost guarenteed to already exist in FQ btw.) 
> 
> Therefore a second question is: 
> will there be a noticable speedup when persuing the above, since the
> request that is done per facet-value is nothing more than giving the
> ordered result of the intersection of the overall query (which is in the
> querycache) and the facetvalue itself (which is almost certainly in the
> filtercache). 
> 
> As a last and somewhat related question: 
> is there a way to explicity specify facet-values that I want to include in
> the faceting without (ab)using Q? This is  relevant for me since the
> perfect solution would be to have the ability to orthogonally get multiple
> toplists in 1 query. Given the current implementation, this orthoganality
> is now 'corrupted' as injection of a fieldvalue in Q for one facetfield
> influences the outcome of another facetfield. 
> 
> kind regards, 
> Geert-Jan
> 
> 
> 
> ---
> if(true) //TODO: this needs facetinfo as a precondition. 
> {
>   NamedList facetFieldList =
> ((NamedList)facetInfo.get("facet_fields"));
>for(int i = 0; i < facetFieldList.size(); i++)
>{
>   NamedList facetValList = (NamedList)facetFieldList.getVal(i); 
>   for(int j = 0; j < facetValList.size(); j++)
>  {
>   NamedList facetValue = new SimpleOrderedMap(); 
> // facetValue.add("count", valList.getVal(j));
>   
>  DocListAndSet resultList = new DocListAndSet();
>Query facetq = QueryParsing.parseQuery(
> facetFieldList.getName(i) + ":" + facetValList.getName(j),
> req.getSchema());
>  resultList.docList = s.getDocList(query,facetq,
> sort,p.getInt(CommonParams.START,0), 
> p.getInt(CommonParams.ROWS,3));
> 
>facetValue.add("results",resultList.docList);
>  facetValList.setVal(j, facetValue);
>  }
>}
>rsp.add("facet_results", facetFieldList);
> }
> 

-- 
View this message in context: 
http://www.nabble.com/showing-results-per-facet-value-efficiently-tf4600154.html#a13150519
Sent from the Solr - User mailing list archive at Nabble.com.

Tomcat Solr Problem

2007-10-11 Thread Nishant Soni

Unable to get solr up in tomcat. Im getting the following log

NFO: Using JNDI solr.home: E:/test/workspace/reviewGist/solr/home
Oct 11, 2007 1:48:13 PM org.apache.solr.core.Config setInstanceDir
INFO: Solr home set to 'E:/test/workspace/reviewGist/solr/home/'
Oct 11, 2007 1:48:13 PM org.apache.solr.core.SolrConfig initConfig
INFO: Loaded SolrConfig: solrconfig.xml
Oct 11, 2007 1:48:13 PM org.apache.solr.servlet.SolrDispatchFilter init
INFO: user.dir=C:\Program Files\eclipse
- Exception starting filter SolrRequestFilter
java.lang.NoClassDefFoundError
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:75)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:223)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:304)
at
org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:77)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3600)
at
org.apache.catalina.core.StandardContext.start(StandardContext.java:4193)
at
org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1013)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:718)
at
org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1013)
at
org.apache.catalina.core.StandardEngine.start(StandardEngine.java:442)
at
org.apache.catalina.core.StandardService.start(StandardService.java:450)
at
org.apache.catalina.core.StandardServer.start(StandardServer.java:709)
at org.apache.catalina.startup.Catalina.start(Catalina.java:551)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:294)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:432)
- Error filterStart
- Context [solr] startup failed due to previous errors
- org.apache.webapp.balancer.BalancerFilter: init(): ruleChain:
[org.apache.webapp.balancer.RuleChain:
[org.apache.webapp.balancer.rules.URLStringMatchRule: Target string:
News / Redirect URL: http://www.cnn.com],
[org.apache.webapp.balancer.rules.RequestParameterRule: Target param
name: paramName / Target param value: paramValue / Redirect URL:
http://www.yahoo.com],
[org.apache.webapp.balancer.rules.AcceptEverythingRule: Redirect URL:
http://jakarta.apache.org]]
- ContextListener: contextInitialized()
- SessionListener: contextInitialized()
- ContextListener: contextInitialized()
- SessionListener: contextInitialized()
- Starting Coyote HTTP/1.1 on http-
- JK: ajp13 listening on /0.0.0.0:8009
- Jk running ID=0 time=0/62  config=null
- Find registry server-registry.xml at classpath resource
- Server startup in 2562 ms

Any help will be greatly appreciated.


  

Don't let your dream ride pass you by. Make it a reality with Yahoo! Autos.
http://autos.yahoo.com/index.html

Re: getting number of stored documents via rest api

2007-10-11 Thread Stefan Rinner



On Oct 10, 2007, at 6:49 PM, Chris Hostetter wrote:



: I think search for "*:*" is the optimal code to do it. I don't  
think you can

: do anything faster.

FYI: getting the data from the xml returned by stats.jsp is definitely
faster in the case where you really want all docs.

if you want the total number from some other query however, don't  
"count"

them yourself in the client ... use 


for my current use-case I gonna use the numFound property because I  
can just use the solrj client for this, and don't have to add another  
http-fetching & xmlparsing method.


I overlooked the numFound property up to now but good to know.

stefan

Re: Spell Check Handler

2007-10-11 Thread climbingrose

Hi all,

I've been so busy the last few days so I haven't replied to this email. I
modified SpellCheckerHandler a while ago to include support for multiword
query. To be honest, I didn't have time to write unit test for the code.
However, I deployed it in a production environment and it has been working
for me so far. My version, however, has two assumptions:

1) I assumpt that when user enter a misspelled multiword query, we should
only check for words that are actually misspelled. For example, if user
enter "life expectancy calculatar", which has "calculator" misspelled, we
should only spellcheck "calculatar".
2) I only return the best string for a mispelled query.

I guess I can just directly paste the code here so that others can adapt for
their own purposes. If you have any question, just send me an email. I'll
happy to help  you.

StringBuffer buf = null;
if (null != words && !"".equals(words.trim())) {
Analyzer analyzer = req.getSchema
().getField(field).getType().getAnalyzer();

TokenStream source = analyzer.tokenStream(field, new
StringReader(words));
Token t;
boolean hasSuggestion = false;
boolean termExists = false;
while (true) {
try {
t = source.next();
} catch (IOException e) {
t = null;
}
if (t == null)
break;

String termText = t.termText();
String[] suggestions = spellChecker.suggestSimilar(termText,
numSug, req.getSearcher().getReader(), restrictToField, true);
if (suggestions != null && suggestions.length > 0) {
if (!suggestions[0].equals(termText)) {
hasSuggestion = true;
}
if (buf == null) {
buf = new StringBuffer(suggestions[0]);
} else
buf.append(" ").append(suggestions[0]);
} else if (spellChecker.exist(termText)){
termExists = true;
if (buf == null) {
buf = new StringBuffer(termText);
} else
buf.append(" ").append(termText);
} else {
hasSuggestion = false;
termExists= false;
break;
}
}
try {
source.close();
} catch (IOException e) {
// ignore
}
// String[] suggestions = spellChecker.suggestSimilar(words,
numSug,
// nullReader, restrictToField, onlyMorePopular);
if (hasSuggestion || (!hasSuggestion && termExists))
rsp.add("suggestions", buf.toString());
else
rsp.add("suggestions", null);



On 10/11/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>
> Hoss,
>
> I had a feeling someone would be quoting Yonik's Law of Patches!  ;-)
>
> For now, this is done.
>
> I created the changes, created JavaDoc comments on the various settings
> and their expected output, created a JUnit test for the
> SpellCheckerRequestHandler
> which tests various components of the handler, and I also created the
> supporting configuration files for the JUnit tests (schema and solrconfig
> files).
>
> I attached the patch to the JIRA issue so now we just have to wait until
> it gets
> added back in to the main code stream.
>
> For anyone who is interested, here is a link to the JIRA:
> https://issues.apache.org/jira/browse/SOLR-375
>
> Could someone please drop me a hint on how to update the wiki or any other
> documentation that could benefit to being updated; I'll like to help out
> as much
> as possible, but first I need to know "how". ;-)
>
> When these changes do get committed back in to the daily build, please
> review the generated JavaDoc for information on how to utilize these new
> features.
> If anyone has any questions, or comments, please do not hesitate to ask.
>
> As a general note of a self-critique on these changes, I am not 100% sure
> of the way I
> implemented the "nested" structure when the "multiWords" parameter is
> used.  My interest
> is that it should work smoothly with some other technology such as
> Prototype using the
> JSon output type.  Unfortunately, I will not be getting a chance to start
> on that coding until
> next week so it is up in the air as to if this structure will be conducive
> or not.  I am planning
> on providing more details in the documentations as far as how to utilize
> these modifications
> in Prototype and AJax when I get a chance (even provide links to a
> production site so you
> can see it in action and view the source if interested).  So stay tuned...
>
>Thanks for everyones time,
>   Scott Tabar
>
>  Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> : If you like, I can p

Re: Spell Check Handler

2007-10-11 Thread climbingrose

Just to clarify this line of code:

String[] suggestions = spellChecker.suggestSimilar(termText, numSug,
req.getSearcher().getReader(), restrictToField, true);

I only return suggestions if they are more popular than termText. You
probably need to use code in Scott's patch to make this behaviour
configurable.

On 10/11/07, climbingrose <[EMAIL PROTECTED]> wrote:
>
> Hi all,
>
> I've been so busy the last few days so I haven't replied to this email. I
> modified SpellCheckerHandler a while ago to include support for multiword
> query. To be honest, I didn't have time to write unit test for the code.
> However, I deployed it in a production environment and it has been working
> for me so far. My version, however, has two assumptions:
>
> 1) I assumpt that when user enter a misspelled multiword query, we should
> only check for words that are actually misspelled. For example, if user
> enter "life expectancy calculatar", which has "calculator" misspelled, we
> should only spellcheck "calculatar".
> 2) I only return the best string for a mispelled query.
>
> I guess I can just directly paste the code here so that others can adapt
> for their own purposes. If you have any question, just send me an email.
> I'll happy to help  you.
>
> StringBuffer buf = null;
> if (null != words && !"".equals(words.trim())) {
> Analyzer analyzer = req.getSchema
> ().getField(field).getType().getAnalyzer();
>
> TokenStream source = analyzer.tokenStream(field, new
> StringReader(words));
> Token t;
> boolean hasSuggestion = false;
> boolean termExists = false;
> while (true) {
> try {
> t = source.next();
> } catch (IOException e) {
> t = null;
> }
> if (t == null)
> break;
>
> String termText = t.termText();
> String[] suggestions = spellChecker.suggestSimilar(termText,
> numSug, req.getSearcher().getReader(), restrictToField, true);
> if (suggestions != null && suggestions.length > 0) {
> if (!suggestions[0].equals(termText)) {
> hasSuggestion = true;
> }
> if (buf == null) {
> buf = new StringBuffer(suggestions[0]);
> } else
> buf.append(" ").append(suggestions[0]);
> } else if (spellChecker.exist(termText)){
> termExists = true;
> if (buf == null) {
> buf = new StringBuffer(termText);
> } else
> buf.append(" ").append(termText);
> } else {
> hasSuggestion = false;
> termExists= false;
> break;
> }
> }
> try {
> source.close();
> } catch (IOException e) {
> // ignore
> }
> // String[] suggestions = spellChecker.suggestSimilar(words,
> numSug,
> // nullReader, restrictToField, onlyMorePopular);
> if (hasSuggestion || (!hasSuggestion && termExists))
> rsp.add("suggestions", buf.toString());
> else
> rsp.add("suggestions", null);
>
>
>
> On 10/11/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> >
> > Hoss,
> >
> > I had a feeling someone would be quoting Yonik's Law of Patches!  ;-)
> >
> > For now, this is done.
> >
> > I created the changes, created JavaDoc comments on the various settings
> > and their expected output, created a JUnit test for the
> > SpellCheckerRequestHandler
> > which tests various components of the handler, and I also created the
> > supporting configuration files for the JUnit tests (schema and
> > solrconfig files).
> >
> > I attached the patch to the JIRA issue so now we just have to wait until
> > it gets
> > added back in to the main code stream.
> >
> > For anyone who is interested, here is a link to the JIRA:
> > https://issues.apache.org/jira/browse/SOLR-375
> >
> > Could someone please drop me a hint on how to update the wiki or any
> > other
> > documentation that could benefit to being updated; I'll like to help out
> > as much
> > as possible, but first I need to know "how". ;-)
> >
> > When these changes do get committed back in to the daily build, please
> > review the generated JavaDoc for information on how to utilize these new
> > features.
> > If anyone has any questions, or comments, please do not hesitate to ask.
> >
> >
> > As a general note of a self-critique on these changes, I am not 100%
> > sure of the way I
> > implemented the "nested" structure when the "multiWords" parameter is
> > used.  My interest
> > is that it should work smoothly with some other technology such as
> > Prototype using the
> > JSon

Re: Availability Issues

2007-10-11 Thread Norberto Meijome

On Tue, 9 Oct 2007 10:12:51 -0400
"David Whalen" <[EMAIL PROTECTED]> wrote:

> So, how would you build it if you could?  Here are the specs:
> 
> a) the index needs to hold at least 25 million articles
> b) the index is constantly updated at a rate of 10,000 articles
> per minute
> c) we need to have faceted queries

Hi David,
Others with more experience than I have given you good answers , so I won't go
there

One thing you want to consider when you have lots of ongoing updates is,
how fast do you want your latest changes to show up in your results. 

Yes, everyone wants the latest to be live the second it hits the index, but 
balancing that act with having a responsive search within certain budget (and
architectural, maybe ? ) constrains isn't always that easy. 

In all seriousness, not everyone is  in a situation where every one of their
users would really need (or benefit hugely) from having each of the 200 docs
posted in the last second come up the ms. they hit "Search". Can they tell if
it was posted within the last 3, 5 or 10 minutes?

I think that tuning the  values for cache warming should yield some good
results. Your probably don't want to have all your searches held until your
cache fully warms...or have to warm too often.

I was thinking that you could even split your indexes, have the latest entries
on smaller, faster index,and the rest of your 25M in another index which gets
updated , say, hourly. But if you have 10K updates (not new docs, but
changes),then maybe the idea of splitting the index is not that useful...

anyway, there many ways to skin a cat :)

good luck,
B
_
{Beto|Norberto|Numard} Meijome

"Everything is interesting if you go into it deeply enough"
  Richard Feynman

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.

Re: index size

2007-10-11 Thread Ravish Bhagdev

Hi All,

I'm facing similar problem.  I want to index entire document as a
field.  But I also want to be able to retrieve snippets (like
Google/Nutch return in results page below the links).

To achieve this I have to keep the document field to "stored" right?
When I do this my index becomes huge 10 GB index, cause I have 10K
docs but each is very lengthy HTML.  Is there any better solution?
Why is index created by nutch so small in comparison (about 27 mb
approx) but it still returns snippets!

Ravish

On 10/9/07, Kevin Lewandowski <[EMAIL PROTECTED]> wrote:
> Late reply on this but I just wanted to say thanks for the
> suggestions. I went through my whole schema and was storing things
> that didn't need to be stored and indexing a lot of things that didn't
> need to be indexed. Just completed a full reindex and it's a much more
> reasonable size now.
>
> Kevin
>
> On 8/20/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> >
> > On 17-Aug-07, at 2:03 PM, Kevin Lewandowski wrote:
> >
> > > Are there any tips on reducing the index size or what factors most
> > > impact index size?
> > >
> > > My index has 2.7 million documents and is 200 gigabytes and growing.
> > > Most documents are around 2-3kb and there are about 30 indexed fields.
> >
> > An "ls -sh" will tell you roughly where the the space is being
> > occupied.  There is something strange going on: 2.5kB * 2.7m is only
> > 6GB, and I have trouble imagining where the 30-fold index size
> > expansion is coming from.
> >
> > -Mike
> >
>

Re: WebException (ServerProtocolViolation) with SolrSharp

2007-10-11 Thread Filipe Correia

Jeff,

Thanks! Your suggestion worked, instead of invoking ToString() on
float values I've used ToString's other signature, which takes a an
IFormatProvider:

CultureInfo MyCulture = CultureInfo.InvariantCulture;
this.Add(new IndexFieldValue("weight",
weight.ToString(MyCulture.NumberFormat)));
this.Add(new IndexFieldValue("price", price.ToString(MyCulture.NumberFormat)));

This made me think on a related issue though. In this case it was the
client that was using a non-invariant number format, but can this also
happen on Solr's side? If so, I guess I may need to configure it
somewhere...

Cheers,
Filipe Correia

On 10/10/07, Jeff Rodenburg <[EMAIL PROTECTED]> wrote:
> Hi Felipe -
>
> The issue you're encountering is a problem with the data format being passed
> to the solr server.  If you follow the stack trace that you posted, you'll
> notice that the solr field is looking for a value that's a float, but the
> passed value is "1,234".
>
> I'm guessing this is caused by one of two possibilities:
>
> (1) there's a typo in your example code, where "1,234" should actually be "
> 1.234", or
> (2) there's a culture settings difference on your server that's converting "
> 1.234" to "1,234"
>
> Assuming it's the latter, add this line in the ExampleIndexDocument
> constructor:
>
> CultureInfo MyCulture = new CultureInfo("en-US");
>
> Please let me know if this fixes the issue, I've been looking at this
> previously and would like to confirm it.
>
> thanks,
> jeff r.
>
>
> On 10/10/07, Filipe Correia <[EMAIL PROTECTED]> wrote:
> >
> > Hello,
> >
> > I am trying to run SolrSharp's example application but am getting a
> > WebException with a ServerProtocolViolation status message.
> >
> > After some debugging I found out this is happening with a call to:
> > http://localhost:8080/solr/update/
> >
> > And using fiddler[1] found out that solr is actually throwing the
> > following exception:
> > org.apache.solr.core.SolrException: Error while creating field
> > 'weight{type=sfloat,properties=indexed,stored,omitNorms,sortMissingLast}'
> > from value '1,234'
> > at org.apache.solr.schema.FieldType.createField(FieldType.java
> > :173)
> > at org.apache.solr.schema.SchemaField.createField(SchemaField.java
> > :94)
> > at org.apache.solr.update.DocumentBuilder.addSingleField(
> > DocumentBuilder.java:57)
> > at org.apache.solr.update.DocumentBuilder.addField(
> > DocumentBuilder.java:73)
> > at org.apache.solr.update.DocumentBuilder.addField(
> > DocumentBuilder.java:83)
> > at org.apache.solr.update.DocumentBuilder.addField(
> > DocumentBuilder.java:77)
> > at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(
> > XmlUpdateRequestHandler.java:339)
> > at org.apache.solr.handler.XmlUpdateRequestHandler.update(
> > XmlUpdateRequestHandler.java:162)
> > at
> > org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(
> > XmlUpdateRequestHandler.java:84)
> > at org.apache.solr.handler.RequestHandlerBase.handleRequest(
> > RequestHandlerBase.java:77)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
> > at org.apache.solr.servlet.SolrDispatchFilter.execute(
> > SolrDispatchFilter.java:191)
> > at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> > SolrDispatchFilter.java:159)
> > at
> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
> > ApplicationFilterChain.java:235)
> > at org.apache.catalina.core.ApplicationFilterChain.doFilter(
> > ApplicationFilterChain.java:206)
> > at org.apache.catalina.core.StandardWrapperValve.invoke(
> > StandardWrapperValve.java:233)
> > at org.apache.catalina.core.StandardContextValve.invoke(
> > StandardContextValve.java:175)
> > at org.apache.catalina.core.StandardHostValve.invoke(
> > StandardHostValve.java:128)
> > at org.apache.catalina.valves.ErrorReportValve.invoke(
> > ErrorReportValve.java:102)
> > at org.apache.catalina.core.StandardEngineValve.invoke(
> > StandardEngineValve.java:109)
> > at org.apache.catalina.connector.CoyoteAdapter.service(
> > CoyoteAdapter.java:263)
> > at org.apache.coyote.http11.Http11Processor.process(
> > Http11Processor.java:844)
> > at
> > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(
> > Http11Protocol.java:584)
> > at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(
> > JIoEndpoint.java:447)
> > at java.lang.Thread.run(Unknown Source)
> > Caused by: java.lang.NumberFormatException: For input string:
> > "1,234"
> > at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
> > at java.lang.Float.parseFloat(Unknown Source)
> > at org.apache.solr.util.NumberUtils.float2sortableStr(
> > NumberUtils.java:80)
> > at org.apache.solr.schema.SortableFloatField.toInternal(
> > SortableFloatField.java:50)
> > at org.apache.solr.schema.FieldType.

Re: getting number of stored documents via rest api

2007-10-11 Thread Walter Underwood

This even works if you request 0 results. --wunder

On 10/11/07 1:56 AM, "Stefan Rinner" <[EMAIL PROTECTED]> wrote:

> 
> On Oct 10, 2007, at 6:49 PM, Chris Hostetter wrote:
> 
>> 
>> : I think search for "*:*" is the optimal code to do it. I don't
>> think you can
>> : do anything faster.
>> 
>> FYI: getting the data from the xml returned by stats.jsp is definitely
>> faster in the case where you really want all docs.
>> 
>> if you want the total number from some other query however, don't
>> "count"
>> them yourself in the client ... use 
> 
> for my current use-case I gonna use the numFound property because I
> can just use the solrj client for this, and don't have to add another
> http-fetching & xmlparsing method.
> 
> I overlooked the numFound property up to now but good to know.
> 
> stefan

Re: Different search results for (german) singular/plural searches - looking for a solution

2007-10-11 Thread Martin Grotzke

Hi Tom,

On Wed, 2007-10-10 at 12:28 +0200, Thomas Traeger wrote:
> in short: use stemming
ok :)

> 
> Try the SnowballPorterFilterFactory with German2 as language attribute 
> first and use synonyms for combined words i.e. "Herrenhose" => "Herren", 
> "Hose".
so you use a combined approach?

> 
> By using stemming you will maybe have some "interesting" results, but it 
> is much better living with them than having no or much less results ;o)
Do you have an example what "interesting" results I can expect, just to
get an idea?

> 
> Find more infos on the Snowball stemming algorithms here:
> 
> http://snowball.tartarus.org/
Thanx! I also had a look at this site already, but what is missing is a
demo where one can see what's happening. I think I'll play a little with
stemming to get a feeling for this.

> 
> Also have a look at the StopFilterFactory, here is a sample stopwordlist 
> for the german language:
> 
> http://snowball.tartarus.org/algorithms/german/stop.txt
Our application handles products, do you think such stopwords are useful
in this scenario also? I wouldn't expect a user to search for "keine
hose" or s.th. like this :)

Thanx && cheers,
Martin

> 
> Good luck,
> 
> Tom
> 
> 
> Martin Grotzke schrieb:
> > Hello,
> >
> > with our application we have the issue, that we get different
> > results for singular and plural searches (german language).
> >
> > E.g. for "hose" we get 1.000 documents back, but for "hosen"
> > we get 10.000 docs. The same applies to "t-shirt" or "t-shirts",
> > of e.g. "hut" and "hüte" - lots of cases :)
> >
> > This is absolutely correct according to the schema.xml, as right
> > now we do not have any stemming or synonyms included.
> >
> > Now we want to have similar search results for these singular/plural
> > searches. I'm thinking of a solution for this, and want to ask, what
> > are your experiences with this.
> >
> > Basically I see two options: stemming and the usage of synonyms. Are
> > there others?
> >
> > My concern with stemming is, that it might produce unexpected results,
> > so that docs are found that do not match the query from the users point
> > of view. I asume that this needs a lot of testing with different data.
> >
> > The issue with synonyms is, that we would have to create a file
> > containing all synonyms, so we would have to figure out all cases, in
> > contrast to a solutions that is based on an algorithm.
> > The advantage of this approach is IMHO, that it is very predictable
> > which results will be returned for a certain query.
> >
> > Some background information:
> > Our documents contain products (id, name, brand, category, producttype,
> > description, color etc). The singular/plural issue basically applied to
> > the fields name, category and producttype, so we would like to restrict
> > the solution to these fields.
> >
> > Do you have suggestions how to handle this?
> >
> > Thanx in advance for sharing your experiences,
> > cheers,
> > Martin
> >
> >
> >   
> 
-- 
Martin Grotzke
http://www.javakaffee.de/blog/


signature.asc
Description: This is a digitally signed message part

RE: quick allowDups questions

2007-10-11 Thread Charlie Jackson

Cool, thanks for the clarification, Ryan.

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 10, 2007 5:28 PM
To: solr-user@lucene.apache.org
Subject: Re: quick allowDups questions

the default solrj implementation should do what you need.

> 
> As for Solrj, you're probably right, but I'm not going to take any
> chances for the time being. The server.add method has an optional
> Boolean flag named "overwrite" that defaults to true. Without knowing
> for sure what it does, I'm not going to mess with it. 
> 

direct solr update allows a few extra fields allowDups, 
overwritePending, overwriteCommited -- the future of overwritePending, 
overwriteCommited is in doubt (SOLR-60), so i did not want to bake that 
into the solrj API.

internally,

  allowDups = !overwrite; (the one field you can set)
  overwritePending = !allowDups;
  overwriteCommited = !allowDups;

ryan

Re: Spell Check Handler

2007-10-11 Thread scott.tabar

Climbingrose,

I think you make a valid point.  Each person may have a different concept of 
how something should work with their application.

My thought on the subject of spell checking multiple words:
  - the parameter "multiWords" enables spell checking on each word in "q" 
parameter instead of on the whole field
  - each word is then represented in its own entry in a list of all words that 
are checked
  - to identify each word that is being checked within that entry, it is 
identified by the key "words"
  - to identify if the word was found exactly as it is within the spell 
checker's index, the "exist" key contains this information
  - Since there can be suggestions for both misspelled words and words that are 
spelled correctly, the list of suggestions is also included for both correctly 
spelled and misspelled words, even if the suggestion list is empty.

  - My vision is that if a user has a search query of multiple words and they 
are wanting to perform a check on the words, the use of "multiWords" will check 
all words at one time, independently from each others and return the list.  The 
presenting web app can then identify visually to the user which words are 
misspelled and which ones have suggestions too.  The user can then work with 
the various lists of suggestions without having to re-hit Solr.  Naturally, if 
the user manually changes a word, then Solr will have to be re-hit, but 
providing a single list of all words, including suggestions for correct words 
along with incorrect words, will help simplify applications (by reducing 
iterating over each word) and will help reduce the number of hits to the Solr 
server.

> 1) I assumpt that when user enter a misspelled multiword query, we should
> only check for words that are actually misspelled. For example, if user
> enter "life expectancy calculatar", which has "calculator" misspelled, we
> should only spellcheck "calculatar".

I think I understand what you mean in the above statement, but you must admit, 
it does sound funny.  After all, how do you identify that a word is misspelled 
by NOT using the spelling checker?  Correct me if I am wrong, but I think you 
intended to say that when a word is identified as being misspelled, then you 
should only include the suggestions for misspelled words.  If this is the case, 
then I would have to disagree with you.  The user may be interested in finding 
words that might mean the same, but are more popular (appears in more indexed 
documents within the Lucene index).  Hence the reason why I added the result 
field "exist" to identify that a word is spelled correctly even if there is a 
list of suggestions.  Please note, the situation can exist too where a word is 
misspelled and there are no suggestions so one cannot use the suggestion list 
as an indicator to the correctness of the individual word(s).

> 2) I only return the best string for a mispelled query.

You can also use the parameter "suggestionCount=1" to control how many words 
are returned.  In this case, it will do what your code is doing, but still 
allow the client to dynamically change this value without the need to hard code 
it within the main source code.

As far as only including terms that are more popular than the word that is 
being checked, there is already a parameter "onlyMorePopular" that you can use 
to dynamically control this feature from the client side so it does not have to 
be hard coded within the spelling checker.

Review these parameter options on the wiki, but keep in mind I have not updated 
the wiki with my changes or the new parameter and result fields:
http://wiki.apache.org/solr/SpellCheckerRequestHandler

   Thanks Climbingrose,

 Scott Tabar

 climbingrose <[EMAIL PROTECTED]> wrote: 
Just to clarify this line of code:

String[] suggestions = spellChecker.suggestSimilar(termText, numSug,
req.getSearcher().getReader(), restrictToField, true);

I only return suggestions if they are more popular than termText. You
probably need to use code in Scott's patch to make this behaviour
configurable.

On 10/11/07, climbingrose <[EMAIL PROTECTED]> wrote:
>
> Hi all,
>
> I've been so busy the last few days so I haven't replied to this email. I
> modified SpellCheckerHandler a while ago to include support for multiword
> query. To be honest, I didn't have time to write unit test for the code.
> However, I deployed it in a production environment and it has been working
> for me so far. My version, however, has two assumptions:
>
> 1) I assumpt that when user enter a misspelled multiword query, we should
> only check for words that are actually misspelled. For example, if user
> enter "life expectancy calculatar", which has "calculator" misspelled, we
> should only spellcheck "calculatar".
> 2) I only return the best string for a mispelled query.
>
> I guess I can just directly paste the code here so that others can adapt
> for their own purposes. If you have any question, just send me an email.
> I'll

Re: Different search results for (german) singular/plural searches - looking for a solution

2007-10-11 Thread Martin Grotzke

Hi Daniel,

thanx for your suggestions, being able to export a large synonyms.txt
sounds very well!

Thx && cheers,
Martin


On Wed, 2007-10-10 at 23:38 +0200, Daniel Naber wrote:
> On Wednesday 10 October 2007 12:00, Martin Grotzke wrote:
> 
> > Basically I see two options: stemming and the usage of synonyms. Are
> > there others?
> 
> A large list of German words and their forms is available from a Windows 
> software called Morphy 
> (http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). You can use 
> it for mapping fullforms to base forms (Häuser -> Haus).
> You can also have 
> a look at www.languagetool.org which uses this data in a Java software. 
> LanguageTool also comes with jWordSplitter, which can find a compound's 
> parts (Autowäsche -> Auto + Wäsche).
> 
> Regards
>  Daniel
> 



signature.asc
Description: This is a digitally signed message part

Re: Different search results for (german) singular/plural searches - looking for a solution

2007-10-11 Thread Thomas Traeger


Martin Grotzke schrieb:
Try the SnowballPorterFilterFactory with German2 as language attribute 
first and use synonyms for combined words i.e. "Herrenhose" => "Herren", 
"Hose".


so you use a combined approach?
  
Yes, we define the relevant parts of compounded words (keywords only) as 
synonyms and feed them in a special field that is used for searching and 
for the product index. I hope there will be a filter that can split 
compounded word sometimes in the future...
By using stemming you will maybe have some "interesting" results, but it 
is much better living with them than having no or much less results ;o)


Do you have an example what "interesting" results I can expect, just to
get an idea?
  

Find more infos on the Snowball stemming algorithms here:

http://snowball.tartarus.org/


Thanx! I also had a look at this site already, but what is missing is a
demo where one can see what's happening. I think I'll play a little with
stemming to get a feeling for this.
  
I think the Snowball stemmer is very good so I have no practical example 
for you. Maybe this is of value to see what happens:


http://snowball.tartarus.org/algorithms/german/diffs.txt

If you have mixed languages in your content, which sometimes happens in 
product data, you might get into some trouble.


Also have a look at the StopFilterFactory, here is a sample stopwordlist 
for the german language:


http://snowball.tartarus.org/algorithms/german/stop.txt


Our application handles products, do you think such stopwords are useful
in this scenario also? I wouldn't expect a user to search for "keine
hose" or s.th. like this :)
  

I have seen much worse queries, so you never know ;o)

think of a query like this: "Hose in blau für Herren"

You will definetly want to remove "in" and "für" during searching and it 
reduces index size when removed during indexing. Maybe you will even get 
better scores when only relevant terms are used. You should optimze the 
stopword list based on your data.


Regards,

Tom

Re: Internal Server Error and waitSearcher="false" for commit/optimize

2007-10-11 Thread Yonik Seeley

On 10/10/07, Jason Rennie <[EMAIL PROTECTED]> wrote:
> We're using solr 1.2 and a nightly build of the solrj client code.  We very
> occasionally see things like this:
>
> org.apache.solr.client.solrj.SolrServerException: Error executing query
> at org.apache.solr.client.solrj.request.QueryRequest.process(
> QueryRequest.java:86)
> at org.apache.solr.client.solrj.impl.BaseSolrServer.query(
> BaseSolrServer.java:99)
> ...
> Caused by: org.apache.solr.common.SolrException: Internal Server Error

Is there a longer stack trace somewhere concerning the internal server error?

> We also occasionally see solr taking too long to respond.  We currently make
> our commit/optimize calls without any arguments.  I'm wondering whether
> setting waitSearcher="false" might allow search queries to be served while a
> commit/optimize is being run.  I found this in an old message from this
> list:

While commit/optimize is being run, requests are served using the old
searcher - there shouldn't be any blocking.

> Is waitSearcher="false" designed to
> allow queries to be processed while a commit/optimize is being run?

No, waitSearcher="true" was designed such that a client could do a
commit, and wait for a new searcher to be registered such that a new
query request is guaranteed to see the changes.
waitSearcher=true/false only affects the thread calling commit... it
has no effect on other query requests which will continue to use the
previous searcher until the new one is registered.

-Yonik

Re: index size

2007-10-11 Thread Kevin Lewandowski

> To achieve this I have to keep the document field to "stored" right?

Yes, the field needs to be stored to return snippets.


> When I do this my index becomes huge 10 GB index, cause I have 10K
> docs but each is very lengthy HTML.  Is there any better solution?
> Why is index created by nutch so small in comparison (about 27 mb
> approx) but it still returns snippets!

Are you storing the complete html? If so I think you should strip out
the html then index the document.




>
> On 10/9/07, Kevin Lewandowski <[EMAIL PROTECTED]> wrote:
> > Late reply on this but I just wanted to say thanks for the
> > suggestions. I went through my whole schema and was storing things
> > that didn't need to be stored and indexing a lot of things that didn't
> > need to be indexed. Just completed a full reindex and it's a much more
> > reasonable size now.
> >
> > Kevin
> >
> > On 8/20/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> > >
> > > On 17-Aug-07, at 2:03 PM, Kevin Lewandowski wrote:
> > >
> > > > Are there any tips on reducing the index size or what factors most
> > > > impact index size?
> > > >
> > > > My index has 2.7 million documents and is 200 gigabytes and growing.
> > > > Most documents are around 2-3kb and there are about 30 indexed fields.
> > >
> > > An "ls -sh" will tell you roughly where the the space is being
> > > occupied.  There is something strange going on: 2.5kB * 2.7m is only
> > > 6GB, and I have trouble imagining where the 30-fold index size
> > > expansion is coming from.
> > >
> > > -Mike
> > >
> >
>

Re: Spell Check Handler

2007-10-11 Thread Matthew Runo

Where does the index come from in the first place? Do we have to  
enter the words, or are they entered as documents enter the SOLR index?


I'd love to be able to use my own documents as the spell check index  
of "correctly spelled words".


++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Oct 11, 2007, at 7:08 AM, <[EMAIL PROTECTED]>  
<[EMAIL PROTECTED]> wrote:



Climbingrose,

I think you make a valid point.  Each person may have a different  
concept of how something should work with their application.


My thought on the subject of spell checking multiple words:
  - the parameter "multiWords" enables spell checking on each word  
in "q" parameter instead of on the whole field
  - each word is then represented in its own entry in a list of all  
words that are checked
  - to identify each word that is being checked within that entry,  
it is identified by the key "words"
  - to identify if the word was found exactly as it is within the  
spell checker's index, the "exist" key contains this information
  - Since there can be suggestions for both misspelled words and  
words that are spelled correctly, the list of suggestions is also  
included for both correctly spelled and misspelled words, even if  
the suggestion list is empty.


  - My vision is that if a user has a search query of multiple  
words and they are wanting to perform a check on the words, the use  
of "multiWords" will check all words at one time, independently  
from each others and return the list.  The presenting web app can  
then identify visually to the user which words are misspelled and  
which ones have suggestions too.  The user can then work with the  
various lists of suggestions without having to re-hit Solr.   
Naturally, if the user manually changes a word, then Solr will have  
to be re-hit, but providing a single list of all words, including  
suggestions for correct words along with incorrect words, will help  
simplify applications (by reducing iterating over each word) and  
will help reduce the number of hits to the Solr server.



1) I assumpt that when user enter a misspelled multiword query, we  
should
only check for words that are actually misspelled. For example, if  
user
enter "life expectancy calculatar", which has "calculator"  
misspelled, we

should only spellcheck "calculatar".


I think I understand what you mean in the above statement, but you  
must admit, it does sound funny.  After all, how do you identify  
that a word is misspelled by NOT using the spelling checker?   
Correct me if I am wrong, but I think you intended to say that when  
a word is identified as being misspelled, then you should only  
include the suggestions for misspelled words.  If this is the case,  
then I would have to disagree with you.  The user may be interested  
in finding words that might mean the same, but are more popular  
(appears in more indexed documents within the Lucene index).  Hence  
the reason why I added the result field "exist" to identify that a  
word is spelled correctly even if there is a list of suggestions.   
Please note, the situation can exist too where a word is misspelled  
and there are no suggestions so one cannot use the suggestion list  
as an indicator to the correctness of the individual word(s).




2) I only return the best string for a mispelled query.


You can also use the parameter "suggestionCount=1" to control how  
many words are returned.  In this case, it will do what your code  
is doing, but still allow the client to dynamically change this  
value without the need to hard code it within the main source code.



As far as only including terms that are more popular than the word  
that is being checked, there is already a parameter  
"onlyMorePopular" that you can use to dynamically control this  
feature from the client side so it does not have to be hard coded  
within the spelling checker.


Review these parameter options on the wiki, but keep in mind I have  
not updated the wiki with my changes or the new parameter and  
result fields:

http://wiki.apache.org/solr/SpellCheckerRequestHandler

   Thanks Climbingrose,

 Scott Tabar




 climbingrose <[EMAIL PROTECTED]> wrote:
Just to clarify this line of code:

String[] suggestions = spellChecker.suggestSimilar(termText, numSug,
req.getSearcher().getReader(), restrictToField, true);

I only return suggestions if they are more popular than termText. You
probably need to use code in Scott's patch to make this behaviour
configurable.

On 10/11/07, climbingrose <[EMAIL PROTECTED]> wrote:


Hi all,

I've been so busy the last few days so I haven't replied to this  
email. I
modified SpellCheckerHandler a while ago to include support for  
multiword
query. To be honest, I didn't have time to write unit test for the  
code.
However, I deployed it in a production environment

Re: WebException (ServerProtocolViolation) with SolrSharp

2007-10-11 Thread Jeff Rodenburg

Good to know, I think this needs to be a configurable value in the library
(overridable, at a minimum.)

What's outstanding for me on this is understanding the Solr side of the
equation, and whether culture variance comes into play.  What makes this
even more interesting/confusing is how culture scenarios may differ across
platforms.  I do most of my production work against a solr farm running on
RHEL4, but often do side development work against Win2K3.

Thanks for confirming the culture issue, this will make its way into the
source as a fix in the future.

cheers,
jeff



On 10/11/07, Filipe Correia <[EMAIL PROTECTED]> wrote:
>
> Jeff,
>
> Thanks! Your suggestion worked, instead of invoking ToString() on
> float values I've used ToString's other signature, which takes a an
> IFormatProvider:
>
> CultureInfo MyCulture = CultureInfo.InvariantCulture;
> this.Add(new IndexFieldValue("weight",
> weight.ToString(MyCulture.NumberFormat)));
> this.Add(new IndexFieldValue("price", price.ToString(
> MyCulture.NumberFormat)));
>
> This made me think on a related issue though. In this case it was the
> client that was using a non-invariant number format, but can this also
> happen on Solr's side? If so, I guess I may need to configure it
> somewhere...
>
> Cheers,
> Filipe Correia
>
> On 10/10/07, Jeff Rodenburg <[EMAIL PROTECTED]> wrote:
> > Hi Felipe -
> >
> > The issue you're encountering is a problem with the data format being
> passed
> > to the solr server.  If you follow the stack trace that you posted,
> you'll
> > notice that the solr field is looking for a value that's a float, but
> the
> > passed value is "1,234".
> >
> > I'm guessing this is caused by one of two possibilities:
> >
> > (1) there's a typo in your example code, where "1,234" should actually
> be "
> > 1.234", or
> > (2) there's a culture settings difference on your server that's
> converting "
> > 1.234" to "1,234"
> >
> > Assuming it's the latter, add this line in the ExampleIndexDocument
> > constructor:
> >
> > CultureInfo MyCulture = new CultureInfo("en-US");
> >
> > Please let me know if this fixes the issue, I've been looking at this
> > previously and would like to confirm it.
> >
> > thanks,
> > jeff r.
> >
> >
> > On 10/10/07, Filipe Correia <[EMAIL PROTECTED]> wrote:
> > >
> > > Hello,
> > >
> > > I am trying to run SolrSharp's example application but am getting a
> > > WebException with a ServerProtocolViolation status message.
> > >
> > > After some debugging I found out this is happening with a call to:
> > > http://localhost:8080/solr/update/
> > >
> > > And using fiddler[1] found out that solr is actually throwing the
> > > following exception:
> > > org.apache.solr.core.SolrException: Error while creating field
> > >
> 'weight{type=sfloat,properties=indexed,stored,omitNorms,sortMissingLast}'
> > > from value '1,234'
> > > at org.apache.solr.schema.FieldType.createField(FieldType.java
> > > :173)
> > > at org.apache.solr.schema.SchemaField.createField(
> SchemaField.java
> > > :94)
> > > at org.apache.solr.update.DocumentBuilder.addSingleField(
> > > DocumentBuilder.java:57)
> > > at org.apache.solr.update.DocumentBuilder.addField(
> > > DocumentBuilder.java:73)
> > > at org.apache.solr.update.DocumentBuilder.addField(
> > > DocumentBuilder.java:83)
> > > at org.apache.solr.update.DocumentBuilder.addField(
> > > DocumentBuilder.java:77)
> > > at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(
> > > XmlUpdateRequestHandler.java:339)
> > > at org.apache.solr.handler.XmlUpdateRequestHandler.update(
> > > XmlUpdateRequestHandler.java:162)
> > > at
> > > org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(
> > > XmlUpdateRequestHandler.java:84)
> > > at org.apache.solr.handler.RequestHandlerBase.handleRequest(
> > > RequestHandlerBase.java:77)
> > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
> > > at org.apache.solr.servlet.SolrDispatchFilter.execute(
> > > SolrDispatchFilter.java:191)
> > > at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> > > SolrDispatchFilter.java:159)
> > > at
> > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
> > > ApplicationFilterChain.java:235)
> > > at org.apache.catalina.core.ApplicationFilterChain.doFilter(
> > > ApplicationFilterChain.java:206)
> > > at org.apache.catalina.core.StandardWrapperValve.invoke(
> > > StandardWrapperValve.java:233)
> > > at org.apache.catalina.core.StandardContextValve.invoke(
> > > StandardContextValve.java:175)
> > > at org.apache.catalina.core.StandardHostValve.invoke(
> > > StandardHostValve.java:128)
> > > at org.apache.catalina.valves.ErrorReportValve.invoke(
> > > ErrorReportValve.java:102)
> > > at org.apache.catalina.core.StandardEngineValve.invoke(
> > > StandardEngineValve.java:109)
> > > at org.apache.catalina.connector.CoyoteAda

Re: Internal Server Error and waitSearcher="false" for commit/optimize

2007-10-11 Thread Jason Rennie

Many thanks for your reply, Yonik.

On 10/11/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> > Caused by: org.apache.solr.common.SolrException: Internal Server Error
>
> Is there a longer stack trace somewhere concerning the internal server
> error?

I shouldn't have bothered you with this.  We've discovered two causes (1)
mal-formed query string, and (2) OOME.  I should have done more homework...

No, waitSearcher="true" was designed such that a client could do a
> commit, and wait for a new searcher to be registered such that a new
> query request is guaranteed to see the changes.
> waitSearcher=true/false only affects the thread calling commit... it
> has no effect on other query requests which will continue to use the
> previous searcher until the new one is registered.
>

Thanks for the explanation.  This is exactly what we needed to know.  Our
query threads are separate from the commit/optimize thread, so this option
would not affect operations.

In case you're curious, we use solr as the search engine for
www.stylefeeder.com.  It has served us very well so far, handling over 3000
queries/day.

Thanks,

Jason

-- 
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/

Re: showing results per facet-value efficiently

2007-10-11 Thread Mike Klaas


On 10-Oct-07, at 4:16 AM, Britske wrote:



However, I realized that for calculating the count for each of the
facetvalues, the original standardrequesthandler already loops the  
doclist
to check for matches. Therefore my implementation actually does  
double work,

since it gets doclists for each of the facetvalues again.


Well, not quite.  Once you get into the faceting code, everything is  
in terms of DocSets, which are undordered collections of doc ids.   
Also, faceting employs efficient algorithms for counting the  
cardinality of intersections without actually materializing them,  
which is another difficulty to reusing the code.



My question:
is there a way to get to the already calculated doclist per  
facetvalue from

a subclassed StandardRequestHandler, and so get a nice speedup?  This
facet-falculation seems to go deep into the core of Solr
(SimpleFacets.getFacetTermEnumCounts) and seems not very sensible  
to alter

for just this requirement. opinions appreciated.


Solr never really materializes much of the DocList for a query-- 
almost all docs are dropped as soon as it is clear that they are not  
in the top N results.


It should be possible to produce an approximation which is more  
efficient, like collecting the DocList for the top 1000 docs,  
converting it to a DocSet, find the set intersections (instead of  
using SimpleFacets), and re-order the resulting sets in terms of the  
original DocList.


It would take a bit of work to implement, however.



As a last and somewhat related question:
is there a way to explicity specify facet-values that I want to  
include in
the faceting without (ab)using Q? This is  relevant for me since  
the perfect
solution would be to have the ability to orthogonally get multiple  
toplists
in 1 query. Given the current implementation, this orthoganality is  
now
'corrupted' as injection of a fieldvalue in Q for one facetfield  
influences

the outcome of another facetfield.


I'm not quite sure what you are asking here.  You can specify  
arbitrary facet values using facet.query or facet.prefix.  If you  
want to facet multiple doclists from different queries in one  
request, just write your own request handler that takes a multi- 
valued q param and facets on each.


I didn't answer all the questions in your email, but I hope this  
clarifies things a bit.  Good luck!


-Mike

Re: showing results per facet-value efficiently

2007-10-11 Thread Britske


yup that clarifies things a lot, thanks.



Mike Klaas wrote:
> 
> On 10-Oct-07, at 4:16 AM, Britske wrote:
> 
>>
>> However, I realized that for calculating the count for each of the
>> facetvalues, the original standardrequesthandler already loops the  
>> doclist
>> to check for matches. Therefore my implementation actually does  
>> double work,
>> since it gets doclists for each of the facetvalues again.
> 
> Well, not quite.  Once you get into the faceting code, everything is  
> in terms of DocSets, which are undordered collections of doc ids.   
> Also, faceting employs efficient algorithms for counting the  
> cardinality of intersections without actually materializing them,  
> which is another difficulty to reusing the code.
> 
>> My question:
>> is there a way to get to the already calculated doclist per  
>> facetvalue from
>> a subclassed StandardRequestHandler, and so get a nice speedup?  This
>> facet-falculation seems to go deep into the core of Solr
>> (SimpleFacets.getFacetTermEnumCounts) and seems not very sensible  
>> to alter
>> for just this requirement. opinions appreciated.
> 
> Solr never really materializes much of the DocList for a query-- 
> almost all docs are dropped as soon as it is clear that they are not  
> in the top N results.
> 
> It should be possible to produce an approximation which is more  
> efficient, like collecting the DocList for the top 1000 docs,  
> converting it to a DocSet, find the set intersections (instead of  
> using SimpleFacets), and re-order the resulting sets in terms of the  
> original DocList.
> 
> It would take a bit of work to implement, however.
> 
>>
>> As a last and somewhat related question:
>> is there a way to explicity specify facet-values that I want to  
>> include in
>> the faceting without (ab)using Q? This is  relevant for me since  
>> the perfect
>> solution would be to have the ability to orthogonally get multiple  
>> toplists
>> in 1 query. Given the current implementation, this orthoganality is  
>> now
>> 'corrupted' as injection of a fieldvalue in Q for one facetfield  
>> influences
>> the outcome of another facetfield.
> 
> I'm not quite sure what you are asking here.  You can specify  
> arbitrary facet values using facet.query or facet.prefix.  If you  
> want to facet multiple doclists from different queries in one  
> request, just write your own request handler that takes a multi- 
> valued q param and facets on each.
> 
> I didn't answer all the questions in your email, but I hope this  
> clarifies things a bit.  Good luck!
> 
> -Mike
> 
> 

-- 
View this message in context: 
http://www.nabble.com/showing-results-per-facet-value-efficiently-tf4600154.html#a13163943
Sent from the Solr - User mailing list archive at Nabble.com.

quickie: do facetfields use same cached items in field cache as FQ-param?

2007-10-11 Thread Britske


say I have the following (partial)
querystring:...&facet=true&facet.field=country
field 'country' is not tokenized, not multi-valued, and not boolean, so the
field-cache approach is used.

Morover, the following (partial) querystring is used as well: 
..fq=country:france

do these queries share cached items in the fieldcache? (in this example:
country:france) or do they somehow live as seperate entities in the cache?
The latter would explain my fieldcache having evictions at the moment.

Geert-Jan



-- 
View this message in context: 
http://www.nabble.com/quickie%3A-do-facetfields-use-same-cached-items-in-field-cache-as-FQ-param--tf4609795.html#a13164249
Sent from the Solr - User mailing list archive at Nabble.com.

doubled/halved performance?

2007-10-11 Thread Mike Klaas

I'm seeing some interesting behaviour when doing benchmarks of query  
and facet performance.  Note that the query cache is disabled, and  
the index is entirely in the OS disk cache.  filterCache is fully  
primed.


Often when repeatedly measuring the same query, I'll see pretty  
consistent results (within a few ms), but occasionally a run which is  
almost exactly half the time:


240ms vs. 120ms:

solr: DEBUGINFO: /select/  
facet=true&debugQuery=true&indent=on&start=0&q=www&facet.field=t&qt=dism 
ax&version=2.2&rows=1 0 239
solr: DEBUGINFO: /select/  
facet=true&debugQuery=true&indent=on&start=0&q=www&facet.field=t&qt=dism 
ax&version=2.2&rows=1 0 237
solr: DEBUGINFO: /select/  
facet=true&debugQuery=true&indent=on&start=0&q=www&facet.field=t&qt=dism 
ax&version=2.2&rows=1 0 120
solr: DEBUGINFO: /select/  
facet=true&debugQuery=true&indent=on&start=0&q=www&facet.field=t&qt=dism 
ax&version=2.2&rows=1 0 120
solr: DEBUGINFO: /select/  
facet=true&debugQuery=true&indent=on&start=0&q=www&facet.field=t&qt=dism 
ax&version=2.2&rows=1 0 237
solr: DEBUGINFO: /select/  
facet=true&debugQuery=true&indent=on&start=0&q=www&facet.field=t&qt=dism 
ax&version=2.2&rows=1 0 238


The strange thing is that the execution time is halved across _all_  
parts of query processing:


101.0   total time
  1.0   setup/query parsing
  68.0  main query
  30.0  faceting
  0.0   pre fetch
  2.0   debug

201.0   total time
  1.0   setup/query parsing
  138.0 main query
  58.0  faceting
  0.0   pre fetch
  4.0   debug

I can't really think of a plausible explanation.  Fortuitous  
instruction pipelining?  It is hard to imagine a cause that wouldn't  
exhibit consistency.


-Mike

Re: doubled/halved performance?

2007-10-11 Thread Yonik Seeley

On 10/11/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> I'm seeing some interesting behaviour when doing benchmarks of query
> and facet performance.  Note that the query cache is disabled, and
> the index is entirely in the OS disk cache.  filterCache is fully
> primed.
>
> Often when repeatedly measuring the same query, I'll see pretty
> consistent results (within a few ms), but occasionally a run which is
> almost exactly half the time:
>
> 240ms vs. 120ms:
>
> solr: DEBUGINFO: /select/
> facet=true&debugQuery=true&indent=on&start=0&q=www&facet.field=t&qt=dism
> ax&version=2.2&rows=1 0 239
> solr: DEBUGINFO: /select/
> facet=true&debugQuery=true&indent=on&start=0&q=www&facet.field=t&qt=dism
> ax&version=2.2&rows=1 0 237
> solr: DEBUGINFO: /select/
> facet=true&debugQuery=true&indent=on&start=0&q=www&facet.field=t&qt=dism
> ax&version=2.2&rows=1 0 120
> solr: DEBUGINFO: /select/
> facet=true&debugQuery=true&indent=on&start=0&q=www&facet.field=t&qt=dism
> ax&version=2.2&rows=1 0 120
> solr: DEBUGINFO: /select/
> facet=true&debugQuery=true&indent=on&start=0&q=www&facet.field=t&qt=dism
> ax&version=2.2&rows=1 0 237
> solr: DEBUGINFO: /select/
> facet=true&debugQuery=true&indent=on&start=0&q=www&facet.field=t&qt=dism
> ax&version=2.2&rows=1 0 238
>
> The strange thing is that the execution time is halved across _all_
> parts of query processing:
>
> 101.0   total time
>1.0  setup/query parsing
>68.0 main query
>30.0 faceting
>0.0  pre fetch
>2.0  debug
>
> 201.0   total time
>1.0  setup/query parsing
>138.0main query
>58.0 faceting
>0.0  pre fetch
>4.0  debug
>
> I can't really think of a plausible explanation.  Fortuitous
> instruction pipelining?  It is hard to imagine a cause that wouldn't
> exhibit consistency.

So the queries are one at a time, the index isn't changing, and
nothing else happening in the system?
It would be easier to explain an occasional long query than an
occasional short one.
It's weird how the granularity seems to be on the basis of a request
(if the speedup sometimes happened half way through, then you'd get an
average of the times).

You could try -Xbatch to see if it's hotspot somehow, but I doubt that's it.

-Yonik

Instant deletes without committing

2007-10-11 Thread BrendanD


Hi,

Is it possible to send a command to have Solr flush the deletesPending
documents without doing a commit? I know there's a setting in solrconfig.xml
for setting a threshold value, but I'd like to somehow kick it off on
demand. We need this to be able to remove merchants from our product
listings if their accounts run out of funds.

Thanks,

Brendan
-- 
View this message in context: 
http://www.nabble.com/Instant-deletes-without-committing-tf4610169.html#a13165413
Sent from the Solr - User mailing list archive at Nabble.com.

Re: doubled/halved performance?

2007-10-11 Thread Mike Klaas


On 11-Oct-07, at 2:37 PM, Yonik Seeley wrote:


On 10/11/07, Mike Klaas <[EMAIL PROTECTED]> wrote:

I'm seeing some interesting behaviour when doing benchmarks of query
and facet performance.  Note that the query cache is disabled, and
the index is entirely in the OS disk cache.  filterCache is fully
primed.

Often when repeatedly measuring the same query, I'll see pretty
consistent results (within a few ms), but occasionally a run which is
almost exactly half the time:

I can't really think of a plausible explanation.  Fortuitous
instruction pipelining?  It is hard to imagine a cause that wouldn't
exhibit consistency.


So the queries are one at a time, the index isn't changing, and
nothing else happening in the system?


The index is completely static, and I'm just sending the same query  
sequentially.  I provoke a slowdown by sending a few requests in  
parallel.  There is the odd bit of db activity on the system, but it  
is rare and covered by the other processor regardless.



It would be easier to explain an occasional long query than an
occasional short one.
It's weird how the granularity seems to be on the basis of a request
(if the speedup sometimes happened half way through, then you'd get an
average of the times).


Indeed, this is what surprised me the most.

You could try -Xbatch to see if it's hotspot somehow, but I doubt  
that's it.


with -Xbatch everything runs more slowly, but the 2x effect is still  
present.


Anyway, it is strange but not really a problem.  This is on a dev box  
only; I might run a similar test on a production box sometime.


Thanks,
-Mike

Re: Instant deletes without committing

2007-10-11 Thread Mike Klaas


On 11-Oct-07, at 2:47 PM, BrendanD wrote:



Hi,

Is it possible to send a command to have Solr flush the deletesPending
documents without doing a commit? I know there's a setting in  
solrconfig.xml

for setting a threshold value, but I'd like to somehow kick it off on
demand. We need this to be able to remove merchants from our product
listings if their accounts run out of funds.


 means "make my edits visible now".  Can you explain why  
this isn't exactly what you need?  Are there performance problems?


The flushing deletes parameter is a performance knob only: it doesn't  
actually make the deletes visible until 


-Mike

Fwd: solr, snippets and stored field in nutch...

2007-10-11 Thread Ravish Bhagdev

Hey guys,

Checkout this thread I opened on nutch mailing list.  Looks like Solr
can benefit from reusing Nutch's "segment" based storage strategy for
efficiency in returning snippets, summaries etc without using Lucene
stored fields?

Was this considered before?

Ravish

-- Forwarded message --
From: Dennis Kubes <[EMAIL PROTECTED]>
Date: Oct 11, 2007 11:27 PM
Subject: Re: snippets and stored field in nutch...
To: [EMAIL PROTECTED]

The reason it is stored in the segments instead of index to allow
summarizers to be run on the content of hits to produce the summaries
that appear in the search results.  Summarizers are pluggable and the
actual content used to produce the summary can change.  And summaries
can be changed without re-fetching or re-indexing.  If a summary were
stored in the index, re-indexing would have to occur to make changes.

Also the way the search process works, Nutch returns hits (basically
document ids).  These hits are then sorted and deduped and the best x
number (usually 10) returned.  For only these 10 best hits, hit details
(fields in the index) and summaries are retrieved.  So there is
something to be said about the amount of data being pushed over the network.

Dennis Kubes

Ravish Bhagdev wrote:
> Ah, I see, didn't know that, Thanks!
>
> Interesting that nutch stores it in a different structure (segments)
> and doesn't reuse Lucene strategy of storing within index.  Any
> particular reason why?  Is there any other use of "Segments" data
> structure except to return snippets?
>
> Cheers,
> Ravish
>
> On 10/11/07, John H. Lee <[EMAIL PROTECTED]> wrote:
>> Hi Ravish.
>>
>> You are correct that Nutch does not store document content in the
>> Lucene index. The content *is* stored in the Nutch segment, which is
>> where snippets come from.
>>
>> Hope this helps.
>>
>> -J
>>
>>
>> On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote:
>>
>>> Hey All,
>>>
>>> Am I right in believing that in Lucene/Nutch, to be able to return
>>> content or snippet to a search query, the field to be returned has to
>>> be stored?
>>>
>>> AFAIK, by default, Nutch dose not store the document field, am I
>>> right?  If so, how does it manage to return snippets?  Wouldn't the
>>> index be quite huge if nutch were storing document field by default?
>>>
>>> I will appreciate any help/comments as I'm bit lost with this.
>>>
>>> Ravi
>>

Re: Instant deletes without committing

2007-10-11 Thread BrendanD

Yes, we have some huge performance issues with non-cached queries. So doing a
commit is very expensive for us. We have our autowarm count for our
filterCache and queryResultCache both set to 4096. But I don't think that's
near high enough. We did have it as high as 16384 before, but it took over
an hour to warm. Some of our queries take 30-60 seconds to complete if
they're not cached.

For example:
Oct 11, 2007 4:51:50 PM org.apache.solr.core.SolrCore execute
INFO: /select/
rows=20&start=40&facet.query=attribute_id:113&facet.query=attribute_id:1005336&facet.query=attribute_id:1005697&facet.query=attribute_id:1006192&sort=merchant_count+desc&facet=true&facet.field=min_price_cad_rounded_to_tens&facet.field=manufacturer_id&facet.field=merchant_id&facet.field=has_coupon&facet.field=has_bundle&facet.field=has_sale_price&facet.field=has_promo&fq=product_is_active:true&fq=product_status_code:complete&fq=category_id:"1001294"&fq=attribute_id_value_en_pair:"1005697\:CCM"&qt=sti_dismax_en&f.min_price_cad_rounded_to_tens.facet.limit=-1
0 48061

As you can see, that query took 48 seconds. Usually after a commit and the
caches are flushed, we get a lot of queries that take a long time. It
usually brings our site to its knees for about 15 minutes until it catches
up.

If I could commit more often, that would be ideal, but unfortunately that's
not possible at the moment.

Does anyone know how DieselPoint can do near real-time updates to their
index? They must be using a different technique obviously. Still. We did
consider using DieselPoint before finding out about Solr. However,
DieselPoint is way too expensive for us at this point in time.

I'm sure we're doing lots wrong with our schema and queries to make them so
slow. We do have over 3 million documents in our index.

Thanks,

Brendan

Mike Klaas wrote:
> 
> On 11-Oct-07, at 2:47 PM, BrendanD wrote:
> 
>>
>> Hi,
>>
>> Is it possible to send a command to have Solr flush the deletesPending
>> documents without doing a commit? I know there's a setting in  
>> solrconfig.xml
>> for setting a threshold value, but I'd like to somehow kick it off on
>> demand. We need this to be able to remove merchants from our product
>> listings if their accounts run out of funds.
> 
>  means "make my edits visible now".  Can you explain why  
> this isn't exactly what you need?  Are there performance problems?
> 
> The flushing deletes parameter is a performance knob only: it doesn't  
> actually make the deletes visible until 
> 
> -Mike
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Instant-deletes-without-committing-tf4610169.html#a13166522
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr, snippets and stored field in nutch...

2007-10-11 Thread Mike Klaas

First, it should be noted that I am not an expert in Nutch's  
architure.  I do think I understand what is being said there, however.


Nutch is a distributed web search engine, and uses lucene as a  
indexing component.  It is free to use external data structures to  
store data, and can store the index on a different machine than the  
contents are stored.  They can be updated independently.


One reason why this is more efficient is that in a distributed  
architecture, more documents are retrieved over the system than are  
eventually summarized and output.  It makes no sense to shovel around  
the contents of all these documents if summaries are only being  
returned for the top 10 over the whole system.


But Nutch is still storing the contents _somewhere_.  They haven't  
found a magical technique that makes this need disappear.


So, does an external store make sense for Solr? Well, unlike Nutch,  
Solr is a solitary unit.  If you ask for 10 docs returned, with  
summaries, all of their contents are going to have to be retrieved.   
There aren't any advantages to storing the contents in a separate  
data structure (which will be the same size).


Now, if you are using Solr in a large-scale distributed federated  
way, then you can replicate Nutch's strategy by storing the index in  
one Solr index, and the contents in another.  This could also yield  
benefits in a single-machine context if your code access many more  
documents than it wants summarized.


Keep in mind also that Solr has facilities to help you manage the  
size of the content store.  Are you stripping your contents to their  
bare minima (removing HTML, etc)?  Are you using a compressed text  
field (highly recommended for this kind of data)?


Believe me, if I found that there was a way of providing summaries  
without storing doc contents, I would pee my pants with happiness and  
it would be in Solr faster than you can say "diaper".


cheers,
-Mike

On 11-Oct-07, at 3:48 PM, Ravish Bhagdev wrote:


Hey guys,

Checkout this thread I opened on nutch mailing list.  Looks like Solr
can benefit from reusing Nutch's "segment" based storage strategy for
efficiency in returning snippets, summaries etc without using Lucene
stored fields?

Was this considered before?

Ravish

-- Forwarded message --
From: Dennis Kubes <[EMAIL PROTECTED]>
Date: Oct 11, 2007 11:27 PM
Subject: Re: snippets and stored field in nutch...
To: [EMAIL PROTECTED]


The reason it is stored in the segments instead of index to allow
summarizers to be run on the content of hits to produce the summaries
that appear in the search results.  Summarizers are pluggable and the
actual content used to produce the summary can change.  And summaries
can be changed without re-fetching or re-indexing.  If a summary were
stored in the index, re-indexing would have to occur to make changes.

Also the way the search process works, Nutch returns hits (basically
document ids).  These hits are then sorted and deduped and the best x
number (usually 10) returned.  For only these 10 best hits, hit  
details

(fields in the index) and summaries are retrieved.  So there is
something to be said about the amount of data being pushed over the  
network.


Dennis Kubes

Ravish Bhagdev wrote:

Ah, I see, didn't know that, Thanks!

Interesting that nutch stores it in a different structure (segments)
and doesn't reuse Lucene strategy of storing within index.  Any
particular reason why?  Is there any other use of "Segments" data
structure except to return snippets?

Cheers,
Ravish

On 10/11/07, John H. Lee <[EMAIL PROTECTED]> wrote:

Hi Ravish.

You are correct that Nutch does not store document content in the
Lucene index. The content *is* stored in the Nutch segment, which is
where snippets come from.

Hope this helps.

-J


On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote:


Hey All,

Am I right in believing that in Lucene/Nutch, to be able to return
content or snippet to a search query, the field to be returned  
has to

be stored?

AFAIK, by default, Nutch dose not store the document field, am I
right?  If so, how does it manage to return snippets?  Wouldn't the
index be quite huge if nutch were storing document field by  
default?


I will appreciate any help/comments as I'm bit lost with this.

Ravi

Field name filter

2007-10-11 Thread Debra


When searching data, I need to process field name similar to processing
terms.
Lower-case of field name so field name is not case sensitive (all fileld
names are lower case when indexed),
have "synonyms" for field names, example-if user types article:abc or user
types content:abc it both cases it whould search the article field for abc.
How do you suggest I go about it?
-- 
View this message in context: 
http://www.nabble.com/Field-name-filter-tf4610551.html#a13166568
Sent from the Solr - User mailing list archive at Nabble.com.

Add fields to query when processing

2007-10-11 Thread Debra


How can I add a field name to query dynamicly?
Examle: If user types "in stock"  replace it with "quantity:[1 TO *]"
-- 
View this message in context: 
http://www.nabble.com/Add-fields-to-query-when-processing-tf4610578.html#a1317
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr, snippets and stored field in nutch...

2007-10-11 Thread Ravish Bhagdev

Hi Mike,

Thanks for your reply :)

I am not an expert of either! But, I understand that Nutch stores
contents albeit in a separate data structure (they call segment as
discussed in the thread), but what I meant was that this seems like
much more efficient way of presenting summaries or snippets (of course
for apps that need these only) than using a stored field which is only
option in solr -  not only resulting in a huge index size but reducing
speed of retrieval because of this increase in size (this is
admittedly a guess, would like to know if not the case).  Also for
queries only requesting ids/urls, the segments would never be touched
even for first n results...

Cheers.
Ravish

On 10/12/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> First, it should be noted that I am not an expert in Nutch's
> architure.  I do think I understand what is being said there, however.
>
> Nutch is a distributed web search engine, and uses lucene as a
> indexing component.  It is free to use external data structures to
> store data, and can store the index on a different machine than the
> contents are stored.  They can be updated independently.
>
> One reason why this is more efficient is that in a distributed
> architecture, more documents are retrieved over the system than are
> eventually summarized and output.  It makes no sense to shovel around
> the contents of all these documents if summaries are only being
> returned for the top 10 over the whole system.
>
> But Nutch is still storing the contents _somewhere_.  They haven't
> found a magical technique that makes this need disappear.
>
> So, does an external store make sense for Solr? Well, unlike Nutch,
> Solr is a solitary unit.  If you ask for 10 docs returned, with
> summaries, all of their contents are going to have to be retrieved.
> There aren't any advantages to storing the contents in a separate
> data structure (which will be the same size).
>
> Now, if you are using Solr in a large-scale distributed federated
> way, then you can replicate Nutch's strategy by storing the index in
> one Solr index, and the contents in another.  This could also yield
> benefits in a single-machine context if your code access many more
> documents than it wants summarized.
>
> Keep in mind also that Solr has facilities to help you manage the
> size of the content store.  Are you stripping your contents to their
> bare minima (removing HTML, etc)?  Are you using a compressed text
> field (highly recommended for this kind of data)?
>
> Believe me, if I found that there was a way of providing summaries
> without storing doc contents, I would pee my pants with happiness and
> it would be in Solr faster than you can say "diaper".
>
> cheers,
> -Mike
>
> On 11-Oct-07, at 3:48 PM, Ravish Bhagdev wrote:
>
> > Hey guys,
> >
> > Checkout this thread I opened on nutch mailing list.  Looks like Solr
> > can benefit from reusing Nutch's "segment" based storage strategy for
> > efficiency in returning snippets, summaries etc without using Lucene
> > stored fields?
> >
> > Was this considered before?
> >
> > Ravish
> >
> > -- Forwarded message --
> > From: Dennis Kubes <[EMAIL PROTECTED]>
> > Date: Oct 11, 2007 11:27 PM
> > Subject: Re: snippets and stored field in nutch...
> > To: [EMAIL PROTECTED]
> >
> >
> > The reason it is stored in the segments instead of index to allow
> > summarizers to be run on the content of hits to produce the summaries
> > that appear in the search results.  Summarizers are pluggable and the
> > actual content used to produce the summary can change.  And summaries
> > can be changed without re-fetching or re-indexing.  If a summary were
> > stored in the index, re-indexing would have to occur to make changes.
> >
> > Also the way the search process works, Nutch returns hits (basically
> > document ids).  These hits are then sorted and deduped and the best x
> > number (usually 10) returned.  For only these 10 best hits, hit
> > details
> > (fields in the index) and summaries are retrieved.  So there is
> > something to be said about the amount of data being pushed over the
> > network.
> >
> > Dennis Kubes
> >
> > Ravish Bhagdev wrote:
> >> Ah, I see, didn't know that, Thanks!
> >>
> >> Interesting that nutch stores it in a different structure (segments)
> >> and doesn't reuse Lucene strategy of storing within index.  Any
> >> particular reason why?  Is there any other use of "Segments" data
> >> structure except to return snippets?
> >>
> >> Cheers,
> >> Ravish
> >>
> >> On 10/11/07, John H. Lee <[EMAIL PROTECTED]> wrote:
> >>> Hi Ravish.
> >>>
> >>> You are correct that Nutch does not store document content in the
> >>> Lucene index. The content *is* stored in the Nutch segment, which is
> >>> where snippets come from.
> >>>
> >>> Hope this helps.
> >>>
> >>> -J
> >>>
> >>>
> >>> On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote:
> >>>
>  Hey All,
> 
>  Am I right in believing that in Lucene/Nutch, to be able to return
>  conte

Re: solr, snippets and stored field in nutch...

2007-10-11 Thread Mike Klaas


On 11-Oct-07, at 4:34 PM, Ravish Bhagdev wrote:


Hi Mike,

Thanks for your reply :)

I am not an expert of either! But, I understand that Nutch stores
contents albeit in a separate data structure (they call segment as
discussed in the thread), but what I meant was that this seems like
much more efficient way of presenting summaries or snippets (of course
for apps that need these only) than using a stored field which is only
option in solr -  not only resulting in a huge index size but reducing
speed of retrieval because of this increase in size (this is
admittedly a guess, would like to know if not the case).  Also for
queries only requesting ids/urls, the segments would never be touched
even for first n results...


It doesn't slow down querying, but it does slow down document  
retrieval (*if you are never going to request the summaries for those  
documents).  That is the case I was referring to below.


One option that has been kicked around is to have solr support  
dividing the stored fields into multiple lucene indices.  This would  
accomplish the same result as running two Solr servers for the  
purpose, but would be quite complicated to implement.


I could be wrong, though.  Feel free to give it a shot!

-Mike



Cheers.
Ravish

On 10/12/07, Mike Klaas <[EMAIL PROTECTED]> wrote:

First, it should be noted that I am not an expert in Nutch's
architure.  I do think I understand what is being said there,  
however.


Nutch is a distributed web search engine, and uses lucene as a
indexing component.  It is free to use external data structures to
store data, and can store the index on a different machine than the
contents are stored.  They can be updated independently.

One reason why this is more efficient is that in a distributed
architecture, more documents are retrieved over the system than are
eventually summarized and output.  It makes no sense to shovel around
the contents of all these documents if summaries are only being
returned for the top 10 over the whole system.

But Nutch is still storing the contents _somewhere_.  They haven't
found a magical technique that makes this need disappear.

So, does an external store make sense for Solr? Well, unlike Nutch,
Solr is a solitary unit.  If you ask for 10 docs returned, with
summaries, all of their contents are going to have to be retrieved.
There aren't any advantages to storing the contents in a separate
data structure (which will be the same size).

Now, if you are using Solr in a large-scale distributed federated
way, then you can replicate Nutch's strategy by storing the index in
one Solr index, and the contents in another.  This could also yield
benefits in a single-machine context if your code access many more
documents than it wants summarized.

Keep in mind also that Solr has facilities to help you manage the
size of the content store.  Are you stripping your contents to their
bare minima (removing HTML, etc)?  Are you using a compressed text
field (highly recommended for this kind of data)?

Believe me, if I found that there was a way of providing summaries
without storing doc contents, I would pee my pants with happiness and
it would be in Solr faster than you can say "diaper".

cheers,
-Mike

On 11-Oct-07, at 3:48 PM, Ravish Bhagdev wrote:


Hey guys,

Checkout this thread I opened on nutch mailing list.  Looks like  
Solr
can benefit from reusing Nutch's "segment" based storage strategy  
for

efficiency in returning snippets, summaries etc without using Lucene
stored fields?

Was this considered before?

Ravish

-- Forwarded message --
From: Dennis Kubes <[EMAIL PROTECTED]>
Date: Oct 11, 2007 11:27 PM
Subject: Re: snippets and stored field in nutch...
To: [EMAIL PROTECTED]


The reason it is stored in the segments instead of index to allow
summarizers to be run on the content of hits to produce the  
summaries
that appear in the search results.  Summarizers are pluggable and  
the
actual content used to produce the summary can change.  And  
summaries
can be changed without re-fetching or re-indexing.  If a summary  
were
stored in the index, re-indexing would have to occur to make  
changes.


Also the way the search process works, Nutch returns hits (basically
document ids).  These hits are then sorted and deduped and the  
best x

number (usually 10) returned.  For only these 10 best hits, hit
details
(fields in the index) and summaries are retrieved.  So there is
something to be said about the amount of data being pushed over the
network.

Dennis Kubes

Ravish Bhagdev wrote:

Ah, I see, didn't know that, Thanks!

Interesting that nutch stores it in a different structure  
(segments)

and doesn't reuse Lucene strategy of storing within index.  Any
particular reason why?  Is there any other use of "Segments" data
structure except to return snippets?

Cheers,
Ravish

On 10/11/07, John H. Lee <[EMAIL PROTECTED]> wrote:

Hi Ravish.

You are correct that Nutch does not store document content in the
Lucene index. The

Re: quickie: do facetfields use same cached items in field cache as FQ-param?

2007-10-11 Thread Chris Hostetter


: ..fq=country:france
: 
: do these queries share cached items in the fieldcache? (in this example:
: country:france) or do they somehow live as seperate entities in the cache?
: The latter would explain my fieldcache having evictions at the moment.

FieldCache can't have evicitions.  it's a really low level "cache" where 
the key is field name and the value is an array containing a value for 
every document (you cna think of it as an inverted-inverted-index) that 
Lucene maintains directly.  items are never removed they just get garbage 
collected when the IndexReader is no longer used.  It's primarily for 
sorting, but the SimpleFacets code also leveragies it for facets in some 
cases -- Solr has no way of showing you what's in the FieldCache, because 
Lucene doesn't expose any inspection APIs to query it (it's a heisenberg 
cache .. once you ask if something is in it, it's in it)

are you refering to the "filterCache" ?  

filterCache contains records whose key is a "query" and whose value is a 
DocSet (an unordered collection of all docs matching a query) ... it's 
used whenever you use an "fq" param, for faceting on some fields (when the 
TermEnum method is used, a filterCache entry is added for each term 
tested), and even for some sorted queries if the 
 config option is set to true.

the easiest way to know whether your faceting is using the FieldCache is 
to start your server cold (no newSearcher warming) and then send it a 
simple query with a single facet.field.  depending on the query, you might 
get 0 or 1 entries in the filterCache if SimpleFacets is using the 
FieldCache -- but if it's using the TermEnums, and generating a DocSet per 
term, you'llsee *lots* of inserts into the filterCache.



-Hoss

query syntax performance difference?

2007-10-11 Thread BrendanD


Hi,

Is there a difference in the performance for the following 2 variations on
query syntax? The first query was a response from Solr by using a single fq
parameter in the URL. The second query was a response from Solr by using
separate fq parameter in the URL, one for each field.


product_is_active:true AND product_status_code:complete AND
category_id:"1001570" AND attribute_id_value_en_pair:"1005758\:Elvis
Presley"


vs:

   product_is_active:true
   product_status_code:complete
   category_id:"1001570"
   attribute_id_value_en_pair:"1005758\:Elvis Presley"


I'm just wondering if the queries get executed differently and whether it's
better to split out each individual query into it's own statement or combine
them using the AND operator.

I've tested them against our production server, but I didn't want to clear
the cache and compare the results. They've already been cached, so they come
back fairly quickly (within 1ms). Although originally the first query came
back in 660 ms. The second query had already been cached.


Thanks,

Brendan


-- 
View this message in context: 
http://www.nabble.com/query-syntax-performance-difference--tf4610975.html#a13167830
Sent from the Solr - User mailing list archive at Nabble.com.

autowarm static queries

2007-10-11 Thread BrendanD


Hi,

I have the following query that I've found in my production logs:

INFO: /select/
rows=0&start=0&f.category_id.facet.limit=-1&facet=true&facet.field=category_id&fq=product_is_active:true&fq=product_status_code:complete&fq=category_id:"1001644"&fq=attribute_id_value_en_pair:"1005585\:Wedding"&fq=merchant_id:"1000156"&qt=sti_dismax_en
0 40734

It took almost 41 seconds to run. ick!

I'd like to pre-warm that query in the solrconfig.xml, but I'd like to leave
off a parameter or two. For example:

rows=0&start=0&f.category_id.facet.limit=-1&facet=true&facet.field=category_id&fq=product_is_active:true&fq=product_status_code:complete&fq=category_id:"1001644"&qt=sti_dismax_en

I've taken off the attribute_id_value_en_pair and merchant_id fields.

The query without those two parameters is never actually executed in
production. Will it help the performance of the full query at all if I cache
the partial query when a newSearcher is created?

The reason I ask is the attribute_id_value_en_pair and merchant_id are
dynamically added to the query at runtime based on results returned from our
database. I can easily generate a static list of queries for each unique
category_id and put them into the solrconfig.xml time in the newSearcher
listener area. I'm just not sure if it will help at all if Solr uses the
entire set of fields as a key into the queryResultsCache.

Thanks,

Brendan





-- 
View this message in context: 
http://www.nabble.com/autowarm-static-queries-tf4611014.html#a13167933
Sent from the Solr - User mailing list archive at Nabble.com.

Re: autowarm static queries

2007-10-11 Thread Mike Klaas


On 11-Oct-07, at 6:47 PM, BrendanD wrote:



Hi,

I have the following query that I've found in my production logs:

INFO: /select/
rows=0&start=0&f.category_id.facet.limit=-1&facet=true&facet.field=cat 
egory_id&fq=product_is_active:true&fq=product_status_code:complete&fq= 
category_id:"1001644"&fq=attribute_id_value_en_pair:"1005585 
\:Wedding"&fq=merchant_id:"1000156"&qt=sti_dismax_en

0 40734

It took almost 41 seconds to run. ick!

I'd like to pre-warm that query in the solrconfig.xml, but I'd like  
to leave

off a parameter or two. For example:

rows=0&start=0&f.category_id.facet.limit=-1&facet=true&facet.field=cat 
egory_id&fq=product_is_active:true&fq=product_status_code:complete&fq= 
category_id:"1001644"&qt=sti_dismax_en


This time is almost certainly being spent in faceting on  
category_id.  In that case, any facet query on that field will warm  
the cache just fine.  (Since you don't care about the results, use  
facet.limit=1, not unlimited).  It'll also warm all the fq  
parameters, which can be shared by future queries.


Any reason why you are faceting on a field that you are restricting?   
Clearly, the answer will be '1001644' --> , (all other  
categories) -> 0.  Just use numFound.


Also, if there can only be one category per doc, make sure you are  
using the fieldCache method for category_id.


-Mike

Re: getting number of stored documents via rest api

2007-10-11 Thread Erik Hatcher

Another route to getting the number of documents is to get it from  
the LukeRequestHandler:


	http://localhost:8983/solr/admin/luke?numTerms=0  (numTerms=0 to get  
the fastest response possible)


  Erik





On Oct 10, 2007, at 10:19 AM, Stefan Rinner wrote:


Hi

for some tests I need to know how many documents are stored in the  
index - is there a fast & easy way to retrieve this number (instead  
of searching for "*:*" and counting the results)?
I already took a look at the stats.jsp code - but there the number  
of documents is retrieved via an api call to SolrInfoRegistry and  
not the webservice.


thanks

- stefan

Re: autowarm static queries

2007-10-11 Thread BrendanD



Mike Klaas wrote:
> 
> On 11-Oct-07, at 6:47 PM, BrendanD wrote:
> 
>>
>> Hi,
>>
>> I have the following query that I've found in my production logs:
>>
>> INFO: /select/
>> rows=0&start=0&f.category_id.facet.limit=-1&facet=true&facet.field=cat 
>> egory_id&fq=product_is_active:true&fq=product_status_code:complete&fq= 
>> category_id:"1001644"&fq=attribute_id_value_en_pair:"1005585 
>> \:Wedding"&fq=merchant_id:"1000156"&qt=sti_dismax_en
>> 0 40734
>>
>> It took almost 41 seconds to run. ick!
>>
>> I'd like to pre-warm that query in the solrconfig.xml, but I'd like  
>> to leave
>> off a parameter or two. For example:
>>
>> rows=0&start=0&f.category_id.facet.limit=-1&facet=true&facet.field=cat 
>> egory_id&fq=product_is_active:true&fq=product_status_code:complete&fq= 
>> category_id:"1001644"&qt=sti_dismax_en
> 
> This time is almost certainly being spent in faceting on  
> category_id.  In that case, any facet query on that field will warm  
> the cache just fine.  (Since you don't care about the results, use  
> facet.limit=1, not unlimited).  It'll also warm all the fq  
> parameters, which can be shared by future queries.
> 
> Any reason why you are faceting on a field that you are restricting?   
> Clearly, the answer will be '1001644' --> , (all other  
> categories) -> 0.  Just use numFound.
> 
> Also, if there can only be one category per doc, make sure you are  
> using the fieldCache method for category_id.
> 
> -Mike
> 

Thanks Mike, that's good to hear it will help.

Unfortunately pretty much ALL of our fields are multi-valued. A product can
exist in multiple categories, be sold by multiple merchants, and have
multiple attributes with multiple attribute values assigned to it. E.g. an
iPod in a special Gifts category and also in the MP3 Players category who's
audio format support attribute values are MP3, AAC, WAV, and Apple Lossless.

I'm not sure what you mean by "just use numFound"? I will have to speak with
my developer on this one as I'm not the one who originally wrote the code.

Thanks,

Brendan


-- 
View this message in context: 
http://www.nabble.com/autowarm-partial-queries-in-solrconfig.xml-tf4611014.html#a13168240
Sent from the Solr - User mailing list archive at Nabble.com.

Re: autowarm static queries

2007-10-11 Thread Brian Wipf


On 11-Oct-07, at 8:38 PM, BrendanD wrote:

Mike Klaas wrote:

Any reason why you are faceting on a field that you are restricting?
Clearly, the answer will be '1001644' --> , (all other
categories) -> 0.  Just use numFound.

Also, if there can only be one category per doc, make sure you are
using the fieldCache method for category_id.

-Mike



Thanks Mike, that's good to hear it will help.

Unfortunately pretty much ALL of our fields are multi-valued. A  
product can

exist in multiple categories, be sold by multiple merchants, and have
multiple attributes with multiple attribute values assigned to it.  
E.g. an
iPod in a special Gifts category and also in the MP3 Players  
category who's
audio format support attribute values are MP3, AAC, WAV, and Apple  
Lossless.


I'm not sure what you mean by "just use numFound"?
I think he means that if items can only be in one category and you're  
filtering on category, the number of results returned is the number  
found; there is no need to facet on category as well.


This doesn't apply to you if items can be in more than one category  
and you need counts for the other categories as well.


Brian Wipf
<[EMAIL PROTECTED]>

Re: autowarm static queries

2007-10-11 Thread Mike Klaas


On 11-Oct-07, at 7:38 PM, BrendanD wrote:



Unfortunately pretty much ALL of our fields are multi-valued. A  
product can

exist in multiple categories, be sold by multiple merchants, and have
multiple attributes with multiple attribute values assigned to it.  
E.g. an
iPod in a special Gifts category and also in the MP3 Players  
category who's
audio format support attribute values are MP3, AAC, WAV, and Apple  
Lossless.


Ah.  Yes, your query will warm the cache.

I'm not sure what you mean by "just use numFound"? I will have to  
speak with
my developer on this one as I'm not the one who originally wrote  
the code.


Nevermind.  I was thinking of category as being single-valued.  For  
multi-valued category, it is still necessary to do faceting to find  
sub-categories.  Sorry!


-Mike

Re: quickie: do facetfields use same cached items in field cache as FQ-param?

2007-10-11 Thread Britske

Yeah i meant filter-cache, thanks. 
It seemed that the particular field (cityname) was using a keywordtokenizer
(which doens't show at the front) which is why i missed it i guess :-S. This
means the term field is tokenized so termEnums-apporach is used. This
results in about 10.000 inserts on facet.field=cityname on a cold searcher,
which matches the nr of different terms in that field. At least that
explains that. 

So if I understand correctly if I use that same field in a FQ-param, say
fq=cityname:amsterdam and amsterdam is a term of field cityname, than the
FQ-query can utilize the cached 'query': cityname:amsterdam which is already
put into the filtercache by the query facet.field=cityname right?

The thing that I still don't get is why my filtercache starts to have
evictions although it's size is 16.000+.  This shouldn't be happing given
that:
I currently only use faceting on cityname and use this field on FQ as well,
as already said (which adds +/- 1 items to the filtercache, given that
faceting and fq share cached items). 
Moreover i use FQ on about 2500 different fields (named _ddp*), but only
check to see if a value exists by doing for example: fq=_ddp1234:[* TO *]. I
sometimes add them together like so: fq=_ddp1234:[* TO *] &fq=_ddp2345:[* TO
*]. But never like so: fq=_ddp1234:[* TO *] +_ddp2345:[* TO *]. Which means
each _ddp*-field is only added once to the filtercache. 

Wouldn't this mean that at a maximum I can only have 12500 items in the
filtercache?
Still my filtercache starts to have evictions although it's size is 16.000+. 

What am I missing here?
Geert-Jan

hossman wrote:
> 
> 
> : ..fq=country:france
> : 
> : do these queries share cached items in the fieldcache? (in this example:
> : country:france) or do they somehow live as seperate entities in the
> cache?
> : The latter would explain my fieldcache having evictions at the moment.
> 
> FieldCache can't have evicitions.  it's a really low level "cache" where 
> the key is field name and the value is an array containing a value for 
> every document (you cna think of it as an inverted-inverted-index) that 
> Lucene maintains directly.  items are never removed they just get garbage 
> collected when the IndexReader is no longer used.  It's primarily for 
> sorting, but the SimpleFacets code also leveragies it for facets in some 
> cases -- Solr has no way of showing you what's in the FieldCache, because 
> Lucene doesn't expose any inspection APIs to query it (it's a heisenberg 
> cache .. once you ask if something is in it, it's in it)
> 
> are you refering to the "filterCache" ?  
> 
> filterCache contains records whose key is a "query" and whose value is a 
> DocSet (an unordered collection of all docs matching a query) ... it's 
> used whenever you use an "fq" param, for faceting on some fields (when the 
> TermEnum method is used, a filterCache entry is added for each term 
> tested), and even for some sorted queries if the 
>  config option is set to true.
> 
> the easiest way to know whether your faceting is using the FieldCache is 
> to start your server cold (no newSearcher warming) and then send it a 
> simple query with a single facet.field.  depending on the query, you might 
> get 0 or 1 entries in the filterCache if SimpleFacets is using the 
> FieldCache -- but if it's using the TermEnums, and generating a DocSet per 
> term, you'llsee *lots* of inserts into the filterCache.
> 
> 
> 
> -Hoss
> 
> 

-- 
View this message in context: 
http://www.nabble.com/quickie%3A-do-facetfields-use-same-cached-items-in-field-cache-as-FQ-param--tf4609795.html#a13169935
Sent from the Solr - User mailing list archive at Nabble.com.

43 matches

Mail list logo