Re: range queries on string field with millions of values

2008-11-29 Thread Naomi Dushay

Hi Hoss,

Thanks for this.

The terms component approach, if i understand it correctly, will be  
problematic.  I need to present not only the next X call numbers in  
sequence, but other fields in those documents (e.g. title, author).  I  
assume the Terms Component approach will only give me the next X call  
number values, not the documents.


It sounds like Glen Newton's suggestion of mapping the call numbers to  
a float number is the most likely solution.


I know it sounds ridiculous to do all this for a "call number browse"  
but our faculty have explicitly asked for this.  For humanities  
scholars especially, they know the call numbers that are of interest  
to them, and they browse the stacks that way (ML 1500s are opera, V35  
is verdi ...).   They are using the research methods that have been  
successful for their entire careers.  Plus, library materials are  
going to off site, high density storage, so the only way for them to  
to browse all materials, regardless of location, via call number is  
online.   I doubt they'll find this feature as useful as they expect,  
but it behooves us to give the users what they ask for.


So yeah, our user needs are perhaps a little outside of your  
expectations.  :-)


- Naomi


On Nov 29, 2008, at 2:58 PM, Chris Hostetter wrote:



: The results are correct.  But the response time sucks.
:
: Reading the docs about caches, I thought I could populate the  
query result
: cache with an autowarming query and the response time would be  
okay.  But that

: hasn't worked.  (See excerpts from my solrConfig file below.)
:
: A repeated query is very fast, implying caching happens for a  
particular

: starting point ("42" above).
:
: Is there a way to populate the cache with the ENTIRE sorted list  
of values for
: the field, so any arbitrary starting point will get results from  
the cache,
: rather than grabbing all results from (x) to the end, then sorting  
all these

: results, then returning the first 10?

there's two "caches" that come into play for something like this...

the first cache is a low level Lucene cache called the "FieldCache"  
that

is completley hidden from you (and for the most part: from Solr).
anytime you sort on a field, it get's built, and reuse for all sorts  
on

that field.  My originl concern was that it wasn't getting warmed on
"newSearcher" (because you have to be explicit about that.

the second cache is the queryResultsCache which caches a "window" of  
an
ordered list of documents based on a query, and a sort.  you can see  
this
cache in your Solr stats, and yes: these two requests results in  
different

cache keys for the queryResultsCache...

   q=yourField:[42+TO+*]&sort=yourField+asc&rows=10
   q=yourField:[52+TO+*]&sort=yourField+asc&rows=10

...BUT! ... the two queries below will result in the same cache key,  
and

the second will be a cache hit, provided a sufficient value for
the "queryResultWindowSize" ...

   q=yourField:[42+TO+*]&sort=yourField+asc&rows=10
   q=yourField:[42+TO+*]&sort=yourField+asc&rows=10&start=10

so perhaps the key to your problem is to just make sure that once  
the user
gives you an id to start with, you "scroll" by increasing the start  
param
(not altering the id) ... the first query might be "slow" but every  
query
after that should be a cache hit (depending on your page size, and  
how far

you expect people to scroll, you should consider increasing
queryResultWindowSize)

But as Yonik said: the new TermsComponent may actually be a better  
option
for you -- doing two requests for every page (the first to get the N  
Terms
in your id field starting with your input, the second to do an query  
for

docs matching any of those N ids) might actually be faster even though
there won't likely even be any cache hits.


My opinion:  Your use case sounds like a waste of effort.  I can't  
imagine
anyone using a library catalog system ever wanting to lookup a  
callnumber,
and then scroll through all posisble books with similar call numbers  
-- it
seems much more likely that i'd want to look at other books with  
similar
authors, or keywords, or tags ... all things that are actaully  
*easier* to
do with Solr.  (but then again: i don't work in a library.  i trust  
that

you know something i don't about what your users want.)


-Hoss



Naomi Dushay
[EMAIL PROTECTED]





boosting certain terms within one field?

2008-11-29 Thread Peter Wolanin
I've recently started working on the Drupal integration module for
SOLR, and we are looking for suggestions for how to address this
question:  how do we boost the importance of a subset of terms within
a field.

For example, we are using the standard request handler for queries,
and the default field for keyword searches is a concatentation of the
title, body, taxonomy terms, etc.

One "hackish" way I can imagine is that terms we want to boost (for
example the title, or text inside h2 tags) could be concatenated on
multiple times.  Would this be effective and reasonable?

It seems like the alternative is to try to switch to using the dismax
handler, storing the terms that we desire to have different boosts
into different fields, all of which are in the list of query fields?

Thanks in advance for your suggestions.

-Peter

--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
[EMAIL PROTECTED]


Returning function values from query

2008-11-29 Thread tonypayne

I have a boost query that combines several functions to affect the score. For
logging, I would like to be able to see the individual values of the
functions as well as the base relevancy score for each document in the
result set. So far, I've been able to work around the issue because all of
my functions operate on data that is local to the document. However, I'd
like to also include a weighting factor by recency (i.e.
recip(rord(date_val),1,x,x). Now, because rord is not strictly dependent on
the document's field values, I seem to be stuck reversing the boost
functions to back into the relevancy score. 

Is there some way to return the value of a function from the query so that I
can log appropriately?
-- 
View this message in context: 
http://www.nabble.com/Returning-function-values-from-query-tp20753803p20753803.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: range queries on string field with millions of values

2008-11-29 Thread Chris Hostetter

: The results are correct.  But the response time sucks.
: 
: Reading the docs about caches, I thought I could populate the query result
: cache with an autowarming query and the response time would be okay.  But that
: hasn't worked.  (See excerpts from my solrConfig file below.)
: 
: A repeated query is very fast, implying caching happens for a particular
: starting point ("42" above).
: 
: Is there a way to populate the cache with the ENTIRE sorted list of values for
: the field, so any arbitrary starting point will get results from the cache,
: rather than grabbing all results from (x) to the end, then sorting all these
: results, then returning the first 10?

there's two "caches" that come into play for something like this...

the first cache is a low level Lucene cache called the "FieldCache" that 
is completley hidden from you (and for the most part: from Solr).  
anytime you sort on a field, it get's built, and reuse for all sorts on 
that field.  My originl concern was that it wasn't getting warmed on 
"newSearcher" (because you have to be explicit about that.

the second cache is the queryResultsCache which caches a "window" of an 
ordered list of documents based on a query, and a sort.  you can see this 
cache in your Solr stats, and yes: these two requests results in different 
cache keys for the queryResultsCache...

q=yourField:[42+TO+*]&sort=yourField+asc&rows=10
q=yourField:[52+TO+*]&sort=yourField+asc&rows=10

...BUT! ... the two queries below will result in the same cache key, and 
the second will be a cache hit, provided a sufficient value for 
the "queryResultWindowSize" ...

q=yourField:[42+TO+*]&sort=yourField+asc&rows=10
q=yourField:[42+TO+*]&sort=yourField+asc&rows=10&start=10

so perhaps the key to your problem is to just make sure that once the user 
gives you an id to start with, you "scroll" by increasing the start param 
(not altering the id) ... the first query might be "slow" but every query 
after that should be a cache hit (depending on your page size, and how far 
you expect people to scroll, you should consider increasing 
queryResultWindowSize)

But as Yonik said: the new TermsComponent may actually be a better option 
for you -- doing two requests for every page (the first to get the N Terms 
in your id field starting with your input, the second to do an query for 
docs matching any of those N ids) might actually be faster even though 
there won't likely even be any cache hits.


My opinion:  Your use case sounds like a waste of effort.  I can't imagine 
anyone using a library catalog system ever wanting to lookup a callnumber, 
and then scroll through all posisble books with similar call numbers -- it 
seems much more likely that i'd want to look at other books with similar 
authors, or keywords, or tags ... all things that are actaully *easier* to 
do with Solr.  (but then again: i don't work in a library.  i trust that 
you know something i don't about what your users want.)


-Hoss



Re: Using Solr with Hadoop ....

2008-11-29 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Sat, Nov 29, 2008 at 7:26 PM, Jon Baer <[EMAIL PROTECTED]> wrote:
> HadoopEntityProcessor for the DIH?
Reading data from Hadoop with DIH could be really cool
There are a few very useful ones which are required badly. Most useful
one would be a TikaEntityProcessor.

But I do not see it solving the scalability problem (the original post).
>
> Ive wondered about this as they make HadoopCluster LiveCDs and EC2 have
> images but best way to make use of them is always a challenge.
>
> - Jon
>
> On Nov 29, 2008, at 3:34 AM, Erik Hatcher wrote:
>
>>
>> On Nov 28, 2008, at 8:38 PM, Yonik Seeley wrote:
>>>
>>> Or, it would be relatively trivial to write a Lucene program
>>> to merge the indexes.
>>
>> FYI, such a tool exists in Lucene's API already:
>>
>>
>>  
>> 
>>
>>Erik
>>
>
>



-- 
--Noble Paul


Re: Regex Transformer Error

2008-11-29 Thread Ahmed Hammad
OK, I contributed it at:
https://issues.apache.org/jira/browse/SOLR-887

I changed it to use Solr class org.apache.solr.analysis.HTMLStripReader

Thank you all.

Ahmed



On Tue, Nov 18, 2008 at 5:49 AM, Noble Paul നോബിള്‍ नोब्ळ् <
[EMAIL PROTECTED]> wrote:

> On Tue, Nov 18, 2008 at 2:49 AM, Ahmed Hammad <[EMAIL PROTECTED]> wrote:
> > Hi All,
> >
> > Although the HTMLStripStandardTokenizerFactory will remove HTML tags, it
> > will be stored in the index and needed to be removed while searching. In
> my
> > case the HTML tags has no need at all. So I created HTMLStripTransformer
> for
> > the DIH to remove the HTML tags and save space on the index. I have used
> the
> > HTML parser included with Lucene ( org.apache.lucene.demo.html). It is
> well
> > performing and worked with me (while working with Lucene before moving to
> > Solr)
> >
> > What do you think? Does it worth contribution?
> Yes. You can contribute this new transformer as an enhancement .
> >
> > My best wishes,
> >
> > Regards,
> > Ahmed
> >
> > On Thu, Nov 6, 2008 at 2:39 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote:
> >
> >> There is a nice HTML stripper inside Solr.
> >> "solr.HTMLStripStandardTokenizerFactory"
> >>
> >> -Original Message-
> >> From: Ahmed Hammad [mailto:[EMAIL PROTECTED]
> >> Sent: Wednesday, November 05, 2008 10:43 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Regex Transformer Error
> >>
> >> Hi,
> >>
> >> It works with the attribute regex="<(.|\n)*?>"
> >>
> >> Sorry for the disturbance.
> >>
> >> Regards,
> >>
> >> ahmd
> >>
> >>
> >> On Wed, Nov 5, 2008 at 8:18 PM, Ahmed Hammad <[EMAIL PROTECTED]> wrote:
> >>
> >> > Hi,
> >> >
> >> > I am using Solr 1.3 data import handler. One of my table fields has
> >> > html tags, I want to strip it of the field text. So obviously I need
> >> > the Regex Transformer.
> >> >
> >> > I added transformer="RegexTransformer" attribute to my entity and a
> >> > new field with:
> >> >
> >> >  >> > replaceWith="X"/>
> >> >
> >> > Every thing works fine. The text is replace without any problem. The
> >> > provlem happend with my regular experession to strip html tags. So I
> >> > use regex="<(.|\n)*?>". Of course the charecters '<' and '>' are not
> >> > allowed in XML. I tried the following regex="<(.|\n)*?>" and
> >> > regex="C;(.|\n)*?E;" but I get the following error:
> >> >
> >> > The value of attribute "regex" associated with an element type "field"
> >>
> >> > must not contain the '<' character. at
> >> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown
> >> > Source) ...
> >> >
> >> > The full stack trace is following:
> >> >
> >> > *FATAL: Could not create importer. DataImporter config invalid
> >> > org.apache.solr.common.SolrException: FATAL: Could not create
> >> importer.
> >> > DataImporter config invalid at
> >> > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport
> >> > Handler.java:114)
> >> > at
> >> > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody
> >> > (DataImportHandler.java:206)
> >> > at
> >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
> >> > rBase.java:131) at
> >> > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at
> >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.
> >> > java:303)
> >> > at
> >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> >> > .java:232)
> >> > at
> >> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli
> >> > cationFilterChain.java:235)
> >> > at
> >> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi
> >> > lterChain.java:206)
> >> > at
> >> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa
> >> > lve.java:233)
> >> > at
> >> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa
> >> > lve.java:191)
> >> > at
> >> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja
> >> > va:128)
> >> > at
> >> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja
> >> > va:102)
> >> > at
> >> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv
> >> > e.java:109)
> >> > at
> >> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
> >> > :286)
> >> > at
> >> > org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor
> >> > .java:857)
> >> > at
> >> > org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.pro
> >> > cess(Http11AprProtocol.java:565) at
> >> > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:150
> >> > 9) at java.lang.Thread.run(Unknown Source) Caused by:
> >> > org.apache.solr.handler.dataimport.DataImportHandlerException:
> >> > Exception occurred while initializing context Processing Document # at
> >> > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp
> >> > orter.java:176)
> >> > at
> >> > org.apache.solr.handler.dataimport.DataImport

Re: Using Solr with Hadoop ....

2008-11-29 Thread Jon Baer

HadoopEntityProcessor for the DIH?

Ive wondered about this as they make HadoopCluster LiveCDs and EC2  
have images but best way to make use of them is always a challenge.


- Jon

On Nov 29, 2008, at 3:34 AM, Erik Hatcher wrote:



On Nov 28, 2008, at 8:38 PM, Yonik Seeley wrote:

Or, it would be relatively trivial to write a Lucene program
to merge the indexes.


FYI, such a tool exists in Lucene's API already:

 


Erik





Re: Spellcheck for phrase queries

2008-11-29 Thread Grant Ingersoll

Hi Kalyan,

Currently the spell checker does not support phrase based  
suggestions.  See http://lucene.markmail.org/message/wdr7wsenhtuecatb?q=spellcheck+list:org%2Eapache%2Elucene%2Esolr-user 
 and a variety of other links.


Now, I think there are a couple of things you could do.

1. Try n-grams (shingles) for a spelling field, such that things like  
"Grand Hyatt" end up in the dictionary.  My guess is you probably  
would want 1, 2 and 3 grams.


2.  Produce a patch to the component that takes the collation and uses  
a specified query parser to check to see if there are results for the  
collation.


3.  You could just do #2 programmatically in your application, which  
would result in a 2nd query, or as a separate query component that  
works off of the collation field.



HTH,
Grant

On Nov 25, 2008, at 3:17 PM, Manepalli, Kalyan wrote:


Hi,

   I am trying to implement a spell check functionality on a
particular field. I need to do a complete phrase spell check when user
enters multiple words.



For eg: If the user enters "great Hyat" the current implementation  
would
suggest "great Hyatt", just correcting the word "hyatt". But there  
will

not be any record for this suggestion.

How do I implement a complete phrase spell check, so that it suggests
"grand Hyatt" instead of "great Hyatt".



Any suggestions in this regard will be helpful



Thanks,

Kalyan Manepalli







Re: Using Solr with Hadoop ....

2008-11-29 Thread Erik Hatcher


On Nov 28, 2008, at 8:38 PM, Yonik Seeley wrote:

Or, it would be relatively trivial to write a Lucene program
to merge the indexes.


FYI, such a tool exists in Lucene's API already:

  


Erik