Re: Standard vs. DisMaxQueryHandler

2006-09-21 Thread Chris Hostetter

: Is the main difference between the StandardQueryHandler and
: DisMaxQueryHandler the supported query syntax (and different query
: parser used in each of them), and the fact that the latter creates
: DisjunctionMaxQueries, while the former just creates vanilla
: BooleanQueries?  Are there any other differences?

the main differnece is the query string yes: Standard expects to get
lucene QueryParser formatted queries, while DisMax expects to get raw
user input strings ... Standard builds queries (wehter they be prefix or
boolean or wildcard) using the QueryParser as is, while DisMax does a
cross product of the user input across many differnet fields and builds
up a very specific query structure -- which can then be augmented with
aditional query clauses like the bq boost query and the bf boost
function.

there's no reason the StandardRequestHandler can't construct DisMaxQueries
(once QueryParser has some syntax for them) and DisMaxRequestHandler does
(at the outermost level) generate a BooleanQuery (with a custom
minShouldMatch value set on it) but the main differnece is really the
use case: if you want the clinet to specify the exact query structure that
they want, use StandardRequstHandler.  if you want the client to just
propogate the raw search string typed by the user, without any structure
or escaping, and get the nice complex DisMax style query across the
configured fields, the DisMax handler was written to fill that niche.

(load up the example configs, and take a look at the query toString from
this url to see what i mean about the complex structure...

http://localhost:8983/solr/select/?qt=dismaxq=how+now+brown+cowdebugQuery=1




-Hoss



Default XML Output Schema

2006-09-21 Thread sangraal aiken

Perhaps a silly questions, but I'm wondering if anyone can tell me why solr
outputs XML like this:

doc
int name=id201038/id
int name=siteId31/siteId
date name=modified2006-09-15T21:36:39.000Z/date
/doc

rather than like this:

doc
id type=int201038/id
siteId type=int31/siteId
modified type=date2006-09-15T21:36:39.000Z/modified
/doc

A front-end PHP developer I know is having trouble parsing the default Solr
output because of that format and mentioned it would be much easier in the
former format... so I was curious if there was a reason it is the way it is.

-Sangraal


Re: Default XML Output Schema

2006-09-21 Thread Yonik Seeley

On 9/21/06, sangraal aiken [EMAIL PROTECTED] wrote:

Perhaps a silly questions, but I'm wondering if anyone can tell me why solr
outputs XML like this:


During the initial development of Solr (2004), I remember throwing up
both options, and most developers preferred to have a limited number
of well defined tags.

It allows you to have rather arbitrary field names, which you couldn't
have if you used the field name as the tag.

It also allows consistency with custom data.  For example, here is the
representation of an array of integer:
arrint1/intint2/int/arr
If field names were used as tags, we would have to either make up a
dummy-name, or we wouldn't be able to use the same style.



doc
int name=id201038/id
int name=siteId31/siteId
date name=modified2006-09-15T21:36:39.000Z/date
/doc

rather than like this:

doc
id type=int201038/id
siteId type=int31/siteId
modified type=date2006-09-15T21:36:39.000Z/modified
/doc

A front-end PHP developer I know is having trouble parsing the default Solr
output because of that format and mentioned it would be much easier in the
former format... so I was curious if there was a reason it is the way it is.


There are a number of options for you.
You could write your own QueryResponseWriter to output XML just as you
like it, or use an XSLT stylesheet in conjunction with
http://issues.apache.org/jira/browse/SOLR-49
or use another format such as JSON.

-Yonik


Extending Solr's Admin functionality

2006-09-21 Thread Otis Gospodnetic
Hello,

I may need to add functionality to Solr's admin pages.  The functionality that 
I'm looking to add is the ability to trigger certain indexing functions and 
monitor their progress.  I'm wondering if people have thoughts about the best 
way to do this.  Here are my initial ideas:

1. Add additional admin screens/JSPs, make them call custom classes that 
trigger indexing (e.g. go to a DB, retrieve some data, index it, maybe optimize 
when done), have that execute in a separate thread, and have these classes call 
Solr via custom HTTP requests that report progress, so that this 
progress/status can be viewed through another admin page for monitoring of this 
stuff.

2. Forget about triggering things from the UI.  Write generic/command-line-type 
classes, have them invoked independently of Solr, but still have them call Solr 
via custom HTTP requests that report progress, so that this progress/status can 
be viewed through another admin page for monitoring of this stuff.

I like 1, because everything is contained in Solr, but I fear it may be hard to 
maintain this extended version with Solr, unless the stuff I write ends up 
being generic enough that I can contribute it back.  I guess 2 would have some 
of these problems because I'd still need an admin page for monitoring.

Any thoughts?
Has anyone already envisioned a good way to extend Solr's functionality with 
custom admin screens?

Thanks,
Otis






Re: relational design in solr?

2006-09-21 Thread Chris Hostetter

While it's certianly possible to join the results of multiple indexes, i
would do so only when absolutely neccessary -- in my experience the only
time i've found that it makes sense, is when one aspect of the data
changes extremely rapidly compared to everything else, making complex
reindexing a pain, but reindexing just the changed data in it's own index
is a lot more feasible.

As a rule of thumb, when building paginated style search applications, I
would advise people to try and flatten their index as much as possible, so
that the application can do one user query (based on the users input)
to get a single page of results, and then use the uniqueKeys from that
page of results to lookup ancillary data from any other indexes (or
databases that you need) -- the key being that all the data you want to
search on, and all hte data you need to sort are in the index, but other
data you needto return to the user can come from other sources.

If you find yourself wanting to join to indexes for hte purposes of
matching or sorting, the amount of work you wind up doing tends to be
prohibitive on really large indexes -- and if your indxes aren't that
large, it would probably just be easier to puteverything in one index and
rebuild it frequently.

: I am trying to integrate solr search results with results from a rdbms
: query.  It's working ok, but fairly complicated  due to large size of
: the results from the database, and many different sort requirements.
:
: I know that solr/lucene was not designed to intelligently handle
: multiple document types in the same collection, i.e. provide join
: features, but I'm wondering if anyone on this list has any thoughts on
: how to do it in lucene, and how it might be integrated into a custom
: solr deployment.  I can't see going back to vanilla lucene after solr!
:
: My basic idea is to add an objType field that would be used to define a
: table.  There would be one main objType, any related objTypes would
: have a field pointing back to the main objs via id, like a foreign key.
:
: I'd run multiple parallel searches and merge the results based on
: foreign keys, either using a Filter or just using custom code.  I'm
: anticipating that iterating through the results to retrieve the foreign
: key values will be too slow.
:
: Our data is highly textual, temporal and spatial, which pretty much
: correspond to the 3 tables I would have.  I can de-normalize a lot of
: the data, but the combination of times, locations and textual
: representations would be way too large to fully flatten.
:
: I'm about to start experimenting with different strategies, and I would
: appreciate any insight anyone can provide.  Would the faceting code help
: here somehow?



-Hoss



http error

2006-09-21 Thread Jeff McCormick
I'm getting the following error when I try and hit the admin console:

HTTP ERROR: 500

dr01142: dr01142

RequestURI=/solr/admin/stats.jsp

Powered by Jetty://

has anyone seen this error before?  The queries to this server seem to work
just fine, only the admin console is not working.
-- 
Jeff McCormick
[EMAIL PROTECTED]


Re: wana use CJKAnalyzer

2006-09-21 Thread Yonik Seeley

On 9/21/06, Chris Hostetter [EMAIL PROTECTED] wrote:


: i just wanna say: no your help,maybe i will give up.thk u again.
:
: http://www.flickr.com/photos/[EMAIL PROTECTED]/248815068/

:  thk Hoss,Nick Snels,Koji,Mike and  everybody who helped me and wanna help
:  me..
: 
:  i can use solr with Chinese Word.

I'm sorry, i'm really confused now ... it seems like you got things
working, but you also say maybe i will give up ... ?


I read that as without your help, maybe I would have given up.

-Yonik


Re: http error

2006-09-21 Thread Jeff McCormick
This error was caused by my machine's
hostname being changed by DHCP and it not resolving to localhost!  Apparently
for JSPs, Jetty requires some kind of hostname resolution, so
if it won't resolve, you get a nice HTTP 500 error with this
rather vague error message.

cheers


On Thursday 21 September 2006 2:16 pm, Yonik Seeley wrote:
 On 9/21/06, Jeff McCormick [EMAIL PROTECTED] wrote:
  I'm getting the following error when I try and hit the admin console:
 
  HTTP ERROR: 500
 
  dr01142: dr01142
 
  RequestURI=/solr/admin/stats.jsp
 
  Powered by Jetty://
 
  has anyone seen this error before?  The queries to this server seem to
  work just fine, only the admin console is not working.

 I haven't seen that problem.
 If you are using the bundled version of Jetty, try making sure that
 the JVM you are starting it with is from a JDK and not a JRE (javac is
 needed to compile the JSPs).

 -Yonik

-- 
Jeff McCormick
Rackspace
x4596


Reloading solrconfig.xml

2006-09-21 Thread Otis Gospodnetic
Hi,

What's the best way to dynamically change solrconfig.xml and have the changes 
take effect?
It looks like one could just regenerate the file and call 
SolrConfig.initConfig(String file).
Is that the proper/best way to do it?

Thanks,
Otis





Re: Reloading solrconfig.xml

2006-09-21 Thread Yonik Seeley

On 9/21/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:

What's the best way to dynamically change solrconfig.xml and have the changes 
take effect?


Everything would need to be designed for that, and it's currently not.

You might be able to reload the config, but all the classes that
looked at the config and configured themselves would need to be

At CNET, we are always in a load-balanced environment for scalability
and HA.  In that environment, you simply change the config and bounce
the server, letting the remaining servers handle requests.

-Yonik


Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Yonik Seeley

On 9/21/06, Michael Imbeault [EMAIL PROTECTED] wrote:

It turns out that journal_name has 17038 different tokens, which is
manageable, but first_author has  400 000. I don't think this will ever
yield good performance, so i might only do journal_name facets.


Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):

http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html

-Yonik


Re: Default XML Output Schema

2006-09-21 Thread sangraal aiken

Thanks for the great explanation Yonik, I passed it on to my collegues for
reference... I knew there was a good reason.

-Sangraal

On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote:


On 9/21/06, sangraal aiken [EMAIL PROTECTED] wrote:
 Perhaps a silly questions, but I'm wondering if anyone can tell me why
solr
 outputs XML like this:

During the initial development of Solr (2004), I remember throwing up
both options, and most developers preferred to have a limited number
of well defined tags.

It allows you to have rather arbitrary field names, which you couldn't
have if you used the field name as the tag.

It also allows consistency with custom data.  For example, here is the
representation of an array of integer:
arrint1/intint2/int/arr
If field names were used as tags, we would have to either make up a
dummy-name, or we wouldn't be able to use the same style.


 doc
 int name=id201038/id
 int name=siteId31/siteId
 date name=modified2006-09-15T21:36:39.000Z/date
 /doc

 rather than like this:

 doc
 id type=int201038/id
 siteId type=int31/siteId
 modified type=date2006-09-15T21:36:39.000Z/modified
 /doc

 A front-end PHP developer I know is having trouble parsing the default
Solr
 output because of that format and mentioned it would be much easier in
the
 former format... so I was curious if there was a reason it is the way it
is.

There are a number of options for you.
You could write your own QueryResponseWriter to output XML just as you
like it, or use an XSLT stylesheet in conjunction with
http://issues.apache.org/jira/browse/SOLR-49
or use another format such as JSON.

-Yonik



Re: Reloading solrconfig.xml

2006-09-21 Thread Otis Gospodnetic
Thanks, that's actually simpler and it will work for me.
Since I'm thinking of only changing mergeFactor and friends on the fly, I 
suppose I'd only need to modify Master's solrconfig.xml.

Otis

- Original Message 
From: Yonik Seeley [EMAIL PROTECTED]
To: solr-user@lucene.apache.org; Otis Gospodnetic [EMAIL PROTECTED]
Sent: Thursday, September 21, 2006 4:08:58 PM
Subject: Re: Reloading solrconfig.xml

On 9/21/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:
 What's the best way to dynamically change solrconfig.xml and have the changes 
 take effect?

Everything would need to be designed for that, and it's currently not.

You might be able to reload the config, but all the classes that
looked at the config and configured themselves would need to be

At CNET, we are always in a load-balanced environment for scalability
and HA.  In that environment, you simply change the config and bounce
the server, letting the remaining servers handle requests.

-Yonik





Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Yonik Seeley

On 9/21/06, Michael Imbeault [EMAIL PROTECTED] wrote:

Btw, Any plans for a facets cache?


Maybe a partial one (like caching top terms to implement some other
optimizations).  My general philosophy on caching in Solr has been to
cache things the client can't: elemental things, or *parts* of
requests to make many different requests faster (most
bang-for-the-buck).

Caching complete requests/responses is generally less useful since it
requires even more memory, has a worse hit ratio, and can be done
anyway by the client or a separate process like squid.

-Yonik


Re: Reloading solrconfig.xml

2006-09-21 Thread Yonik Seeley

On 9/21/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:

Thanks, that's actually simpler and it will work for me.
Since I'm thinking of only changing mergeFactor and friends on the fly, I 
suppose I'd only need to modify Master's solrconfig.xml.


Is this for testing or something?

I could think of usecases where it might make sense to somehow allow
changing mergeFactor in add requests (complete index builds vs
incremental adds, etc).


-Yonik


Fixed first hits - custom RequestHandler?

2006-09-21 Thread Otis Gospodnetic
Hello,

I have a situation where I want certain documents to appear at the top of the 
hit list for certain searches, regardless of their score.  One can think of it 
as the ads right on top of Google's search results (but I'm not dealing with 
ads).

Example:
If I'm searching books in a bookstore, and a person is searching for lucene, 
the owner of the bookstore may want to promote the recently published Lucene 
in Action instead of some other book about Lucene, so he wants any search for 
lucene or java search to put the link to Lucene in Action on top.

Is there a good way to accomplish this in Solr?
My initial thoughts are that it would be best to have an external store, maybe 
even a Lucene index.  This store would host the data to display on top of hits, 
as well as keywords/phrases that would have to match user's search terms.  A 
custom RequestHandler would then perform a regular search (a la any of the 
existing RequestHandlers), plus pull the data from this side store, and stick 
those in the response.

Is this a good candidate for a custom RequestHandler?

Thanks,
Otis





Re: Re: Default XML Output Schema

2006-09-21 Thread Tim Archambault

This structure was inhibiting to me at first too using Coldfusion.
However, I was able to create a function that dynamically creates a
query recordset for both facets and search results and will accomodate
new/additional fields at any time. If I can do it, any reasonable
programmer can handle it.

On 9/21/06, sangraal aiken [EMAIL PROTECTED] wrote:

Thanks for the great explanation Yonik, I passed it on to my collegues for
reference... I knew there was a good reason.

-Sangraal

On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote:

 On 9/21/06, sangraal aiken [EMAIL PROTECTED] wrote:
  Perhaps a silly questions, but I'm wondering if anyone can tell me why
 solr
  outputs XML like this:

 During the initial development of Solr (2004), I remember throwing up
 both options, and most developers preferred to have a limited number
 of well defined tags.

 It allows you to have rather arbitrary field names, which you couldn't
 have if you used the field name as the tag.

 It also allows consistency with custom data.  For example, here is the
 representation of an array of integer:
 arrint1/intint2/int/arr
 If field names were used as tags, we would have to either make up a
 dummy-name, or we wouldn't be able to use the same style.


  doc
  int name=id201038/id
  int name=siteId31/siteId
  date name=modified2006-09-15T21:36:39.000Z/date
  /doc
 
  rather than like this:
 
  doc
  id type=int201038/id
  siteId type=int31/siteId
  modified type=date2006-09-15T21:36:39.000Z/modified
  /doc
 
  A front-end PHP developer I know is having trouble parsing the default
 Solr
  output because of that format and mentioned it would be much easier in
 the
  former format... so I was curious if there was a reason it is the way it
 is.

 There are a number of options for you.
 You could write your own QueryResponseWriter to output XML just as you
 like it, or use an XSLT stylesheet in conjunction with
 http://issues.apache.org/jira/browse/SOLR-49
 or use another format such as JSON.

 -Yonik





Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Yonik Seeley

On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote:

Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):


OK, the optimization has been checked in.  You can checkout from svn
and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT).
I'd be interested in hearing your results with it.

The first facet request on a field will take longer than subsequent
ones because the FieldCache entry is loaded on demand.  You can use a
firstSearcher/newSearcher hook in solrconfig.xml to send a facet
request so that a real user would never see this slower query.

-Yonik


Re: Fixed first hits - custom RequestHandler?

2006-09-21 Thread Tim Archambault

Otis,

I'm curious as to what you find out here. I'm looking at setting up a
second Solr instance to handle keyword advertising and the first
instance to handle the site search for our newspaper website. Never
thought of your question.

Thanks,

Tim

On 9/21/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:

Hello,

I have a situation where I want certain documents to appear at the top of the 
hit list for certain searches, regardless of their score.  One can think of it 
as the ads right on top of Google's search results (but I'm not dealing with 
ads).

Example:
If I'm searching books in a bookstore, and a person is searching for lucene, the owner of the bookstore may want to 
promote the recently published Lucene in Action instead of some other book about Lucene, so he wants any search for 
lucene or java search to put the link to Lucene in Action on top.

Is there a good way to accomplish this in Solr?
My initial thoughts are that it would be best to have an external store, maybe 
even a Lucene index.  This store would host the data to display on top of hits, 
as well as keywords/phrases that would have to match user's search terms.  A 
custom RequestHandler would then perform a regular search (a la any of the 
existing RequestHandlers), plus pull the data from this side store, and stick 
those in the response.

Is this a good candidate for a custom RequestHandler?

Thanks,
Otis






Re: Fixed first hits - custom RequestHandler?

2006-09-21 Thread Yonik Seeley

On 9/21/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:

I have a situation where I want certain documents to appear at the top of the 
hit list for certain searches, regardless of their score.  One can think of it 
as the ads right on top of Google's search results (but I'm not dealing with 
ads).


You could make anything with an isSpecial boolean field appear first:
search_field:java; score desc, special desc

The special field could even be an int field so you could control
the order that the special docs appeared.

You could also do something with boosting:
+(search_terms:java) special:true^100

If you have special search terms you want to associate with a doc, you
can have another field for that and boost it highly... that would give
you a measure of relevancy among special documents:

normal_search_field:java  special_search_field:java^100


Is this a good candidate for a custom RequestHandler?


Hopefully all the tools are already there to do this w/o extra code.

-Yonik


Re: Fixed first hits - custom RequestHandler?

2006-09-21 Thread Yonik Seeley

On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote:

You could make anything with an isSpecial boolean field appear first:
search_field:java; score desc, special desc


Oops, that should be
 search_field:java; special desc, score desc
score desc should be the secondary sort, or whatever you normally
want to sort by.

-Yonik


dismax and facets with constraints

2006-09-21 Thread Brian Lucas
I'm experimenting with dismax to do faceted browsing, and when I add a
constraint with dismax on that facet, I no longer get the entire
facet-count.

 

i.e.

 

q=blahqt=dismaxfq=type_id:1hl=truehl.fl=title+summaryhl.snippets=3face
t=truefacet.limit=-1facet.zeros=falsefacet.field=type_id

 

lst name=facet_counts

lst name=facet_queries/

lst name=facet_fields

lst name=type_id

int name=295/int

/lst

/lst

/lst

 

 

I understand why this is happening, but is there another way to add a
constraint via querystring (instead of 'fq=type_id:1') and still get the
full facet-counts list?

 

I can do it with standardrequest, but then it doesn't appear like I can sort
the results.

 

 



RE: dismax and facets with constraints

2006-09-21 Thread Brian Lucas
Just to clarify on this point, I am using highlighting in standardquery.
When I add a constraint and sort by a field, the highlighting function no
longer works.  Possible bug or user error?

 

  _  

 

I can do it with standardrequest, but then it doesn't appear like I can sort
the results.

 

 



RE: dismax and facets with constraints

2006-09-21 Thread Chris Hostetter

: Just to clarify on this point, I am using highlighting in standardquery.
: When I add a constraint and sort by a field, the highlighting function no
: longer works.  Possible bug or user error?

: I can do it with standardrequest, but then it doesn't appear like I can sort
: the results.

this sounds like it's unrelated to the facet counts issue ... but i'm
having trouble following what you mean, can you give us an example URL and
the output that you are getting from it (preferably using the example
schema/docs -- but we might be able to help even if it's your own custom
schema/data)



-Hoss



Re: Fixed first hits - custom RequestHandler?

2006-09-21 Thread Chris Hostetter

: On 9/21/06, Otis Gospodnetic [EMAIL PROTECTED] wrote:
:  I have a situation where I want certain documents to appear at the top
: of the hit list for certain searches, regardless of their score.  One
: can think of it as the ads right on top of Google's search results (but
: I'm not dealing with ads).

the kind of approach Yonik described works well when you really want the
boosted documents (ie: ads, promoted products, etc) to be inlcuded in
the main paginated flow of results, regardless of how many there are.

if you want them to be broken out (like the ads google shows in the right
nav of their pages) so that they aren't affected by pagination or sorting
changes; or if you want a limited number to appear (ie: bubble the 3
highest scoring promoted products up to the top, but leave the rest of the
promoted products where they are in the normally sorted list) then i don't
know any way arround this except executing two searches.

I've typically done it by making two Solr requests from the client, but
you could also do this will a custom request handler that included two
DocLists in the results.

(now that you can progromatically modify/override the params of a
SolrQueryRequest, it would be really easy to write a subclass of any
existing request handler that first did the promo search, and then
delegated to the super class with fq params telling it to ignore the
results you've already included)



-Hoss



Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Michael Imbeault
I upgraded to the most recent Solr build (9-22) and sadly it's still 
really slow. 800 seconds query with a single facet on first_author, 15 
millions documents total, the query return 180. Maybe i'm doing 
something wrong? Also, this is on my personal desktop; not on a server. 
Still, I'm getting 0.1 seconds queries without facets, so I don't think 
thats the cause. In the admin panel i can still see the filtercache 
doing millions of lookups (and tons of evictions once it hits the maxsize).


Here's the field i'm using in schema.xml :
field name =first_author type=string indexed=true stored=true/

This is the query :
q=hiv red 
bloodstart=0rows=20fl=article_title+authors+journal_iso+pubdate+pmid+scoreqt=standardfacet=truefacet.field=first_authorfacet.limit=5facet.missing=falsefacet.zeros=false


I'll do more testing on the weekend,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:

On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote:

Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):


OK, the optimization has been checked in.  You can checkout from svn
and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT).
I'd be interested in hearing your results with it.

The first facet request on a field will take longer than subsequent
ones because the FieldCache entry is loaded on demand.  You can use a
firstSearcher/newSearcher hook in solrconfig.xml to send a facet
request so that a real user would never see this slower query.

-Yonik