Re: Searching multiple indices (solr newbie)

2007-01-08 Thread Chris Hostetter

: > with a single schema -- but dynamicFields are used to store category
: > specific fields, so that if you are doing a category specific search,
: > category specific filters can be offered to you...
: >
: > http://shopper.cnet.com/4144-6501_9-0-1.html?query=canon
:
: Could you elaborate a bit more about how the front-end and back-end
: work to communicate the category specific filters?
:
: Showing *all* facets from the start can be unnecessarily too much for
: a user to navigate, so I'm interested in how to develop a system that
: can adjust the facets shown based on context a bit more elegantly.

I'm not sure i understand the question.  Our Facet lists (and
per facet constraint lists) are specific to each category -- once a
specific category has been specified by the application server, a custom
request handler parses the list of facets/constraints from a metadata
document for thta category, computes the intersection for each constraint
query and returns *all* of the information to the application -- it
normally shows only the first 3 facets not yet constrained, and for each
of those facets it shows the labels for the "best" constraints, where the
defintion of "best" depends on the data type of the facet -- by default
it's the ones with the largest counts, or in "natural order" for
numeric ranges, but there are overrides for extremely popular constraints.

there are a lot of optimizations that could be done in the Solr plugin to
only compute the counts for facets/constraints we know we wnat to display
-- but i specificly made it compute everything so the frontened could make
whatever choices it wanted about displaying facets without needing to
change the Solr plugin.  (it's on the list as a potential optimization to
move some of that logic into the requset handler, but there haven't been
any complaints about hte performance)

does that answer your question?


-Hoss



Re: Seeking FAQs

2007-01-08 Thread Thorsten Scherler
On Sat, 2007-01-06 at 10:25 -0500, David Halsted wrote:
> I wonder what would happen if we used a clustering engine like Carrot
> to categorize either the e-mails in the archive or the results of
> searches against them?  Perhaps we'd find some candidates for the FAQ
> that way.

Not sure about tools but IMO this works fine done by user/committer. I
think the one that asked the question on the list is a likely candidate
to add an entry in the faq.

The typical scenario should be:
user asks question -> user get answers from community -> user adds FAQ
entry with the solution that worked for her

This way the one asking the question can give a little something back to
the community.

If you follow the lists a bit one can identify some faq's right away:
- Searching multiple indeces 
- Clustering solr (custom scorer, highlighter, ...)
- ...


> 
> Dave
> 
> On 1/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> >
> > Hey everybody,
> >
> > I was lookin at the FAQ today, and I realized it hasn't really changed
> > much in the past year ... in fact, only two people besides myself have
> > added questions (thanks Thorsten and Darren) in the entire time Solr
> > has been in incubation -- which is not to say that Erik and Respaldo's
> > efforts to fix my typo's aren't equally helpful :)
> >
> > http://wiki.apache.org/solr/FAQ
> >
> > In my experience, FAQs are one of the few pieces of documentation that are
> > really hard for developers to write, because we are so use to dealing with
> > the systems we work on, we don't allways notice when a question has been
> > asked more then once or twice (unless it gets asked over and over and
> > *over*).  The best source of FAQ updates tend to come from users who have
> > a question, and either find the answer in the mailing list archives, or
> > notice the same question asked by someone else later.
> >

Yes, I totally agree. Sometimes the content for the solution can be
found in the wiki. One would just need to link to the wiki page from the
FAQ.

> > So If there are any "gotchas" you remember having when you first started
> > using Solr, or questions you've noticed asked more then once please feel
> > free to add them to the wiki.  The Convention is to only add a question if
> > you're also adding an answer, but even if you don't think a satisfactory
> > answer has ever been given, or you're not sure how to best summarize
> > multiple answers given in the past, just including links to
> > instances in the mailing list archives where the question was asked is
> > helpful -- both in the short term as pointers for people looking for help,
> > and in the long term as starter points for people who want to flesh out a
> > detailed answer.
> >

In the long run the content of the wiki that has proved solution should
IMO go directly in the official documentation. 

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)




newbie question on determining fieldtype

2007-01-08 Thread mike topper

Hi,

I have a question that I couldn't find the exact answer to. 

I have some fields that I want to add to my schema but will never be 
searched on.  They are only used as additional information about a 
document when retrieved.  They are integers, so should i just have the 
field be:




I'm pretty sure this is right, but I just wanted to check that I'm not missing 
any speedups from using a different field
or adding some other parameters.

-Mike



Re: newbie question on determining fieldtype

2007-01-08 Thread Thorsten Scherler
On Mon, 2007-01-08 at 10:29 -0300, mike topper wrote:
> Hi,
> 
> I have a question that I couldn't find the exact answer to. 
> 
> I have some fields that I want to add to my schema but will never be 
> searched on.  They are only used as additional information about a 
> document when retrieved.  They are integers, so should i just have the 
> field be:
> 
>  stored="true"/>
> 
> I'm pretty sure this is right, but I just wanted to check that I'm not 
> missing any speedups from using a different field
> or adding some other parameters.
> 

Seems pretty right to me.

Did you read 
http://wiki.apache.org/solr/SchemaXml

and saw the comment:
 

HTH
salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)




Re: Handling disparate data sources in Solr

2007-01-08 Thread Walter Underwood
On 1/7/07 7:24 AM, "Erik Hatcher" <[EMAIL PROTECTED]> wrote:

> Care has to be taken when passing a URL to Solr for it to go fetch,
> though.  There are a lot of complexities in fetching resources via
> HTTP, especially when handing something off to Solr which should be
> behind a firewall and may not be able to see the web as you would
> with your browser.

Cracking documents and spidering URLs are both big, big problems.
PDF is a horrid mess, as are old versions of MS Office. Proxies,
logins, cookies, all sort of issues show up with fetching URLs,
along with a fun variety of misbehaving servers.

I remember crashing one server with 25 GET requests before we
implemented session cookies in our spider. That used all that
DB connections and killed the server.

If you need to do a lot of spidering and parse lots of kinds of
documents, I don't know of an open source solution for that.
Products like Ultraseek and the Googlebox are about your only
choice.

wunder
-- 
Walter Underwood
Search Guru, Netflix
Former Architect for Ultraseek



Re: Multiple indexes

2007-01-08 Thread Jeff Rodenburg

This is good information, thanks Chris.  My preference was to keep things
separate, just needed some external info from others to back me up.

thanks,
jeff

On 1/7/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:



I don't know if there really are any general purpose best practices ... it
really depends on use cases -- the main motivation for allowing JNDI
context specification of the solr.home location so that multiple instances
of SOlr can run in a single instace of a servlet container was so that if
you *wanted* to run multiple instances in a single JVM, they could share
one heap space, and you wouldn't have to "guess" how much memory to
allocate to multiple instances -- but wether or not you *want* to have a
single instance or not is really up to you.

the plus side (as i mentioned) is that you can throw all of your available
memory at that single JVM instance, and not worry about how much ram each
solr instance really needs.

the down side is that if any one solr instance really gets hammered to
hell by it's users and rolls over and dies, it could bring down your other
solr instances as well -- which may not be a big deal if in your use cases
all solr instances get hit equally (via a meta searcher) but might be
quite a big problem if those seperate instances are completely independent
(ie: each paid for by seperate clients)

personally: if you've got the resources (money/boxes/RAM) i would
recommend keeping everything isolated.

(the nice thing about my job is that while i frequently walk out of
meetings with the directive to "make it faster", I've never been asked to
"make it use less RAM")


-Hoss




specifying dataDir on launch of jetty

2007-01-08 Thread Brian Whitman
I would like to specify the solr dataDir on launch of jetty via java - 
jar start.jar instead of editing the solrconfig.xml before launching.


I've tried java -Dsolr.dataDir=/x/y/z -jar start.jar but it seems to  
have no effect -- it starts with the solrconfig.xml default.


Use case is that I would like to deploy solr to other users with the  
data dir going in their home directories.. setting the dataDir in  
solrconfig.xml to ~/solrdata does not work (the ~ does not get  
expanded on this OSX system.)







Re: Handling disparate data sources in Solr

2007-01-08 Thread Alan Burlison

Erik Hatcher wrote:

The idea of having Solr handle various document types is a good one, for 
sure.  I'm not sure what specifics would need to be implemented, but I 
at least wanted to reply and say its a good idea!


Care has to be taken when passing a URL to Solr for it to go fetch, 
though.  There are a lot of complexities in fetching resources via HTTP, 
especially when handing something off to Solr which should be behind a 
firewall and may not be able to see the web as you would with your browser.


In that case the client should encode the content and send it as part of 
the index insert/update request - the aim is to merely prevent the bloat 
caused by encoding the document (e.g. as base64) when the indexer can 
access the source document directly.


--
Alan Burlison
--


Re: Handling disparate data sources in Solr

2007-01-08 Thread Alan Burlison

Walter Underwood wrote:


Cracking documents and spidering URLs are both big, big problems.
PDF is a horrid mess, as are old versions of MS Office. Proxies,
logins, cookies, all sort of issues show up with fetching URLs,
along with a fun variety of misbehaving servers.

I remember crashing one server with 25 GET requests before we
implemented session cookies in our spider. That used all that
DB connections and killed the server.

If you need to do a lot of spidering and parse lots of kinds of
documents, I don't know of an open source solution for that.
Products like Ultraseek and the Googlebox are about your only
choice.


I'm not suggesting that Solr be extended to become a spider, I'm just 
suggesting we provide a mechanism for direct access to source documents 
if they are accessible.  For example if the document being indexed was 
on the same machine as Solr, the href would usually start "file://", not 
"http://";


BTW, this discussion is also occurring on solr-dev, it might be better 
to move all of it over there ;-)


--
Alan Burlison
--


Faceted Dates

2007-01-08 Thread Ryan McKinley

I would like to use faceted browsing to group documents by year,
month, and day.  I can think of a few ways to do this, but I'd like to
see what folks think before i start down the wrong track.

Option 1:
Add three fields, one for year, month, day.  Something like:






then use copyField to generate the various versions:




this would somehow convert the original date format for each copy:
addedTime  = "2007-01-08T21:36:15.635Z"
addedTimeYEAR  = "2007"
addedTimeMONTH = "2007-01"
addedTimeDAY   = "2007-01-08"

Perhaps this requires a custom FieldType for Y/M/D to convert the
larger string to the smaller one.

pros:
* Can use SimpleFacets directly
cons:
* seems messy.  particularly since i have multiple fields i'd like to
have the same behavior.


Option 2:
Add an analyzer to the date field that adds multiple Tokens with
various resolutions, then write a custom faceter that knows a string
length 4=year, y=month, 10=day.  Or, perhaps it could look at the
token name.

schema.xml:

 
   
 

DateFacetAnalyzer:
Token t = new Token( date, 0, date.length(), "original" );
t.setPositionIncrement( 0 );
tokens.add( t );

t = new Token( date, 0, 4, "year" );
t.setPositionIncrement( 0 );
tokens.add( t );

t = new Token( date, 0, 7, "month" );
t.setPositionIncrement( 0 );
tokens.add( t );

...

pros:
* simple / reusable
cons:
* I don't fully understand how it would affect search & sorting

Any thoughts / pointers / advice?

thanks
ryan


Re: specifying dataDir on launch of jetty

2007-01-08 Thread Ryan McKinley

You can specify a configuration file when jetty starts.  Take a look at:
http://wiki.apache.org/solr/SolrJetty

then you can start jetty with:
java -jar start.jar yourjettyconfig.xml


On 1/8/07, Brian Whitman <[EMAIL PROTECTED]> wrote:

I would like to specify the solr dataDir on launch of jetty via java -
jar start.jar instead of editing the solrconfig.xml before launching.

I've tried java -Dsolr.dataDir=/x/y/z -jar start.jar but it seems to
have no effect -- it starts with the solrconfig.xml default.

Use case is that I would like to deploy solr to other users with the
data dir going in their home directories.. setting the dataDir in
solrconfig.xml to ~/solrdata does not work (the ~ does not get
expanded on this OSX system.)







Re: specifying dataDir on launch of jetty

2007-01-08 Thread Chris Hostetter

I don't think there is anyway to do what you describe at the moment ...
SOLR-79 was an attempt at allowing variabes representing system
properties to be used in the solrconfig.xml, but it hasn't been commited
because it's an incomplete solution.

what you could do, is create a seperate Solr "home" dir for each user,
each containing symlinks to your master files for the various subdirs
(conf, bin, lib) *except* data which you leve as a real directory.

: I would like to specify the solr dataDir on launch of jetty via java -
: jar start.jar instead of editing the solrconfig.xml before launching.
:
: I've tried java -Dsolr.dataDir=/x/y/z -jar start.jar but it seems to
: have no effect -- it starts with the solrconfig.xml default.
:
: Use case is that I would like to deploy solr to other users with the
: data dir going in their home directories.. setting the dataDir in
: solrconfig.xml to ~/solrdata does not work (the ~ does not get
: expanded on this OSX system.)



-Hoss