positionIncrementGap - what is its value meaning?

2008-03-26 Thread Vinci

Hi all,

While I changing the default schema.xml, I found this attribute where
defined the analyzer...seems it will add some space when multiple fields
appear in document, but what is its effect appear in query and what is the
values mean here?

Thank you,
Vinci
-- 
View this message in context: 
http://www.nabble.com/positionIncrementGap---what-is-its-value-meaning--tp16296677p16296677.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Highlighting Quoted Phrases

2008-03-26 Thread Vinci

Hi,

Would it be easier if you turn off the highlighting while viewing full
document (but summary highlighting is still available) and use javascript to
do the matching? (As long as we are need highlighting only when looking at
specific document in runtime)

Thank you,
Vinci

Brian Whitman wrote:
> 
> 
> On Mar 25, 2008, at 6:31 PM, Chris Harris wrote:
> 
>> working pretty well, but my testers have
>> discovered something they find borderline unacceptable. If they search
>> for
>>
>>"stock market"
>>
>> (with quotes), then Solr correctly returns only documents where
>> "stock" and "market" appear as adjacent words. Two problems though:
>> First, Solr is willing to pick snippets where only one of the terms
>> appears, e.g.
>>
>>...and changes in the market regulation environment...
> 
> 
> I recently asked about the same thing. There's a patch in lucene (not  
> in trunk yet) to support this.
> 
> It would take some amount of work to get it in solr, but I haven't  
> investigated yet.
> 
> -b
> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Highlighting-Quoted-Phrases-tp16290330p16297027.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Update a field without reindexing the entire document?

2008-03-26 Thread Ard Schrijvers
Hello Otis,

I have been looking for something similar for Jackrabbit's lucene index,
but I still have some uncertainty about wether I understand correctly
what the patches in SOLR-139 supply:

Do they just retrieve formerly stored fields of a lucene Document,
change some field, and then analyze and tokenize the fetched fields
again? I am merely interested in avoiding the analyzing and tokenisation
of the entire Document when for example a single Field changes (think
about 100 Mb pdf's in Jackrabbit which I do not want to extract the
content from again when just a single small prop changes). I got some
pointers before from Karl Wettin (see [1])when using term vectors that I
can re-assemble the tokenstream without having the expensive analyzing
again. 

Anyway, is this what is understood with modifying an existing lucene
document, or is it done with retrieving stored fields and analyze them
again? Thanks for any clarifications.

[1]
http://www.nabble.com/Reusing-indexed-and-analyzed-documents-tt1523.
html#a1523

[EMAIL PROTECTED] - [EMAIL PROTECTED] - www.onehippo.com
-
Amsterdam - Hippo B.V. Oosteinde 11 1017 WT Amsterdam +31(0)20-5224466 
San Francisco - Hippo USA Inc. 101 H Street, suite Q Petaluma CA
94952-3329 +1 (707) 773-4646
-


> 
> Hi Galen,
> 
> See SOLR-139 (this is from memory) issue in JIRA.  Doable, 
> but not in Solr nightlies yet, I believe (also from memory), 
> and requires all your fields to be stored.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> - Original Message 
> From: Galen Pahlke <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, March 25, 2008 4:21:45 PM
> Subject: Update a field without reindexing the entire document?
> 
> 
> Hi, I'm wondering if theres a way to change a single field of 
> a document without re-indexing every field.  I'd like to do 
> something like this:
> 
> 1 name="field1">val1
> 
> Then later:
> 
> 1 name="field2">val2
> 
> After the second statement, the document is overwritten, so 
> the value of
> field1 is lost.  Is there a way I can do something like this 
> so that documents are only updated, as opposed to 
> overwritten? I've looked through the docs but couldn't find anything.
> 
> Thanks,
> - Galen Pahlke
> --
> View this message in context: 
> http://www.nabble.com/Update-a-field-without-reindexing-the-en
> tire-document--tp16287718p16287718.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 
> 
> 


Re: Update a field without reindexing the entire document?

2008-03-26 Thread Vinci

Hi Otis,

One question: If the target field is a multi-value field, what will be the
consequence of the update for SOLR-139: overriding or appending?

Thank you,
Vinci


Otis Gospodnetic wrote:
> 
> Hi Galen,
> 
> See SOLR-139 (this is from memory) issue in JIRA.  Doable, but not in Solr
> nightlies yet, I believe (also from memory), and requires all your fields
> to be stored.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> - Original Message 
> From: Galen Pahlke <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, March 25, 2008 4:21:45 PM
> Subject: Update a field without reindexing the entire document?
> 
> 
> Hi, I'm wondering if theres a way to change a single field of a document
> without re-indexing every field.  I'd like to do something like this:
> 
> 1 name="field1">val1
> 
> Then later:
> 
> 1 name="field2">val2
> 
> After the second statement, the document is overwritten, so the value of
> field1 is lost.  Is there a way I can do something like this so that
> documents are only updated, as opposed to overwritten? I've looked through
> the docs but couldn't find anything.
> 
> Thanks, 
> - Galen Pahlke 
> -- 
> View this message in context:
> http://www.nabble.com/Update-a-field-without-reindexing-the-entire-document--tp16287718p16287718.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Update-a-field-without-reindexing-the-entire-document--tp16287718p16297582.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Update schema.xml without restarting Solr?

2008-03-26 Thread solr

Quoting Ryan McKinley <[EMAIL PROTECTED]>:


In general, you need to be very careful when you change the schema
without reindexing.  Many changes will break all search, some may be
just fine.

for example, if you change sint to slong anything already indexed as an
"sint" will be incompatible with the current settings.


This example, with changing from sint to slong, was just an example of  
bad > design in my

opinion, not an example of what we need to do without reindexing.

Actually, as long as a reindexing doesn't take too long time I don't  
have any problem with reindexing per say. But I would like it to  
happen without having the search functionality disabled in the mean  
time. I guess the MultiCore stuff is the option here.


Thanks for your input, Ryan.

/Jimi


Re: positionIncrementGap - what is its value meaning?

2008-03-26 Thread Erik Hatcher


On Mar 26, 2008, at 3:11 AM, Vinci wrote:

While I changing the default schema.xml, I found this attribute where
defined the analyzer...seems it will add some space when multiple  
fields
appear in document, but what is its effect appear in query and what  
is the

values mean here?


Suppose you add two tokenized "author"s for a document:

   author: Billy Bob
   author: Thorton Gospodnetic

Without a position gap, the phrase query of "Bob Thorton" would  
match!   However, with a position increment gap defined you can avoid  
that match.  The value you set that gap to depends on whether you'll  
be using sloppy phrase queries, and how sloppy they'll be and whether  
you desire matching across field instances.  Under the covers,  
remember, that a tokenized field is just a string of terms, even with  
multiple field instances.


Erik



Re: Update schema.xml without restarting Solr?

2008-03-26 Thread Jeryl Cook
Top often requested feature:
1. Make the option on using the "RAMDirectory" to hook in Terracotta(
billion(s) of items in an index anyone?..it would be possible using
this.)
2. Make the "schema.xml" configurable at runtime, not really sure the
best way to address this, because changing the schema would require
"re-indexing" the documents.


Terracotta:
http://www.terracotta.org/

On Tue, Mar 25, 2008 at 11:27 AM,  <[EMAIL PROTECTED]> wrote:
> Hi,
>
>  The wiki for Solr talks about the schema.xml, and it seems that
>  changes in this file requires a restart of Solr before they have effect.
>
>  In the wiki it says:
>
>  
>  How can I rebuild my index from scratch if I change my schema?
>
>  The most efficient/complete way is to...
>
> 1. Stop your application server
> 2. Change your schema.xml file
> 3. Delete the index directory in your data directory
> 4. Start your application server (Solr will detect that there is
>  no existing index and make a new one)
> 5. Re-Index your data
>
>  If the permission scheme of your server does not allow you to manually
>  delete the index directory an alternate technique is...
>
> 1. Stop your application server
> 2. Change your schema.xml file
> 3. Start your application server
> 4. Use the "match all docs" query in a delete by query command:
>  *:*
> 5. Send an  command.
> 6. Re-Index your data
>  
>
>  Is this really the case? I find that quite strange that you need to
>  restart solr for a change in the schema.xml. The way we plan to use
>  Solr together with a Content Management System is that the
>  authors/editors can create new article/document types when needed,
>  without any need to restart anything. The CMS itself has full support
>  for this. But we need Solr to also support this. Is that possible?
>  Like a simple  command, maybe, that would trigger
>  Solr to re-read it's schema.xml file.
>
>  If this is not possible to do, is it really necessary to restart the
>  entire application server for a change in schema.xml to have effect?
>  Or only the solr webapp?
>
>  Regards
>  /Jimi
>



-- 
Jeryl Cook
/^\ Pharaoh /^\
http://pharaohofkush.blogspot.com/
"..Act your age, and not your shoe size.." -Prince(1986)


Re: Update schema.xml without restarting Solr?

2008-03-26 Thread solr

Quoting Jeryl Cook <[EMAIL PROTECTED]>:


2. Make the "schema.xml" configurable at runtime, not really sure the
best way to address this, because changing the schema would require
"re-indexing" the documents.


Isn't the best way to address this just to leave it to the persons  
that integrate solr into their system? I mean, if a change in the  
schema only effects 1% of all documents, then it's a bad idea to  
reindex them all (at least if the dataset is big).


/Jimi


document retrieval, nested field and HTMLStripStandardTokenizerFactory

2008-03-26 Thread Vinci

Hi all,

I am working for developing the interface for Solr with JSON. And some
question here:
1. Can I limit the number of returned document in config file to avoid
misconfiguration pull down the server?
2. How can I retrieve the document by unique key for result view purpose ?
And how can I do the xslt transformation on it?
3. can I use nested field in document like this?

   
   

4. Does HTMLStripStandardTokenizerFactory do the same thing as
solr.HTMLStripWhitespaceTokenizerFactory but only their target difference? 
And Can I use HTMLStripStandardTokenizerFactory with TokenizerFactory which
extended from  BaseTokenizerFactory?
5. If I use HTMLStripStandardTokenizerFactory, do I need to escape the html
character in field element?

Thank you,
Vinci
-- 
View this message in context: 
http://www.nabble.com/document-retrieval%2C-nested-field-and-HTMLStripStandardTokenizerFactory-tp16300794p16300794.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Update schema.xml without restarting Solr?

2008-03-26 Thread Daniel Papasian
[EMAIL PROTECTED] wrote:
> Quoting Jeryl Cook <[EMAIL PROTECTED]>:
> 
>> 2. Make the "schema.xml" configurable at runtime, not really sure the
>> best way to address this, because changing the schema would require
>> "re-indexing" the documents.
> 
> Isn't the best way to address this just to leave it to the persons that
> integrate solr into their system? I mean, if a change in the schema only
> effects 1% of all documents, then it's a bad idea to reindex them all
> (at least if the dataset is big).

Or if you're adding a new field to the schema (perhaps the most common
need for editing schema.xml), you don't need to reindex any documents at
all, right?  Unless I'm missing something?

I suppose if you add a new dynamic field specification that conflicts
with existing fields, reindexing is probably a good idea, but if you're
doing that... well, I probably don't want to know.

Daniel


Re: Update schema.xml without restarting Solr?

2008-03-26 Thread solr

Quoting Daniel Papasian <[EMAIL PROTECTED]>:


[EMAIL PROTECTED] wrote:

Quoting Jeryl Cook <[EMAIL PROTECTED]>:


2. Make the "schema.xml" configurable at runtime, not really sure the
best way to address this, because changing the schema would require
"re-indexing" the documents.


Isn't the best way to address this just to leave it to the persons that
integrate solr into their system? I mean, if a change in the schema only
effects 1% of all documents, then it's a bad idea to reindex them all
(at least if the dataset is big).


Or if you're adding a new field to the schema (perhaps the most common
need for editing schema.xml), you don't need to reindex any documents at
all, right?  Unless I'm missing something?


Well, it all depends on if that "field" (not solar/lucene field)  
exists on the already indexed material, but that particular field was  
never indexed. Lets say that we have a bunch of articles, that has a  
field "author" that someone decided  that it doesn't need to be in the  
index. But then later he changes his mind, and add the author field to  
the schema. In this case all articles that has a populated author  
field should now be reindexed.



I suppose if you add a new dynamic field specification that conflicts
with existing fields, reindexing is probably a good idea, but if you're
doing that... well, I probably don't want to know.


I must say that I'm abit confused by these dynamic fields. Can someone  
tell me if there is any reasonable use of dynamic fields without  
having the "variable type" (for example i for int/sint) in the name?


/Jimi


Solr commits automatically on appserver shutdown

2008-03-26 Thread Noble Paul നോബിള്‍ नोब्ळ्
hi,
If my appserver fails during an update or if I do a planned shutdown
without wanting to commit my changes Solr does not allow it?.
It commits whatever unfinished changes.
Is it by design?
Can I change this behavior?
--Noble


Index "corruption" makes it return a different result

2008-03-26 Thread Lucas F. A. Teixeira




Hello all!

I had a problem this week, and I like to share with you all.
My weblogic server that generate my index hrows its logs in a shared
storage. During my indexing process (SOLR+Lucene), this shared storage
became 100% full, and everything collapsed (all servers that uses this
shared storage). But my index (that is generated in the local
filesystem, just "grabbed" some logs of the server (who knows weblogic
knows the managed server accesslog, that's the guy) from the buffer (my
supposition), and put inside my index files! Take a look how my
"_al1.cfs" became between some binary parts of the file:

2008-03-19    -    02:31:03    -    [ip]    -    POST    -    200   
-    /AcomProductSyncServiceWeb/AcomProductSyncService
2008-03-19    -    02:31:03    -    [ip]    -    POST    -    200   
-    /AcomProductSyncServiceWeb/AcomProductSyncService
2008-03-19    -    02:31:04    -    [ip]    -    POST    -    200   
-    /AcomProductSyncServiceWeb/AcomProductSyncService
2008-03-19    -    02:31:04    -    [ip]    -    POST    -    200   
-    /AcomProductSyncServiceWeb/AcomProductSyncService
2008-03-19    -    02:31:04    -    [ip]    -    POST    -    200   
-    /AcomProductSyncServiceWeb/AcomProductSyncService

The most incredible thing, is that I can open the index without a
CorruptedIndexException, normally. That's really bad for me, cause the
application didn't warn about a corrupted index (of course, it is not).
I can open it with the Luke App, and with this simple code snippet
accessing directly the lucene index without solr: 

        IndexReader indexReader =
IndexReader.open(FSDirectory.getDirectory("C/index/index.2008-03-19"));
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        
        TermQuery termQuery = new TermQuery(new Term("itemId",
"680804"));
        Hits hits = indexSearcher.search(termQuery);
        
        Iterator itHits = hits.iterator();
        while (itHits.hasNext()) {
            Hit hit = (Hit) itHits.next();
            Document document = hit.getDocument();
            String itemId = document.getField("itemId").stringValue();
            System.out.println("itemId="+itemId);
        }
        
        indexSearcher.close();
        indexReader.close();


Ok, ok. But, if it's opening, whats my real problem?  Making this
little search above, the Document that I got, was another one, with
other information different from the original one that I was looking
for (the one with the itemId field = 680804). The whole document was
another document (but a valid document, that I've indexed before). The
itemId value that I got, the one that was printed from that application
above was 578340. Wow!!

I can reproduce this error anytime with this code or with luke on this
corrupted index, but was terrible for me to find the exact point of
this fault.

I've reindexed everything, it solves my problem. But I wants to know if
someone have any idea why this happened...

Thanks people!

[]s,

Lucas Teixeira
[EMAIL PROTECTED]




Re: Update a field without reindexing the entire document?

2008-03-26 Thread Erik Hatcher


On Mar 26, 2008, at 4:28 AM, Vinci wrote:
One question: If the target field is a multi-value field, what will  
be the

consequence of the update for SOLR-139: overriding or appending?


You can specify when you update a field how that works.

SOLR-139, though, seems a long way from being included in Solr -  
needs lots of work.  (but it is being used on Collex, a project I  
worked on that allows documents in Solr to be tagged/annotated)


Erik



Re: How to index multiple sites with option of combining results in search

2008-03-26 Thread Dietrich
I understand that, and that makes sense. But, coming back to the
orginal question:
>  >  When performing searches,
>  >  I need to be able to search against any combination of sites.
>  >  Does anybody have suggestions what the best practice for a scenario
>  >  like that would be, considering  both indexing and querying
>  >  performance? Put everything into one index and filter when performing
>  >  the queries, or creating a separate index for each one and combining
>  >  results when performing the query?

Are there any established best practices for that?

-ds

On Tue, Mar 25, 2008 at 11:25 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> Dietrich,
>
>  I pointed to SOLR-303 because 275 * 200,000 looks like a too big of a number 
> for a single machine to handle.
>
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>  - Original Message 
>  From: Dietrich <[EMAIL PROTECTED]>
>  To: solr-user@lucene.apache.org
>
>
> Sent: Tuesday, March 25, 2008 7:00:17 PM
>  Subject: Re: How to index multiple sites with option of combining results in 
> search
>
>  On Tue, Mar 25, 2008 at 6:12 PM, Otis Gospodnetic
>  <[EMAIL PROTECTED]> wrote:
>  > Sounds like SOLR-303 is a must for you.
>  Why? I see the benefits of using a distributed architecture in
>  general, but why do you recommend it specifically for this scenario.
>  > Have you looked at Nutch?
>  I don't want to (or need to) use a crawler. I am using a crawler-base
>  system now, and it does not offer the flexibility I need when it comes
>  to custom schemes and faceting.
>  >
>  >  Otis
>  >  --
>  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >
>  >
>  >
>  >  - Original Message 
>  >  From: Dietrich <[EMAIL PROTECTED]>
>  >  To: solr-user@lucene.apache.org
>  >  Sent: Tuesday, March 25, 2008 4:15:23 PM
>  >  Subject: How to index multiple sites with option of combining results in 
> search
>  >
>  >  I am planning to index 275+ different sites with Solr, each of which
>  >  might have anywhere up to 200 000 documents. When performing searches,
>  >  I need to be able to search against any combination of sites.
>  >  Does anybody have suggestions what the best practice for a scenario
>  >  like that would be, considering  both indexing and querying
>  >  performance? Put everything into one index and filter when performing
>  >  the queries, or creating a separate index for each one and combining
>  >  results when performing the query?
>  >
>  >
>  >
>  >
>
>
>
>


Term frequency

2008-03-26 Thread Tim Mahy
Hi All,

is there a way to get the term frequency per found result back from Solr ?

Greetings,
Tim




Info Support - http://www.infosupport.com

Alle informatie in dit e-mailbericht is onder voorbehoud. Info Support is op 
geen enkele wijze aansprakelijk voor vergissingen of onjuistheden in dit 
bericht en staat niet in voor de juiste en volledige overbrenging van de inhoud 
hiervan. Op al de werkzaamheden door Info Support uitgevoerd en op al de aan 
ons gegeven opdrachten zijn - tenzij expliciet anders overeengekomen - onze 
Algemene Voorwaarden van toepassing, gedeponeerd bij de Kamer van Koophandel te 
Utrecht onder nr. 30135370. Een exemplaar zenden wij u op uw verzoek per 
omgaande kosteloos toe.

De informatie in dit e-mailbericht is uitsluitend bestemd voor de 
geadresseerde. Gebruik van deze informatie door anderen is verboden. 
Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking van deze 
informatie aan derden is niet toegestaan.

Dit e-mailbericht kan vertrouwelijke informatie bevatten. Indien u dit bericht 
dus per ongeluk ontvangt, stelt Info Support het op prijs als u de zender door 
een antwoord op deze e-mail hiervan op de hoogte brengt en deze e-mail 
vervolgens vernietigt.


Re: Update schema.xml without restarting Solr?

2008-03-26 Thread Ryan McKinley

Jeryl Cook wrote:

Top often requested feature:
1. Make the option on using the "RAMDirectory" to hook in Terracotta(
billion(s) of items in an index anyone?..it would be possible using
this.)


This is noted in: https://issues.apache.org/jira/browse/SOLR-465

Out of cueriosity, any sense of performance with a terracotta index?  It 
seems like it would have to be *substantially* slower.  Also, if it is a 
RAM directly, does it persist?


If your looking to support billions of docs, perhaps consider:
http://wiki.apache.org/solr/DistributedSearch

ryan


Replication of Segmented indexes

2008-03-26 Thread oleg_gnatovskiy

Hello, this is actually a repost of a question posed by Swarag. I don't think
he made the question quite clear, so let me give it a shot. It is known that
Solr has support for index replication, and it has support for index
segmentation. The question is, how would you use the replication tools with
a segmented index?
-- 
View this message in context: 
http://www.nabble.com/Replication-of-Segmented-indexes-tp16303343p16303343.html
Sent from the Solr - User mailing list archive at Nabble.com.



Making stop-words optional with DisMax?

2008-03-26 Thread Ronald K. Braun
I've followed the stop-word discussion with some interest, but I've
yet to find a solution that completely satisfies our needs.  I was
wondering if anyone could suggest some other options to try short of a
custom handler or building our own queries (DisMax does such a fine
job generally!).

We are using DisMax, and indexing media titles (books, music).  We
want our queries to be sensitive to stop-words, but not so sensitive
that we fail to match on missing or incorrect stop-words.  For
example, here are a set of queries and desired behavior:

* it -> matches It by steven king (high relevance) and other titles
with it therein, e.g. Some Like It Hot (lower relevance)
* the the -> matches music by The The, other titles with the therein
at lower relevance are fine
* the sound of music -> matches The Sound of Music high relevance
* a sound of music -> still matches The Sound of Music, lower relevance is fine
* the doors -> matches music by The Doors, even though it is indexed
just as "Doors" (our data supplier drops the definite article)
* the life -> matches titles The Life with high relevance, matches
titles of just Life with lower relevance

Basically, we want direct matches (including stop-words) to be highly
relevant and we use the phrase query mechanism for that, but we also
want matches if the user mis-remembers the correct (stopped)
prepositions or inserts a few irrelevant stop-words (like articles).
We see this in the wild with non-trivial frequency -- the wrong choice
of preposition ("on mice and men") or an article used that our data
supplier didn't include in the original version ("doors").

One thing we tried is to include both a stopped version and a
non-stopped version of the title in the qf field, in the hopes that
this would retrieve all titles without stop-words and still allow us
to include pure stop-word queries ("it").  However, DisMax constructs
queries such that mixing stopped and non-stopped fields doesn't work
as one might hope, as described well here:

http://www.nabble.com/DisMax-request-handler-doesn%27t-work-with-stopwords--td11015905.html#a2461

Since qf controls the initial set of results retrieved for DisMax, and
we don't want to use a pure stopped set of fields there (because we
won't match on "it" as a query) nor a pure non-stopped set (won't get
results for "a sound of music"), we'd seem to be out of luck unless we
can figure out a way to augment the qf coverage.

We've tried relaxing query term requirements to allow a missing word
or two in the query via mm, but recall is amped up too much since
non-stop-words tend to be dropped and you get a lot of results that
match primarily just across stop-words.

We've also considered creating a sort of equivalence class for all
stop-words (defining synonyms to map stops to some special token)
which would allow mis-remembered stop-words to be conflated, but then
something like "it" would match anything that contained any stop-word
-- again, too high on the recall.

What I think we want is something like an "optional stop-word DisMax"
that would mark stops as optional and construct queries such that
stop-words aren't passed into fields that apply stop-word removal in
query clauses (if that makes sense).  Has anyone done anything similar
or found a better way to handle stops that exhibits the desired
behavior?

Thanks in advance for any thoughts!  And, being new to Solr, apologies
if I'm confused in my reasoning somewhere.

Ron


Re: Update schema.xml without restarting Solr?

2008-03-26 Thread Daniel Papasian

[EMAIL PROTECTED] wrote:

Quoting Daniel Papasian <[EMAIL PROTECTED]>:

Or if you're adding a new field to the schema (perhaps the most common
need for editing schema.xml), you don't need to reindex any documents at
all, right?  Unless I'm missing something?


Well, it all depends on if that "field" (not solar/lucene field) exists 
on the already indexed material, but that particular field was never 
indexed. Lets say that we have a bunch of articles, that has a field 
"author" that someone decided  that it doesn't need to be in the index. 
But then later he changes his mind, and add the author field to the 
schema. In this case all articles that has a populated author field 
should now be reindexed.


Yeah, I guess the use case I was thinking of was someone who had 
multiple different types of content in their index (say, articles, 
events, organizations) and when they added a new content type (book 
review) if they found the need to add a new field for that content type 
(say, publisher) that would only be relevant for that type -- as you're 
adding it before any data that would have it was indexed, I believe 
you'd be fine making that schema change without reindexing anything.



I suppose if you add a new dynamic field specification that conflicts
with existing fields, reindexing is probably a good idea, but if you're
doing that... well, I probably don't want to know.


I must say that I'm abit confused by these dynamic fields. Can someone 
tell me if there is any reasonable use of dynamic fields without having 
the "variable type" (for example i for int/sint) in the name?


Well, perhaps this is fulfilling your requirement on a technicality, but 
there's always higher order types...  Offhand, I can think of things 
where you might want to define a dynamic field like *_propername or 
*_cost and then you'd be able to use fields like author_propername or 
editor_propername, or book_cost or volume_cost or what have you.


Daniel



Re: Solr commits automatically on appserver shutdown

2008-03-26 Thread Yonik Seeley
On Wed, Mar 26, 2008 at 10:18 AM, Noble Paul നോബിള്‍ नोब्ळ्
<[EMAIL PROTECTED]> wrote:
>  If my appserver fails during an update or if I do a planned shutdown
>  without wanting to commit my changes Solr does not allow it?.
>  It commits whatever unfinished changes.
>  Is it by design?
>  Can I change this behavior?

You can't currently avoid it.
With the newer changes in Lucene though, it should be possible I think.

-Yonik


Re: Replication of Segmented indexes

2008-03-26 Thread Yonik Seeley
On Wed, Mar 26, 2008 at 11:34 AM, oleg_gnatovskiy
<[EMAIL PROTECTED]> wrote:
>  Hello, this is actually a repost of a question posed by Swarag. I don't think
>  he made the question quite clear, so let me give it a shot. It is known that
>  Solr has support for index replication, and it has support for index
>  segmentation. The question is, how would you use the replication tools with
>  a segmented index?

Have a master for each segment?

-Yonik


Search fail if copyField absent?(+ Jetty Question)

2008-03-26 Thread Vinci

Hi,

While I am testing the Solr schema (1.3 nightly) with example mySolr on
jetty, for the exampledocs and the default schema,
I see the declaration: 

it should be indexed, so  I comment this


However, the search fail. After I clear up the index and, uncomment the
copyField and commit the document again, the search work again.

That I feeling very confusing as wiki and the schema.xml said this is
optional...is this a bug or wiki information is wrong?

--
For mySolr on Jetty, some question
1. I only need to reload the server when I changed the JSP and html, schema
and config file but no need for index update?
2. Can I gain faster reloading if I extract the war file content into
webapps and then start the application from directory but not the war file?
3. Everytime I load the start.jar I will get this exception:
2008/3/27 AM 12:55:55 org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:136)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:118)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:953)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:968)
at
org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:50)
at org.apache.solr.core.SolrCore$3.call(SolrCore.java:797)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)
although the Solr run fine, I still worry for the hidden exception - does
this exception harmful? 

Thank you
Vinci
-- 
View this message in context: 
http://www.nabble.com/Search-fail-if-copyField-absent-%28%2B-Jetty-Question%29-tp16306854p16306854.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Making stop-words optional with DisMax?

2008-03-26 Thread Otis Gospodnetic
Hi Ron,,

I skimmed your email.  You are indexing book and music titles.  Those tend to 
be short.  Do you really benefit from removing stop words in the first place?  
I'd try keeping all the stop words and seeing if that has any negative 
side-effects in your context.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Ronald K. Braun <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, March 26, 2008 11:41:46 AM
Subject: Making stop-words optional with DisMax?

I've followed the stop-word discussion with some interest, but I've
yet to find a solution that completely satisfies our needs.  I was
wondering if anyone could suggest some other options to try short of a
custom handler or building our own queries (DisMax does such a fine
job generally!).

We are using DisMax, and indexing media titles (books, music).  We
want our queries to be sensitive to stop-words, but not so sensitive
that we fail to match on missing or incorrect stop-words.  For
example, here are a set of queries and desired behavior:

* it -> matches It by steven king (high relevance) and other titles
with it therein, e.g. Some Like It Hot (lower relevance)
* the the -> matches music by The The, other titles with the therein
at lower relevance are fine
* the sound of music -> matches The Sound of Music high relevance
* a sound of music -> still matches The Sound of Music, lower relevance is fine
* the doors -> matches music by The Doors, even though it is indexed
just as "Doors" (our data supplier drops the definite article)
* the life -> matches titles The Life with high relevance, matches
titles of just Life with lower relevance

Basically, we want direct matches (including stop-words) to be highly
relevant and we use the phrase query mechanism for that, but we also
want matches if the user mis-remembers the correct (stopped)
prepositions or inserts a few irrelevant stop-words (like articles).
We see this in the wild with non-trivial frequency -- the wrong choice
of preposition ("on mice and men") or an article used that our data
supplier didn't include in the original version ("doors").

One thing we tried is to include both a stopped version and a
non-stopped version of the title in the qf field, in the hopes that
this would retrieve all titles without stop-words and still allow us
to include pure stop-word queries ("it").  However, DisMax constructs
queries such that mixing stopped and non-stopped fields doesn't work
as one might hope, as described well here:

http://www.nabble.com/DisMax-request-handler-doesn%27t-work-with-stopwords--td11015905.html#a2461

Since qf controls the initial set of results retrieved for DisMax, and
we don't want to use a pure stopped set of fields there (because we
won't match on "it" as a query) nor a pure non-stopped set (won't get
results for "a sound of music"), we'd seem to be out of luck unless we
can figure out a way to augment the qf coverage.

We've tried relaxing query term requirements to allow a missing word
or two in the query via mm, but recall is amped up too much since
non-stop-words tend to be dropped and you get a lot of results that
match primarily just across stop-words.

We've also considered creating a sort of equivalence class for all
stop-words (defining synonyms to map stops to some special token)
which would allow mis-remembered stop-words to be conflated, but then
something like "it" would match anything that contained any stop-word
-- again, too high on the recall.

What I think we want is something like an "optional stop-word DisMax"
that would mark stops as optional and construct queries such that
stop-words aren't passed into fields that apply stop-word removal in
query clauses (if that makes sense).  Has anyone done anything similar
or found a better way to handle stops that exhibits the desired
behavior?

Thanks in advance for any thoughts!  And, being new to Solr, apologies
if I'm confused in my reasoning somewhere.

Ron





Re: How to index multiple sites with option of combining results in search

2008-03-26 Thread Otis Gospodnetic
Dietrich,

I don't think there are established practices in the open (yet).  You could 
design your application with a site(s)->shard mapping and then, knowing which 
sites are involved in the query, search only the relevant shards.  This will be 
efficient, but it would require careful management on your part.

Putting everything in a single index would just not work with "normal" 
machines, I think.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Dietrich <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, March 26, 2008 10:47:55 AM
Subject: Re: How to index multiple sites with option of combining results in 
search

I understand that, and that makes sense. But, coming back to the
orginal question:
>  >  When performing searches,
>  >  I need to be able to search against any combination of sites.
>  >  Does anybody have suggestions what the best practice for a scenario
>  >  like that would be, considering  both indexing and querying
>  >  performance? Put everything into one index and filter when performing
>  >  the queries, or creating a separate index for each one and combining
>  >  results when performing the query?

Are there any established best practices for that?

-ds

On Tue, Mar 25, 2008 at 11:25 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> Dietrich,
>
>  I pointed to SOLR-303 because 275 * 200,000 looks like a too big of a number 
> for a single machine to handle.
>
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>  - Original Message 
>  From: Dietrich <[EMAIL PROTECTED]>
>  To: solr-user@lucene.apache.org
>
>
> Sent: Tuesday, March 25, 2008 7:00:17 PM
>  Subject: Re: How to index multiple sites with option of combining results in 
> search
>
>  On Tue, Mar 25, 2008 at 6:12 PM, Otis Gospodnetic
>  <[EMAIL PROTECTED]> wrote:
>  > Sounds like SOLR-303 is a must for you.
>  Why? I see the benefits of using a distributed architecture in
>  general, but why do you recommend it specifically for this scenario.
>  > Have you looked at Nutch?
>  I don't want to (or need to) use a crawler. I am using a crawler-base
>  system now, and it does not offer the flexibility I need when it comes
>  to custom schemes and faceting.
>  >
>  >  Otis
>  >  --
>  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >
>  >
>  >
>  >  - Original Message 
>  >  From: Dietrich <[EMAIL PROTECTED]>
>  >  To: solr-user@lucene.apache.org
>  >  Sent: Tuesday, March 25, 2008 4:15:23 PM
>  >  Subject: How to index multiple sites with option of combining results in 
> search
>  >
>  >  I am planning to index 275+ different sites with Solr, each of which
>  >  might have anywhere up to 200 000 documents. When performing searches,
>  >  I need to be able to search against any combination of sites.
>  >  Does anybody have suggestions what the best practice for a scenario
>  >  like that would be, considering  both indexing and querying
>  >  performance? Put everything into one index and filter when performing
>  >  the queries, or creating a separate index for each one and combining
>  >  results when performing the query?
>  >
>  >
>  >
>  >
>
>
>
>





Re: Update schema.xml without restarting Solr?

2008-03-26 Thread Otis Gospodnetic
Hey Ryan, why do you say a Lucene/Solr index served via Terracotta would be 
substantially slower?
I often wanted to try Terracotta + Lucene, but... time.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Ryan McKinley <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, March 26, 2008 10:52:45 AM
Subject: Re: Update schema.xml without restarting Solr?

Jeryl Cook wrote:
> Top often requested feature:
> 1. Make the option on using the "RAMDirectory" to hook in Terracotta(
> billion(s) of items in an index anyone?..it would be possible using
> this.)

This is noted in: https://issues.apache.org/jira/browse/SOLR-465

Out of cueriosity, any sense of performance with a terracotta index?  It 
seems like it would have to be *substantially* slower.  Also, if it is a 
RAM directly, does it persist?

If your looking to support billions of docs, perhaps consider:
http://wiki.apache.org/solr/DistributedSearch

ryan





Re: How to index multiple sites with option of combining results in search

2008-03-26 Thread Dietrich
Makes sense, nut probably overkill for my requirements. I wasn't
really talking 275*20, more likely the total would be something
like four million documents. I was under the assumption that a single
machine, or a simple distributed index, should be able to handle that,
is that wrong?

-ds

On Wed, Mar 26, 2008 at 2:05 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> Dietrich,
>
>  I don't think there are established practices in the open (yet).  You could 
> design your application with a site(s)->shard mapping and then, knowing which 
> sites are involved in the query, search only the relevant shards.  This will 
> be efficient, but it would require careful management on your part.
>
>  Putting everything in a single index would just not work with "normal" 
> machines, I think.
>
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>  - Original Message 
>  From: Dietrich <[EMAIL PROTECTED]>
>  To: solr-user@lucene.apache.org
>
>
> Sent: Wednesday, March 26, 2008 10:47:55 AM
>  Subject: Re: How to index multiple sites with option of combining results in 
> search
>
>  I understand that, and that makes sense. But, coming back to the
>  orginal question:
>  >  >  When performing searches,
>  >  >  I need to be able to search against any combination of sites.
>  >  >  Does anybody have suggestions what the best practice for a scenario
>  >  >  like that would be, considering  both indexing and querying
>  >  >  performance? Put everything into one index and filter when performing
>  >  >  the queries, or creating a separate index for each one and combining
>  >  >  results when performing the query?
>
>  Are there any established best practices for that?
>
>  -ds
>
>  On Tue, Mar 25, 2008 at 11:25 PM, Otis Gospodnetic
>  <[EMAIL PROTECTED]> wrote:
>  > Dietrich,
>  >
>  >  I pointed to SOLR-303 because 275 * 200,000 looks like a too big of a 
> number for a single machine to handle.
>  >
>  >
>  >  Otis
>  >  --
>  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >
>  >  - Original Message 
>  >  From: Dietrich <[EMAIL PROTECTED]>
>  >  To: solr-user@lucene.apache.org
>  >
>  >
>  > Sent: Tuesday, March 25, 2008 7:00:17 PM
>  >  Subject: Re: How to index multiple sites with option of combining results 
> in search
>  >
>  >  On Tue, Mar 25, 2008 at 6:12 PM, Otis Gospodnetic
>  >  <[EMAIL PROTECTED]> wrote:
>  >  > Sounds like SOLR-303 is a must for you.
>  >  Why? I see the benefits of using a distributed architecture in
>  >  general, but why do you recommend it specifically for this scenario.
>  >  > Have you looked at Nutch?
>  >  I don't want to (or need to) use a crawler. I am using a crawler-base
>  >  system now, and it does not offer the flexibility I need when it comes
>  >  to custom schemes and faceting.
>  >  >
>  >  >  Otis
>  >  >  --
>  >  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >  >
>  >  >
>  >  >
>  >  >  - Original Message 
>  >  >  From: Dietrich <[EMAIL PROTECTED]>
>  >  >  To: solr-user@lucene.apache.org
>  >  >  Sent: Tuesday, March 25, 2008 4:15:23 PM
>  >  >  Subject: How to index multiple sites with option of combining results 
> in search
>  >  >
>  >  >  I am planning to index 275+ different sites with Solr, each of which
>  >  >  might have anywhere up to 200 000 documents. When performing searches,
>  >  >  I need to be able to search against any combination of sites.
>  >  >  Does anybody have suggestions what the best practice for a scenario
>  >  >  like that would be, considering  both indexing and querying
>  >  >  performance? Put everything into one index and filter when performing
>  >  >  the queries, or creating a separate index for each one and combining
>  >  >  results when performing the query?
>  >  >
>  >  >
>  >  >
>  >  >
>  >
>  >
>  >
>  >
>
>
>
>


Re: How to index multiple sites with option of combining results in search

2008-03-26 Thread Otis Gospodnetic
Ah, that's a very different number.  Yes, assuming your docs are web pages, a 
single reasonably equipped machine should be able to handle that and a few 
dozen QPS.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Dietrich <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, March 26, 2008 2:18:53 PM
Subject: Re: How to index multiple sites with option of combining results in 
search

Makes sense, nut probably overkill for my requirements. I wasn't
really talking 275*20, more likely the total would be something
like four million documents. I was under the assumption that a single
machine, or a simple distributed index, should be able to handle that,
is that wrong?

-ds

On Wed, Mar 26, 2008 at 2:05 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> Dietrich,
>
>  I don't think there are established practices in the open (yet).  You could 
> design your application with a site(s)->shard mapping and then, knowing which 
> sites are involved in the query, search only the relevant shards.  This will 
> be efficient, but it would require careful management on your part.
>
>  Putting everything in a single index would just not work with "normal" 
> machines, I think.
>
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>  - Original Message 
>  From: Dietrich <[EMAIL PROTECTED]>
>  To: solr-user@lucene.apache.org
>
>
> Sent: Wednesday, March 26, 2008 10:47:55 AM
>  Subject: Re: How to index multiple sites with option of combining results in 
> search
>
>  I understand that, and that makes sense. But, coming back to the
>  orginal question:
>  >  >  When performing searches,
>  >  >  I need to be able to search against any combination of sites.
>  >  >  Does anybody have suggestions what the best practice for a scenario
>  >  >  like that would be, considering  both indexing and querying
>  >  >  performance? Put everything into one index and filter when performing
>  >  >  the queries, or creating a separate index for each one and combining
>  >  >  results when performing the query?
>
>  Are there any established best practices for that?
>
>  -ds
>
>  On Tue, Mar 25, 2008 at 11:25 PM, Otis Gospodnetic
>  <[EMAIL PROTECTED]> wrote:
>  > Dietrich,
>  >
>  >  I pointed to SOLR-303 because 275 * 200,000 looks like a too big of a 
> number for a single machine to handle.
>  >
>  >
>  >  Otis
>  >  --
>  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >
>  >  - Original Message 
>  >  From: Dietrich <[EMAIL PROTECTED]>
>  >  To: solr-user@lucene.apache.org
>  >
>  >
>  > Sent: Tuesday, March 25, 2008 7:00:17 PM
>  >  Subject: Re: How to index multiple sites with option of combining results 
> in search
>  >
>  >  On Tue, Mar 25, 2008 at 6:12 PM, Otis Gospodnetic
>  >  <[EMAIL PROTECTED]> wrote:
>  >  > Sounds like SOLR-303 is a must for you.
>  >  Why? I see the benefits of using a distributed architecture in
>  >  general, but why do you recommend it specifically for this scenario.
>  >  > Have you looked at Nutch?
>  >  I don't want to (or need to) use a crawler. I am using a crawler-base
>  >  system now, and it does not offer the flexibility I need when it comes
>  >  to custom schemes and faceting.
>  >  >
>  >  >  Otis
>  >  >  --
>  >  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >  >
>  >  >
>  >  >
>  >  >  - Original Message 
>  >  >  From: Dietrich <[EMAIL PROTECTED]>
>  >  >  To: solr-user@lucene.apache.org
>  >  >  Sent: Tuesday, March 25, 2008 4:15:23 PM
>  >  >  Subject: How to index multiple sites with option of combining results 
> in search
>  >  >
>  >  >  I am planning to index 275+ different sites with Solr, each of which
>  >  >  might have anywhere up to 200 000 documents. When performing searches,
>  >  >  I need to be able to search against any combination of sites.
>  >  >  Does anybody have suggestions what the best practice for a scenario
>  >  >  like that would be, considering  both indexing and querying
>  >  >  performance? Put everything into one index and filter when performing
>  >  >  the queries, or creating a separate index for each one and combining
>  >  >  results when performing the query?
>  >  >
>  >  >
>  >  >
>  >  >
>  >
>  >
>  >
>  >
>
>
>
>





Are 1.2 and 1.3/trunk indexes compatible?

2008-03-26 Thread Chris Harris
What are the odds that I can plop an index created in Solr 1.2 into a
Solr 1.3 and/or Solr trunk install and have things work correctly?
This would be more convenient than reindexing, but I'm wondering how
dangerous it is, and hence how much testing is required.


Facet searching and facet hierarchies.

2008-03-26 Thread A . Z
I have a couple of question concerning facet searching.

As I understand, after passing facets to Solr, one must
manually add facet results to search to narrow the search.
ex. i search for "foo bar" and click some facet. must i now
search for 'foo bar facet:value' ? Must I include + signs?
I'm using solrphpclient, maybe there's an API (PHP) that can parse all this?

Another question is facet hierarchies, what's the easiest way
to throw away facets when one is parent of other.
ex. it makes no sense to display 'country' facet when a
'city' facet has been clicked. Storing all facets hardcoded is not possible,
they may change. How is this usually done? Again, any API that can do this?
Thanks!


Re: Are 1.2 and 1.3/trunk indexes compatible?

2008-03-26 Thread Yonik Seeley
On Wed, Mar 26, 2008 at 3:05 PM, Chris Harris <[EMAIL PROTECTED]> wrote:
> What are the odds that I can plop an index created in Solr 1.2 into a
>  Solr 1.3 and/or Solr trunk install and have things work correctly?

Should be relatively high.
I'd never do it on a live index, regardless of what is advertised,
without first trying it on a test copy.  If you are in a replicated
environment, try on a single slave.  If that works correctly, then
convert all other slaves before converting the master.

-Yonik


Re: Are 1.2 and 1.3/trunk indexes compatible?

2008-03-26 Thread Ryan McKinley

It *should* work as a drop in replacement.  Check:
http://svn.apache.org/repos/asf/lucene/solr/trunk/CHANGES.txt

So you should be good.  Note that trunk has a newer verison of lucene, 
so the index will be automatically upgraded and you can't go back from 
there.


so make sure to backup before trying, but it should go smoothly...

ryan


Chris Harris wrote:

What are the odds that I can plop an index created in Solr 1.2 into a
Solr 1.3 and/or Solr trunk install and have things work correctly?
This would be more convenient than reindexing, but I'm wondering how
dangerous it is, and hence how much testing is required.





Re: Are 1.2 and 1.3/trunk indexes compatible?

2008-03-26 Thread Chris Harris
Looks like that can't-go-back bit hasn't made it into CHANGES.txt yet.
Might want to eventually add that somewhere particularly obvious, to
help out people who assume they could downgrade. Maybe under
"Upgrading from Solr 1.2"?

On Wed, Mar 26, 2008 at 12:59 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> It *should* work as a drop in replacement.  Check:
>  http://svn.apache.org/repos/asf/lucene/solr/trunk/CHANGES.txt
>
>  So you should be good.  Note that trunk has a newer verison of lucene,
>  so the index will be automatically upgraded and you can't go back from
>  there.
>
>  so make sure to backup before trying, but it should go smoothly...
>
>  ryan
>
>
>  Chris Harris wrote:
>  > What are the odds that I can plop an index created in Solr 1.2 into a
>  > Solr 1.3 and/or Solr trunk install and have things work correctly?
>  > This would be more convenient than reindexing, but I'm wondering how
>  > dangerous it is, and hence how much testing is required.


Re: Update schema.xml without restarting Solr?

2008-03-26 Thread Ryan McKinley
just intuition - haven't tried it, so i'd love to be proved wrong. 
Instrumenting Objects and magically passing them around seems like it 
would be slower then a tuned approach used in SOLR-303.


It looks like they have a lucene example:
http://www.terracotta.org/confluence/display/integrations/Lucene

Also, i don't understand how terracotta could get lucene past the 
Integer.MAX_VALUE limit because it does not change the API, it works 
within it.


ryan


Otis Gospodnetic wrote:

Hey Ryan, why do you say a Lucene/Solr index served via Terracotta would be 
substantially slower?
I often wanted to try Terracotta + Lucene, but... time.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Ryan McKinley <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, March 26, 2008 10:52:45 AM
Subject: Re: Update schema.xml without restarting Solr?

Jeryl Cook wrote:

Top often requested feature:
1. Make the option on using the "RAMDirectory" to hook in Terracotta(
billion(s) of items in an index anyone?..it would be possible using
this.)


This is noted in: https://issues.apache.org/jira/browse/SOLR-465

Out of cueriosity, any sense of performance with a terracotta index?  It 
seems like it would have to be *substantially* slower.  Also, if it is a 
RAM directly, does it persist?


If your looking to support billions of docs, perhaps consider:
http://wiki.apache.org/solr/DistributedSearch

ryan








Re: Are 1.2 and 1.3/trunk indexes compatible?

2008-03-26 Thread Ryan McKinley

good point:
http://svn.apache.org/viewvc/lucene/solr/trunk/CHANGES.txt?r1=641573&r2=641572&pathrev=641573

ryan

Chris Harris wrote:

Looks like that can't-go-back bit hasn't made it into CHANGES.txt yet.
Might want to eventually add that somewhere particularly obvious, to
help out people who assume they could downgrade. Maybe under
"Upgrading from Solr 1.2"?

On Wed, Mar 26, 2008 at 12:59 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote:

It *should* work as a drop in replacement.  Check:
 http://svn.apache.org/repos/asf/lucene/solr/trunk/CHANGES.txt

 So you should be good.  Note that trunk has a newer verison of lucene,
 so the index will be automatically upgraded and you can't go back from
 there.

 so make sure to backup before trying, but it should go smoothly...

 ryan


 Chris Harris wrote:
 > What are the odds that I can plop an index created in Solr 1.2 into a
 > Solr 1.3 and/or Solr trunk install and have things work correctly?
 > This would be more convenient than reindexing, but I'm wondering how
 > dangerous it is, and hence how much testing is required.






Re: Update schema.xml without restarting Solr?

2008-03-26 Thread Yonik Seeley
On Wed, Mar 26, 2008 at 4:41 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> just intuition - haven't tried it, so i'd love to be proved wrong.
>  Instrumenting Objects and magically passing them around seems like it
>  would be slower then a tuned approach used in SOLR-303.

Yep, that's my sense too.  No magic solutions when it comes to scalability.

-Yonik


Re: Update schema.xml without restarting Solr?

2008-03-26 Thread Jeryl Cook
i wouldn't call Terracotta approach magic(smile)..., it's being used
quite a bit in many scalable high performing projects...

i personally used Terracotta and Lucene, and it worked but did not try
to "cluster" it with multiple terracotta(workers) across nodes , and
the Terracotta(master)..just a single box with two tomcat instances...

However "talk is cheap", if I have the time over the next few weeks
ill make a bench mark test based on the "Terracotta and Lucene", with
maybe 3 nodes?and a 1 million documents..
maybe some others can do the same :)..

FYI: 
http://www.terracotta.org/confluence/display/tcforge/Proposal+-+Terracotta+for+Lucene

Jeryl Cook

On Wed, Mar 26, 2008 at 5:16 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On Wed, Mar 26, 2008 at 4:41 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote:
>  > just intuition - haven't tried it, so i'd love to be proved wrong.
>  >  Instrumenting Objects and magically passing them around seems like it
>  >  would be slower then a tuned approach used in SOLR-303.
>
>  Yep, that's my sense too.  No magic solutions when it comes to scalability.
>
>  -Yonik
>



-- 
Jeryl Cook
/^\ Pharaoh /^\
http://pharaohofkush.blogspot.com/
"..Act your age, and not your shoe size.." -Prince(1986)


Re: Search fail if copyField absent?(+ Jetty Question)

2008-03-26 Thread Chris Hostetter

: it should be indexed, so  I comment this
: 
: 
: However, the search fail. After I clear up the index and, uncomment the
: copyField and commit the document again, the search work again.
: 
: That I feeling very confusing as wiki and the schema.xml said this is
: optional...is this a bug or wiki information is wrong?

are you searching on the "text" field ... in teh example schema it is the 
default search field, so unless you are explicitly putting data in that 
field for your docs, that copyField may be hte only way info is getting 
into thatfield ... without it it's not suprising that searches on the text 
field wouldn't work.

: 1. I only need to reload the server when I changed the JSP and html, schema
: and config file but no need for index update?

corret ... just send a  command.

: 2. Can I gain faster reloading if I extract the war file content into
: webapps and then start the application from directory but not the war file?

I doubt it .. but that would relaly depend on how the application server 
works.

: 3. Everytime I load the start.jar I will get this exception:
: 2008/3/27 AM 12:55:55 org.apache.solr.common.SolrException log
: SEVERE: java.lang.NullPointerException
...
: 
org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:50)

...that looks like this known bug...

http://issues.apache.org/jira/browse/SOLR-509


-Hoss



Using Field Collapsing and Filter Query to implement JOIN

2008-03-26 Thread Lester Scofield
Hello solr people,
I'm very new to solr to please forgive any misunderstanding on my part.
I am hoping to do a  JOIN across documents.

Let me start with the 4 documents:


  part1
  ABC
  this is a test


   part2
   ABC
   of a fake JOIN


  part1
  XYZ
  this is a test


   part2
   XYZ
   of a mismatch




If I wanted to form a fake query that was something like:
(+type:part1 AND +foo:test) AND (+type:part2 AND +bar:join) AND (JOIN-ON:key)

I could first change this into:
(+type:part1 AND +foo:test) OR (+type:part2 AND +bar:join)
and field collapse on field "key"

I would then use Filter Query to remove any hits that did not collapse
into 2 fields with something like:
(+collapse:2)
In this way I would exclude the match on the 3rd doc above.

I know there is a problem with being able to get at the number of
collapsed items in the result set like this, but maybe an updated
patch to "Field Collapse" could be made.
Something that lets you name a field that will house the number of
documents collapsed.

I am not sure how field collapse would work with Filter Query, is
there a problem here?

I'm also not sure if it would be best to use Filter Query or the other
filter that happens after the main query (I get the names confused).

So what do you all think?  I hope this could at least spark an idea on
how this could really happen.

Thank you all,
Lester


Re: Update schema.xml without restarting Solr?

2008-03-26 Thread Chris Hostetter

: > Top often requested feature:
: > 1. Make the option on using the "RAMDirectory" to hook in Terracotta(
: > billion(s) of items in an index anyone?..it would be possible using
: > this.)
: 
: This is noted in: https://issues.apache.org/jira/browse/SOLR-465

...and if people posted comments in the issue saying they tried the 
patch and it worked well for them (or didn't work well and described 
their use case and what they think a better API would be) that would help 
raise the visibility of the visibility of hte issue and increase the 
likelyhod of geting it commited.

API changes, particularly new "plugin hooks" have to be made carefully 
because once they are released they have to be supported indefinitley.  so 
we need people to really help us test them out while they are still just 
patches.


-Hoss



Re: How to use Solr in java program

2008-03-26 Thread Chris Hostetter

: I am new user of Solr and I want to know how can I use Solr in my own java

http://wiki.apache.org/solr/SolJava

: program, what are the different possibilities of using solr. Is a web
: servlet  container is  ncessary to run and use Solr, Is servlet Container as
: Tomcat is enough to use  all the power offered by solr. I heard about the
: possiblity of using Solr as web service is it the best way and especially if
: a program is  written in another language than JAVA.

I recommend using Solr as a webservice, even if your client is Java.  but 
there are options for embedding Solr directly into your applications using 
SolrJ.





-Hoss



Re: Adding custom field for sorting?

2008-03-26 Thread Chris Hostetter
: 
: Inspirited by the previous post, does it possible to add my custom field and
: use it for sorting the search result?
: If It is possible, what will be the step? Do I need to modify the source
: code?

adding a custom field is easy, just add the  to your schema.xml 
and put data in it.

adding a custom *fieldtype* is trickier ... compile your new fieldtype 
against the solr code base, package it in a jar, and put that jar in the 
plugin lib dir...

http://wiki.apache.org/solr/SolrPlugins

...no need to modify any Solr code.




-Hoss



Re: Document Path issue and change the layout in the example

2008-03-26 Thread Chris Hostetter

: I started the indexing with jetty and then I come with some question...
: 1. If I use the example start.jar, what should be my document system layout?
: What is the essential folder?
: solr_jar
: |_start.jar
: |_solrhome
: |_etc
: |_lib
: |_logs

i'm not sure what "solr_jar" is ... but most of those directories there 
are jetty specific. "solrhome" (or just "solr" in the example as provided) 
is the only thing solr specific.  it can live anywhere you want as long as 
your servlet container knows how to fine it.

: And where is the solr main library located? outside of the example?

it's in the solr.war.

: *if I need to run change the jsp of solr admin, do I need to change the war
: file content?

yes.

: 2. the post.jar must be placed with the document system root? Can the
: document placed in folder?

post.jar is just an example client.  You don't have to use it, you can 
send your data to Solr anyway you want (using any langauge you want) from 
any computer you want.

: 3. and what is the use of bin/ in solr home folder?

per the README file located in the example solr home dir...

   bin/
This directory is optional.  It is the default location used for
keeping the replication scripts.

http://svn.apache.org/viewvc/lucene/solr/trunk/example/solr/README.txt?view=markup



-Hoss



Re: Highlighting Quoted Phrases

2008-03-26 Thread Chris Harris
On Tue, Mar 25, 2008 at 4:25 PM, Brian Whitman <[EMAIL PROTECTED]> wrote:
>
>  On Mar 25, 2008, at 6:31 PM, Chris Harris wrote:
>
>  > working pretty well, but my testers have
>  > discovered something they find borderline unacceptable. If they search
>  > for
>  >
>  >"stock market"
>  >
>  > (with quotes), then Solr correctly returns only documents where
>  > "stock" and "market" appear as adjacent words. Two problems though:
>  > First, Solr is willing to pick snippets where only one of the terms
>  > appears, e.g.
>  >
>  >...and changes in the market regulation environment...
>
>
>  I recently asked about the same thing. There's a patch in lucene (not
>  in trunk yet) to support this.

Oh dear, you did ask the same question very recently. Sorry to re-ask
the same thing, everybody.

For the record, that thread is called "highlighting pt2: returning
tokens out of order from PhraseQuery", and it's (currently anyway)
available at:

http://www.nabble.com/highlighting-pt2%3A-returning-tokens-out-of-order-from-PhraseQuery-to16156718.html


Re: Beginner questions: Jetty and solr with utf-8 + cached page + dedup

2008-03-26 Thread Thorsten Scherler
On Tue, 2008-03-25 at 10:56 -0700, Vinci wrote:
> Hi,
> 
> Thank for your reply.
> Question for apply xslt: If I use saxon, where should the saxon.jar located
> if I using the example jetty server? lib/ inside example/ or outside the
> example/?

http://wiki.apache.org/solr/mySolr
"...
Typically it's not recommended to have your front end users/clients
hitting Solr directly as part of an HTML form submit
..."

In the above page there you find answers to many of your questions.

HTH

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Making stop-words optional with DisMax?

2008-03-26 Thread Ronald K. Braun
Hi Otis,

> I skimmed your email.  You are indexing book and music titles.  Those tend to 
> be short.
> Do you really benefit from removing stop words in the first place?  I'd try 
> keeping all the stop
> words and seeing if that has any negative side-effects in your context.

Thanks for your skim and response!  We do keep all stop-words -- as
you say, makes sense since we aren't dealing with long free text
fields and because some titles are pure stops.

The negative side-effects lie in stop-words being treated with the
same importance as non-stop-words for matching purposes.  This
manifests in two ways:  1. Users occasionally get the stop-words wrong
-- say, wrong choice of preposition, which torpedoes the query since
some of the query terms aren't present in the target.  For example "on
mice and men" may return nothing (no match for "on") even though it is
equivalent to "of mice and men" in a stopped sense.  2. Our original
indexed data doesn't always have leading articles and such.  For
example, we index on "Doors" since that is our sourced data but
frequently get queried for "The Doors".  Articles and prepositions
(the stuff of good stop-lists) seem to me to be in a fuzzier class --
use 'em if you have 'em during matching, but don't kill your queries
because of them.  Hence some desire to make them in some way
"optional" during matching.

Ron


Re: Adding custom field for sorting?

2008-03-26 Thread Vinci

Hi hossman,

Thank you for your reply.
Some question on sorting: 
1. Does Solr have a limit, e.g a % or a number to limit the number of
document involved in sorting? or just sort all document?
2. Does the order in parameter 'sort' refer to the sorting order? (sort the
first argument first, then the second and so on)

Thank you,
Vinci


hossman wrote:
> 
> : 
> : Inspirited by the previous post, does it possible to add my custom field
> and
> : use it for sorting the search result?
> : If It is possible, what will be the step? Do I need to modify the source
> : code?
> 
> adding a custom field is easy, just add the  to your schema.xml 
> and put data in it.
> 
> adding a custom *fieldtype* is trickier ... compile your new fieldtype 
> against the solr code base, package it in a jar, and put that jar in the 
> plugin lib dir...
> 
>   http://wiki.apache.org/solr/SolrPlugins
> 
> ...no need to modify any Solr code.
> 
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Adding-custom-field-for-sorting--tp16269118p16320156.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Search fail if copyField absent?(+ Jetty Question)

2008-03-26 Thread Vinci

Hi hossman,

Thank you for your reply, it help a lots...just little more question here:


hossman wrote:
> 
> 
> : it should be indexed, so  I comment this
> : 
> : 
> : However, the search fail. After I clear up the index and, uncomment the
> : copyField and commit the document again, the search work again.
> : 
> : That I feeling very confusing as wiki and the schema.xml said this is
> : optional...is this a bug or wiki information is wrong?
> 
> are you searching on the "text" field ... in teh example schema it is the 
> default search field, so unless you are explicitly putting data in that 
> field for your docs, that copyField may be hte only way info is getting 
> into thatfield ... without it it's not suprising that searches on the text 
> field wouldn't work.
> 
I doesn't change other thing about field in default schema, so I think you
are correct...Then here is one question: Can I use parameter to change the
search field when query come in?



hossman wrote:
> 
> : 2. Can I gain faster reloading if I extract the war file content into
> : webapps and then start the application from directory but not the war
> file?
> 
> I doubt it .. but that would relaly depend on how the application server 
> works.
> 
I see jetty have do some caching...so many be the effect appear when I
recomplie a war file.
But at least I can change the jsp and as well as the style sheet easier :)


hossman wrote:
> 
> : 3. Everytime I load the start.jar I will get this exception:
> : 2008/3/27 AM 12:55:55 org.apache.solr.common.SolrException log
> : SEVERE: java.lang.NullPointerException
>   ...
> :
> org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:50)
> 
> ...that looks like this known bug...
> 
> http://issues.apache.org/jira/browse/SOLR-509
> 
Seems they didn't fix it yet...but it seems harmless.

Thank you,
Vinci

-- 
View this message in context: 
http://www.nabble.com/Search-fail-if-copyField-absent-%28%2B-Jetty-Question%29-tp16306854p16320160.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: positionIncrementGap - what is its value meaning?

2008-03-26 Thread Vinci

Hi Erik,

Thank you for your help. This is useful.
Some follow up questions,


Erik Hatcher wrote:
> 
> ..The value you set that gap to depends on whether you'll  
> be using sloppy phrase queries, and how sloppy they'll be and whether  
> you desire matching across field instances.  
> 

1. If I doesn't care the sloppy queries, I just set a number larger than 0
and then it will work?
2. If the sloppy queries use a range larger than the gap, what will happen?

Thank you,
Vinci

-- 
View this message in context: 
http://www.nabble.com/positionIncrementGap---what-is-its-value-meaning--tp16296677p16320265.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Facet searching and facet hierarchies.

2008-03-26 Thread Erik Hatcher


On Mar 26, 2008, at 3:34 PM, A.Z wrote:

As I understand, after passing facets to Solr, one must
manually add facet results to search to narrow the search.
ex. i search for "foo bar" and click some facet. must i now
search for 'foo bar facet:value' ? Must I include + signs?
I'm using solrphpclient, maybe there's an API (PHP) that can parse  
all this?


What you're after in narrowing results by facet constraints is the  
"fq" parameter:




Another question is facet hierarchies, what's the easiest way
to throw away facets when one is parent of other.
ex. it makes no sense to display 'country' facet when a
'city' facet has been clicked. Storing all facets hardcoded is not  
possible,
they may change. How is this usually done? Again, any API that can  
do this?




They may change?  If that is the case they you'd reindex the  
documents that changed to keep them in sync.


As for whether to display a facet or not based on other facets being  
selected, this seems more like a Solr client / UI issue in showing  
the right thing in the right context.


Erik



Re: positionIncrementGap - what is its value meaning?

2008-03-26 Thread Erik Hatcher


On Mar 26, 2008, at 10:15 PM, Vinci wrote:

Erik Hatcher wrote:


..The value you set that gap to depends on whether you'll
be using sloppy phrase queries, and how sloppy they'll be and whether
you desire matching across field instances.



1. If I doesn't care the sloppy queries, I just set a number larger  
than 0

and then it will work?


Yes.

2. If the sloppy queries use a range larger than the gap, what will  
happen?


Then you can match across field instance boundaries.

Erik




RE: How to index multiple sites with option of combining results in search

2008-03-26 Thread Lance Norskog
In fact, 55m records works fine in Solr; assuming they are small records.
The problem is that the index files wind up in the tens of gigabytes. The
logistics of doing backups, snapping to query servers, etc. is what makes
this index unwieldy, and why multiple shards are useful.

Lance

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 26, 2008 11:22 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index multiple sites with option of combining results in
search

Ah, that's a very different number.  Yes, assuming your docs are web pages,
a single reasonably equipped machine should be able to handle that and a few
dozen QPS.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Dietrich <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, March 26, 2008 2:18:53 PM
Subject: Re: How to index multiple sites with option of combining results in
search

Makes sense, nut probably overkill for my requirements. I wasn't really
talking 275*20, more likely the total would be something like four
million documents. I was under the assumption that a single machine, or a
simple distributed index, should be able to handle that, is that wrong?

-ds

On Wed, Mar 26, 2008 at 2:05 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> Dietrich,
>
>  I don't think there are established practices in the open (yet).  You
could design your application with a site(s)->shard mapping and then,
knowing which sites are involved in the query, search only the relevant
shards.  This will be efficient, but it would require careful management on
your part.
>
>  Putting everything in a single index would just not work with "normal"
machines, I think.
>
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>  - Original Message 
>  From: Dietrich <[EMAIL PROTECTED]>
>  To: solr-user@lucene.apache.org
>
>
> Sent: Wednesday, March 26, 2008 10:47:55 AM
>  Subject: Re: How to index multiple sites with option of combining 
> results in search
>
>  I understand that, and that makes sense. But, coming back to the  
> orginal question:
>  >  >  When performing searches,
>  >  >  I need to be able to search against any combination of sites.
>  >  >  Does anybody have suggestions what the best practice for a 
> scenario  >  >  like that would be, considering  both indexing and 
> querying  >  >  performance? Put everything into one index and filter 
> when performing  >  >  the queries, or creating a separate index for 
> each one and combining  >  >  results when performing the query?
>
>  Are there any established best practices for that?
>
>  -ds
>
>  On Tue, Mar 25, 2008 at 11:25 PM, Otis Gospodnetic  
> <[EMAIL PROTECTED]> wrote:
>  > Dietrich,
>  >
>  >  I pointed to SOLR-303 because 275 * 200,000 looks like a too big of a
number for a single machine to handle.
>  >
>  >
>  >  Otis
>  >  --
>  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch  >  >  
> - Original Message   >  From: Dietrich 
> <[EMAIL PROTECTED]>  >  To: solr-user@lucene.apache.org  >  
> >  > Sent: Tuesday, March 25, 2008 7:00:17 PM  >  Subject: Re: How to 
> index multiple sites with option of combining results in search  >  >  
> On Tue, Mar 25, 2008 at 6:12 PM, Otis Gospodnetic  >  
> <[EMAIL PROTECTED]> wrote:
>  >  > Sounds like SOLR-303 is a must for you.
>  >  Why? I see the benefits of using a distributed architecture in  >  
> general, but why do you recommend it specifically for this scenario.
>  >  > Have you looked at Nutch?
>  >  I don't want to (or need to) use a crawler. I am using a 
> crawler-base  >  system now, and it does not offer the flexibility I 
> need when it comes  >  to custom schemes and faceting.
>  >  >
>  >  >  Otis
>  >  >  --
>  >  >  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch  >  >  
> >  >  >  >  >  >  - Original Message   >  >  From: Dietrich 
> <[EMAIL PROTECTED]>  >  >  To: solr-user@lucene.apache.org  
> >  >  Sent: Tuesday, March 25, 2008 4:15:23 PM  >  >  Subject: How to 
> index multiple sites with option of combining results in search  >  >  
> >  >  I am planning to index 275+ different sites with Solr, each of 
> which  >  >  might have anywhere up to 200 000 documents. When 
> performing searches,  >  >  I need to be able to search against any
combination of sites.
>  >  >  Does anybody have suggestions what the best practice for a 
> scenario  >  >  like that would be, considering  both indexing and 
> querying  >  >  performance? Put everything into one index and filter 
> when performing  >  >  the queries, or creating a separate index for 
> each one and combining  >  >  results when performing the query?
>  >  >
>  >  >
>  >  >
>  >  >
>  >
>  >
>  >
>  >
>
>
>
>






Re: Making stop-words optional with DisMax?

2008-03-26 Thread Walter Underwood
We use two fields, one with and one without stopwords. The exact
field has a higher boost than the other. That works pretty well.

It helps to have an automated relevance test when tuning the boost
(and other things). I extracted queries and clicks from the logs
for a couple of months. Not perfect, but it is hard to argue with
32 million clicks.

wunder

On 3/26/08 6:05 PM, "Ronald K. Braun" <[EMAIL PROTECTED]> wrote:

> Hi Otis,
> 
>> I skimmed your email.  You are indexing book and music titles.  Those tend to
>> be short.
>> Do you really benefit from removing stop words in the first place?  I'd try
>> keeping all the stop
>> words and seeing if that has any negative side-effects in your context.
> 
> Thanks for your skim and response!  We do keep all stop-words -- as
> you say, makes sense since we aren't dealing with long free text
> fields and because some titles are pure stops.
> 
> The negative side-effects lie in stop-words being treated with the
> same importance as non-stop-words for matching purposes.  This
> manifests in two ways:  1. Users occasionally get the stop-words wrong
> -- say, wrong choice of preposition, which torpedoes the query since
> some of the query terms aren't present in the target.  For example "on
> mice and men" may return nothing (no match for "on") even though it is
> equivalent to "of mice and men" in a stopped sense.  2. Our original
> indexed data doesn't always have leading articles and such.  For
> example, we index on "Doors" since that is our sourced data but
> frequently get queried for "The Doors".  Articles and prepositions
> (the stuff of good stop-lists) seem to me to be in a fuzzier class --
> use 'em if you have 'em during matching, but don't kill your queries
> because of them.  Hence some desire to make them in some way
> "optional" during matching.
> 
> Ron



Re: Solr commits automatically on appserver shutdown

2008-03-26 Thread Noble Paul നോബിള്‍ नोब्ळ्
Can I make an API call to remove the stale indexsearcher so that the
documents do not get committed?

Basically what I need is a 'rollback'  feature
--Noble

On Wed, Mar 26, 2008 at 9:08 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> On Wed, Mar 26, 2008 at 10:18 AM, Noble Paul നോബിള്‍ नोब्ळ्
>  <[EMAIL PROTECTED]> wrote:
>  >  If my appserver fails during an update or if I do a planned shutdown
>  >  without wanting to commit my changes Solr does not allow it?.
>  >  It commits whatever unfinished changes.
>  >  Is it by design?
>  >  Can I change this behavior?
>
>  You can't currently avoid it.
>  With the newer changes in Lucene though, it should be possible I think.
>
>  -Yonik
>


Re: Solr commits automatically on appserver shutdown

2008-03-26 Thread Yonik Seeley
On Thu, Mar 27, 2008 at 12:11 AM, Noble Paul നോബിള്‍ नोब्ळ्
<[EMAIL PROTECTED]> wrote:
> Can I make an API call to remove the stale indexsearcher so that the
>  documents do not get committed?
>
>  Basically what I need is a 'rollback'  feature

This should be possible when Solr starts using Lucene's update,
delete, and deleteByQuery features on the IndexWriter.

-Yonik


Re: Highlight - get terms used by lucene

2008-03-26 Thread Chris Hostetter

: we use highlighting and snippets for our searches. Besides those two, I 
: would want to have a list of terms that lucene used for the 
: highlighting, so that I can pull out of a "Tim OR Antwerpen AND Ekeren" 
: the following terms : Antwerpen, Ekeren if let's say these are the only 
: terms that gave results ...

the closest you can get is the "explain" info in the debugging output.

currently that comes back as a big string you would need to parse, but 
since the topic of progromaticly accessing that data seems to have come up 
quite a bit more then i ever really expected, i will point out that 
internally it's a fairly well structured class that could be output as a 
hierarchy of NamedLists (funny bit of trivia: i wrote that code once upon 
a time before SOlr was an Apache project, but it wouldn't work because the 
XmlResponseWriter had a bug where it couldn't handle NamedLists more then 
3 levels deep)

a patch would be fairly simple if someone wanted to write one.



-Hoss



Re: synonyms

2008-03-26 Thread Chris Hostetter
: And if I search for "refrigerador", I'll have all results for "refrigerador",
: for "geladeira", and all results for the flexed words for what i've typed
: (refrigerador, refrigerado, refrigeração, etc). But I won't find the results
: for the flexed words of the synonym that i've defined (geladeira), for example
: "gelado, gelo, etc".

I'm not sure what "flexed" means ... it looks like you are refering to 
other words with a common stem.

if you use the SynonymFilter before you use your stemming filter, it 
should work fine.



-Hoss


Re: Master Slave Replication

2008-03-26 Thread Chris Hostetter

: I want to know if we can use index replication when we have segmented indexes
: over multiple solr instances? 

I'm not sure i understand your question, but a slave just knows about it's 
master and the data in it -- it doesn't care or need to know if the master 
index is really just a subset of a larger logical index distributed over 
many machines.

you just need to have a separate "pool" of slaves, slaving off of *each* 
of the "master" machines managing your whole index.


-Hoss



Re: document retrieval, nested field and HTMLStripStandardTokenizerFactory

2008-03-26 Thread Chris Hostetter

: 1. Can I limit the number of returned document in config file to avoid
: misconfiguration pull down the server?

You can configure it with an invariant value in your requestHandler config 
... so it won't matter how many the client asks for, they'll get the 
number you pick (or less if there aren't that many) ... but there is no 
way to let them pick, but "limit" the value.

: 2. How can I retrieve the document by unique key for result view purpose ?

make your uniqueKey field searchable.

: And how can I do the xslt transformation on it?

http://wiki.apache.org/solr/XsltResponseWriter   ?

: 3. can I use nested field in document like this?

nope.

: 4. Does HTMLStripStandardTokenizerFactory do the same thing as
: solr.HTMLStripWhitespaceTokenizerFactory but only their target difference? 
: And Can I use HTMLStripStandardTokenizerFactory with TokenizerFactory which
: extended from  BaseTokenizerFactory?

the html stripping happens prior to true "tokenizetion" ... so the 
difference is one is based on the the StandardTokenizer, one uses the 
WhitespaceTokenizer ... if you want to use a differnet tokenizer you just 
have to write a new factory for your Tokenizer that wraps the reader .. if 
you look at the source of hte existing ones it's pretty straight forward.

(Hmm... maybe we should add a ReaderWrapperFactory that can be optionally 
specificed using a  config inside an  config?)

: 5. If I use HTMLStripStandardTokenizerFactory, do I need to escape the html
: character in field element?

you mean when sending Solr your data using the XmlUpdateRequestHandler? 
... yes.  XML is the message format, your HTML is the data, the data has 
to be properly XML escaped no matter what it is.




-Hoss