Re: How to handle frequent updates.

2008-07-15 Thread Karl Wettin
13 jul 2008 kl. 16.58 skrev miztaken: What sort of operations do you use the matrix for? How large can it grow? Can you give an example of what the matrix might contain? What was the reason to solve your problem using Lucene? Is there some specific feature that made something easier or faster

Re: matching sub phrases in user entered query...

2008-07-15 Thread Karl Wettin
Couldn't you create multiple "shingle phrase queries" from the user query and add them all to a BooleanQuery? "example input query"^10 OR "example input"^5 OR "input query"^5 SpanNear and PhraseQueries are rather expensive though. Not too long ago I replaced phrase queries with a shingles in

Improving GeoSort from Lucene in Action Book

2008-07-15 Thread Sascha Fahl
Hi, I read the chapter about custom sort methods and hacked around with the GeoSort. Are there ways to improve the algorithm? Espacially calculating the distance for ALL documents in the index is a bad idea because only the distance for hitted documents are of interest. That could save lo

Re: Cosine Similarity between two documents, using different zone weights

2008-07-15 Thread Karl Wettin
I'm not sure what it is you say you want to do. If what you want to do is to measure distance between two documents then the easiet way is to extract the feature vectors (document TermFreqVector) from those two documents and measure the distance using something like the Tanimoto coefficient

Re: Improving GeoSort from Lucene in Action Book

2008-07-15 Thread Karl Wettin
15 jul 2008 kl. 09.50 skrev Sascha Fahl: I read the chapter about custom sort methods and hacked around with the GeoSort. Are there ways to improve the algorithm? Espacially calculating the distance for ALL documents in the index is a bad idea because only the distance for hitted documents

Mixing non scored an scored queries

2008-07-15 Thread John Patterson
Hi, I have a number of fields that are used to filter documents from a search. They should not contribute to the score of the document but merely decide which documents are valid. i.e. it doesn't matter how rare they are in the index. I also have a single "combined" field that is used for free

Re: Improving GeoSort from Lucene in Action Book

2008-07-15 Thread Sascha Fahl
There is a big difference between GeoSearch and GeoSort. GeoSearch means you are looking for data within a certain range. To implement this index structures like R-Trees help, because they make it a lot easier to think in "boxes". GeoSort is just to sort the data in relation to a given poin

Re: matching sub phrases in user entered query...

2008-07-15 Thread Preetam Rao
That is very good performance. But, If I take, on an average, 6 terms per user query, and looking at shingles of size 2 I will have a boolean OR of 5 shingle phrase queries. How better is this compared to a single sub phrase query which would internally be just like another phrase query with som

MoreLikeThis from a field with a specific value

2008-07-15 Thread martinoleary
Hi there... im trying to get MoreLikeThis documents from my lucene index given a sentence... just one line of text lets say... but i also want to get the returned results only where a field has a specific value so for example if i have my index and it contains a categoryId and content... i

Re: Indexing questions

2008-07-15 Thread Michael McCandless
Anshum wrote: But the downside to this would be, in case your daemon crashes in the meantime or you need to restart the daemon, the index would not be usable until you have completed your indexing processs. This isn't quite true. If you open IndexWriter with autoCommit=false, then none

Fwd: Returned mail: see transcript for details

2008-07-15 Thread Preetam Rao
Hi, Every time I send a mail to this list, I get the below error. Any idea where is the problem ? It also appears that my mails are actually reaching the list. Any help in rectifying this is appreciated. Thanks Preetam -- Forwarded message -- From: Mail Delivery Subsystem <[EMAIL

Print the text files before indexing them in lucene

2008-07-15 Thread starz10de
Hi All, It might be easy question, but for new one as me in lucene it is not that easy. I want to print the text files before indexing them in lucene , I did try to do it , but i could just print the index content where we see the kewowrds and document nr and frequency. I need beside that to pr

Re: Indexing questions

2008-07-15 Thread spring
> How about just copying and performing your indexing (or index write > related) > operations on the copy and then performing a rename operation followed by > reopening of the index readers. This is how we did it until now. But the indexes become bigger and bigger (50 GB and more) and so we are

Re: Indexing questions

2008-07-15 Thread spring
> This isn't quite true. If you open IndexWriter with autoCommit=false, > then none of the changes you do with it will be visible to an > IndexReader, even one reopened while IndexWriter is doing its work, > until you close the IndexWriter. Where are the docs for this transaction buffered?

Re: Improving GeoSort from Lucene in Action Book

2008-07-15 Thread Karl Wettin
15 jul 2008 kl. 10.20 skrev Sascha Fahl: There is a big difference between GeoSearch and GeoSort. GeoSearch means you are looking for data within a certain range. To implement this index structures like R-Trees help, because they make it a lot easier to think in "boxes". GeoSort is just to

Re: Returned mail: see transcript for details

2008-07-15 Thread Karl Wettin
The list subscriber [EMAIL PROTECTED] is not a known email address and the MX server (spsoftindia.com) sends the bounce back to you. I'm not sure if this is because some header is missing or if spsoftindia.com does not follow protocol. My guess is the latter. A list moderator should remove

Re: Indexing questions

2008-07-15 Thread Michael McCandless
[EMAIL PROTECTED] wrote: This isn't quite true. If you open IndexWriter with autoCommit=false, then none of the changes you do with it will be visible to an IndexReader, even one reopened while IndexWriter is doing its work, until you close the IndexWriter. Where are the docs for this tran

RE: matching sub phrases in user entered query...

2008-07-15 Thread Preetham B.R
Hi Steve, It would be simpler if I have a query called SubPhraseQuery in which case I do not have to either generate extra terms during ingestion or generate extra queries during querying. As a user, the best I would hope for is, to ingest the data from some feed into different fields, run the use

Re: newbie question (for John Griffin) - fixed

2008-07-15 Thread Chris Bamford
Hi John Thanks for your continued interest in my travails! ==I'm not sure I understand. You want a phrase query so they should be ==passed as a phrase in quotes. Ok... well I must be missing something then :-( This fails to return any hits for me: PhraseQuery pq = new PhraseQuery();

ANN: A Lucene-OJVM native REST WS example

2008-07-15 Thread Marcelo Ochoa
Hi all: For people who are using Lucene Oracle integration project: http://marceloochoa.blogspot.com/2008/07/lucene-ojvm-native-rest-ws.html Best regards, Marcelo. -- Marcelo F. Ochoa http://marceloochoa.blogspot.com/ http://marcelo.ochoa.googlepages.com/home __ Do you Know DBPris

Re: Mixing non scored an scored queries

2008-07-15 Thread Erick Erickson
One way would be to create Filters and add them in with ConstantScoreRangeQuery Best Erick On Tue, Jul 15, 2008 at 4:07 AM, John Patterson <[EMAIL PROTECTED]> wrote: > > Hi, > > I have a number of fields that are used to filter documents from a search. > They should not contribute to the sco

Re: Print the text files before indexing them in lucene

2008-07-15 Thread Erick Erickson
I guess I don't understand this. Somewhere, you have to be opening the text file to feed it's contents to Lucene. Why can't you just print things then? If you're using the demo, you need to look into the code and you'll see something like this. Best Erick On Tue, Jul 15, 2008 at 4:55 AM, starz10d

Re: Mixing non scored an scored queries

2008-07-15 Thread John Patterson
Erick Erickson wrote: > > One way would be to create Filters and add them in with > ConstantScoreRangeQuery > Would that mean running the query twice? i.e. once to create the filter and once to rank the results? -- View this message in context: http://www.nabble.com/Mixing-non-scored-a

Re: Mixing non scored an scored queries

2008-07-15 Thread Erick Erickson
No, you create the filter via TermDocs/TermEnum. You can also cache them. Creating filters is *much* faster than you think . Alternatively, you could boost everything *else* by some large factor and then the unimportant fields would add relatively little to the final score. Best Erick On Tue, Ju

Re: Mixing non scored an scored queries

2008-07-15 Thread John Patterson
Erick Erickson wrote: > > One way would be to create Filters and add them in with > I could possibly wrap the standard BooleanQuery in an adapter which also wraps its Weight and Scorer to return a constant value. But that seems like a hell of a lot of internal jiggery pokery for something th

Re: Mixing non scored an scored queries

2008-07-15 Thread John Patterson
Erick Erickson wrote: > > No, you create the filter via TermDocs/TermEnum. You can also cache > them. Creating filters is *much* faster than you think . > But I can have many terms in the query. With over 10 million documents and many concurrent searches, creating a filter for every search w

Re: Mixing non scored an scored queries

2008-07-15 Thread Karl Wettin
15 jul 2008 kl. 10.07 skrev John Patterson: I have a number of fields that are used to filter documents from a search. They should not contribute to the score of the document but merely decide which documents are valid. i.e. it doesn't matter how rare they are in the index. I also have a

Re: Mixing non scored an scored queries

2008-07-15 Thread John Patterson
Karl Wettin wrote: > > I think all you need to do is to create a custom query (sounds like > you want a clone of TermQuery) that uses a Scorer that always return 1f. > That sounds exactly like what is required. I imagine that would be quite useful to have in the core project? -- View this

Re: Mixing non scored an scored queries

2008-07-15 Thread eks dev
do not forget that Filter does not have to be loaded in memory, not any more since LUECEN-584 commit! Now it is only skipping iterator what you need. translated, you could use: ConstantScoreQuery created with Filter made from TermDocs (you need to implement only DocIdSet / DocIdSetIterator, thi

Re: Stable score scaling; LSI again

2008-07-15 Thread Asad Sayeed
In other words, for my first question, what I want to know is how I might consistently and correctly get the same max score for any two pairs of identical documents without having to rewrite major parts of lucene. I could find ALL the scores and divide them by the max, but that seems somehow wron

Re: Returned mail: see transcript for details

2008-07-15 Thread Erik Hatcher
I've finally successfully removed the offending address from the list. I had tried earlier, but somehow it failed to take, but this time I think it has worked. Let me know off the list if you continue to get this bounce (something I've never seen personally, for the record). E

Re: Mixing non scored an scored queries

2008-07-15 Thread John Patterson
eks dev wrote: > > do not forget that Filter does not have to be loaded in memory, not any > more since LUECEN-584 commit! Now it is only skipping iterator what you > need. > > > translated, you could use: > ConstantScoreQuery created with Filter made from TermDocs (you need to > implement on

Re: Mixing non scored an scored queries

2008-07-15 Thread John Patterson
John Patterson wrote: > > > I don't think filters are the way to go here because I need to use boolean > style logic e.g. > > Search for free text "open fire" restricted to "London" OR "Brighton" in > category "Pubs and bars" OR "Restaurants" > > which means I need to construct and run a Boo

Re: Mixing non scored an scored queries

2008-07-15 Thread John Patterson
Karl Wettin wrote: > > > I think all you need to do is to create a custom query (sounds like > you want a clone of TermQuery) that uses a Scorer that always return 1f. > > Actually, I just thought that it would probably be better to create an adapter Query that always returns a constant s

Re: Mixing non scored an scored queries

2008-07-15 Thread Yonik Seeley
On Tue, Jul 15, 2008 at 10:24 AM, Karl Wettin <[EMAIL PROTECTED]> wrote: >> I have a number of fields that are used to filter documents from a search. >> They should not contribute to the score of the document but merely decide >> which documents are valid. i.e. it doesn't matter how rare they are

Re: Mixing non scored an scored queries

2008-07-15 Thread Karl Wettin
15 jul 2008 kl. 18.11 skrev John Patterson: So it seems that creating a constant scoring TermQuery is the best suggestion so far. Would be really great if I could call BooleanQuery.setConstantScore(1.0f) or something. You might be looking at implementing something like public class N

Re: Returned mail: see transcript for details

2008-07-15 Thread Erick Erickson
Erik: I'm having the same problem, I expect this e-mail to get a bounce-back. Could I ask you to take a glance at it? It's no big deal, I just have to delete the bounce-back. Thanks [EMAIL PROTECTED] On Tue, Jul 15, 2008 at 11:57 AM, Erik Hatcher <[EMAIL PROTECTED]> wrote: > I've finally succe

RE: newbie question (for John Griffin) - fixed

2008-07-15 Thread Steven A Rowe
Hi Chris, The PhraseQuery class does no parsing; tokenization is expected to happen before you feed anything to it. So unless you have an index-time analyzer that outputs terms that look like "aaa ddd" -- that is, terms with embedded spaces -- then attempting to use PhraseQuery or any other qu

Re: newbie question (for John Griffin) - fixed

2008-07-15 Thread Chris Bamford
Thanks Steve. Steven A Rowe wrote: Hi Chris, The PhraseQuery class does no parsing; tokenization is expected to happen before you feed anything to it. So unless you have an index-time analyzer that outputs terms that look like "aaa ddd" -- that is, terms with embedded spaces -- then attempti

Re: Mixing non scored an scored queries

2008-07-15 Thread John Patterson
Karl Wettin wrote: > > > Feel free to post it as an issue in the Jira when it's implemented. > > Thanks a lot! Will do John -- View this message in context: http://www.nabble.com/Mixing-non-scored-an-scored-queries-tp18460018p18470916.html Sent from the Lucene - Java Users mailing list a

RE: Returned mail: see transcript for details

2008-07-15 Thread Steven A Rowe
Hi Erik, I'm seeing the same problem - here's an excerpt from the headers of a bounce I just got (note the address "[EMAIL PROTECTED]" in the last couple of "Received:" headers): Received: from spwiki.spsoftware.com (static61.17.14-87.vsnl.eth.net [61.17.14.87] (may be forged)) for <[E

Re: Boolean expression for no terms OR matching a wildcard

2008-07-15 Thread Chris Hostetter
Assuming i understand your question: the fact that your first clause is a wildcard query is irrelevant, to generalize your request you want a way to query for all docs which either match some sub query, or have no terms in the field at all. to find all docs with no terms for a given field, you

Re: Mixing non scored an scored queries

2008-07-15 Thread Karl Wettin
15 jul 2008 kl. 18.44 skrev Yonik Seeley: On Tue, Jul 15, 2008 at 10:24 AM, Karl Wettin <[EMAIL PROTECTED]> wrote: I have a number of fields that are used to filter documents from a search. They should not contribute to the score of the document but merely decide which documents are valid.

Re: Boolean expression for no terms OR matching a wildcard

2008-07-15 Thread Ronald Rudy
Thanks Chris (or if you prefer, Hoss) - I will definitely try that for matching no docs, but one of the problems I'm having is that I'm indexing multiple terms for one field and I need ALL the terms to match it. Maybe this is easier ... suppose what I'm indexing is a phone number, and the

Lucene & XFile interface

2008-07-15 Thread Jamie
Hi there I am trying to use YANFS (see https://yanfs.dev.java.net/) to allow administrators to configure Lucene index that is accessible via NFS on a remote drive. Is there a way to easily modify lucene such that when it reads / writes from the Index it uses the XFile object instead of File?

Re: MoreLikeThis from a field with a specific value

2008-07-15 Thread Daniel Noll
martinoleary wrote: Hi there... im trying to get MoreLikeThis documents from my lucene index given a sentence... just one line of text lets say... but i also want to get the returned results only where a field has a specific value so for example if i have my index and it contains a categ

Re: Mixing non scored an scored queries

2008-07-15 Thread John Patterson
Karl Wettin wrote: > > >> Or just set the boost to zero on the individual filter fields, or on >> the whole filter expression. >> >> +(my query) +(filter1 OR filter2 AND filter3)^0 > > That sounds perfect! I thought that boosts would be multiplied together to give 0 for the whole expressio