Re: Optional terms in BooleanQuery

2007-05-21 Thread Soeren Pekrul

Peter Bloem wrote:
[...]

"+(A B) C D E"

[...]
In other words, Lucene considers all documents that 
have both A and B, and ranks them higher if they also have C D or E.


Hello Peter,

for my understanding "+(A B) C D E" means at least one of the terms "A" 
or "B" must be contained and the terms "C", "D", and "E" are optional. 
The following documents d are hits:

d(A, B)
d(A)
d(B)
d(A, C)
...
Documents without "A" and "B" are not a hit.

To have both terms "A" and "B" in a document the query should be: "(+A 
+B) C D E" or "+A +B C D E".


Sören



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: search result problem

2007-05-21 Thread Stefan Colella

hello,
thx for u reply, i used the explain method and i understand now why some 
documents are returned.


I am using the same Analyzer for indexing and searching.

I tried to only add the content of the page where that expression can be 
found (instead of the whole document) and then  the search works.


Do i have to split my pdf text into more field? Or what could be the 
problem?



Grant Ingersoll wrote:
Try using the explain() method to see why the documents that were 
returned scored the way they did.


If I am understanding correctly, you are saying that Luke shows that 
those words aren't actually in your index?  Can you elaborate on what 
your analysis process is?  Are you searching using the same Analyzer 
as you are indexing with?  I would try to isolate the problem down to 
some unit tests, if possible.


Cheers,
Grant

On May 18, 2007, at 8:12 AM, Stefan Colella wrote:


Hello,

My application is working with PDF files so i use lucene with PdfBox 
to create a little search engine. I am new to lucene.


All seemed to work fine but after some tests I saw that some 
expressions like "stock option" where never found (or returns the 
wrong documents) even if it exist in my PDF files. I searched in the 
mail archive and found that I have to use the "French Analyser" but 
that didn't work too.


I found that there is a tool named Luke to check the lucene index. I 
could see that the original text contains those words but nothing in 
the tokenizer.


Anybody who can help or can explain where I can start to look ?

thanks



--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at 
http://wiki.apache.org/jakarta-lucene/LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Very odd behaviour of FrenchAnalyzer with strings in capital letters

2007-05-21 Thread Jolinar13

Hello,

I tried org.apache.lucene.analysis.fr.FrenchAnalyzer and I got strange
search results on strings in uppercase. (example : VEHICLE)
When I search the string (in lower case), I get no result. I get results if
I use "vehicle*" or "vehiclE", or "vehicLe" etc.

What is odd is that it affects only some of the strings, not all of them.
Anyone who has ever experienced this problem?

Thanks,
Florian 
-- 
View this message in context: 
http://www.nabble.com/Very-odd-behaviour-of-FrenchAnalyzer-with-strings-in-capital-letters-tf3789153.html#a10715673
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Upgrade 2.0 -> 2.1

2007-05-21 Thread Svend Ole Nielsen
Hi
I have tried to upgrade from 2.0 -> 2.1 to overcome some NFS-issues. It
compiles just fine, but when I run the application and try to add a
document if throws an exception stating NoSuchMethod. This happens when
I try to add an object of type Field to a newly created empty Document.

I have erased all dependencies in my project aswell on the server. So it
should be as clean as a whistle, but no luck. I'm running it on a Bea
8.1 SP6 with the old 1.4 Java

Anyone knows where to look ??

Best regards,
Svend Ole


Re: Very odd behaviour of FrenchAnalyzer with strings in capital letters

2007-05-21 Thread Erick Erickson

First have you gotten a copy of Luke to examine your index to see
what's actually indexed?

The default behavior is usually to lowercase everything, but I'm not
entirely sure if the French analyzer does this. But I suspect so.

Searches are case sensitive. To get caseless searching, you need
to put everything in the same case. This is usually done for you with
any of the standard analyzers, but check specifically.

Are you using the same analyzer at index AND search time?

Best
Erick

On 5/21/07, Jolinar13 <[EMAIL PROTECTED]> wrote:



Hello,

I tried org.apache.lucene.analysis.fr.FrenchAnalyzer and I got strange
search results on strings in uppercase. (example : VEHICLE)
When I search the string (in lower case), I get no result. I get results
if
I use "vehicle*" or "vehiclE", or "vehicLe" etc.

What is odd is that it affects only some of the strings, not all of them.
Anyone who has ever experienced this problem?

Thanks,
Florian
--
View this message in context:
http://www.nabble.com/Very-odd-behaviour-of-FrenchAnalyzer-with-strings-in-capital-letters-tf3789153.html#a10715673
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Upgrade 2.0 -> 2.1

2007-05-21 Thread Ian Lea

Hi


I saw this or something similar going from 2.0 to 2.1 when hadn't
recompiled all my lucene related code.  It went away when everything
was recompiled, so I'd guess you've got an old class file lurking
somewhere.


--
Ian.


On 5/21/07, Svend Ole Nielsen <[EMAIL PROTECTED]> wrote:

Hi
I have tried to upgrade from 2.0 -> 2.1 to overcome some NFS-issues. It
compiles just fine, but when I run the application and try to add a
document if throws an exception stating NoSuchMethod. This happens when
I try to add an object of type Field to a newly created empty Document.

I have erased all dependencies in my project aswell on the server. So it
should be as clean as a whistle, but no luck. I'm running it on a Bea
8.1 SP6 with the old 1.4 Java

Anyone knows where to look ??

Best regards,
Svend Ole



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: documents with large numbers of fields

2007-05-21 Thread Steven Rowe
Mike Klaas wrote:
> On 18-May-07, at 1:01 PM, charlie w wrote:
>> Is there an upper limit on the number of fields comprising a document,
>> and if so what is it?
> 
> There is not.  They are relatively costless if omitNorms=False

Mike, I think you meant "relatively costless if omitNorms=True".

Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Very odd behaviour of FrenchAnalyzer with strings in capital letters

2007-05-21 Thread Jolinar13

Hello,
Thank you for your quick answer.
I use Luke to examine the index, but since I switched to FrenchAnalyzer, it
says 'Not a Lucene index'.
If I open the index files in a text viewer, the strings are in UPPER case.
I do use the same analyzer to index and search.
So, do I have to specify the FrenchAnalyzer not to be case sensitive? How to
do that?
Thanks a lot
Florian


Erick Erickson wrote:
> 
> First have you gotten a copy of Luke to examine your index to see
> what's actually indexed?
> 
> The default behavior is usually to lowercase everything, but I'm not
> entirely sure if the French analyzer does this. But I suspect so.
> 
> Searches are case sensitive. To get caseless searching, you need
> to put everything in the same case. This is usually done for you with
> any of the standard analyzers, but check specifically.
> 
> Are you using the same analyzer at index AND search time?
> 
> Best
> Erick
> 
> On 5/21/07, Jolinar13 <[EMAIL PROTECTED]> wrote:
>>
>>
>> Hello,
>>
>> I tried org.apache.lucene.analysis.fr.FrenchAnalyzer and I got strange
>> search results on strings in uppercase. (example : VEHICLE)
>> When I search the string (in lower case), I get no result. I get results
>> if
>> I use "vehicle*" or "vehiclE", or "vehicLe" etc.
>>
>> What is odd is that it affects only some of the strings, not all of them.
>> Anyone who has ever experienced this problem?
>>
>> Thanks,
>> Florian
>> --
>> View this message in context:
>> http://www.nabble.com/Very-odd-behaviour-of-FrenchAnalyzer-with-strings-in-capital-letters-tf3789153.html#a10715673
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Very-odd-behaviour-of-FrenchAnalyzer-with-strings-in-capital-letters-tf3789153.html#a10719413
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Upgrade 2.0 -> 2.1

2007-05-21 Thread Svend Ole Nielsen
Hi Ian

Well it worked. Thanks :)

Wasn't aware of that could have fixed it, but after your suggestion it
seemed like the most logical solution.

/Svend

man, 21 05 2007 kl. 14:30 +0100, skrev Ian Lea:

> Hi
> 
> 
> I saw this or something similar going from 2.0 to 2.1 when hadn't
> recompiled all my lucene related code.  It went away when everything
> was recompiled, so I'd guess you've got an old class file lurking
> somewhere.
> 
> 
> --
> Ian.
> 
> 
> On 5/21/07, Svend Ole Nielsen <[EMAIL PROTECTED]> wrote:
> > Hi
> > I have tried to upgrade from 2.0 -> 2.1 to overcome some NFS-issues. It
> > compiles just fine, but when I run the application and try to add a
> > document if throws an exception stating NoSuchMethod. This happens when
> > I try to add an object of type Field to a newly created empty Document.
> >
> > I have erased all dependencies in my project aswell on the server. So it
> > should be as clean as a whistle, but no luck. I'm running it on a Bea
> > 8.1 SP6 with the old 1.4 Java
> >
> > Anyone knows where to look ??
> >
> > Best regards,
> > Svend Ole
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


Re: How to Update the Index once it is created

2007-05-21 Thread bhecht

If you are using Orcale and Lucene, check out 
http://www.hibernate.org/410.html "Hibernate Search"  , this will
automaticly update your lucene index, on any change to your database table

Erick Erickson wrote:
> 
> You have to delete the old document and add it a new one.
> 
> See IndexModifier class.
> 
> There is no ability to modify a document in place.
> 
> Best
> Erick
> 
> On 5/14/07, Krishna Prasad Mekala <[EMAIL PROTECTED]> wrote:
>>
>> Hi All,
>>
>>
>>
>> Thanks for your response. I have one more doubt. How can I update a
>> index once created from Oracle, instead of recreating the whole.
>> Whenever there is a change in the oracle table
>> (insertion/updation/deletion of a row) my application should update the
>> index.
>>
>>
>>
>> Thanks in advance.
>>
>>
>>
>>
>>
>> Krishna Prasad M
>>
>>
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-Update-the-Index-once-it-is-created-tf3752208.html#a10724708
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Implement a tokenizer

2007-05-21 Thread bhecht

Hi there,

I was interested in changing the StandardTokenzier so it will not remove the
"+" (plus) sign from my stream.
Looking in the code and documentation, it reads: 

"If this tokenizer does not suit your application, please consider copying
this source code
directory to your project and maintaining your own grammar-based tokenizer."

I can't understand from this code where I should jump in, and do my change.
Can someone point me out to where I should look at in order perform my
change?

Thanks in advanced
-- 
View this message in context: 
http://www.nabble.com/Implement-a-tokenizer-tf3792172.html#a10724827
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



stop words, synonyms... what's in it for me?

2007-05-21 Thread bhecht

Hi there,

I have started using Lucene not long ago, with plans to replace my current
sql queries in my application with it.
As I wasn't aware of Lucene before, I have implemented some similar tools
(filters) as Lucene includes.

For example I have implemented a "stop word" tool.
In my case I have much more configuration options than Lucene, having the
option to remove sub strings in addition to complete tokens.
I can configure the wanted location of the sub string within the token,
or even the location of the token within the phrase.

I have implemented a synonym mechanism (substitution mechanism) that can
also be configured according to location within a phrase. It can also be
configured to find synonyms taking into account spelling mistakes. Although
it doesn't expand but only transforms to one certain replacement.It can find
replacements for sub strings as well. So I can use it to separate words. For
example in German I have "strasse"=> " strasse" (with a space in the front),
so words like "mainstrasse" will be split to "main" and "strasse".

I am wondering if I can use my "standardization" tools before calling the
lucene indexing, without implementing any custom analyzers and achieve more
or less the same results?

What do I "loose" if I go this way? The stemming filters are really one
thing I didn't have and I will use.
Is there any point for me to start creating custom analyzers with filter for
stop words, synonyms, and implementing my own "sub string" filter, for
separating tokens into "sub words" (like "mainstrasse"=> "main", "strasse")
?

Thanks in advance

-- 
View this message in context: 
http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10725950
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stop words, synonyms... what's in it for me?

2007-05-21 Thread Daniel Naber
On Monday 21 May 2007 22:05, bhecht wrote:

> Is there any point for me to start creating custom analyzers with filter
> for stop words, synonyms, and implementing my own "sub string" filter,
> for separating tokens into "sub words" (like "mainstrasse"=> "main",
> "strasse")

Yes: I assume your document should be found both with "strasse" and with 
"mainstrasse". You will then need to put main, strasse, and mainstrasse at 
the same position (setPositionIncrement(0)). If you don't do that, phrase 
queries will not work anymore as expected. Thus you need an analyzer, 
modifying the string before they are put in Lucene is not enough.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stop words, synonyms... what's in it for me?

2007-05-21 Thread bhecht

Thanks Daniel,

But when searching, I will run my "standardization" tools again before
querying Lucene, so what you mentioned will not be a problem.
If someone searches for mainstrasse, my tools will split it again to main
and strasse, and then lucene will be able to find it.


Daniel Naber-5 wrote:
> 
> On Monday 21 May 2007 22:05, bhecht wrote:
> 
>> Is there any point for me to start creating custom analyzers with filter
>> for stop words, synonyms, and implementing my own "sub string" filter,
>> for separating tokens into "sub words" (like "mainstrasse"=> "main",
>> "strasse")
> 
> Yes: I assume your document should be found both with "strasse" and with 
> "mainstrasse". You will then need to put main, strasse, and mainstrasse at 
> the same position (setPositionIncrement(0)). If you don't do that, phrase 
> queries will not work anymore as expected. Thus you need an analyzer, 
> modifying the string before they are put in Lucene is not enough.
> 
> Regards
>  Daniel
> 
> -- 
> http://www.danielnaber.de
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10726812
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stop words, synonyms... what's in it for me?

2007-05-21 Thread Daniel Naber
On Monday 21 May 2007 22:53, bhecht wrote:

> If someone searches for mainstrasse, my tools will split it again to
> main and strasse, and then lucene will be able to find it.

"strasse" will match "mainstrasse" but the phrase query "schöne strasse" 
will not match "schöne mainstrasse". However, this could be considered a 
feature. Aynway, it will be difficult to use features that rely on the 
term list, e.g. the spellchecker. It will not be able to suggest 
"mainstrasse", as that's not in the index.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to Update the Index once it is created

2007-05-21 Thread Chris Lu

Does it mandate you to pass data through Hibernate? This seems very
similar to Compass' approach.

I believe a more generic approach is to compare what's already indexed
with what's changed or deleted, so you can use any framework to work
with Lucene. And simply selecting the data and creating the index can
avoid some specific framework limitation and easier to scale.  Also,
re-indexing will also be easier.

DBSight tracks changes through simple SQLs, hard-deleted or
soft-deleted content, and do it very efficiently even for large number
of documents.

--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

On 5/21/07, bhecht <[EMAIL PROTECTED]> wrote:


If you are using Orcale and Lucene, check out
http://www.hibernate.org/410.html "Hibernate Search"  , this will
automaticly update your lucene index, on any change to your database table

Erick Erickson wrote:
>
> You have to delete the old document and add it a new one.
>
> See IndexModifier class.
>
> There is no ability to modify a document in place.
>
> Best
> Erick
>
> On 5/14/07, Krishna Prasad Mekala <[EMAIL PROTECTED]> wrote:
>>
>> Hi All,
>>
>>
>>
>> Thanks for your response. I have one more doubt. How can I update a
>> index once created from Oracle, instead of recreating the whole.
>> Whenever there is a change in the oracle table
>> (insertion/updation/deletion of a row) my application should update the
>> index.
>>
>>
>>
>> Thanks in advance.
>>
>>
>>
>>
>>
>> Krishna Prasad M
>>
>>
>>
>>
>
>

--
View this message in context: 
http://www.nabble.com/How-to-Update-the-Index-once-it-is-created-tf3752208.html#a10724708
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Implement a tokenizer

2007-05-21 Thread Mordo, Aviran (EXP N-NANNATEK)
What you need to do is to create your own tokenizer. Just copy the code
from the StandardTokenizer to your XYZTokenizer and make your changes.
Then you need to create your own Analyzer class (again copy the code
from the StandardAnalyzer) and user your XYZTokenizer in the new
XYZAnalyzer you created.

HTH

Aviran
http://www.aviransplace.com
http://shaveh.co.il 

-Original Message-
From: bhecht [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 21, 2007 2:59 PM
To: java-user@lucene.apache.org
Subject: Implement a tokenizer


Hi there,

I was interested in changing the StandardTokenzier so it will not remove
the "+" (plus) sign from my stream.
Looking in the code and documentation, it reads: 

"If this tokenizer does not suit your application, please consider
copying this source code directory to your project and maintaining your
own grammar-based tokenizer."

I can't understand from this code where I should jump in, and do my
change.
Can someone point me out to where I should look at in order perform my
change?

Thanks in advanced
--
View this message in context:
http://www.nabble.com/Implement-a-tokenizer-tf3792172.html#a10724827
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Implement a tokenizer

2007-05-21 Thread Chris Lu

Actually before you jump in, be warned that the "+" plus sign is also
part of query parser.
You can not really/easily pass the query with the "+" sign through
query parser in order to get a match.

--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

On 5/21/07, bhecht <[EMAIL PROTECTED]> wrote:


Hi there,

I was interested in changing the StandardTokenzier so it will not remove the
"+" (plus) sign from my stream.
Looking in the code and documentation, it reads:

"If this tokenizer does not suit your application, please consider copying
this source code
directory to your project and maintaining your own grammar-based tokenizer."

I can't understand from this code where I should jump in, and do my change.
Can someone point me out to where I should look at in order perform my
change?

Thanks in advanced
--
View this message in context: 
http://www.nabble.com/Implement-a-tokenizer-tf3792172.html#a10724827
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stop words, synonyms... what's in it for me?

2007-05-21 Thread bhecht

I will never have "mainstrasse" in my lucene index, since strasse is always
replaced with " strasse" causing "mainstrasse" to be split to "main
strasse".
So the example you gave:
"schöne strasse" will match "schöne mainstrasse", since in the lucene index
I have "schöne main strasse".


Daniel Naber-5 wrote:
> 
> On Monday 21 May 2007 22:53, bhecht wrote:
> 
>> If someone searches for mainstrasse, my tools will split it again to
>> main and strasse, and then lucene will be able to find it.
> 
> "strasse" will match "mainstrasse" but the phrase query "schöne strasse" 
> will not match "schöne mainstrasse". However, this could be considered a 
> feature. Aynway, it will be difficult to use features that rely on the 
> term list, e.g. the spellchecker. It will not be able to suggest 
> "mainstrasse", as that's not in the index.
> 
> Regards
>  Daniel
> 
> -- 
> http://www.danielnaber.de
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10727213
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: stop words, synonyms... what's in it for me?

2007-05-21 Thread Erick Erickson

No, a phrase search it will NOT match. Phrase semantics
requires that split tokens be adjacent (slop of 0). So, since
"mainstrasse" was split into two tokens at index time, the test for
"is schöne right next to strasse" will fail because of the intervening
(introduced) term "main". Whether this is desired behavior or not is
another question.

You're right that asking for a non-phrase search *will* work
though.

Best
Erick

On 5/21/07, bhecht <[EMAIL PROTECTED]> wrote:



I will never have "mainstrasse" in my lucene index, since strasse is
always
replaced with " strasse" causing "mainstrasse" to be split to "main
strasse".
So the example you gave:
"schöne strasse" will match "schöne mainstrasse", since in the lucene
index
I have "schöne main strasse".


Daniel Naber-5 wrote:
>
> On Monday 21 May 2007 22:53, bhecht wrote:
>
>> If someone searches for mainstrasse, my tools will split it again to
>> main and strasse, and then lucene will be able to find it.
>
> "strasse" will match "mainstrasse" but the phrase query "schöne strasse"
> will not match "schöne mainstrasse". However, this could be considered a
> feature. Aynway, it will be difficult to use features that rely on the
> term list, e.g. the spellchecker. It will not be able to suggest
> "mainstrasse", as that's not in the index.
>
> Regards
>  Daniel
>
> --
> http://www.danielnaber.de
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>

--
View this message in context:
http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10727213
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: search result problem

2007-05-21 Thread Doron Cohen
Stefan Colella wrote:

> I tried to only add the content of the page where that expression can be
> found (instead of the whole document) and then  the search works.
>
> Do i have to split my pdf text into more field? Or what could be the
> problem?

Perhaps indexWriter's setMaxFieldLength() is relevant here.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



getTermFreqVector atomicity

2007-05-21 Thread Walter Ferrara
I'm interested in getting the term vector of a lucene doc. The point is,
it seems I have to give to the IndexReader.getTermFreqVector a doc ID,
while I would know if there is a way to get the termvector by a doc
identifier (not lucene doc id, but a my own field). I know how to get
the lucene docid for the doc I'm interested, but my concern is about the
non-atomicity of getting a id and pass it to another function.
This because I reload index time by time, and I'm worried about a loss
of consistency if the new indexreader remap docids (after deletion for
example), and I end up accessing a different doc, just because between
"get the id" and "get the termvector for that id", the reader could have
been reloaded (and doc-ids changed).

Best,
Walter



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



In memory MultiSearcher

2007-05-21 Thread Peter W.

Hello,

I have been using a large, in memory MultiSearcher that
is reaching the limits of my hardware RAM with this code:

  try
 {
 IndexSearcher[] searcher_a=
{
new IndexSearcher(new RAMDirectory(index_one_path)),
new IndexSearcher(new RAMDirectory(index_two_path)),
new IndexSearcher(new RAMDirectory(index_thee_path)),
new IndexSearcher(new RAMDirectory(index_four_path)),
new IndexSearcher(new RAMDirectory(index_n_path))
};

 MultiSearcher searcher_ms=new MultiSearcher(searcher_a);
...
 }
  catch(Exception e)
 {
 System.out.println(e);
 }

For example, one of several indexes is 768MB. Is there possibly a  
better way to do this?


Regards,

Peter W.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: getTermFreqVector atomicity

2007-05-21 Thread Erick Erickson

An IndexReader doesn't see changes in the index unless you close
and reopen it, but if there is significant time between the time you
fetch your docid and read it's vector, that could be a problem.

You can always use TermEnum/TermDocs to find the doc ID
associated with a particular field you have, but I suspect that
suffers from the same problem. In fact, *anything* you do
between fetching the doc ID and getting its termvector
has this problem, and there's no way I know of to get
termvectors by your own ID.

What might work is a "sanity check" sort of algorithm. That is,
fetch the doc ID, then fetch it's term vector, then look at your
custom field for that doc ID and see if it matches the
original. If not, do it all over again.

But that all seems too complicated to me. Why not just insure
that you use the *same* IndexReader both when you get the
original doc ID and when you get its termverctor? Even a temporary
reference should hold things open long enough to insure that
atomicity.

Best
Erick

On 5/21/07, Walter Ferrara <[EMAIL PROTECTED]> wrote:


I'm interested in getting the term vector of a lucene doc. The point is,
it seems I have to give to the IndexReader.getTermFreqVector a doc ID,
while I would know if there is a way to get the termvector by a doc
identifier (not lucene doc id, but a my own field). I know how to get
the lucene docid for the doc I'm interested, but my concern is about the
non-atomicity of getting a id and pass it to another function.
This because I reload index time by time, and I'm worried about a loss
of consistency if the new indexreader remap docids (after deletion for
example), and I end up accessing a different doc, just because between
"get the id" and "get the termvector for that id", the reader could have
been reloaded (and doc-ids changed).

Best,
Walter



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: In memory MultiSearcher

2007-05-21 Thread Erick Erickson

Why are you doing this in the first place? Do you actually have
evidence that the default Lucene behavior (caching, etc) is inadequate
for your needs?

I'd *strongly* recommend, if you haven't, just using the regular
FSDirectories rather than RAMDirectories and only getting
complex if that's too slow...

I ask because I am searching FS-based indexes that are 4G with
no problem. The index *was* 8G and had only a 10% performance hit.

Best
Erick

On 5/21/07, Peter W. <[EMAIL PROTECTED]> wrote:


Hello,

I have been using a large, in memory MultiSearcher that
is reaching the limits of my hardware RAM with this code:

   try
  {
  IndexSearcher[] searcher_a=
 {
 new IndexSearcher(new RAMDirectory(index_one_path)),
 new IndexSearcher(new RAMDirectory(index_two_path)),
 new IndexSearcher(new RAMDirectory(index_thee_path)),
 new IndexSearcher(new RAMDirectory(index_four_path)),
 new IndexSearcher(new RAMDirectory(index_n_path))
 };

 MultiSearcher searcher_ms=new MultiSearcher(searcher_a);
...
  }
   catch(Exception e)
  {
  System.out.println(e);
  }

For example, one of several indexes is 768MB. Is there possibly a
better way to do this?

Regards,

Peter W.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: stop words, synonyms... what's in it for me?

2007-05-21 Thread bhecht

Thanks Erik, thats what I thought.
In my case no phrase queries are done, so it seems I am good to go.
Any additional thoughts on the issue are welcomed.
Thanks



Erick Erickson wrote:
> 
> No, a phrase search it will NOT match. Phrase semantics
> requires that split tokens be adjacent (slop of 0). So, since
> "mainstrasse" was split into two tokens at index time, the test for
> "is schöne right next to strasse" will fail because of the intervening
> (introduced) term "main". Whether this is desired behavior or not is
> another question.
> 
> You're right that asking for a non-phrase search *will* work
> though.
> 
> Best
> Erick
> 
> On 5/21/07, bhecht <[EMAIL PROTECTED]> wrote:
>>
>>
>> I will never have "mainstrasse" in my lucene index, since strasse is
>> always
>> replaced with " strasse" causing "mainstrasse" to be split to "main
>> strasse".
>> So the example you gave:
>> "schöne strasse" will match "schöne mainstrasse", since in the lucene
>> index
>> I have "schöne main strasse".
>>
>>
>> Daniel Naber-5 wrote:
>> >
>> > On Monday 21 May 2007 22:53, bhecht wrote:
>> >
>> >> If someone searches for mainstrasse, my tools will split it again to
>> >> main and strasse, and then lucene will be able to find it.
>> >
>> > "strasse" will match "mainstrasse" but the phrase query "schöne
>> strasse"
>> > will not match "schöne mainstrasse". However, this could be considered
>> a
>> > feature. Aynway, it will be difficult to use features that rely on the
>> > term list, e.g. the spellchecker. It will not be able to suggest
>> > "mainstrasse", as that's not in the index.
>> >
>> > Regards
>> >  Daniel
>> >
>> > --
>> > http://www.danielnaber.de
>> >
>> > -
>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10727213
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/stop-words%2C-synonyms...-what%27s-in-it-for-me--tf3792510.html#a10731811
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]