Re: Phrase indexing and searching

2013-12-23 Thread Steve Rowe
Hi Manjula,

Sounds like ShingleFilter will do what you want: 
http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html


Steve
www.lucidworks.com
On Dec 22, 2013 11:25 PM, Manjula Wijewickrema manjul...@gmail.com
wrote:

 Dear All,

 My Lucene programme is able to index single words and search the most
 matching documents (based on term frequencies) documents from a corpus to
 the input document.
 Now I want to index two word phrases and search the matching corpus
 documents (based on phrase frequencies) to the input documents.

 ex:-
 input document:
 blue house is very beautiful

 split it into phrases (say two term phrases) like:
 blue house
 house very
 very beautiful
 etc.

  Is it possible to do this with Lucene? If so how can I do it?

 Thanks,

 Manjula.



Re: Phrase indexing and searching

2013-12-23 Thread Manjula Wijewickrema
Hi Steve,

Thanks for the reply. Could you please simply let me know how to embed
SingleFilter in the code for both indexing and searching? Coz, different
people suggest different snippets to the code and they did not do the job.

Thanks,

Manjula.


On Mon, Dec 23, 2013 at 8:42 PM, Steve Rowe sar...@gmail.com wrote:

 Hi Manjula,

 Sounds like ShingleFilter will do what you want: 

 http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html
 

 Steve
 www.lucidworks.com
 On Dec 22, 2013 11:25 PM, Manjula Wijewickrema manjul...@gmail.com
 wrote:

  Dear All,
 
  My Lucene programme is able to index single words and search the most
  matching documents (based on term frequencies) documents from a corpus to
  the input document.
  Now I want to index two word phrases and search the matching corpus
  documents (based on phrase frequencies) to the input documents.
 
  ex:-
  input document:
  blue house is very beautiful
 
  split it into phrases (say two term phrases) like:
  blue house
  house very
  very beautiful
  etc.
 
   Is it possible to do this with Lucene? If so how can I do it?
 
  Thanks,
 
  Manjula.
 



Phrase indexing and searching

2013-12-22 Thread Manjula Wijewickrema
Dear All,

My Lucene programme is able to index single words and search the most
matching documents (based on term frequencies) documents from a corpus to
the input document.
Now I want to index two word phrases and search the matching corpus
documents (based on phrase frequencies) to the input documents.

ex:-
input document:
blue house is very beautiful

split it into phrases (say two term phrases) like:
blue house
house very
very beautiful
etc.

 Is it possible to do this with Lucene? If so how can I do it?

Thanks,

Manjula.


Phrase indexing and searching

2013-12-18 Thread Manjula Wijewickrema
Dear list,

My Lucene programme is able to index single words and search the most
matching documents (based on term frequencies) documents from a corpus to
the input document.
Now I want to index two word phrases and search the matching corpus
documents (based on phrase frequencies) to the input documents.

ex:-
input document:
blue house is very beautiful

split it into phrases (say two term phrases) like:
blue house
house very
very beautiful
etc.

 Is it possible to do this with Lucene? If so how can I do it?

Thanks,

Manjula.


Re: Phrase indexing and searching with Lucene

2009-02-23 Thread Chris Hostetter

: Subject: Phrase indexing and searching with Lucene
: References: 499be497.2060...@mapmyindia.com
: 18248.30864...@web26005.mail.ukl.yahoo.com

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking





-Hoss


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Phrase indexing and searching with Lucene

2009-02-19 Thread Nada Mimouni

Hello,

Thank you Erick for this detailed answer, that makes things clearer in my mind.

I'm still not clear why the built-in phrase query syntax won't work.

I have programmed a set of java classes (I use Lucene classes) to index and 
search into a collection of documents for a set of queries.
To test my system, I use a corpus which consists in a collection of queries (n 
queries) and documents (m documents). 
I started by creating one index for all queries and another one for all 
documents. Then I make the search to match between the queries index and 
documents index.
I use a trec evaluation tool to generate a file that gives all hits (matches) 
between the queryID and documentID with different scores. 

In this first step, I just index terms, therefore the search process (as I have 
it now) looks only for term matches between the query terms and the documents 
terms.
Now I want to get better results (better matching) by adding phrases to terms. 

I don't know exactly whether it makes a difference if I index phrases and terms 
(erick, erickson, thinks, small, thoughts, erick erickson, erickson thinks, 
small thoughts, erickson thoughts) and then search for both, or just keep the 
indexing process as it is (erick, erickson, thinks, small, thoughts) and then 
make a search for phrases (PhraseQuery : erick erickson, erickson thinks, small 
thoughts, erickson thoughts) and terms. 
Any idea?


Some examples of what you put in your index and what searches
you expect to return results for your example AND searches you do
NOT want to hit that document would be a great help.

input: 

*Query*  
898Why is the sun bright?

*Documents* 
7568  Star, large celestial body composed of gravitationally contained hot 
gases emitting electromagnetic radiation, especially light, as a result of 
nuclear reactions inside the star. The sun is a star.
7567  The sun has a magnitude of -26.7, inasmuch as it is about 10 billion 
times as bright as Sirius in the earth's sky. 

output: 

qID dID score 
898 7568 0,13 (not relevant)
898 7567 1 (relevant)


In this example, Lucene matches document 7567 to be relevant to he query (since 
it contains all query terms), however bright here is relative to Sirius (what 
we need is to get sun bright).




Best 
Nada


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wed 2/18/2009 3:24 PM
To: java-user@lucene.apache.org
Subject: Re: Phrase indexing and searching with Lucene
 
I'm still not clear why the built-in phrase query syntax won't work. If I
index the following terms (erick, erickson, thinks, small, thoughts)
in a single field, then searching for erick erickson (as a phrase query,
i.e. with double quotes when sent through a query parser or constructing
a PhraseQuery yourself) will generate a hit but erick thinks won't
generate a hit (unless you specify slop).

thinks small thoughts would also generate a hit

If you're saying that you only want to match on *all* the tokens, i.e.
the only way to get a hit on the above would be to search for
erick erickson thinks small thoughts, then you can create a
field that's UN_ANALYZED. If you do this, though, beware
that you have to do things like lower-case terms yourself when
indexing.

I have no idea what IndexTermGenerator is or what it does, but I'm
assuming that it just generates single words.

Some examples of what you put in your index and what searches
you expect to return results for your example AND searches you do
NOT want to hit that document would be a great help.

As far as searching for both, constructing a BooleanQuery with regular
TermQuerys and PhraseQuerys would work if you're constructing
your queries programmatically, or just using a Lucene query
like +termfield:word +phrasefield:erick erickson thinks would
work. Or, if you just require that the phrase exists you could do
it all in one field like
+field:word +field:erick erickson thinks



Best
Erick


On Wed, Feb 18, 2009 at 8:42 AM, Nada Mimouni 
mimo...@tk.informatik.tu-darmstadt.de wrote:



 Thank you Erick.

 I need first to index phrases, the built-in phrase processing (with double
 quotes) comes in the search step.
 Is there any difference between :
1) start by indexing phrases and then make a phrase search
2) index terms and then search for phrases


 To make things clearer:

 What I am doing now:
  - In the indexing step:  I am using IndexTermGenerator to generate term
 based indexes, one index for all queries I have and another one for
 documents (term means single word).
  - In the search step : Lucene matches terms in queries index with terms in
 documents index.

 What I need to do:
  - Index phrases (multi words) in addition to terms (single words)
  - Search for both : phrases and terms


 Is there any idea on how to proceed?

 Regards
 Nada


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Wed 2/18/2009 2:10 PM
 To: java

Re: Phrase indexing and searching with Lucene

2009-02-19 Thread Erick Erickson
It looks to me like what you're trying to do is akin to document similarity,
which
I haven't had to delve into. But it's been discussed on the user list a few
times,
so perhaps your best bet would be to search the mail archives for that
topic.

Best
Erick

On Thu, Feb 19, 2009 at 3:14 AM, Nada Mimouni 
mimo...@tk.informatik.tu-darmstadt.de wrote:


 Hello,

 Thank you Erick for this detailed answer, that makes things clearer in my
 mind.

 I'm still not clear why the built-in phrase query syntax won't work.

 I have programmed a set of java classes (I use Lucene classes) to index and
 search into a collection of documents for a set of queries.
 To test my system, I use a corpus which consists in a collection of queries
 (n queries) and documents (m documents).
 I started by creating one index for all queries and another one for all
 documents. Then I make the search to match between the queries index and
 documents index.
 I use a trec evaluation tool to generate a file that gives all hits
 (matches) between the queryID and documentID with different scores.

 In this first step, I just index terms, therefore the search process (as I
 have it now) looks only for term matches between the query terms and the
 documents terms.
 Now I want to get better results (better matching) by adding phrases to
 terms.

 I don't know exactly whether it makes a difference if I index phrases and
 terms (erick, erickson, thinks, small, thoughts, erick erickson, erickson
 thinks, small thoughts, erickson thoughts) and then search for both, or just
 keep the indexing process as it is (erick, erickson, thinks, small,
 thoughts) and then make a search for phrases (PhraseQuery : erick erickson,
 erickson thinks, small thoughts, erickson thoughts) and terms.
 Any idea?


 Some examples of what you put in your index and what searches
 you expect to return results for your example AND searches you do
 NOT want to hit that document would be a great help.

 input:

 *Query*
 898Why is the sun bright?

 *Documents*
 7568  Star, large celestial body composed of gravitationally contained hot
 gases emitting electromagnetic radiation, especially light, as a result of
 nuclear reactions inside the star. The sun is a star.
 7567  The sun has a magnitude of -26.7, inasmuch as it is about 10 billion
 times as bright as Sirius in the earth's sky.

 output:

 qID dID score
 898 7568 0,13 (not relevant)
 898 7567 1 (relevant)


 In this example, Lucene matches document 7567 to be relevant to he query
 (since it contains all query terms), however bright here is relative to
 Sirius (what we need is to get sun bright).




 Best
 Nada


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Wed 2/18/2009 3:24 PM
 To: java-user@lucene.apache.org
 Subject: Re: Phrase indexing and searching with Lucene

 I'm still not clear why the built-in phrase query syntax won't work. If I
 index the following terms (erick, erickson, thinks, small, thoughts)
 in a single field, then searching for erick erickson (as a phrase query,
 i.e. with double quotes when sent through a query parser or constructing
 a PhraseQuery yourself) will generate a hit but erick thinks won't
 generate a hit (unless you specify slop).

 thinks small thoughts would also generate a hit

 If you're saying that you only want to match on *all* the tokens, i.e.
 the only way to get a hit on the above would be to search for
 erick erickson thinks small thoughts, then you can create a
 field that's UN_ANALYZED. If you do this, though, beware
 that you have to do things like lower-case terms yourself when
 indexing.

 I have no idea what IndexTermGenerator is or what it does, but I'm
 assuming that it just generates single words.

 Some examples of what you put in your index and what searches
 you expect to return results for your example AND searches you do
 NOT want to hit that document would be a great help.

 As far as searching for both, constructing a BooleanQuery with regular
 TermQuerys and PhraseQuerys would work if you're constructing
 your queries programmatically, or just using a Lucene query
 like +termfield:word +phrasefield:erick erickson thinks would
 work. Or, if you just require that the phrase exists you could do
 it all in one field like
 +field:word +field:erick erickson thinks



 Best
 Erick


 On Wed, Feb 18, 2009 at 8:42 AM, Nada Mimouni 
 mimo...@tk.informatik.tu-darmstadt.de wrote:

 
 
  Thank you Erick.
 
  I need first to index phrases, the built-in phrase processing (with
 double
  quotes) comes in the search step.
  Is there any difference between :
 1) start by indexing phrases and then make a phrase search
 2) index terms and then search for phrases
 
 
  To make things clearer:
 
  What I am doing now:
   - In the indexing step:  I am using IndexTermGenerator to generate
 term
  based indexes, one index for all queries I have and another one for
  documents (term means

Phrase indexing and searching with Lucene

2009-02-18 Thread Nada Mimouni

Hello everybody,

In my research work, I use Lucene to index and search into text documents.
At present, I just index and search for single words. I want to extend this to 
phrases (or nGrams).

Could anyone please give me more details on how to do it and also point me to 
some useful references on this?

Thank you very much in advance for your help.

Best regards,
Nada Mimouni



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Phrase indexing and searching with Lucene

2009-02-18 Thread Nada Mimouni


Hello everybody,

I use Lucene to index and search into text documents.
At present, I just index and search for single words. I want to extend this to 
phrases (or nGrams).

Could anyone please give me details on how to index phrases and then make a 
phrase search? 

Thank you very much in advance for your help.

Nada Mimouni

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Phrase indexing and searching with Lucene

2009-02-18 Thread Erick Erickson
Have you tried the built-in phrase processing with double quotes? e.g.
this is a phrase?

See the Term section at
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html

Best
Erick

On Wed, Feb 18, 2009 at 5:57 AM, Nada Mimouni 
mimo...@tk.informatik.tu-darmstadt.de wrote:



 Hello everybody,

 I use Lucene to index and search into text documents.
 At present, I just index and search for single words. I want to extend this
 to phrases (or nGrams).

 Could anyone please give me details on how to index phrases and then make a
 phrase search?

 Thank you very much in advance for your help.

 Nada Mimouni


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Phrase indexing and searching with Lucene

2009-02-18 Thread Nada Mimouni


Thank you Erick.

I need first to index phrases, the built-in phrase processing (with double 
quotes) comes in the search step.  
Is there any difference between : 
1) start by indexing phrases and then make a phrase search 
2) index terms and then search for phrases


To make things clearer:

What I am doing now: 
 - In the indexing step:  I am using IndexTermGenerator to generate term 
based indexes, one index for all queries I have and another one for documents 
(term means single word). 
 - In the search step : Lucene matches terms in queries index with terms in 
documents index.

What I need to do:
 - Index phrases (multi words) in addition to terms (single words)
 - Search for both : phrases and terms


Is there any idea on how to proceed?

Regards
Nada


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wed 2/18/2009 2:10 PM
To: java-user@lucene.apache.org
Subject: Re: Phrase indexing and searching with Lucene
 
Have you tried the built-in phrase processing with double quotes? e.g.
this is a phrase?

See the Term section at
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html

Best
Erick

On Wed, Feb 18, 2009 at 5:57 AM, Nada Mimouni 
mimo...@tk.informatik.tu-darmstadt.de wrote:



 Hello everybody,

 I use Lucene to index and search into text documents.
 At present, I just index and search for single words. I want to extend this
 to phrases (or nGrams).

 Could anyone please give me details on how to index phrases and then make a
 phrase search?

 Thank you very much in advance for your help.

 Nada Mimouni


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Phrase indexing and searching with Lucene

2009-02-18 Thread Erick Erickson
I'm still not clear why the built-in phrase query syntax won't work. If I
index the following terms (erick, erickson, thinks, small, thoughts)
in a single field, then searching for erick erickson (as a phrase query,
i.e. with double quotes when sent through a query parser or constructing
a PhraseQuery yourself) will generate a hit but erick thinks won't
generate a hit (unless you specify slop).

thinks small thoughts would also generate a hit

If you're saying that you only want to match on *all* the tokens, i.e.
the only way to get a hit on the above would be to search for
erick erickson thinks small thoughts, then you can create a
field that's UN_ANALYZED. If you do this, though, beware
that you have to do things like lower-case terms yourself when
indexing.

I have no idea what IndexTermGenerator is or what it does, but I'm
assuming that it just generates single words.

Some examples of what you put in your index and what searches
you expect to return results for your example AND searches you do
NOT want to hit that document would be a great help.

As far as searching for both, constructing a BooleanQuery with regular
TermQuerys and PhraseQuerys would work if you're constructing
your queries programmatically, or just using a Lucene query
like +termfield:word +phrasefield:erick erickson thinks would
work. Or, if you just require that the phrase exists you could do
it all in one field like
+field:word +field:erick erickson thinks



Best
Erick


On Wed, Feb 18, 2009 at 8:42 AM, Nada Mimouni 
mimo...@tk.informatik.tu-darmstadt.de wrote:



 Thank you Erick.

 I need first to index phrases, the built-in phrase processing (with double
 quotes) comes in the search step.
 Is there any difference between :
1) start by indexing phrases and then make a phrase search
2) index terms and then search for phrases


 To make things clearer:

 What I am doing now:
  - In the indexing step:  I am using IndexTermGenerator to generate term
 based indexes, one index for all queries I have and another one for
 documents (term means single word).
  - In the search step : Lucene matches terms in queries index with terms in
 documents index.

 What I need to do:
  - Index phrases (multi words) in addition to terms (single words)
  - Search for both : phrases and terms


 Is there any idea on how to proceed?

 Regards
 Nada


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Wed 2/18/2009 2:10 PM
 To: java-user@lucene.apache.org
 Subject: Re: Phrase indexing and searching with Lucene

 Have you tried the built-in phrase processing with double quotes? e.g.
 this is a phrase?

 See the Term section at
 http://lucene.apache.org/java/2_4_0/queryparsersyntax.html

 Best
 Erick

 On Wed, Feb 18, 2009 at 5:57 AM, Nada Mimouni 
 mimo...@tk.informatik.tu-darmstadt.de wrote:

 
 
  Hello everybody,
 
  I use Lucene to index and search into text documents.
  At present, I just index and search for single words. I want to extend
 this
  to phrases (or nGrams).
 
  Could anyone please give me details on how to index phrases and then make
 a
  phrase search?
 
  Thank you very much in advance for your help.
 
  Nada Mimouni
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org