Re: Missing fields used for a sort

2006-07-12 Thread Chris Hostetter

: > I can't thank you enough, Yonik :-)
: >
:
: send money .

Bah! ... there's lots of money in the world, they print more and more of
it every day.

Quality Patches ... now there's something I bet Yonik would *really*
appreciate!  :)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RangeQuery question?

2006-07-12 Thread Chris Hostetter

1) RangeQuery is the devil, don't use it.  If I weren't so lazy I would
change the javadocs for RangeQuery so that sentence was the class summary.
Takes a look at RangeFilter or ConstantScoreRangeQuery.

2) it's not clear what exactly you want your example to mean ... perhaps
you mean you want to match all docs with a field of "startDate
greater then "20060710" and a field of "endDate" less then "20060711", in
which case what you want to do is make a BooleanQuery containing two
ConstantScoreRangeQueries -- one on the startDate and one on the endDate.
... If that's not what you mean, then I don't understand your question.

: Is there a RangeQuery equivalent that can query date range on two
: different fields?
:
:
:
: Term startTerm = new Term("startDate", "20060710");
:
: Term endTerm = new Term("endDate", "20060711");
:
:
:
: RangeQuery q = new RangeQuery(startTerm, endTerm, true);



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching for a phrase which spans on 2 pages

2006-07-12 Thread Mile Rosu

Hello Erick,

I have been trying on Google Books some scenarios and apparently found a 
Google bug ...

It looks like they use number 2 approach, as this query illustrates it.

http://books.google.com/books?vid=ISBN1564968316&id=14Xx2T8tmMYC&pg=PA8&lpg=PA8&dq=%2B%22the+site+is+unburdened%22&sig=QRJSkKNLm0JlbkcWe2m1-y8YYz0

The phrase returns 2 hits, but if you look at the documents, only in the 
first one the phrase is visible.


Anyway, it makes possible finding something like:

http://books.google.com/books?q=%22sense+of+dissatisfaction+with+existing+elements%22&btnG=Search+Books
The returned page is the first one on which the phrase spans (but no 
more highlighting).


It seems we are really close to a good solution, now looking for a way 
to implementing it in terms of index structure.


Thanks again,
Mile Rosu


Erick Erickson wrote:
I can think of several approaches, but the experts will no doubt show 
me up

..

1> index the entire book as a single document. Also, index the 
beginning and

ending offset of each page in separate "documents". Assuming you can find
the offset in the big doc of each matching phrase, you can also find out
what pages each match starts on and ends on, and if they are different 
you'd

know to display two pages. Not sure what this does to relevancy...

2> Index, say, the 10 words on the previous page and 10 words on the next
page with the current page. You'd have to make sure your match wasn't
entirely within the 10 words you prepended or appended to the "match" 
page

(again by match position) when you returned data.

3> Have a series of "joiner" "documents". One for the 9 words of page 
n, and

9 words of  page n + 1 (along with the page number). Another set for 8
before and 8 after. etc. down to 1. If your phrase was 10 words, you'd
search your normal pages, and the 9 word "joiner" pages. Any match in the
joiners would be a page spanner. Again, what does that do to relevancy?


Note that there is no requirement that every document have the same 
fields,
so your searches can be disjoint. Also, I'm assuming that you can 
reasonably
decide that, say, 10 word phrases are the max you'll respect, which 
may not

be true.

I have no idea whether these are reasonable approaches given your problem
domain

Best
Erick




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



query for search through lucene for BLOB

2006-07-12 Thread sudarshan angirash

hi all

i have some PDF files stored in Oracle 9i as BLOB.
now i want to search for a string in those pdf files using Lucene. then i
want to show the selected PDF files which contains The String.

if you can give me any pointers about how to do it, then it will be a gr8
help for me.

regards
sudarshan


Re: Storing Part of Speech information in Lucene Indices

2006-07-12 Thread Grant Ingersoll

Hi Amit,

This is definitely something you can do.   What are your goals for  
it?  Do you want to search by word and POS or do you just want POS  
available for post processing?


You could just append the POS tag onto the end of your token as it  
gets indexed, something like foo_NN or foo_ADJ.  This approach may  
mean you have to use prefix query when you want to search against  
just "foo".You could also have a parallel field to your main  
field that stores the POS.  Then you could access it via the term  
vectors array.


Also, we have been discussing on the developers list on how to add  
payloads to a posting (i.e. store related information at a position  
in the index) similar to what Google discusses in their original  
paper.  Unfortunately, this isn't implemented yet, but if you feel  
like helping out, check out the discussion on the developer's list  
(see Flexible Indexing).


-Grant

On Jul 12, 2006, at 1:36 AM, Amit Kumar wrote:


Hi,

A new project that I am investigating lucene for needs the  Parts  
of speech information for the tokens. I can  get that
information using NLP techniques  (GATE etc.), by pre processing  
the documents but I would like to  store that

information in the Indices. Something along the lines of

TermVectorOffsetInfo[?].getPartofSpeech();

I am writing to ask for your advice, you can tell me I am b o n k e  
r s  or let me know where I should start digging :).
Is that a good idea? Or would it be just less trouble for me to  
store the offset information along with parts of speech

outside Lucene.

Has anyone else done that?

Best,
Amit


ps: Thank you for putting the LuceneInAction source online, it was  
a great help to see the CategorizerTest.java.

I am ordering my copy of the book tomorrow :)

-
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
-










--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Fax: 315-443-6886




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Searching for a phrase which spans on 2 pages

2006-07-12 Thread Mike Streeton
The simplest solution is always the best - when storing the page, do not
break up sentences. So a page will be all the sentences that occur on
it. If a sentence starts on one page and finishes on the next it will be
included in both pages in the index.

Hope this helps

Mike

www.ardentia.com the home of NetSearch
-Original Message-
From: Mile Rosu [mailto:[EMAIL PROTECTED] 
Sent: 11 July 2006 15:55
To: java-user@lucene.apache.org
Subject: Searching for a phrase which spans on 2 pages

Hello,

I am working on an application similar to google books which allows 
searching on documents which represent a scanned page. Of course, one 
might search for a phrase starting at the end of one page and ending at 
the beginning of the next one. In this case I do not know how I might 
treat this. Both pages should be returned as hit results.
Do you have any idea on how this situation might be handled?

Thank you,
Mile Rosu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing Part of Speech information in Lucene Indices

2006-07-12 Thread mark harwood
Could you not use a custom analyzer to inject "metadata" tokens into the index 
at the same position as the source tokens?

For example, given the text:
The cat jumped over the dog
your analyzer could emit tokens:
[the] [cat,_posNoun] [jumped,_posVerb] [over] [the] 
[dog,_posNoun]

where the "_pos" tokens have a zero position increment to effectively 
associate them with the term to which they relate (this is how the example 
SynonymTokenizer in the highlighter package works). The "_pos" prefix is used 
as a uniquefier for metadata tokens to avoid any name-clashes with any real 
content tokens.

Theoretically you could then construct queries where the queries mixed both 
data and your part-of-speech metadata eg you could use the position information 
based queries to find out what things normally have a particular verb applied 
to them:
 "jumped  _posNoun"~3
  or what verbs are commonly associated with a dog (caution advised here):
"_posVerb the dog"~3
or to use an ambiguous word in a particular context/sense
"_posVerb track"~1
 

Cheers,
Mark

- Original Message 
From: Amit Kumar <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Cc: Amit Kumar <[EMAIL PROTECTED]>
Sent: Wednesday, 12 July, 2006 6:36:24 AM
Subject: Storing Part of Speech information in Lucene Indices

Hi,

A new project that I am investigating lucene for needs the  Parts of  
speech information for the tokens. I can  get that
information using NLP techniques  (GATE etc.), by pre processing the  
documents but I would like to  store that
information in the Indices. Something along the lines of

TermVectorOffsetInfo[?].getPartofSpeech();

I am writing to ask for your advice, you can tell me I am b o n k e r  
s  or let me know where I should start digging :).
Is that a good idea? Or would it be just less trouble for me to store  
the offset information along with parts of speech
outside Lucene.

Has anyone else done that?

Best,
Amit


ps: Thank you for putting the LuceneInAction source online, it was a  
great help to see the CategorizerTest.java.
I am ordering my copy of the book tomorrow :)

-
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
-










-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



PhraseQuery - retrieving the fieldname

2006-07-12 Thread Mile Rosu

Hello,

A small problem this time: I would like to retrieve the field name of a 
PhraseQuery.

Could you tell me please which is the best way for this ?

Thank you,
Mile Rosu


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: modify existing non-indexed field

2006-07-12 Thread dan2000

I did clean everything but still getting the same problem. I'm using lucene
2.0. Do you get the same problem on your machine?
-- 
View this message in context: 
http://www.nabble.com/modify-existing-non-indexed-field-tf1905726.html#a5288759
Sent from the Lucene - Java Users forum at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: query for search through lucene for BLOB

2006-07-12 Thread Steven Rowe

Hi Sudarshan,

When your question is Java usage related, you will almost certainly get 
better responses by asking just on the Java User list.  Oddly enough, 
hitting all of the mailing lists for the project at once with the same 
question is likely to *reduce* your chances of getting polite/on-topic 
responses.


To index PDF files, you need to first extract their content, and Lucene 
does not do this for you.  Here is the Lucene FAQ entry on the topic:




Chapter 7 of the excellent book Lucene in Action 
 also covers this topic.


Once you have extracted the text for a document, you'll want to store a 
key for the document in a separate field in your Lucene index.  Then 
when you have hits from a search, you'll be able to use the DB key to 
retrieve the PDF blob from Oracle, then maybe save the it to a temp file 
and start Adobe Reader to display the doc(s).  Or something like that.


Steve

sudarshan angirash wrote:

hi all

i have some PDF files stored in Oracle 9i as BLOB.
now i want to search for a string in those pdf files using Lucene. then i
want to show the selected PDF files which contains The String.

if you can give me any pointers about how to do it, then it will be a gr8
help for me.

regards
sudarshan




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing Part of Speech information in Lucene Indices

2006-07-12 Thread Amit Kumar
We need to be able to search by word and POS and also have POS  
available for each occurrence.  Appending POS to the terms will  
create post processing nightmare to retrieve
term frequencies right? (I would have to add all the foo_NN and  
foo_ADJ etc.).


I can store the POS in a parallel field and access it via term  
vectors, but that wouldn't allow any kind of search on POS related  
fields right?  For example if I wanted to search for any
adjective with in 3 words of say a term or say If I wanted to get all  
the patterns that follow the sequence ADJ NN ADJ.


Let me look in the developer archives for the payload discussions,  
perhaps implementing that might satisfy my use cases.


Comments?

-Thanks
Amit



On Jul 12, 2006, at 6:39 AM, Grant Ingersoll wrote:


Hi Amit,

This is definitely something you can do.   What are your goals for  
it?  Do you want to search by word and POS or do you just want POS  
available for post processing?


You could just append the POS tag onto the end of your token as it  
gets indexed, something like foo_NN or foo_ADJ.  This approach may  
mean you have to use prefix query when you want to search against  
just "foo".You could also have a parallel field to your main  
field that stores the POS.  Then you could access it via the term  
vectors array.


Also, we have been discussing on the developers list on how to add  
payloads to a posting (i.e. store related information at a position  
in the index) similar to what Google discusses in their original  
paper.  Unfortunately, this isn't implemented yet, but if you feel  
like helping out, check out the discussion on the developer's list  
(see Flexible Indexing).


-Grant

On Jul 12, 2006, at 1:36 AM, Amit Kumar wrote:


Hi,

A new project that I am investigating lucene for needs the  Parts  
of speech information for the tokens. I can  get that
information using NLP techniques  (GATE etc.), by pre processing  
the documents but I would like to  store that

information in the Indices. Something along the lines of

TermVectorOffsetInfo[?].getPartofSpeech();

I am writing to ask for your advice, you can tell me I am b o n k  
e r s  or let me know where I should start digging :).
Is that a good idea? Or would it be just less trouble for me to  
store the offset information along with parts of speech

outside Lucene.

Has anyone else done that?

Best,
Amit


ps: Thank you for putting the LuceneInAction source online, it was  
a great help to see the CategorizerTest.java.

I am ordering my copy of the book tomorrow :)

-
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
-










--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Fax: 315-443-6886




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
-






Re: combined filesystem and web search

2006-07-12 Thread Erick Erickson

I haven't used the multisearcher personally, so I'll let others chime in.
And I know nothing about the IndexMergeTool, I've only seen the interface in
the Lucene Javadoc. And I must say the documentation isn't real helpful :(.

To add to an existing index, just instantiate the IndexWriter with the
boolean create parameter set to 'false'. There's no contention problem with
one (or more) searchers searching an index while it's being modified.
HOWEVER, your searchers won't see the data you're adding UNTIL you close and
re-open the readers/searchers.

Best
Erick


Re: Searching for a phrase which spans on 2 pages

2006-07-12 Thread Erick Erickson

Sweet!


Lucene index database

2006-07-12 Thread mcarcelen
Hi,
Can Lucene index a database? PostgreSQL, Mysql, Access ?
Thanks
Cheers
Teresa



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing Part of Speech information in Lucene Indices

2006-07-12 Thread mark harwood
>>Appending POS to the terms will create post processing nightmare

I think you may have missed the subtle distinction between Grant's suggestion 
and mine.

His suggestion was to append your POS info to the source token - creating a 
single token which combined both the original content and your POS info.
My suggestion was to generate *two* tokens - one is the original source token, 
untouched, the other is the POS token but the trick is to set the "position 
increment" to 0 for the POS token. This ensures that the POS token is 
considered to occur at exactly the same location as the source token. The 
original source token is untampered with and has all the usual stats in the 
index.The POS token and/or source token can be used interchangeably in queries 
on the same field so you can use position info eg phrase or span queries.

Cheers
Mark



- Original Message 
From: Amit Kumar <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, 12 July, 2006 4:15:34 PM
Subject: Re: Storing Part of Speech information in Lucene Indices

We need to be able to search by word and POS and also have POS  
available for each occurrence.  Appending POS to the terms will  
create post processing nightmare to retrieve
term frequencies right? (I would have to add all the foo_NN and  
foo_ADJ etc.).

I can store the POS in a parallel field and access it via term  
vectors, but that wouldn't allow any kind of search on POS related  
fields right?  For example if I wanted to search for any
adjective with in 3 words of say a term or say If I wanted to get all  
the patterns that follow the sequence ADJ NN ADJ.

Let me look in the developer archives for the payload discussions,  
perhaps implementing that might satisfy my use cases.

Comments?

-Thanks
Amit



On Jul 12, 2006, at 6:39 AM, Grant Ingersoll wrote:

> Hi Amit,
>
> This is definitely something you can do.   What are your goals for  
> it?  Do you want to search by word and POS or do you just want POS  
> available for post processing?
>
> You could just append the POS tag onto the end of your token as it  
> gets indexed, something like foo_NN or foo_ADJ.  This approach may  
> mean you have to use prefix query when you want to search against  
> just "foo".You could also have a parallel field to your main  
> field that stores the POS.  Then you could access it via the term  
> vectors array.
>
> Also, we have been discussing on the developers list on how to add  
> payloads to a posting (i.e. store related information at a position  
> in the index) similar to what Google discusses in their original  
> paper.  Unfortunately, this isn't implemented yet, but if you feel  
> like helping out, check out the discussion on the developer's list  
> (see Flexible Indexing).
>
> -Grant
>
> On Jul 12, 2006, at 1:36 AM, Amit Kumar wrote:
>
>> Hi,
>>
>> A new project that I am investigating lucene for needs the  Parts  
>> of speech information for the tokens. I can  get that
>> information using NLP techniques  (GATE etc.), by pre processing  
>> the documents but I would like to  store that
>> information in the Indices. Something along the lines of
>>
>> TermVectorOffsetInfo[?].getPartofSpeech();
>>
>> I am writing to ask for your advice, you can tell me I am b o n k  
>> e r s  or let me know where I should start digging :).
>> Is that a good idea? Or would it be just less trouble for me to  
>> store the offset information along with parts of speech
>> outside Lucene.
>>
>> Has anyone else done that?
>>
>> Best,
>> Amit
>>
>>
>> ps: Thank you for putting the LuceneInAction source online, it was  
>> a great help to see the CategorizerTest.java.
>> I am ordering my copy of the book tomorrow :)
>>
>> -
>> Amit Kumar
>> Research Programmer
>> The Graduate School of Library and Information Science
>> University of Illinois, Urbana Champaign IL, 61820
>> phone: 217-333-4118 fax: 217-244-3302
>> -
>>
>>
>>
>>
>>
>>
>
>
>
> --
> Grant Ingersoll
> Sr. Software Engineer
> Center for Natural Language Processing
> Syracuse University
> 335 Hinds Hall
> Syracuse, NY 13244
> http://www.cnlp.org
>
> Voice: 315-443-5484
> Fax: 315-443-6886
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
-








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing Part of Speech information in Lucene Indices

2006-07-12 Thread Amit Kumar
You are right. I saw your email after pressing  send. Let me  
experiment. Thanks for the tip.


Best,
Amit

On Jul 12, 2006, at 10:55 AM, mark harwood wrote:


Appending POS to the terms will create post processing nightmare


I think you may have missed the subtle distinction between Grant's  
suggestion and mine.


His suggestion was to append your POS info to the source token -  
creating a single token which combined both the original content  
and your POS info.
My suggestion was to generate *two* tokens - one is the original  
source token, untouched, the other is the POS token but the trick  
is to set the "position increment" to 0 for the POS token. This  
ensures that the POS token is considered to occur at exactly the  
same location as the source token. The original source token is  
untampered with and has all the usual stats in the index.The POS  
token and/or source token can be used interchangeably in queries on  
the same field so you can use position info eg phrase or span queries.


Cheers
Mark



- Original Message 
From: Amit Kumar <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, 12 July, 2006 4:15:34 PM
Subject: Re: Storing Part of Speech information in Lucene Indices

We need to be able to search by word and POS and also have POS
available for each occurrence.  Appending POS to the terms will
create post processing nightmare to retrieve
term frequencies right? (I would have to add all the foo_NN and
foo_ADJ etc.).

I can store the POS in a parallel field and access it via term
vectors, but that wouldn't allow any kind of search on POS related
fields right?  For example if I wanted to search for any
adjective with in 3 words of say a term or say If I wanted to get all
the patterns that follow the sequence ADJ NN ADJ.

Let me look in the developer archives for the payload discussions,
perhaps implementing that might satisfy my use cases.

Comments?

-Thanks
Amit



On Jul 12, 2006, at 6:39 AM, Grant Ingersoll wrote:


Hi Amit,

This is definitely something you can do.   What are your goals for
it?  Do you want to search by word and POS or do you just want POS
available for post processing?

You could just append the POS tag onto the end of your token as it
gets indexed, something like foo_NN or foo_ADJ.  This approach may
mean you have to use prefix query when you want to search against
just "foo".You could also have a parallel field to your main
field that stores the POS.  Then you could access it via the term
vectors array.

Also, we have been discussing on the developers list on how to add
payloads to a posting (i.e. store related information at a position
in the index) similar to what Google discusses in their original
paper.  Unfortunately, this isn't implemented yet, but if you feel
like helping out, check out the discussion on the developer's list
(see Flexible Indexing).

-Grant

On Jul 12, 2006, at 1:36 AM, Amit Kumar wrote:


Hi,

A new project that I am investigating lucene for needs the  Parts
of speech information for the tokens. I can  get that
information using NLP techniques  (GATE etc.), by pre processing
the documents but I would like to  store that
information in the Indices. Something along the lines of

TermVectorOffsetInfo[?].getPartofSpeech();

I am writing to ask for your advice, you can tell me I am b o n k
e r s  or let me know where I should start digging :).
Is that a good idea? Or would it be just less trouble for me to
store the offset information along with parts of speech
outside Lucene.

Has anyone else done that?

Best,
Amit


ps: Thank you for putting the LuceneInAction source online, it was
a great help to see the CategorizerTest.java.
I am ordering my copy of the book tomorrow :)

-
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
-










--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Fax: 315-443-6886




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
-








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Amit Kumar
Research Programmer
The Graduate School

Sort Cache

2006-07-12 Thread Mark Miller

I am going to be working with a medium index of 200k to 1m documents.
Occasionaly, there will be single document corrections applied to this
index. I am worried about this action clearing my sort buffers. I saw the
method of priming another searcher, but if you have a bunch of fields that
may be sorted that could take longer than I want (I am trying to get as
close to realtime as possible). Would it make any sense in trying to save
the sort cache, insert the new doc in that (whatever that entails, I don't
know), and then pass the sort cache to a new searcher? Or something along
those lines...?

Does that sound completely crazy or do you have a much better way for me to
try? Or mabye I am worrying about nothing at all...

Thanks for comment,

Mark


Reuse of IndexReader

2006-07-12 Thread Dominik Bruhn
Hy,
I got the following situation:
A Servlet runing in Tomcat5. When starting the servlet up it automatically 
creates a IndexReader and stores it in a static variable. For searching this 
variable is used. When adding a document to the index, I create a 
IndexWriter, write the Document, and close the IndexWriter again.
This leads into a problem: The IndexReader only searches in documents which 
were in the index when the Reader was created. Every Document added 
afterwards is not searched in. When I restart the Servlet I can search in 
those documents.

Is this a normal behaviour and how can I avoid this?

Thanks
-- 
Dominik Bruhn
mailto: [EMAIL PROTECTED]
http://www.dbruhn.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing Part of Speech information in Lucene Indices

2006-07-12 Thread Grant Ingersoll
I think Mark's idea is better for this.  Although I seem to recall  
there being some caveats w/ multiple tokens at the same position, but  
I don't remember the details.  I _think_ term vectors don't like it,  
so if you need them, you might have troubles.  Perhaps a search of  
the mailing lists and JIRA might turn up something or maybe someone  
else remembers.  At any rate, it may not effect you, so I would try  
Mark's suggestion and see if it works.


-Grant

On Jul 12, 2006, at 11:15 AM, Amit Kumar wrote:

We need to be able to search by word and POS and also have POS  
available for each occurrence.  Appending POS to the terms will  
create post processing nightmare to retrieve
term frequencies right? (I would have to add all the foo_NN and  
foo_ADJ etc.).


I can store the POS in a parallel field and access it via term  
vectors, but that wouldn't allow any kind of search on POS related  
fields right?  For example if I wanted to search for any
adjective with in 3 words of say a term or say If I wanted to get  
all the patterns that follow the sequence ADJ NN ADJ.


Let me look in the developer archives for the payload discussions,  
perhaps implementing that might satisfy my use cases.


Comments?

-Thanks
Amit



On Jul 12, 2006, at 6:39 AM, Grant Ingersoll wrote:


Hi Amit,

This is definitely something you can do.   What are your goals for  
it?  Do you want to search by word and POS or do you just want POS  
available for post processing?


You could just append the POS tag onto the end of your token as it  
gets indexed, something like foo_NN or foo_ADJ.  This approach may  
mean you have to use prefix query when you want to search against  
just "foo".You could also have a parallel field to your main  
field that stores the POS.  Then you could access it via the term  
vectors array.


Also, we have been discussing on the developers list on how to add  
payloads to a posting (i.e. store related information at a  
position in the index) similar to what Google discusses in their  
original paper.  Unfortunately, this isn't implemented yet, but if  
you feel like helping out, check out the discussion on the  
developer's list (see Flexible Indexing).


-Grant

On Jul 12, 2006, at 1:36 AM, Amit Kumar wrote:


Hi,

A new project that I am investigating lucene for needs the  Parts  
of speech information for the tokens. I can  get that
information using NLP techniques  (GATE etc.), by pre processing  
the documents but I would like to  store that

information in the Indices. Something along the lines of

TermVectorOffsetInfo[?].getPartofSpeech();

I am writing to ask for your advice, you can tell me I am b o n k  
e r s  or let me know where I should start digging :).
Is that a good idea? Or would it be just less trouble for me to  
store the offset information along with parts of speech

outside Lucene.

Has anyone else done that?

Best,
Amit


ps: Thank you for putting the LuceneInAction source online, it  
was a great help to see the CategorizerTest.java.

I am ordering my copy of the book tomorrow :)

-
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
-










--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Fax: 315-443-6886




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Amit Kumar
Research Programmer
The Graduate School of Library and Information Science
University of Illinois, Urbana Champaign IL, 61820
phone: 217-333-4118 fax: 217-244-3302
-






--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org

Voice: 315-443-5484
Skype: grant_ingersoll
Fax: 315-443-6886




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene index database

2006-07-12 Thread Erick Erickson

This has been extensively discussed in the mail archive, I think a search of
the archive would help you a lot.

The short form is no. There's nothing built into Lucene to help you index a
database. How would you define that anyway? 

That said, you can write a program to extract data from the database and
index that data. Depending on what you need to do, you can either store
enough data in the index to satisfy searches, or store data in each
"document" you index that allows you to "do the right thing" as far as the
database is concerned to satisfy searches.

Best
Erick


Re: Lucene index database

2006-07-12 Thread Michael J. Prichard

Hey there Teresa.

Short answer: Not directly.

Long answer:  Lucene is a set of libraries built for indexing text and 
then searching those indexes.  Not sure what you mean by indexing a 
database per se.  You could write some code to get the records you want 
from the database and then index those.  For example, if you have a ton 
of articles stored in a database you could grab those articles, pull the 
text, index and then use the lucene index for searches.  Once you found 
something you wanted you could go back to the database for additional 
information.


Hope this helps.
-Michael

mcarcelen wrote:


Hi,
Can Lucene index a database? PostgreSQL, Mysql, Access ?
Thanks
Cheers
Teresa



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reuse of IndexReader

2006-07-12 Thread Erik Hatcher


On Jul 12, 2006, at 12:48 PM, Dominik Bruhn wrote:


Hy,
I got the following situation:
A Servlet runing in Tomcat5. When starting the servlet up it  
automatically
creates a IndexReader and stores it in a static variable. For  
searching this

variable is used. When adding a document to the index, I create a
IndexWriter, write the Document, and close the IndexWriter again.
This leads into a problem: The IndexReader only searches in  
documents which

were in the index when the Reader was created. Every Document added
afterwards is not searched in. When I restart the Servlet I can  
search in

those documents.

Is this a normal behaviour and how can I avoid this?


This is normal behavior.  You'll need to develop some scheme for  
opening a new IndexReader when it is appropriate.  Lots of caveats  
apply.  See Solr for a solid implementation of how this can be done.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene index database

2006-07-12 Thread Michael J. Prichard

Ha Erick,

we must have sent our responses at the same time :)

What Erick said :)

Erick Erickson wrote:

This has been extensively discussed in the mail archive, I think a 
search of

the archive would help you a lot.

The short form is no. There's nothing built into Lucene to help you 
index a

database. How would you define that anyway? 

That said, you can write a program to extract data from the database and
index that data. Depending on what you need to do, you can either store
enough data in the index to satisfy searches, or store data in each
"document" you index that allows you to "do the right thing" as far as 
the

database is concerned to satisfy searches.

Best
Erick




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reuse of IndexReader

2006-07-12 Thread Erick Erickson

This is normal behavior. When you open a reader, it takes a snapshot of the
index and uses that snapshot until it is closed, and any updates to the
index in the meantime are invisible to that reader.

You could periodically close and reopen the reader to get the latest data,
it's not necessary to stop the program. But as you describe things, you'd
have to take care that you don't close the reader if one of your (I presume)
threads was actually using it.

Best
Erick


Re: Lucene index database

2006-07-12 Thread Erick Erickson

What Michael said :).


Re: Reuse of IndexReader

2006-07-12 Thread Dominik Bruhn
Hy,
thanks for your answers.
Uppon creation of the Reader, does Lucene copy the whole Index into RAM? Or is 
this cache filled while searching? How can I find out how long it takes to 
create the IndexReader? Just time to Create-Call?

Thanks
-- 
Dominik Bruhn
mailto: [EMAIL PROTECTED]
http://www.dbruhn.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reuse of IndexReader

2006-07-12 Thread Mark Miller

http://www.nabble.com/Fwd%3A-Contribution%3A-LuceneIndexAccessor-t17416.html#a47049

A good implementation for what you need.

- Mark

On 7/12/06, Dominik Bruhn <[EMAIL PROTECTED]> wrote:


Hy,
thanks for your answers.
Uppon creation of the Reader, does Lucene copy the whole Index into RAM?
Or is
this cache filled while searching? How can I find out how long it takes to
create the IndexReader? Just time to Create-Call?

Thanks
--
Dominik Bruhn
mailto: [EMAIL PROTECTED]
http://www.dbruhn.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Sort Cache

2006-07-12 Thread Chris Hostetter

: close to realtime as possible). Would it make any sense in trying to save
: the sort cache, insert the new doc in that (whatever that entails, I don't
: know), and then pass the sort cache to a new searcher? Or something along
: those lines...?

as crazy as this sounds -- it's even harder then you think.
deleting and re-adding a single document to your index can cause the
docIds of other documents to change as a result of segment merging -- so
reusing your FieldCache andjust making small modifications to it really
isn't feasible.

: try? Or mabye I am worrying about nothing at all...

Unless you have existing code and hard numbers demonstrating a
performance problem -- I wouldn't worry about it.  In my experience with
Solr, warming FieldCaches generaly only takes a few seconds. (It's
when I warm up 16,000 Filters that it takes a minute or two)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: modify existing non-indexed field

2006-07-12 Thread Doron Cohen
> I did clean everything but still getting the same problem. I'm using
lucene
> 2.0. Do you get the same problem on your machine?

Please try with this code - http://cdoronc.20m.com/tmp/indexingThreads.zip

Regards,
Doron


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Reuse of IndexReader

2006-07-12 Thread Mark Miller

I have not seen an expert's comment on the previous code I linked to. It
seems (to my young inexperienced eyes) to do an optimal job of providing
realtime access to an index. Anyone else have some experience with this
code?

On 7/12/06, Mark Miller <[EMAIL PROTECTED]> wrote:



http://www.nabble.com/Fwd%3A-Contribution%3A-LuceneIndexAccessor-t17416.html#a47049

A good implementation for what you need.

- Mark


On 7/12/06, Dominik Bruhn <[EMAIL PROTECTED]> wrote:
>
> Hy,
> thanks for your answers.
> Uppon creation of the Reader, does Lucene copy the whole Index into RAM?
> Or is
> this cache filled while searching? How can I find out how long it takes
> to
> create the IndexReader? Just time to Create-Call?
>
> Thanks
> --
> Dominik Bruhn
> mailto: [EMAIL PROTECTED]
> http://www.dbruhn.de
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



Re: Lucene index database

2006-07-12 Thread Chris Lu

What Erick and Michael said are all correct, or the same. :)

What Lucene can do is search data that stored into Document objects.
Lucene is said to be able to search html, pdf, etc, but that's because
those formats are relatively fixed. You can easily tell title,
content, etc.

With database, which has more complicated and flexible structures. You
need to code to select data, retrieve content, save into Lucene index,
parse the queries, render the results, and also, keep the index in
sync with the database, etc.

You are welcome to try DBSight to save most of these repeated efforts.

Chris Lu
-
Lucene Search On Any Databases/Applications
http://www.dbsight.net

On 7/12/06, Erick Erickson <[EMAIL PROTECTED]> wrote:

This has been extensively discussed in the mail archive, I think a search of
the archive would help you a lot.

The short form is no. There's nothing built into Lucene to help you index a
database. How would you define that anyway? 

That said, you can write a program to extract data from the database and
index that data. Depending on what you need to do, you can either store
enough data in the index to satisfy searches, or store data in each
"document" you index that allows you to "do the right thing" as far as the
database is concerned to satisfy searches.

Best
Erick




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: RangeQuery question?

2006-07-12 Thread Van Nguyen
Exactly what I was looking for.

Thanks!

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 12, 2006 12:47 AM
To: java-user@lucene.apache.org
Subject: Re: RangeQuery question?


1) RangeQuery is the devil, don't use it.  If I weren't so lazy I would
change the javadocs for RangeQuery so that sentence was the class
summary.
Takes a look at RangeFilter or ConstantScoreRangeQuery.

2) it's not clear what exactly you want your example to mean ... perhaps
you mean you want to match all docs with a field of "startDate
greater then "20060710" and a field of "endDate" less then "20060711",
in
which case what you want to do is make a BooleanQuery containing two
ConstantScoreRangeQueries -- one on the startDate and one on the
endDate.
... If that's not what you mean, then I don't understand your question.

: Is there a RangeQuery equivalent that can query date range on two
: different fields?
:
:
:
: Term startTerm = new Term("startDate", "20060710");
:
: Term endTerm = new Term("endDate", "20060711");
:
:
:
: RangeQuery q = new RangeQuery(startTerm, endTerm, true);



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



how to find out if two fields are identical?

2006-07-12 Thread Van Nguyen
Is there a way to compare the values of two fields to see if they are
the same?

 

Let's say we have an index with these fields:

 

ID:2

childID: 7

parentID:   0



ID:3

childID: 6

parentID:   5



ID:4

childID: 2

parentID:   2



 

Based on these three documents, I want the query to return the third
document where childID=parentID.



kbforge 2.10 released

2006-07-12 Thread kbforge1

kbforge.com is pleased to announce Release 2.10 of kbforge, a desktop search
application of particular interest to people on the move, including software
developers.
kbforge is different from other desktop search applications because it
creates a database you can carry with you practically anywhere there is a
USB port to connect to, with complete interoperability between Windows and
Linux. A flash memory stick, a portable hard drive or even an MP3 player
have all been used. kbforge is also able to categorize the information
before it is indexed. Using databases and collections of databases, kbforge 
categorizes information into meaningful topics which can be searched
individually or in groups. A host of other features and options make kbforge
the ultimate tool for finding that elusive program fragment, or example or
article or tutorial. Databases can be refreshed automatically, on a
schedule.
kbforge can index inside .ZIP and .JAR files, remove HTML tags before
indexing, etc..
It uses Lucene for indexing and searching, and as the default database
manager. When used in conjunction with Lucene, kbforge is entirely self
contained and can run anywhere.

Written entirely in Java it is the first of its kind to run under Linux as
well as Windows.

kbforge Release 2.10 costs less than a slice of pizza and is just as
addictive, but unlike a slice of pizza it lasts forever and, in the unlikely
event that you don't like it, you will get all your money back.  

Go to http://www.kbforge.com/topics/kbfeatures.html 

Victor Negrin

[EMAIL PROTECTED]
-- 
View this message in context: 
http://www.nabble.com/kbforge-2.10-released-tf1934597.html#a5300730
Sent from the Lucene - Java Users forum at Nabble.com.


Re: how to find out if two fields are identical?

2006-07-12 Thread Chris Hostetter

: Based on these three documents, I want the query to return the third
: document where childID=parentID.

the the best of my knowledge there is no easy way to do this using the
existing lucene query types -- but it would be fairly easy to impliment.

Since there are no "scoring" issues involved, i would suggest implimenting
it as a new Filter ... if you can allow yourself the constraint that no
document will have more then one value for either field, you can use the
FieldCache to do this very easily -- just iteratre over the FieldCache
arrays in unison, and record a Bit for each document where the value is
the same.

SOlving the more general case where a document matches if any Term it has
for Field X matches any Term it has for Field Y would certinaly be
trickier.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]