date:20080627

Re: Problem with search an exact word and stemming

2008-06-27 Thread renou oki

 Thanks for the reply.

I will try to add an other data field.
I thought about this solution but i was not very sure. I thought that was an
easier solution to do that...


best regards
Renou

2008/6/26 Matthew Hall <[EMAIL PROTECTED]>:

> You could also add another data field to the index, with an untokenized
> version of your data, and then use a multifield query to go against both the
> stemmed and exact match parts of your search at the same time.
>
> This is a technique I've used quite often on my project with various
> different requirements for the second field.  Mind you it makes the indexes
> bigger, but unless your dataset is large its not really a huge problem.
>
> Matt
>
> Erick Erickson wrote:
>
>> The way I've solved this is to index the stemmed *and* a special
>> token at the same position (see Synonym Analyzer). The From your
>> example, say you're indexing progresser. You'd go ahead and index the
>> stemmed version , "progress", AND you'd also index "progresser$"
>> at the same offset. Now, when you want exact matches, search for
>> the token with the $ at the end.
>>
>> This does make your index a bit larger, but not as much as you'd expect.
>>
>> Best
>> Erick
>>
>> On Wed, Jun 25, 2008 at 4:21 AM, renou oki <[EMAIL PROTECTED]> wrote:
>>
>>
>>
>>> Hello,
>>>
>>> I have a stemmed index, but i want to search the exact form of a word.
>>> I use French Analyzer, so for instance "progression", "progresser" are
>>> indexed with the linguistic root "progress".
>>> But if I want to search the word "progress" (and only this word), I have
>>> to
>>> many hits (because of "progression", "progresser"...)
>>> The field is indexed, tokenized and no store...
>>>
>>> Is there a way to do this, I mean to search an exact word in a stemmed
>>> index
>>> ?
>>> I suppose that I have to use the same analyzer for indexing and
>>> searching.
>>>
>>>
>>> I try with a PhraseQuery, with quotes...
>>>
>>> Ps : I use lucene 1.9.1
>>>
>>> Thanks
>>> Renald
>>>
>>>
>>>
>>
>>
>>
>
> --
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> [EMAIL PROTECTED]
> (207) 288-6012
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Can we know "number-of-documents-that-will-be-flushed"?

2008-06-27 Thread Michael McCandless

IndexWriter.numRamDocs() should give you that.

Mike

java_is_everything <[EMAIL PROTECTED]> wrote:
>
> Hi all.
>
> Is there a way to know "number-of-documents-that-will-be-flushed", just
> before giving a call to flush() method?
> I am currently using Lucene 2.2.0 API.
>
> Looking forward to replies.
>
> Ajay Garg
> --
> View this message in context: 
> http://www.nabble.com/Can-we-know-%22number-of-documents-that-will-be-flushed%22--tp18147958p18147958.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Preventing index corruption

2008-06-27 Thread John Byrne


Hi,

Rather than disabling the merging, have you considered putting the 
documents in a separate index, possibly in memory, and then deciding 
when to merge them with the main index yourself?


That way, you can change you mind and simply not merge the  new 
documents if you want.


To do this, you can create a new RAMDirectory, and add your documents to 
that, then when you want to merge with the main index, open an 
IndexWriter on the main index, and call 
IndexWriter.addIndexes(Directory[]). Of course, you don't have to use a 
RAMDirectory, but it would make sense, if it's only purpose is to 
temporarily hold the documents until you decide to commit them.


I don't know what will happen if the computer crashes during the merge, 
but see http://lucene.apache.org/java/2_3_2/api/index.html


This is from the "IndexWriter.addIndexes(Directory[])" documentation:

"This method is transactional in how Exceptions are handled: it does not 
commit a new segments_N file until all indexes are added. This means if 
an Exception occurs (for example disk full), then either no indexes will 
have been added or they all will have been."


I hope that helps!

Regards,
-JB

Eran Sevi wrote:

Thanks Erick.
You might be joking, but one of our clients indeed had all his servers
destroyed in a flood. Of course in this rare case, a solution would be to
keep the backup on another site.

However I'm still confused about normal scenarios:

Let's say that in the middle of the batch I got an exception and wan't to
rollback. Can I do this ?
I want to make sure that after a batch finishes (and only then), it is
written to disk (and not find out after a while during a commit that
something went wrong).Do I have to close the writer or Flush is enough? I
though about raising mergeFactor and other parameters to high values (or
disabling them) so an automatic merge/commit will not happen, and then I can
manually decide when to commit the changes - the size of the batches is not
constant so I can't determine in advance.
I don't mind hurting the index performance a bit by doing this manually, but
I can't efford to let the client think that the information is secured in
the index and than to find out that it's not.

My index size contains a few million docs and it's size can reach about 30G
(we're saving a lot of fields and information for each document). Having a
backup index is an option I considered but I wanted to avoid the overhead of
keeping them synchronized (they might not be on the same server which
exposes a lot of new problems like network issues).

Thanks.

On Thu, Jun 26, 2008 at 5:42 PM, Erick Erickson <[EMAIL PROTECTED]>
wrote:

  

How big is your index? The simpleminded way would be to copy things around
as your batches come in and only switch to the *real* one after the
additions
were verified.

You could also just maintain two indexes but only update one at a time. In
the
99.99% case where things went well, it would just be a matter of continuing
on.
Whenever "something bad happened", you could copy the good index over the
bad one and go at it again.

But to ask that no matter what, the index is OK is asking a lot There
are fires and floods and earthquakes to consider 

Best
Erick

On Thu, Jun 26, 2008 at 10:28 AM, Eran Sevi <[EMAIL PROTECTED]> wrote:



Hi,

I'm looking for the correct way to create an index given the following
restrictions:

1. The documents are received in batches of variable sizes (not more then
100 docs in a batch).
2. The batch insertion must be transactional - either the whole batch is
added to the index (exists physically on the disk), or the whole batch is
canceled/aborted and the index remains as before.
3. The index must remain valid at all times and shouldn't be corrupted
  

even


if a power interruption occurs - *most important*
4. Index speed is less important than search speed.

How should I use a writer with all these restrictions? Can I do it
  

without


having to close the writer after each batch (maybe flush is enough)?

Should I change the IndexWriter parameters such as mergeFactor,
RAMBufferSize, etc. ?
I want to make sure that partial batches are not written to the disk (if
the
computer crashes in the middle of the batch, I want to be able to work
  

with


the index as it was before the crash).

If I'm working with a single writer, is it guaranteed that no matter what
happens the index can be opened and used (I don't mind loosing docs, just
that the index won't be ruined).

Thanks and sorry about the long list of questions,
Eran.

  


  



No virus found in this incoming message.
Checked by AVG. 
Version: 7.5.524 / Virus Database: 270.4.1/1517 - Release Date: 24/06/2008 20:41
  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Preventing index corruption

2008-06-27 Thread Michael McCandless

If you open your IndexWriter with autoCommit=false, then no changes
will be visible in the index until you call commit() or close().
Added documents can still be flushed to disk as new segments when the
RAM buffer is full, but these segments are not referenced (by a new
segments_N file) until commit() or close() is called.  commit() is
only available in the trunk (to be released as 2.4 at some point)
version of Lucene.

Re safety on sudden power loss or machine crash: on the trunk only,
the index will not become corrupt due to such events as long as the
underlying IO system correctly implements fsync().  But on all current
releases of Lucene a sudden power loss or machine crash could in fact
corrupt the index.  See details here:

https://issues.apache.org/jira/browse/LUCENE-1044

Mike

John Byrne <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Rather than disabling the merging, have you considered putting the documents
> in a separate index, possibly in memory, and then deciding when to merge
> them with the main index yourself?
>
> That way, you can change you mind and simply not merge the  new documents if
> you want.
>
> To do this, you can create a new RAMDirectory, and add your documents to
> that, then when you want to merge with the main index, open an IndexWriter
> on the main index, and call IndexWriter.addIndexes(Directory[]). Of course,
> you don't have to use a RAMDirectory, but it would make sense, if it's only
> purpose is to temporarily hold the documents until you decide to commit
> them.
>
> I don't know what will happen if the computer crashes during the merge, but
> see http://lucene.apache.org/java/2_3_2/api/index.html
>
> This is from the "IndexWriter.addIndexes(Directory[])" documentation:
>
> "This method is transactional in how Exceptions are handled: it does not
> commit a new segments_N file until all indexes are added. This means if an
> Exception occurs (for example disk full), then either no indexes will have
> been added or they all will have been."
>
> I hope that helps!
>
> Regards,
> -JB
>
> Eran Sevi wrote:
>>
>> Thanks Erick.
>> You might be joking, but one of our clients indeed had all his servers
>> destroyed in a flood. Of course in this rare case, a solution would be to
>> keep the backup on another site.
>>
>> However I'm still confused about normal scenarios:
>>
>> Let's say that in the middle of the batch I got an exception and wan't to
>> rollback. Can I do this ?
>> I want to make sure that after a batch finishes (and only then), it is
>> written to disk (and not find out after a while during a commit that
>> something went wrong).Do I have to close the writer or Flush is enough? I
>> though about raising mergeFactor and other parameters to high values (or
>> disabling them) so an automatic merge/commit will not happen, and then I
>> can
>> manually decide when to commit the changes - the size of the batches is
>> not
>> constant so I can't determine in advance.
>> I don't mind hurting the index performance a bit by doing this manually,
>> but
>> I can't efford to let the client think that the information is secured in
>> the index and than to find out that it's not.
>>
>> My index size contains a few million docs and it's size can reach about
>> 30G
>> (we're saving a lot of fields and information for each document). Having a
>> backup index is an option I considered but I wanted to avoid the overhead
>> of
>> keeping them synchronized (they might not be on the same server which
>> exposes a lot of new problems like network issues).
>>
>> Thanks.
>>
>> On Thu, Jun 26, 2008 at 5:42 PM, Erick Erickson <[EMAIL PROTECTED]>
>> wrote:
>>
>>
>>>
>>> How big is your index? The simpleminded way would be to copy things
>>> around
>>> as your batches come in and only switch to the *real* one after the
>>> additions
>>> were verified.
>>>
>>> You could also just maintain two indexes but only update one at a time.
>>> In
>>> the
>>> 99.99% case where things went well, it would just be a matter of
>>> continuing
>>> on.
>>> Whenever "something bad happened", you could copy the good index over the
>>> bad one and go at it again.
>>>
>>> But to ask that no matter what, the index is OK is asking a lot There
>>> are fires and floods and earthquakes to consider 
>>>
>>> Best
>>> Erick
>>>
>>> On Thu, Jun 26, 2008 at 10:28 AM, Eran Sevi <[EMAIL PROTECTED]> wrote:
>>>
>>>

 Hi,

 I'm looking for the correct way to create an index given the following
 restrictions:

 1. The documents are received in batches of variable sizes (not more
 then
 100 docs in a batch).
 2. The batch insertion must be transactional - either the whole batch is
 added to the index (exists physically on the disk), or the whole batch
 is
 canceled/aborted and the index remains as before.
 3. The index must remain valid at all times and shouldn't be corrupted

>>>
>>> even
>>>

 if a power interruption occurs - *most important*
 4. Index speed

Lucene CFS naming significance

2008-06-27 Thread mick l


Folks,
Could anyone tell me the significance of the naming of the cfs files in the
luceneindex e.g.  _1pp.cfs, _2kk.cfs etc.
I have observed many differently named files being created temporarily while
the index is being built, but the same set of named files are in place once
the index has finished building.
Can I rely on the index files always having the same names once index
buildng is complete? This would greatly simplify my auto ftp'ing of the
files up to a web server via an SSIS package.
Thanks
M
-- 
View this message in context: 
http://www.nabble.com/Lucene-CFS-naming-significance-tp18151693p18151693.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Can we know "number-of-documents-that-will-be-flushed"?

2008-06-27 Thread java_is_everything


Hi Mike. Thanks for the reply.

Just one doubt. Will it work if the indexwriter directory is "not" a
RAMDirectory?

Looking forward to a reply.

Ajay Garg



Michael McCandless-2 wrote:
> 
> IndexWriter.numRamDocs() should give you that.
> 
> Mike
> 
> java_is_everything <[EMAIL PROTECTED]> wrote:
>>
>> Hi all.
>>
>> Is there a way to know "number-of-documents-that-will-be-flushed", just
>> before giving a call to flush() method?
>> I am currently using Lucene 2.2.0 API.
>>
>> Looking forward to replies.
>>
>> Ajay Garg
>> --
>> View this message in context:
>> http://www.nabble.com/Can-we-know-%22number-of-documents-that-will-be-flushed%22--tp18147958p18147958.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Can-we-know-%22number-of-documents-that-will-be-flushed%22--tp18147958p18152451.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Does Lucene Java 2.3.2 supports parsing of Microsoft office 2007 documents...

2008-06-27 Thread Kumar Gaurav

Dear all,

 

Currently I am using Lucene jave 2.3.2 demo to parse Microsoft 2003 and 2007
docs and PDF files.

It is able to parse files with *.pdf, *.doc, *.xls etc. 

But it does not search in files of Microsoft 2007 docs.

It shows indexing *.docx and other Microsoft 2007 doc files.

 

Does Lucene java supports parsing of extensions *.docx, *.pptx, *.mpp i.e.
Microsoft Windows 2007 documents?

If it supports, what should be done in Lucene demo 2.3.2 to search queries
on file with above mentioned extensions?

 

Thanks

Kumar

Re: Lucene CFS naming significance

2008-06-27 Thread Lucas F. A. Teixeira


Folks,
Could anyone tell me the significance of the naming of the cfs files in the
luceneindex e.g.  _1pp.cfs, _2kk.cfs etc.

> Just names that won`t repeat in the same folder.

I have observed many differently named files being created temporarily while
the index is being built, but the same set of named files are in place once
the index has finished building.

> That not true. Unless you always have the same "data" every time you 
build the index, and if you build it every time from the beggining (not 
rewriting the docs)


Can I rely on the index files always having the same names once index
buildng is complete? 


> No. That names will be rotating through your indexing.

This would greatly simplify my auto ftp'ing of the
files up to a web server via an SSIS package.

> Why don`t you just ftp the hole index dir? The directory is actually 
the "index" by itself, and not its content.


[]s,


Lucas Frare A. Teixeira
[EMAIL PROTECTED] 
Tel: +55 11 3660.1622 - R3018



mick l escreveu:

Folks,
Could anyone tell me the significance of the naming of the cfs files in the
luceneindex e.g.  _1pp.cfs, _2kk.cfs etc.
I have observed many differently named files being created temporarily while
the index is being built, but the same set of named files are in place once
the index has finished building.
Can I rely on the index files always having the same names once index
buildng is complete? This would greatly simplify my auto ftp'ing of the
files up to a web server via an SSIS package.
Thanks
M

Re: Can you create a Field that is a copy of another Field?

2008-06-27 Thread Grant Ingersoll



On Jun 27, 2008, at 12:01 AM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED] 
> wrote:



Hello Lucene Gurus,



I'm new to Lucene so sorry if this question basic or naïve.



I have a Document to which I want to add a Field named, say, "foo"  
that is tokenized, indexed and unstored.  I am using the  
"Field(String name, TokenStream tokenStream)" constructor to create  
it.  The TokenStream may take a fairly long time to return all its  
tokens.




Can you share some code here?  What's the reasoning behind using it  
(not saying it's wrong, just wondering what led you to it)?  Are you  
just loading it up from a file, string or something or do you have  
another reason?






Now for querying reasons I want to add another Field named, say,  
"bar", that is tokenized and indexed in exactly the same way as  
"foo".  I could just pass it the same TokenStream that I used to  
create "foo" but since it takes so long to return all its tokens, I  
was wondering if there is a way to say, create "bar" as a copy of  
"foo".  I looked thru the javadoc but didn't see anything.





By exactly the same, do you really mean exactly the same?  What's the  
point of that?  What are the "querying reasons"?


You may want to look at the TeeTokenFilter and the SinkTokenizer, but  
I guess I'd like to know more about what's going on before fully  
recommending anything.





Is this possible in Lucene or do I just have to bite the bullet  
build the new Field using the same TokenStream again?


--
Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194  
Oak Valley Drive * Ann Arbor, MI 48103
Tel 734-332-4405 * Fax 734-332-4440 * [EMAIL PROTECTED] 

www.sungard.com/energy http://www.sungard.com/energy>





--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Doubt on IndexWriter.close()

2008-06-27 Thread java_is_everything


Hi all.

IndexWriter.close() API states that ::

"Flushes all changes to an index and closes all associated files.".

What does "closes all associated files" mean, since we are apparently able
to still addDocument() even after calling IndexWriter.close() ?


Looking forward to a reply.

Ajay garg
-- 
View this message in context: 
http://www.nabble.com/Doubt-on-IndexWriter.close%28%29-tp18153935p18153935.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Does Lucene Java 2.3.2 supports parsing of Microsoft office 2007 documents...

2008-06-27 Thread Erick Erickson

Lucene doesn't actually support any of the document types. What happens
is that some program is used to parse the files into an indexable stream
and that stream is indexed. That used to be POI in the old days.

I confess I haven't used the latest demo, but I assume that under the
covers there's some program installed that Microsoft documents are
pushed through to get indexable tokens. So the real question is
whether that program handles the documents you're interested in.

I know this isn't very helpful, but you'll have to dig into this in some
detail if you really want to index Microsoft documents. If you don't
need to, then you don't need to waste time on this issue.

Best
Erick

On Fri, Jun 27, 2008 at 7:08 AM, Kumar Gaurav <[EMAIL PROTECTED]>
wrote:

> Dear all,
>
>
>
> Currently I am using Lucene jave 2.3.2 demo to parse Microsoft 2003 and
> 2007
> docs and PDF files.
>
> It is able to parse files with *.pdf, *.doc, *.xls etc.
>
> But it does not search in files of Microsoft 2007 docs.
>
> It shows indexing *.docx and other Microsoft 2007 doc files.
>
>
>
> Does Lucene java supports parsing of extensions *.docx, *.pptx, *.mpp i.e.
> Microsoft Windows 2007 documents?
>
> If it supports, what should be done in Lucene demo 2.3.2 to search queries
> on file with above mentioned extensions?
>
>
>
> Thanks
>
> Kumar
>
>

Re: Lucene CFS naming significance

2008-06-27 Thread mick l


 > That not true. Unless you always have the same "data" every time you 
build the index, and if you build it every time from the beggining (not 
rewriting the docs)

>>Lucas, The same tables are being converted to an index each time, but
there will just be extra rows.
I do rebuild the index each time from the beginning.

>Why don`t you just ftp the hole index dir? The directory is actually 
the "index" by itself, and not its content.

>> I am just having problems getting SSIS to loop and FTP the files and I
>> need this out by Monday.

Thanks for the info anyway mate. I knew this was going to easy




Lucas F. A. Teixeira wrote:
> 
> Folks,
> Could anyone tell me the significance of the naming of the cfs files in
> the
> luceneindex e.g.  _1pp.cfs, _2kk.cfs etc.
> 
>  > Just names that won`t repeat in the same folder.
> 
> I have observed many differently named files being created temporarily
> while
> the index is being built, but the same set of named files are in place
> once
> the index has finished building.
> 
>  > That not true. Unless you always have the same "data" every time you 
> build the index, and if you build it every time from the beggining (not 
> rewriting the docs)
> 
> Can I rely on the index files always having the same names once index
> buildng is complete? 
> 
>  > No. That names will be rotating through your indexing.
> 
> This would greatly simplify my auto ftp'ing of the
> files up to a web server via an SSIS package.
> 
>  > Why don`t you just ftp the hole index dir? The directory is actually 
> the "index" by itself, and not its content.
> 
> []s,
> 
> 
> Lucas Frare A. Teixeira
> [EMAIL PROTECTED] 
> Tel: +55 11 3660.1622 - R3018
> 
> 
> 
> mick l escreveu:
>> Folks,
>> Could anyone tell me the significance of the naming of the cfs files in
>> the
>> luceneindex e.g.  _1pp.cfs, _2kk.cfs etc.
>> I have observed many differently named files being created temporarily
>> while
>> the index is being built, but the same set of named files are in place
>> once
>> the index has finished building.
>> Can I rely on the index files always having the same names once index
>> buildng is complete? This would greatly simplify my auto ftp'ing of the
>> files up to a web server via an SSIS package.
>> Thanks
>> M
>>   
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Lucene-CFS-naming-significance-tp18151693p18155342.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem with search an exact word and stemming

2008-06-27 Thread Matthew Hall

Also, please note that I thought about it and realized that I mispoke 
when I sent out my original suggestion.  You don't want an untokenized 
field in your case, you want an unstemmed one instead.


This will allow you to get the functionality you are looking for.. at 
least I believe so ^^


Anyhow, best of luck!

Matt

renou oki wrote:

 Thanks for the reply.

I will try to add an other data field.
I thought about this solution but i was not very sure. I thought that was an
easier solution to do that...


best regards
Renou

2008/6/26 Matthew Hall <[EMAIL PROTECTED]>:

  

You could also add another data field to the index, with an untokenized
version of your data, and then use a multifield query to go against both the
stemmed and exact match parts of your search at the same time.

This is a technique I've used quite often on my project with various
different requirements for the second field.  Mind you it makes the indexes
bigger, but unless your dataset is large its not really a huge problem.

Matt

Erick Erickson wrote:



The way I've solved this is to index the stemmed *and* a special
token at the same position (see Synonym Analyzer). The From your
example, say you're indexing progresser. You'd go ahead and index the
stemmed version , "progress", AND you'd also index "progresser$"
at the same offset. Now, when you want exact matches, search for
the token with the $ at the end.

This does make your index a bit larger, but not as much as you'd expect.

Best
Erick

On Wed, Jun 25, 2008 at 4:21 AM, renou oki <[EMAIL PROTECTED]> wrote:



  

Hello,

I have a stemmed index, but i want to search the exact form of a word.
I use French Analyzer, so for instance "progression", "progresser" are
indexed with the linguistic root "progress".
But if I want to search the word "progress" (and only this word), I have
to
many hits (because of "progression", "progresser"...)
The field is indexed, tokenized and no store...

Is there a way to do this, I mean to search an exact word in a stemmed
index
?
I suppose that I have to use the same analyzer for indexing and
searching.


I try with a PhraseQuery, with quotes...

Ps : I use lucene 1.9.1

Thanks
Renald






  

--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





  


--
Matthew Hall
Software Engineer
Mouse Genome Informatics
[EMAIL PROTECTED]
(207) 288-6012



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Can we know "number-of-documents-that-will-be-flushed"?

2008-06-27 Thread Michael McCandless

Yes, it will.  The javadocs for that method is rather confusing; I'll
correct it.

Mike

On Fri, Jun 27, 2008 at 6:44 AM, java_is_everything
<[EMAIL PROTECTED]> wrote:
>
> Hi Mike. Thanks for the reply.
>
> Just one doubt. Will it work if the indexwriter directory is "not" a
> RAMDirectory?
>
> Looking forward to a reply.
>
> Ajay Garg
>
>
>
> Michael McCandless-2 wrote:
>>
>> IndexWriter.numRamDocs() should give you that.
>>
>> Mike
>>
>> java_is_everything <[EMAIL PROTECTED]> wrote:
>>>
>>> Hi all.
>>>
>>> Is there a way to know "number-of-documents-that-will-be-flushed", just
>>> before giving a call to flush() method?
>>> I am currently using Lucene 2.2.0 API.
>>>
>>> Looking forward to replies.
>>>
>>> Ajay Garg
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Can-we-know-%22number-of-documents-that-will-be-flushed%22--tp18147958p18147958.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Can-we-know-%22number-of-documents-that-will-be-flushed%22--tp18147958p18152451.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Can you create a Field that is a copy of another Field?

2008-06-27 Thread Bill.Chesky

Grant,

Thanks for the reply.  What we're trying to do is kind of esoteric and hard to 
explain without going into a lot of gory details so I was trying to keep it 
simple.  But I'll try to summarize.

We're trying to index entities in a relational database.  One of the entities 
we're trying to index is something called a Property.  Think of a Property kind 
of like the java.util.Properties class, i.e. a name/value pair. So some 
examples of Properties might be:

State=California
City=Sacremento
ZipCode=94203
StreetName=South Main
StreetNumber=1234
Name=Joe Smith

Etc., etc.

(Note: this isn't the type of data we're storing... just trying to keep it 
simple.)

Imagine that the above list represents the the set of Properties that specify 
the address for a single person, Joe Smith.  Each Property in the set will be 
indexed by the values on the right-hand side of all the other name/value pairs 
in the set, i.e.: California, Sacremento, 94203, South, Main, 1234, Joe and 
Smith.

There are two types of queries that we want to do.  
1) retrieve every Property matching the specified search terms, regardless of 
its left-hand side.  For this we want to create a field in EVERY Document 
called "keywords" and index it by the right-hand side values as described above.
2) retrieve every Property with a given left-hand side that matches the 
specified search terms.  For example, find all the 'City' Properties that match 
the term 'South'.  For this we want to create a field with the name of the 
left-hand side (e.g. State, City, ZipCode, etc.) but only in those Documents 
that correspond to a Property with that left-hand side.  Again this field will 
be indexed by the right-hand side values as described above.

So a couple of examples from the above list might look something like:

Document: State=California
  Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
  Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.

Document: City=Sacremento
  Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
  Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.

Now if I'm interested in all the Properties that match the word "South", I 
search the index on the "keywords" field for the term "South".  This will 
return both documents above.  

But if I'm only interested in any 'City' Properties that match the term 'South' 
I search the index on the "City" field for the term "South".  This will only 
return the 'City=Sacremento' document above because it's the only Document of 
the two that even has a 'City' field in it.

But in any case, the 'State' field and the 'City' field are indexed exactly the 
same way as the 'keywords' field.  Which is why I was wondering if there was a 
way to just create these fields as copies of the 'keywords' field.

Here is a code sample where I'm creating the index.  We're using Hibernate 
search to search the indexes, thus the "id" and "_hibernate_class" fields.

Query q = em.createQuery("select p from Property p");

List properties = q.getResultList();

for (Property p : properties)
{
// Indexing property.
Document doc = new Document();
doc.add(new Field("id", 
   Integer.toString(p.getId()), 
   Field.Store.YES, 
   Field.Index.UN_TOKENIZED));
doc.add(new Field("_hibernate_class", 
  Property.class.getCanonicalName(), 
  Field.Store.YES, 
  Field.Index.UN_TOKENIZED));
TokenStream tokenStream = new PropertyTokenStream(p);
doc.add(new Field("keywords", tokenStream));
propertyIndexWriter.addDocument(doc);
tokenStream.close();
// Here is where I would like to add the second field that is a copy
// of the "keywords" field just created above.  Note: the call
// p.getCharacteristic().getName() is getting the name of the 
// left-hand side of the Property as described above.
TokenStream tokenStream = new PropertyTokenStream(p);
doc.add(new Field(p.getCharacteristic().getName(), tokenStream));
propertyIndexWriter.addDocument(doc);
tokenStream.close();
}

Hope that clears it up.  

BTW, in case this seems like a strange way to index things, I will also add 
that we are doing it this way in order to impose a heirarchical structure on 
Properties.  So my example above should really look like this:

State=California
City=Sacremento
ZipCode=94203
StreetName=South Main
StreetNumber=1234
Name=Joe Smith

Use your imagination to visualize what the tree might look like with millions 
of peoples' addresses.  Now imagine trying to tokenize the Document 
corresponding to "State=California".  Each path thru the tree from root (State) 
to leaf (Name) represents a set of Properties that is used to index the 
"keywords" field in the "State=California" document.  In other words it takes a 
long time

Sorting issues

2008-06-27 Thread Robert . Hastings

I just implemented a sorting feature on our application where the user can 
change the sort on a query and reexecute the search.  It works fine on 
text fields where most of the documents have different field values. 
However, on fields that are categories, that is, there are only four 
distinct values for the category field and
all of the documents fit into a distinct category, the results are not in 
the sorted values.

All of the sort fields are untokenized and stored. and each document has 
only one value for the sorted field.

Are there any known issues (Lucene 2.3.0)?  How can I go about debugging 
this? I have tried Luke, also I have the Lucene source.

Bob Hastings
Ancept Inc.

Re: Can you create a Field that is a copy of another Field?

2008-06-27 Thread Matthew Hall

I'm not sure if this is helpful, but I do something VERY similar to this 
in my project.


So, for the example you are citing I would design my index as follows:

db_key, data, data_type

Where the data_type is some sort of value representing the thing that's 
on the left hand side of your property relationship there.


So, then in order to satisfy your search, the queries become quite simple:

The search for everything simply searches against the data field in this 
index, wheras the search for a specific data_type + searchterm becomes a 
simple boolean query, that has a MUST clause for the data_type value.


As an even BETTER bonus, this will then mean that all of your searchable 
values will now have relevance to each other at scoring time, which is 
quite useful in the long run.


Hope this helps you out,

Matt

[EMAIL PROTECTED] wrote:

Grant,

Thanks for the reply.  What we're trying to do is kind of esoteric and hard to 
explain without going into a lot of gory details so I was trying to keep it 
simple.  But I'll try to summarize.

We're trying to index entities in a relational database.  One of the entities 
we're trying to index is something called a Property.  Think of a Property kind 
of like the java.util.Properties class, i.e. a name/value pair. So some 
examples of Properties might be:

State=California
City=Sacremento
ZipCode=94203
StreetName=South Main
StreetNumber=1234
Name=Joe Smith

Etc., etc.

(Note: this isn't the type of data we're storing... just trying to keep it 
simple.)

Imagine that the above list represents the the set of Properties that specify 
the address for a single person, Joe Smith.  Each Property in the set will be 
indexed by the values on the right-hand side of all the other name/value pairs 
in the set, i.e.: California, Sacremento, 94203, South, Main, 1234, Joe and 
Smith.

There are two types of queries that we want to do.  
1) retrieve every Property matching the specified search terms, regardless of its left-hand side.  For this we want to create a field in EVERY Document called "keywords" and index it by the right-hand side values as described above.

2) retrieve every Property with a given left-hand side that matches the 
specified search terms.  For example, find all the 'City' Properties that match 
the term 'South'.  For this we want to create a field with the name of the 
left-hand side (e.g. State, City, ZipCode, etc.) but only in those Documents 
that correspond to a Property with that left-hand side.  Again this field will 
be indexed by the right-hand side values as described above.

So a couple of examples from the above list might look something like:

Document: State=California
  Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
  Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.

Document: City=Sacremento
  Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
  Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.

Now if I'm interested in all the Properties that match the word "South", I search the index on the "keywords" field for the term "South".  This will return both documents above.  


But if I'm only interested in any 'City' Properties that match the term 'South' I search the index 
on the "City" field for the term "South".  This will only return the 
'City=Sacremento' document above because it's the only Document of the two that even has a 'City' 
field in it.

But in any case, the 'State' field and the 'City' field are indexed exactly the 
same way as the 'keywords' field.  Which is why I was wondering if there was a 
way to just create these fields as copies of the 'keywords' field.

Here is a code sample where I'm creating the index.  We're using Hibernate search to search the 
indexes, thus the "id" and "_hibernate_class" fields.

Query q = em.createQuery("select p from Property p");

List properties = q.getResultList();

for (Property p : properties)

{
// Indexing property.
Document doc = new Document();
doc.add(new Field("id", 
   Integer.toString(p.getId()), 
   Field.Store.YES, 
   Field.Index.UN_TOKENIZED));
doc.add(new Field("_hibernate_class", 
  Property.class.getCanonicalName(), 
  Field.Store.YES, 
  Field.Index.UN_TOKENIZED));

TokenStream tokenStream = new PropertyTokenStream(p);
doc.add(new Field("keywords", tokenStream));
propertyIndexWriter.addDocument(doc);
tokenStream.close();
// Here is where I would like to add the second field that is a copy

// of the "keywords" field just created above.  Note: the call
// p.getCharacteristic().getName() is getting the name of the 
// left-hand side of the Property as described above.

TokenStream tokenStream = new PropertyTokenStream(p);
doc.add(new Field(p.getCharacteristic().getName(), tokenStream));
propertyInd

Re: Searching any part of a string

2008-06-27 Thread Mark Ferguson

Hi Erick,

Thanks for the suggestions. I've used indexed n-grams before to implement
spell-checking; I think in this case I may take a look at WildcardTermEnum
and RegexTermEnum. It seems like a good solution because I am doing my own
results ordering so Lucene's scoring is irrelevant in this case. I wasn't
aware of these classes so thanks for mentioning them!

Best,

Mark


On Wed, Jun 25, 2008 at 12:25 PM, Erick Erickson <[EMAIL PROTECTED]>
wrote:

> Warning: I don't understand ngrams at all, so you should
> read this as a plea for those who do to tell me I'm off base .
>
>
> But I wonder if indexing as n-grams would be a way to
> cope with this issue that lots of people have. *assuming*
> you are thinking about single terms, then it seems that
> "smith" would be tokenized as sm, mi, it, th. Then
> a wildcard search for "mi it" would hit (as a phrase
> query or a SpanQuery with slop of 0). It seems like there
> are several issues to work out here, especially including
> multiple terns, matching mixtures of wildcards and
> non-wildcards, etc.
>
> But it seems do-able
>
>
> Another approach is to use WildcardTernEnum and/or
> RegexTermEnum to build up a filter and use the filter as
> part of the query. What you loose with this approach is
> that the filter (and wildcards) then don't contribute to
> scoring. But this isn't a huge price to pay...
>
> Best
> Erick
>
> On Wed, Jun 25, 2008 at 1:47 PM, Mark Ferguson <[EMAIL PROTECTED]>
> wrote:
>
> > Hello,
> >
> > I am currently keeping an index of all our client's usernames. The search
> > functionality is implemented using a PrefixFilter. However, we would like
> > to
> > expand the functionality to be able to search any part of a user's name,
> > rather than requiring that it begin with the query string. So for
> example,
> > the search term 'mit' would return the username 'smith'.
> >
> > I am hesitant to use a WildcardQuery starting with an asterisk because
> I've
> > read about why this is a bad idea. I am looking for suggestions on the
> best
> > way to implement this.
> >
> > The idea I've come up with is to index each part of the username; so for
> > example, if the username is 'mark', you would index mark, ark, rk, and k.
> > Then you could still use the PrefixFilter. I'm not overly concerned about
> > how this would enlarge the index because usernames tend to be fairly
> short.
> >
> > I am very much open to other suggestions however. Does anyone have any
> > opinions or ideas that they can share?
> >
> > Thanks very much.
> >
> > Mark
> >
>

Re: Doubt on IndexWriter.close()

2008-06-27 Thread Michael McCandless

Which version of Lucene are you using?  Recent versions do not allow
addDocument to be called after close.

Mike

java_is_everything <[EMAIL PROTECTED]> wrote:
>
> Hi all.
>
> IndexWriter.close() API states that ::
>
> "Flushes all changes to an index and closes all associated files.".
>
> What does "closes all associated files" mean, since we are apparently able
> to still addDocument() even after calling IndexWriter.close() ?
>
>
> Looking forward to a reply.
>
> Ajay garg
> --
> View this message in context: 
> http://www.nabble.com/Doubt-on-IndexWriter.close%28%29-tp18153935p18153935.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Sorting issues

2008-06-27 Thread Erick Erickson

That's surprising. Could you post a brief example of your
index and search code?

It sounds like you're saying
docs 1, 2, 3 all have category aaa
docs 4, 5, 6 all have category bbb
docs 7, 8, 9 all have category ccc

But if you search for category:bbb
you don't get docs 4, 5, and 6

Is this a fair statement of the issue?

Best
Erick


On Fri, Jun 27, 2008 at 11:44 AM, <[EMAIL PROTECTED]> wrote:

> I just implemented a sorting feature on our application where the user can
> change the sort on a query and reexecute the search.  It works fine on
> text fields where most of the documents have different field values.
> However, on fields that are categories, that is, there are only four
> distinct values for the category field and
> all of the documents fit into a distinct category, the results are not in
> the sorted values.
>
> All of the sort fields are untokenized and stored. and each document has
> only one value for the sorted field.
>
> Are there any known issues (Lucene 2.3.0)?  How can I go about debugging
> this? I have tried Luke, also I have the Lucene source.
>
> Bob Hastings
> Ancept Inc.

Re: Can you create a Field that is a copy of another Field?

2008-06-27 Thread Erick Erickson

How sure are you that the TokenStream is that expensive? But
assuming you are AND that the values for these properties
aren't that big, the simple-minded approach that comes to my
simple mind is to just iterate through the stream yourself, assemble
a string from the returned tokens and pass the string to the two add
calls.

This might be worth it if your tokenizer is going to the DB or something

Best
Erick


On Fri, Jun 27, 2008 at 10:56 AM, <[EMAIL PROTECTED]> wrote:

> Grant,
>
> Thanks for the reply.  What we're trying to do is kind of esoteric and hard
> to explain without going into a lot of gory details so I was trying to keep
> it simple.  But I'll try to summarize.
>
> We're trying to index entities in a relational database.  One of the
> entities we're trying to index is something called a Property.  Think of a
> Property kind of like the java.util.Properties class, i.e. a name/value
> pair. So some examples of Properties might be:
>
> State=California
> City=Sacremento
> ZipCode=94203
> StreetName=South Main
> StreetNumber=1234
> Name=Joe Smith
>
> Etc., etc.
>
> (Note: this isn't the type of data we're storing... just trying to keep it
> simple.)
>
> Imagine that the above list represents the the set of Properties that
> specify the address for a single person, Joe Smith.  Each Property in the
> set will be indexed by the values on the right-hand side of all the other
> name/value pairs in the set, i.e.: California, Sacremento, 94203, South,
> Main, 1234, Joe and Smith.
>
> There are two types of queries that we want to do.
> 1) retrieve every Property matching the specified search terms, regardless
> of its left-hand side.  For this we want to create a field in EVERY Document
> called "keywords" and index it by the right-hand side values as described
> above.
> 2) retrieve every Property with a given left-hand side that matches the
> specified search terms.  For example, find all the 'City' Properties that
> match the term 'South'.  For this we want to create a field with the name of
> the left-hand side (e.g. State, City, ZipCode, etc.) but only in those
> Documents that correspond to a Property with that left-hand side.  Again
> this field will be indexed by the right-hand side values as described above.
>
> So a couple of examples from the above list might look something like:
>
> Document: State=California
>  Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
>  Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.
>
> Document: City=Sacremento
>  Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
>  Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.
>
> Now if I'm interested in all the Properties that match the word "South", I
> search the index on the "keywords" field for the term "South".  This will
> return both documents above.
>
> But if I'm only interested in any 'City' Properties that match the term
> 'South' I search the index on the "City" field for the term "South".  This
> will only return the 'City=Sacremento' document above because it's the only
> Document of the two that even has a 'City' field in it.
>
> But in any case, the 'State' field and the 'City' field are indexed exactly
> the same way as the 'keywords' field.  Which is why I was wondering if there
> was a way to just create these fields as copies of the 'keywords' field.
>
> Here is a code sample where I'm creating the index.  We're using Hibernate
> search to search the indexes, thus the "id" and "_hibernate_class" fields.
>
> Query q = em.createQuery("select p from Property p");
>
> List properties = q.getResultList();
>
> for (Property p : properties)
> {
>// Indexing property.
>Document doc = new Document();
>doc.add(new Field("id",
>   Integer.toString(p.getId()),
>   Field.Store.YES,
>   Field.Index.UN_TOKENIZED));
>doc.add(new Field("_hibernate_class",
>  Property.class.getCanonicalName(),
>  Field.Store.YES,
>  Field.Index.UN_TOKENIZED));
>TokenStream tokenStream = new PropertyTokenStream(p);
>doc.add(new Field("keywords", tokenStream));
>propertyIndexWriter.addDocument(doc);
>tokenStream.close();
>// Here is where I would like to add the second field that is a copy
>// of the "keywords" field just created above.  Note: the call
>// p.getCharacteristic().getName() is getting the name of the
>// left-hand side of the Property as described above.
>TokenStream tokenStream = new PropertyTokenStream(p);
>doc.add(new Field(p.getCharacteristic().getName(), tokenStream));
>propertyIndexWriter.addDocument(doc);
>tokenStream.close();
> }
>
> Hope that clears it up.
>
> BTW, in case this seems like a strange way to index things, I will also add
> that we are doing it this way in order to impose a heirarchical structure on
> Properties.  So my example above should r

Read index into RAM?

2008-06-27 Thread Darren Govoni

Hi,
   Is it possible to read a disk-based index into RAM (entirely) and
have all searches operate on it there? I saw some RAMDirectory examples,
but it didn't look like it will transfer a disk index into RAM.

thanks
D


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Read index into RAM?

2008-06-27 Thread Erick Erickson

I posted this reply to this question last time you posted it

>From the docs...

RAMDirectory

public *RAMDirectory*(Directory dir)
 throws IOException

Creates a new RAMDirectory instance from a different
Directoryimplementation. This can be used to load a disk-based index
into memory.

Seems like exactly what you're asking for...

Best
Erick

On Fri, Jun 27, 2008 at 1:52 PM, Darren Govoni <[EMAIL PROTECTED]> wrote:

> Hi,
>   Is it possible to read a disk-based index into RAM (entirely) and
> have all searches operate on it there? I saw some RAMDirectory examples,
> but it didn't look like it will transfer a disk index into RAM.
>
> thanks
> D
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Read index into RAM?

2008-06-27 Thread Anshum

Hi Darren,
Assuming that you use a *nix/*nux machine, the best way to work that out
would be to have your index moved to a tmpfs. Steps to have that done :
1. Mount a tmpfs  (It uses RAM by default)
2. Copy your index to your new mount point
3. Open your index readers pointing to the new directory location (on tmpfs)

This does the job pretty neat.

--
Anshum Gupta
Naukri Labs

On Fri, Jun 27, 2008 at 11:22 PM, Darren Govoni <[EMAIL PROTECTED]> wrote:

> Hi,
>   Is it possible to read a disk-based index into RAM (entirely) and
> have all searches operate on it there? I saw some RAMDirectory examples,
> but it didn't look like it will transfer a disk index into RAM.
>
> thanks
> D
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-- 
--
The facts expressed here belong to everybody, the opinions to me.
The distinction is yours to draw

RE: Can you create a Field that is a copy of another Field?

2008-06-27 Thread Bill.Chesky

Matthew,

Thanks for the reply.  This looks very interesting.  If I'm understanding 
correctly your db_key, data and data_type are Fields within the Document, 
correct?  So is this how you envision it?

Document: State=California
   Field: 'db_key'='1395' (primary key into relational table, correct?)
   Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
   Field: 'data_type' indexed by 'State'

Document: City=Sacremento
   Field: 'db_key'='2405' 
   Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
   Field: 'data_type' indexed by 'City'

Then my query for all Properties would be:

+data:South

My query for only 'City' Properties would be:

+data:South +data_type:City

Is that right?

I think that would work.  Very nice.  Thank you very much
--
Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley 
Drive * Ann Arbor, MI 48103
Tel 734-332-4405 * Fax 734-332-4440 * [EMAIL PROTECTED]
 www.sungard.com/energy 


-Original Message-
From: Matthew Hall [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 27, 2008 11:49 AM
To: java-user@lucene.apache.org
Subject: Re: Can you create a Field that is a copy of another Field?

I'm not sure if this is helpful, but I do something VERY similar to this 
in my project.

So, for the example you are citing I would design my index as follows:

db_key, data, data_type

Where the data_type is some sort of value representing the thing that's 
on the left hand side of your property relationship there.

So, then in order to satisfy your search, the queries become quite simple:

The search for everything simply searches against the data field in this 
index, wheras the search for a specific data_type + searchterm becomes a 
simple boolean query, that has a MUST clause for the data_type value.

As an even BETTER bonus, this will then mean that all of your searchable 
values will now have relevance to each other at scoring time, which is 
quite useful in the long run.

Hope this helps you out,

Matt

[EMAIL PROTECTED] wrote:
> Grant,
>
> Thanks for the reply.  What we're trying to do is kind of esoteric and hard 
> to explain without going into a lot of gory details so I was trying to keep 
> it simple.  But I'll try to summarize.
>
> We're trying to index entities in a relational database.  One of the entities 
> we're trying to index is something called a Property.  Think of a Property 
> kind of like the java.util.Properties class, i.e. a name/value pair. So some 
> examples of Properties might be:
>
> State=California
> City=Sacremento
> ZipCode=94203
> StreetName=South Main
> StreetNumber=1234
> Name=Joe Smith
>
> Etc., etc.
>
> (Note: this isn't the type of data we're storing... just trying to keep it 
> simple.)
>
> Imagine that the above list represents the the set of Properties that specify 
> the address for a single person, Joe Smith.  Each Property in the set will be 
> indexed by the values on the right-hand side of all the other name/value 
> pairs in the set, i.e.: California, Sacremento, 94203, South, Main, 1234, Joe 
> and Smith.
>
> There are two types of queries that we want to do.  
> 1) retrieve every Property matching the specified search terms, regardless of 
> its left-hand side.  For this we want to create a field in EVERY Document 
> called "keywords" and index it by the right-hand side values as described 
> above.
> 2) retrieve every Property with a given left-hand side that matches the 
> specified search terms.  For example, find all the 'City' Properties that 
> match the term 'South'.  For this we want to create a field with the name of 
> the left-hand side (e.g. State, City, ZipCode, etc.) but only in those 
> Documents that correspond to a Property with that left-hand side.  Again this 
> field will be indexed by the right-hand side values as described above.
>
> So a couple of examples from the above list might look something like:
>
> Document: State=California
>   Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
>   Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.
>
> Document: City=Sacremento
>   Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
>   Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.
>
> Now if I'm interested in all the Properties that match the word "South", I 
> search the index on the "keywords" field for the term "South".  This will 
> return both documents above.  
>
> But if I'm only interested in any 'City' Properties that match the term 
> 'South' I search the index on the "City" field for the term "South".  This 
> will only return the 'City=Sacremento' document above because it's the only 
> Document of the two that even has a 'City' field in it.
>
> But in any case, the 'State' field and the 'City' field are indexed exactly 
> the same way as the 'keywords' field.  Which is why I was wondering if there 
> was a way to just create these fields as copies of the 'keywo

RE: Can you create a Field that is a copy of another Field?

2008-06-27 Thread Bill.Chesky

Erick,

Thanks for the response.  I'm very sure the TokenStream is expensive.  Not 
always but in some case, yes, it can take a long time to complete.  However, I 
do like your approach.  I'm going to try a different approach suggested by 
another poster first, but this is very interesting.

Thank you!
--
Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley 
Drive * Ann Arbor, MI 48103
Tel 734-332-4405 * Fax 734-332-4440 * [EMAIL PROTECTED]
 www.sungard.com/energy 


-Original Message-
From: Erick Erickson [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 27, 2008 1:37 PM
To: java-user@lucene.apache.org
Subject: Re: Can you create a Field that is a copy of another Field?

How sure are you that the TokenStream is that expensive? But
assuming you are AND that the values for these properties
aren't that big, the simple-minded approach that comes to my
simple mind is to just iterate through the stream yourself, assemble
a string from the returned tokens and pass the string to the two add
calls.

This might be worth it if your tokenizer is going to the DB or something

Best
Erick


On Fri, Jun 27, 2008 at 10:56 AM, <[EMAIL PROTECTED]> wrote:

> Grant,
>
> Thanks for the reply.  What we're trying to do is kind of esoteric and hard
> to explain without going into a lot of gory details so I was trying to keep
> it simple.  But I'll try to summarize.
>
> We're trying to index entities in a relational database.  One of the
> entities we're trying to index is something called a Property.  Think of a
> Property kind of like the java.util.Properties class, i.e. a name/value
> pair. So some examples of Properties might be:
>
> State=California
> City=Sacremento
> ZipCode=94203
> StreetName=South Main
> StreetNumber=1234
> Name=Joe Smith
>
> Etc., etc.
>
> (Note: this isn't the type of data we're storing... just trying to keep it
> simple.)
>
> Imagine that the above list represents the the set of Properties that
> specify the address for a single person, Joe Smith.  Each Property in the
> set will be indexed by the values on the right-hand side of all the other
> name/value pairs in the set, i.e.: California, Sacremento, 94203, South,
> Main, 1234, Joe and Smith.
>
> There are two types of queries that we want to do.
> 1) retrieve every Property matching the specified search terms, regardless
> of its left-hand side.  For this we want to create a field in EVERY Document
> called "keywords" and index it by the right-hand side values as described
> above.
> 2) retrieve every Property with a given left-hand side that matches the
> specified search terms.  For example, find all the 'City' Properties that
> match the term 'South'.  For this we want to create a field with the name of
> the left-hand side (e.g. State, City, ZipCode, etc.) but only in those
> Documents that correspond to a Property with that left-hand side.  Again
> this field will be indexed by the right-hand side values as described above.
>
> So a couple of examples from the above list might look something like:
>
> Document: State=California
>  Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
>  Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.
>
> Document: City=Sacremento
>  Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
>  Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.
>
> Now if I'm interested in all the Properties that match the word "South", I
> search the index on the "keywords" field for the term "South".  This will
> return both documents above.
>
> But if I'm only interested in any 'City' Properties that match the term
> 'South' I search the index on the "City" field for the term "South".  This
> will only return the 'City=Sacremento' document above because it's the only
> Document of the two that even has a 'City' field in it.
>
> But in any case, the 'State' field and the 'City' field are indexed exactly
> the same way as the 'keywords' field.  Which is why I was wondering if there
> was a way to just create these fields as copies of the 'keywords' field.
>
> Here is a code sample where I'm creating the index.  We're using Hibernate
> search to search the indexes, thus the "id" and "_hibernate_class" fields.
>
> Query q = em.createQuery("select p from Property p");
>
> List properties = q.getResultList();
>
> for (Property p : properties)
> {
>// Indexing property.
>Document doc = new Document();
>doc.add(new Field("id",
>   Integer.toString(p.getId()),
>   Field.Store.YES,
>   Field.Index.UN_TOKENIZED));
>doc.add(new Field("_hibernate_class",
>  Property.class.getCanonicalName(),
>  Field.Store.YES,
>  Field.Index.UN_TOKENIZED));
>TokenStream tokenStream = new PropertyTokenStream(p);
>doc.add(new Field("keywords", tokenStream));
>propertyIndexWriter.addDocument(doc);
>tokenStream

Re: Sorting issues

2008-06-27 Thread Robert . Hastings

Actually, I do a global search and the order comes out: 1, 2, 8, 3, 5, 6, 
7,8, 4, 9.  I'm having trouble finding in the code where the sort actually 
gets applied.  Can you help me out there?

Bob




"Erick Erickson" <[EMAIL PROTECTED]> 
06/27/2008 12:19 PM
Please respond to
java-user@lucene.apache.org


To
java-user@lucene.apache.org
cc

Subject
Re: Sorting issues






That's surprising. Could you post a brief example of your
index and search code?

It sounds like you're saying
docs 1, 2, 3 all have category aaa
docs 4, 5, 6 all have category bbb
docs 7, 8, 9 all have category ccc

But if you search for category:bbb
you don't get docs 4, 5, and 6

Is this a fair statement of the issue?

Best
Erick


On Fri, Jun 27, 2008 at 11:44 AM, <[EMAIL PROTECTED]> wrote:

> I just implemented a sorting feature on our application where the user 
can
> change the sort on a query and reexecute the search.  It works fine on
> text fields where most of the documents have different field values.
> However, on fields that are categories, that is, there are only four
> distinct values for the category field and
> all of the documents fit into a distinct category, the results are not 
in
> the sorted values.
>
> All of the sort fields are untokenized and stored. and each document has
> only one value for the sorted field.
>
> Are there any known issues (Lucene 2.3.0)?  How can I go about debugging
> this? I have tried Luke, also I have the Lucene source.
>
> Bob Hastings
> Ancept Inc.

Re: Sorting issues

2008-06-27 Thread Erick Erickson

I can't really help since I've never had to go into the guts of Lucene and
see where sorting is applied, so I don't know where to point you .

But the sorting has always worked for me, and I don't remember anyone
else posting a similar issue in the last year or so. Which means that
the first thing I'd suspect is that something in your code.

Which is why it would be helpful if you'd post some code snippets
showing
1> how you index your field
2> how you construct your sort object
3> how you search using your sort object

Best
Erick

On Fri, Jun 27, 2008 at 2:46 PM, <[EMAIL PROTECTED]> wrote:

> Actually, I do a global search and the order comes out: 1, 2, 8, 3, 5, 6,
> 7,8, 4, 9.  I'm having trouble finding in the code where the sort actually
> gets applied.  Can you help me out there?
>
> Bob
>
>
>
>
> "Erick Erickson" <[EMAIL PROTECTED]>
> 06/27/2008 12:19 PM
> Please respond to
> java-user@lucene.apache.org
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: Sorting issues
>
>
>
>
>
>
> That's surprising. Could you post a brief example of your
> index and search code?
>
> It sounds like you're saying
> docs 1, 2, 3 all have category aaa
> docs 4, 5, 6 all have category bbb
> docs 7, 8, 9 all have category ccc
>
> But if you search for category:bbb
> you don't get docs 4, 5, and 6
>
> Is this a fair statement of the issue?
>
> Best
> Erick
>
>
> On Fri, Jun 27, 2008 at 11:44 AM, <[EMAIL PROTECTED]> wrote:
>
> > I just implemented a sorting feature on our application where the user
> can
> > change the sort on a query and reexecute the search.  It works fine on
> > text fields where most of the documents have different field values.
> > However, on fields that are categories, that is, there are only four
> > distinct values for the category field and
> > all of the documents fit into a distinct category, the results are not
> in
> > the sorted values.
> >
> > All of the sort fields are untokenized and stored. and each document has
> > only one value for the sorted field.
> >
> > Are there any known issues (Lucene 2.3.0)?  How can I go about debugging
> > this? I have tried Luke, also I have the Lucene source.
> >
> > Bob Hastings
> > Ancept Inc.
>
>

Re: Sorting issues

2008-06-27 Thread Robert . Hastings

Thanks Eric,

I did find the problem using Luke, I see that all of the documents have 
the same category field, so I must not be adding the field correctly when 
I index them.

Bob

Bob




"Erick Erickson" <[EMAIL PROTECTED]> 
06/27/2008 01:58 PM
Please respond to
java-user@lucene.apache.org


To
java-user@lucene.apache.org
cc

Subject
Re: Sorting issues






I can't really help since I've never had to go into the guts of Lucene and
see where sorting is applied, so I don't know where to point you .

But the sorting has always worked for me, and I don't remember anyone
else posting a similar issue in the last year or so. Which means that
the first thing I'd suspect is that something in your code.

Which is why it would be helpful if you'd post some code snippets
showing
1> how you index your field
2> how you construct your sort object
3> how you search using your sort object

Best
Erick

On Fri, Jun 27, 2008 at 2:46 PM, <[EMAIL PROTECTED]> wrote:

> Actually, I do a global search and the order comes out: 1, 2, 8, 3, 5, 
6,
> 7,8, 4, 9.  I'm having trouble finding in the code where the sort 
actually
> gets applied.  Can you help me out there?
>
> Bob
>
>
>
>
> "Erick Erickson" <[EMAIL PROTECTED]>
> 06/27/2008 12:19 PM
> Please respond to
> java-user@lucene.apache.org
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: Sorting issues
>
>
>
>
>
>
> That's surprising. Could you post a brief example of your
> index and search code?
>
> It sounds like you're saying
> docs 1, 2, 3 all have category aaa
> docs 4, 5, 6 all have category bbb
> docs 7, 8, 9 all have category ccc
>
> But if you search for category:bbb
> you don't get docs 4, 5, and 6
>
> Is this a fair statement of the issue?
>
> Best
> Erick
>
>
> On Fri, Jun 27, 2008 at 11:44 AM, <[EMAIL PROTECTED]> wrote:
>
> > I just implemented a sorting feature on our application where the user
> can
> > change the sort on a query and reexecute the search.  It works fine on
> > text fields where most of the documents have different field values.
> > However, on fields that are categories, that is, there are only four
> > distinct values for the category field and
> > all of the documents fit into a distinct category, the results are not
> in
> > the sorted values.
> >
> > All of the sort fields are untokenized and stored. and each document 
has
> > only one value for the sorted field.
> >
> > Are there any known issues (Lucene 2.3.0)?  How can I go about 
debugging
> > this? I have tried Luke, also I have the Lucene source.
> >
> > Bob Hastings
> > Ancept Inc.
>
>

Re: Sorting issues

2008-06-27 Thread Erick Erickson

I can't count how many times I've said "It must be a bug
in the compiler", but I *can* count how rarely I've been
right .

Glad you're on a path to resolution.

Erick

On Fri, Jun 27, 2008 at 3:09 PM, <[EMAIL PROTECTED]> wrote:

> Thanks Eric,
>
> I did find the problem using Luke, I see that all of the documents have
> the same category field, so I must not be adding the field correctly when
> I index them.
>
> Bob
>
> Bob
>
>
>
>
> "Erick Erickson" <[EMAIL PROTECTED]>
> 06/27/2008 01:58 PM
> Please respond to
> java-user@lucene.apache.org
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: Sorting issues
>
>
>
>
>
>
> I can't really help since I've never had to go into the guts of Lucene and
> see where sorting is applied, so I don't know where to point you .
>
> But the sorting has always worked for me, and I don't remember anyone
> else posting a similar issue in the last year or so. Which means that
> the first thing I'd suspect is that something in your code.
>
> Which is why it would be helpful if you'd post some code snippets
> showing
> 1> how you index your field
> 2> how you construct your sort object
> 3> how you search using your sort object
>
> Best
> Erick
>
> On Fri, Jun 27, 2008 at 2:46 PM, <[EMAIL PROTECTED]> wrote:
>
> > Actually, I do a global search and the order comes out: 1, 2, 8, 3, 5,
> 6,
> > 7,8, 4, 9.  I'm having trouble finding in the code where the sort
> actually
> > gets applied.  Can you help me out there?
> >
> > Bob
> >
> >
> >
> >
> > "Erick Erickson" <[EMAIL PROTECTED]>
> > 06/27/2008 12:19 PM
> > Please respond to
> > java-user@lucene.apache.org
> >
> >
> > To
> > java-user@lucene.apache.org
> > cc
> >
> > Subject
> > Re: Sorting issues
> >
> >
> >
> >
> >
> >
> > That's surprising. Could you post a brief example of your
> > index and search code?
> >
> > It sounds like you're saying
> > docs 1, 2, 3 all have category aaa
> > docs 4, 5, 6 all have category bbb
> > docs 7, 8, 9 all have category ccc
> >
> > But if you search for category:bbb
> > you don't get docs 4, 5, and 6
> >
> > Is this a fair statement of the issue?
> >
> > Best
> > Erick
> >
> >
> > On Fri, Jun 27, 2008 at 11:44 AM, <[EMAIL PROTECTED]> wrote:
> >
> > > I just implemented a sorting feature on our application where the user
> > can
> > > change the sort on a query and reexecute the search.  It works fine on
> > > text fields where most of the documents have different field values.
> > > However, on fields that are categories, that is, there are only four
> > > distinct values for the category field and
> > > all of the documents fit into a distinct category, the results are not
> > in
> > > the sorted values.
> > >
> > > All of the sort fields are untokenized and stored. and each document
> has
> > > only one value for the sorted field.
> > >
> > > Are there any known issues (Lucene 2.3.0)?  How can I go about
> debugging
> > > this? I have tried Luke, also I have the Lucene source.
> > >
> > > Bob Hastings
> > > Ancept Inc.
> >
> >
>
>

Re: Can you create a Field that is a copy of another Field?

2008-06-27 Thread Matthew Hall


Yup, you're pretty much there.

The only part I'm a bit confused about is what you've said in your data 
field there,


I'm thinking you mean that for the data_type: "State", you would have 
the data entry of "California", right?


If so, then yup, you are spot on ^^

We use this technique all the time on our side, and its helped 
considerably.  We then use the db_key to reference into a display time 
cache that holds all of the display information for the underlying 
object that we would ever want to present to the user.  This allows our 
search time index to be very concise, and as a result nearly every 
search we hit it with is subsecond, which is a nice place to be ^^


Matt

[EMAIL PROTECTED] wrote:

Matthew,

Thanks for the reply.  This looks very interesting.  If I'm understanding 
correctly your db_key, data and data_type are Fields within the Document, 
correct?  So is this how you envision it?

Document: State=California
   Field: 'db_key'='1395' (primary key into relational table, correct?)
   Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
   Field: 'data_type' indexed by 'State'

Document: City=Sacremento
   Field: 'db_key'='2405' 
   Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.

   Field: 'data_type' indexed by 'City'

Then my query for all Properties would be:

+data:South

My query for only 'City' Properties would be:

+data:South +data_type:City

Is that right?

I think that would work.  Very nice.  Thank you very much
--
Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley 
Drive * Ann Arbor, MI 48103
Tel 734-332-4405 * Fax 734-332-4440 * [EMAIL PROTECTED]
 www.sungard.com/energy 



-Original Message-
From: Matthew Hall [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 27, 2008 11:49 AM

To: java-user@lucene.apache.org
Subject: Re: Can you create a Field that is a copy of another Field?

I'm not sure if this is helpful, but I do something VERY similar to this 
in my project.


So, for the example you are citing I would design my index as follows:

db_key, data, data_type

Where the data_type is some sort of value representing the thing that's 
on the left hand side of your property relationship there.


So, then in order to satisfy your search, the queries become quite simple:

The search for everything simply searches against the data field in this 
index, wheras the search for a specific data_type + searchterm becomes a 
simple boolean query, that has a MUST clause for the data_type value.


As an even BETTER bonus, this will then mean that all of your searchable 
values will now have relevance to each other at scoring time, which is 
quite useful in the long run.


Hope this helps you out,

Matt

[EMAIL PROTECTED] wrote:
  

Grant,

Thanks for the reply.  What we're trying to do is kind of esoteric and hard to 
explain without going into a lot of gory details so I was trying to keep it 
simple.  But I'll try to summarize.

We're trying to index entities in a relational database.  One of the entities 
we're trying to index is something called a Property.  Think of a Property kind 
of like the java.util.Properties class, i.e. a name/value pair. So some 
examples of Properties might be:

State=California
City=Sacremento
ZipCode=94203
StreetName=South Main
StreetNumber=1234
Name=Joe Smith

Etc., etc.

(Note: this isn't the type of data we're storing... just trying to keep it 
simple.)

Imagine that the above list represents the the set of Properties that specify 
the address for a single person, Joe Smith.  Each Property in the set will be 
indexed by the values on the right-hand side of all the other name/value pairs 
in the set, i.e.: California, Sacremento, 94203, South, Main, 1234, Joe and 
Smith.

There are two types of queries that we want to do.  
1) retrieve every Property matching the specified search terms, regardless of its left-hand side.  For this we want to create a field in EVERY Document called "keywords" and index it by the right-hand side values as described above.

2) retrieve every Property with a given left-hand side that matches the 
specified search terms.  For example, find all the 'City' Properties that match 
the term 'South'.  For this we want to create a field with the name of the 
left-hand side (e.g. State, City, ZipCode, etc.) but only in those Documents 
that correspond to a Property with that left-hand side.  Again this field will 
be indexed by the right-hand side values as described above.

So a couple of examples from the above list might look something like:

Document: State=California
  Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
  Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.

Document: City=Sacremento
  Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
  Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.

Now if I'm interested in all the Properties that match the word "South

Re: Does Lucene Java 2.3.2 supports parsing of Microsoft office 2007 documents...

2008-06-27 Thread Hasan Diwan

Kumar:
Assuming you want to index a pre-parsed document...

2008/6/27 Erick Erickson <[EMAIL PROTECTED]>:
>> If it supports, what should be done in Lucene demo 2.3.2 to search queries
>> on file with above mentioned extensions?
The new ODF-compatible Office 2007 is not supported by POI. However,
you could write a JNI wrapper around OpenOffice, which does have this
support.
-- 
Cheers,
Hasan Diwan <[EMAIL PROTECTED]>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Can you create a Field that is a copy of another Field?

2008-06-27 Thread Bill.Chesky

Hmmm, I think maybe I am missing something.  In your design is the 'data' field 
indexed, i.e. searchable?  Or is it an unindexed, stored field?  

I was thinking that both 'data' and 'data_type' were indexed and searchable.  

Maybe the confusion stems from the fact that for the Document corresponding to 
"State=California", we're not just indexing on the token 'California'.  We're 
indexing on all the tokens from all the Properties in the set of Properties 
corresponding to a person's address.  In my original example this would be: 
California, Sacremento, 94203, South, Main, 1234, Joe and Smith.

For the 'data_type' field I was thinking you were saying we'd index on a single 
token, namely 'State' (or whatever the left-hand side is).

Does that make sense?
--
Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley 
Drive * Ann Arbor, MI 48103
Tel 734-332-4405 * Fax 734-332-4440 * [EMAIL PROTECTED]
 www.sungard.com/energy 


-Original Message-
From: Matthew Hall [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 27, 2008 3:33 PM
To: java-user@lucene.apache.org
Subject: Re: Can you create a Field that is a copy of another Field?

Yup, you're pretty much there.

The only part I'm a bit confused about is what you've said in your data 
field there,

I'm thinking you mean that for the data_type: "State", you would have 
the data entry of "California", right?

If so, then yup, you are spot on ^^

We use this technique all the time on our side, and its helped 
considerably.  We then use the db_key to reference into a display time 
cache that holds all of the display information for the underlying 
object that we would ever want to present to the user.  This allows our 
search time index to be very concise, and as a result nearly every 
search we hit it with is subsecond, which is a nice place to be ^^

Matt

[EMAIL PROTECTED] wrote:
> Matthew,
>
> Thanks for the reply.  This looks very interesting.  If I'm understanding 
> correctly your db_key, data and data_type are Fields within the Document, 
> correct?  So is this how you envision it?
>
> Document: State=California
>Field: 'db_key'='1395' (primary key into relational table, correct?)
>Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
>Field: 'data_type' indexed by 'State'
>
> Document: City=Sacremento
>Field: 'db_key'='2405' 
>Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
>Field: 'data_type' indexed by 'City'
>
> Then my query for all Properties would be:
>
>   +data:South
>
> My query for only 'City' Properties would be:
>
>   +data:South +data_type:City
>
> Is that right?
>
> I think that would work.  Very nice.  Thank you very much
> --
> Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak 
> Valley Drive * Ann Arbor, MI 48103
> Tel 734-332-4405 * Fax 734-332-4440 * [EMAIL PROTECTED]
>  www.sungard.com/energy 
>
>
> -Original Message-
> From: Matthew Hall [mailto:[EMAIL PROTECTED] 
> Sent: Friday, June 27, 2008 11:49 AM
> To: java-user@lucene.apache.org
> Subject: Re: Can you create a Field that is a copy of another Field?
>
> I'm not sure if this is helpful, but I do something VERY similar to this 
> in my project.
>
> So, for the example you are citing I would design my index as follows:
>
> db_key, data, data_type
>
> Where the data_type is some sort of value representing the thing that's 
> on the left hand side of your property relationship there.
>
> So, then in order to satisfy your search, the queries become quite simple:
>
> The search for everything simply searches against the data field in this 
> index, wheras the search for a specific data_type + searchterm becomes a 
> simple boolean query, that has a MUST clause for the data_type value.
>
> As an even BETTER bonus, this will then mean that all of your searchable 
> values will now have relevance to each other at scoring time, which is 
> quite useful in the long run.
>
> Hope this helps you out,
>
> Matt
>
> [EMAIL PROTECTED] wrote:
>   
>> Grant,
>>
>> Thanks for the reply.  What we're trying to do is kind of esoteric and hard 
>> to explain without going into a lot of gory details so I was trying to keep 
>> it simple.  But I'll try to summarize.
>>
>> We're trying to index entities in a relational database.  One of the 
>> entities we're trying to index is something called a Property.  Think of a 
>> Property kind of like the java.util.Properties class, i.e. a name/value 
>> pair. So some examples of Properties might be:
>>
>> State=California
>> City=Sacremento
>> ZipCode=94203
>> StreetName=South Main
>> StreetNumber=1234
>> Name=Joe Smith
>>
>> Etc., etc.
>>
>> (Note: this isn't the type of data we're storing... just trying to keep it 
>> simple.)
>>
>> Imagine that the above list represents the the set of Properties that 
>> specify the address for a single person, Joe Smith.  Each Property in the 
>> set will be indexed by the values on the right-hand sid

QueryWrapperFilter performance

2008-06-27 Thread Jordon Saardchit

Hello All,
 
Sort of new to lucene but have a general question in regards to
performance.  I've got a single index of rather large size (about 7
million docs).  I've ran a couple different queries against it, which
are described below.
 
 * WildcardQuery: (*term*) Which returns roughly 12000 hits in around
7000ms
 * RangeQuery: (term TO term) Which returns roughly 6000 hits in around
200ms
 
Now for performance reasons, I am attempting to run the WildcardQuery
against only the RangeQuery hits (6000 as opposed to 7 million), via
using a QueryWrapperFilter (constructed with my RangeQuery).  However,
the avg response times are still around 7000ms, the same as the wildcard
search over the entire index.  Is it possible the performance of the
WildcardQuery is not effected by the filter? Or I have gone about
implementing my intentions incorrectly?
 
Thanks in advance for the help,
Jordon

Question: Can lucene do parallel indexing?

2008-06-27 Thread David Lee

If I'm using a computer that has multiple cores, or if I want to use several
computers to speed up the indexing process, how should I do that? Is there
some kind of support for that in the API?

David Lee

Re: Question: Can lucene do parallel indexing?

2008-06-27 Thread Phil Myers

> If I'm using a computer that has multiple cores, or if I
> want to use several
> computers to speed up the indexing process, how should I do
> that? Is there
> some kind of support for that in the API?

Yes. There are some comments on this near the end of this page:

http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

-Phil



--- On Fri, 6/27/08, David Lee <[EMAIL PROTECTED]> wrote:

> From: David Lee <[EMAIL PROTECTED]>
> Subject: Question: Can lucene do parallel indexing?
> To: java-user@lucene.apache.org
> Date: Friday, June 27, 2008, 5:57 PM
> If I'm using a computer that has multiple cores, or if I
> want to use several
> computers to speed up the indexing process, how should I do
> that? Is there
> some kind of support for that in the API?
> 
> David Lee

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Searching any part of a string

2008-06-27 Thread Chris Hostetter


: Thanks for the suggestions. I've used indexed n-grams before to implement
: spell-checking; I think in this case I may take a look at WildcardTermEnum
: and RegexTermEnum. It seems like a good solution because I am doing my own
: results ordering so Lucene's scoring is irrelevant in this case. I wasn't
: aware of these classes so thanks for mentioning them!

using the Enum's directly will help you avoid potential "TooManyClauses" 
exceptions that you would get with a straight WildcardQuery, but it should 
be more efficient to index ngrams and then do a Prefix style search 
because then you can skipTo(yourTerm) and iterate from there.

with WildcardTermEnum if you have a leading wildcard the TermEnum has to 
"next()" over every term inthe field.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

37 matches

Mail list logo