Re: Query across multiple fields scenario not handled by "MultiFieldQueryParser"

2004-07-20 Thread Thomas Plümpe
Daniel,

> > Does anybody here know which changes I
> > would have to make to QueryParser.jj to get the functionality described?
> 
> I haven't tried it but I guess you need to change the getXXXQuery() methods so 
> they return a BooleanQuery. For example, getFieldQuery currently might return 
> a TermQuery; you'll need to change that so it returns a BooleanQuery with two 
> TermQuerys. These two queries would have the same term, but a different 
> field.
> 
> Another approach is to leave QueryParser alone and modify the query after it 
> has been parsed by recursively iterating over the parsed query, replacing 
> e.g. TermQuerys with BooleanQuerys (just like described above).
many thanks for your advice. Although I was hoping not to have to
implement the change (as it has apparently been done), I guess this is
enough to get me going.

Thomas



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Post-sorted inverted index?

2004-07-20 Thread Erik Hatcher
On Jul 20, 2004, at 1:27 AM, Aphinyanaphongs, Yindalon wrote:
I gather from reading the documentation that the scores for each 
document hit are computed at query time.  I have an application that, 
due to the complexity of the function, cannot compute scores at query 
time.  Would it be possible for me to store the documents in 
pre-sorted order in the inverted index? (i.e. after the initial index 
is created, to have a post processing step to sort and reindex the 
final documents).

For example:
Document A - score 0.2
Document B - score 0.4
Document C - score 0.6
Thus for the word 'the', the stored order in the index would be C,B,A.
Lucene 1.4 includes a Sort facility - look at the additional 
IndexSearcher.search() methods for details.  By default, if the scores 
computed are identical, the results are then ordered by document id, 
which is the insertion order.

I hope this helps.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


The indexer

2004-07-20 Thread Ian McDonnell
Can Lucenes indexer be used to store info in fields in a mysql db?

If so can anybody point me to an example or some documentation relating to it.

Ian

_
Sign up for FREE email from SpinnersCity Online Dance Magazine & Vortal at 
http://www.spinnerscity.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: The indexer

2004-07-20 Thread Erik Hatcher
On Jul 20, 2004, at 8:44 AM, Ian McDonnell wrote:
Can Lucenes indexer be used to store info in fields in a mysql db?
I'm not quite clear on your question.  You want to store a Lucene index 
(aka Directory) within mysql?

Or, you want to index data from your existing mysql database into a 
Lucene index?

A Directory implementation for Berkeley DB was created by the Chandler 
project and contributed to the Lucene sandbox (see Lucene's website for 
details on the sandbox and how to get to it).  There has been some 
efforts to put a Lucene index into SQL Server, I believe, but I haven't 
seen mention of that in a while.  It *can* be done, but I'm skeptical 
of the performance hit of adding in a relational database layer - and 
to do it well would certainly be non-trivial.

As for indexing data from mysql - there have been lots of discussions 
of that recently, so check the archives.  Basically you read the data, 
and index it with Lucene's API.  And you are responsible for keeping it 
in sync.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: The indexer

2004-07-20 Thread Ian McDonnell
Basically i add details about a movie clip as various fields in an sql db using a jsp 
form. When the form submits i want to add the details into the db and also want the 
fields to be stored as a searchable lucene index on the server.

Is this possible?

Ian


--- Erik Hatcher <[EMAIL PROTECTED]> wrote:
On Jul 20, 2004, at 8:44 AM, Ian McDonnell wrote:
> Can Lucenes indexer be used to store info in fields in a mysql db?

I'm not quite clear on your question.  You want to store a Lucene index 
(aka Directory) within mysql?

Or, you want to index data from your existing mysql database into a 
Lucene index?

A Directory implementation for Berkeley DB was created by the Chandler 
project and contributed to the Lucene sandbox (see Lucene's website for 
details on the sandbox and how to get to it).  There has been some 
efforts to put a Lucene index into SQL Server, I believe, but I haven't 
seen mention of that in a while.  It *can* be done, but I'm skeptical 
of the performance hit of adding in a relational database layer - and 
to do it well would certainly be non-trivial.

As for indexing data from mysql - there have been lots of discussions 
of that recently, so check the archives.  Basically you read the data, 
and index it with Lucene's API.  And you are responsible for keeping it 
in sync.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



_
Sign up for FREE email from SpinnersCity Online Dance Magazine & Vortal at 
http://www.spinnerscity.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: The indexer

2004-07-20 Thread Ian McDonnell
Yeah that last part of your reply seems to be what i'm trying to do(you're going to 
have to excuse me as i'm a total newbie to Lucene and am only finding my feet with 
it). I searched the archives and went back through it manually just there, but didnt 
find any relevant posts in the archive.

>As for indexing data from mysql - there have been lots of discussions 
>of that recently, so check the archives.  Basically you read the data, 
>and index it with Lucene's API.  And you are responsible for keeping it >in sync.

The problem i am having is reading the data from the sql tables and then using the 
indexer to store it. Has anybody indexed from a mysql table before? If so, do i need 
to create some kind of JDBC query that selects all the field values from the table and 
indexes them in a lucene document that is stored on the server? If i do this, how can 
this process be automated rather than manually running the program everytime a new 
profile is added via the jsp form? 

Erik, i'm not sure what you mean about keeping the db in sync. Are you talking about 
stale or updated db entries?

Ian

Ian

Erik


--- Erik Hatcher <[EMAIL PROTECTED]> wrote:
On Jul 20, 2004, at 8:44 AM, Ian McDonnell wrote:
> Can Lucenes indexer be used to store info in fields in a mysql db?

I'm not quite clear on your question.  You want to store a Lucene index 
(aka Directory) within mysql?

Or, you want to index data from your existing mysql database into a 
Lucene index?

A Directory implementation for Berkeley DB was created by the Chandler 
project and contributed to the Lucene sandbox (see Lucene's website for 
details on the sandbox and how to get to it).  There has been some 
efforts to put a Lucene index into SQL Server, I believe, but I haven't 
seen mention of that in a while.  It *can* be done, but I'm skeptical 
of the performance hit of adding in a relational database layer - and 
to do it well would certainly be non-trivial.

As for indexing data from mysql - there have been lots of discussions 
of that recently, so check the archives.  Basically you read the data, 
and index it with Lucene's API.  And you are responsible for keeping it 
in sync.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



_
Sign up for FREE email from SpinnersCity Online Dance Magazine & Vortal at 
http://www.spinnerscity.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: The indexer

2004-07-20 Thread Erik Hatcher
On Jul 20, 2004, at 9:29 AM, Ian McDonnell wrote:
Basically i add details about a movie clip as various fields in an sql 
db using a jsp form. When the form submits i want to add the details 
into the db and also want the fields to be stored as a searchable 
lucene index on the server.

Is this possible?
Of course.  But you'll have to code it.  It's only a few lines of code 
to index a "document" into a Lucene index, but it is up to you to code 
those into the appropriate spot in your system (most likely right where 
you insert into mysql).

Erik
Ian
--- Erik Hatcher <[EMAIL PROTECTED]> wrote:
On Jul 20, 2004, at 8:44 AM, Ian McDonnell wrote:
Can Lucenes indexer be used to store info in fields in a mysql db?
I'm not quite clear on your question.  You want to store a Lucene index
(aka Directory) within mysql?
Or, you want to index data from your existing mysql database into a
Lucene index?
A Directory implementation for Berkeley DB was created by the Chandler
project and contributed to the Lucene sandbox (see Lucene's website for
details on the sandbox and how to get to it).  There has been some
efforts to put a Lucene index into SQL Server, I believe, but I haven't
seen mention of that in a while.  It *can* be done, but I'm skeptical
of the performance hit of adding in a relational database layer - and
to do it well would certainly be non-trivial.
As for indexing data from mysql - there have been lots of discussions
of that recently, so check the archives.  Basically you read the data,
and index it with Lucene's API.  And you are responsible for keeping it
in sync.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

_
Sign up for FREE email from SpinnersCity Online Dance Magazine & 
Vortal at http://www.spinnerscity.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: The indexer

2004-07-20 Thread Erik Hatcher
On Jul 20, 2004, at 10:07 AM, Ian McDonnell wrote:
As for indexing data from mysql - there have been lots of discussions
of that recently, so check the archives.  Basically you read the data,
and index it with Lucene's API.  And you are responsible for keeping 
it >in sync.
The problem i am having is reading the data from the sql tables and 
then using the indexer to store it. Has anybody indexed from a mysql 
table before? If so, do i need to create some kind of JDBC query that 
selects all the field values from the table and indexes them in a 
lucene document that is stored on the server? If i do this, how can 
this process be automated rather than manually running the program 
everytime a new profile is added via the jsp form?
How you get the data from  your database is really up to you.  Some 
folks here may be able to offer some advice, but ultimately it is 
specific to your application and business process.

Once you have the data, via some query (again, this is up to you how 
you do it) you use Lucene's IndexWriter, create new Document's, add 
Field's to them, add the document to the writer, then close the writer. 
 That's all there is to indexing a document with Lucene.

As for automation - again this is up to your application but certainly 
you can interact with a Lucene index from your application so that it 
is not a manual separate indexing step.

Erik, i'm not sure what you mean about keeping the db in sync. Are you 
talking about stale or updated db entries?
You need to ensure that when data changes, the index is updated to 
reflect those changes.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


lucene cutomized indexing

2004-07-20 Thread John Wang
Hi:
   I am trying to store some Databased like field values into lucene.
I have my own way of storing field values in a customized format.

   I guess my question is wheather we can make the Reader/Writer
classes, e.g. FieldReader, FieldWriter, DocumentReader/Writer classes
non-final?

   I have asked to make the Lucene API less restrictive many many many
times but got no replies. Is this request feasible?

Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



No change in the indexing time after increase the merge factor

2004-07-20 Thread Praveen Peddi
I performed lucene indexing with 25,000 documents.
We feel that indexing is slow, so I am trying to tune it.
My configuration is as follow:
Machine: Windows XP, 1GB RAM, 3GHz
# of documents: 25,000
App Server: Weblogic 7.0
lucene version: lucene 1.4 final

I ran the indexer with merge factor of 10 and 50. Both times, the total indexing time 
(lucene time only) is almost the same (27.92 mins for mergefactor=10 and 28.11 mins 
for mergefactor=50).

>From the lucene mails and lucene related articles I read, I thought increasing merge 
>factor will imporve the performance of indexing. Am I wrong?


Praveen


** 
Praveen Peddi
Sr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] 
Tel:  401.854.3475 
Fax:  401.861.3596 
web: http://www.contextmedia.com 
** 
Context Media- "The Leader in Enterprise Content Integration" 


Re: lucene cutomized indexing

2004-07-20 Thread Daniel Naber
On Tuesday 20 July 2004 17:28, John Wang wrote:

>I have asked to make the Lucene API less restrictive many many many
> times but got no replies.

I suggest you just change it in your source and see if it works. Then you can 
still explain what exactly you did and why it's useful. From the developers 
point-of-view having things non-final means more stuff is exposed and making 
changes is more difficult (unless one accepts that derived classes may break 
with the next update).

Regards
 Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Post-sorted inverted index?

2004-07-20 Thread Doug Cutting
You can define a subclass of FilterIndexReader that re-sorts documents 
in TermPositions(Term) and document(int), then use 
IndexWriter.addIndexes() to write this in Lucene's standard format.  I 
have done this in Nutch, with the (as yet unused) IndexOptimizer.

http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/indexer/IndexOptimizer.java?view=markup
Doug
Aphinyanaphongs, Yindalon wrote:
I gather from reading the documentation that the scores for each document hit are computed at query time.  I have an application that, due to the complexity of the function, cannot compute scores at query time.  Would it be possible for me to store the documents in pre-sorted order in the inverted index? (i.e. after the initial index is created, to have a post processing step to sort and reindex the final documents).
 
For example:
Document A - score 0.2
Document B - score 0.4
Document C - score 0.6
 
Thus for the word 'the', the stored order in the index would be C,B,A.
 
Thanks!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: lucene cutomized indexing

2004-07-20 Thread John Wang
Hi Daniel:

 There are few things I want to do to be able to customize lucene:

1) to be able to plug in a different similarity model (e.g. bayesian,
vector space etc.)

2) to be able to store certain fields in its own format and provide
corresponding readers. I may not want to store every field in the
lexicon/inverted index structure. I may have fields that doesn't make
sense to store the position or frequency information.

3) to be able to customize analyzers to add more information to the
Token while doing tokenization.

Oleg mentioned about the HayStack project. In the HayStack source
code, they had to modifiy many lucene class to make them non-final in
order to customzie. They make sure during deployment their "versions"
gets loaded before the same classes in the lucene .jar. It is
cumbersome, but it is a Lucene restriction they had to live with.

I believe there are many other users feel the same way. 

If I write some classes that derives from the lucene API and it
breaks, then it is my responsibility to fix it. I don't understand why
it would add burden to the Lucene developers.

Thanks

-John

On Tue, 20 Jul 2004 17:56:26 +0200, Daniel Naber
<[EMAIL PROTECTED]> wrote:
> On Tuesday 20 July 2004 17:28, John Wang wrote:
> 
> >I have asked to make the Lucene API less restrictive many many many
> > times but got no replies.
> 
> I suggest you just change it in your source and see if it works. Then you can
> still explain what exactly you did and why it's useful. From the developers
> point-of-view having things non-final means more stuff is exposed and making
> changes is more difficult (unless one accepts that derived classes may break
> with the next update).
> 
> Regards
> Daniel
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: No change in the indexing time after increase the merge factor

2004-07-20 Thread Otis Gospodnetic
All Lucene articles that I know of were written before
IndexWriter.minMergeDocs was added.  Check IndexWriter javadoc for more
info, but this is another field you can tune.

Otis 

--- Praveen Peddi <[EMAIL PROTECTED]> wrote:
> I performed lucene indexing with 25,000 documents.
> We feel that indexing is slow, so I am trying to tune it.
> My configuration is as follow:
> Machine: Windows XP, 1GB RAM, 3GHz
> # of documents: 25,000
> App Server: Weblogic 7.0
> lucene version: lucene 1.4 final
> 
> I ran the indexer with merge factor of 10 and 50. Both times, the
> total indexing time (lucene time only) is almost the same (27.92 mins
> for mergefactor=10 and 28.11 mins for mergefactor=50).
> 
> From the lucene mails and lucene related articles I read, I thought
> increasing merge factor will imporve the performance of indexing. Am
> I wrong?
> 
> 
> Praveen
> 
> 
> ** 
> Praveen Peddi
> Sr Software Engg, Context Media, Inc. 
> email:[EMAIL PROTECTED] 
> Tel:  401.854.3475 
> Fax:  401.861.3596 
> web: http://www.contextmedia.com 
> ** 
> Context Media- "The Leader in Enterprise Content Integration" 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



join two indexes

2004-07-20 Thread Sergio
Hi,
i want to join two lucene indexes but i dont know how to do that.

For example i have a student index and a school index. 
In the scholl index i have the studentId field.


How to do that ?
Any idea will be wellcomed.
Thx, Sergio.



Syntax of Query

2004-07-20 Thread Hetan Shah
Hey guys,
Need some help with creating a query. Here is the scenario:
Field 1: 
Field 2: 
Field 3: 
MultiSelect 1 :
   
   
   
MultiSelect 2 :
   
   
   
What would the query look like if the condition is at any time there 
will be one entry from field 1, 2, or 3 and few entries from 
MultiSelect1 and few entries from MultiSelect.

Would it look something like
+field1 +(val11 OR val12 OR val14) +(val21 OR val23 OR val24)
Thanks for all you guys support.
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Here is how to search multiple indexes

2004-07-20 Thread Don Vaillancourt
Here is the code that I use to do multi-index searches:
// create a multi index searcher
IndexSearcher indexes[] = new IndexSearcher[n];  // where n is the number 
of indexes to search

for (int i = 0; i < n; i++)
{
// use whichever IndexSearcher constructor you want
// blah is the appropriate value to pass
indexes[i] = new IndexSearcher(blah);
}
// This is the part which allows you to search multiple indexes
Searcher searcher = new MultiSearcher(indexes);
// do the search
Analyzer analyzer = new StandardAnalyzer();
Query query = QueryParser.parse(expression, colSearch, analyzer);
searcher.search(query);
At 01:19 PM 20/07/2004, you wrote:
Hi,
i want to join two lucene indexes but i dont know how to do that.
For example i have a student index and a school index.
In the scholl index i have the studentId field.
How to do that ?
Any idea will be wellcomed.
Thx, Sergio.
Don Vaillancourt
Director of Software Development
WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com

This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.







Re: lucene cutomized indexing

2004-07-20 Thread Erik Hatcher
On Jul 20, 2004, at 12:12 PM, John Wang wrote:
 There are few things I want to do to be able to customize lucene:
[...]
3) to be able to customize analyzers to add more information to the
Token while doing tokenization.
I have already provided my opinion on this one - I think it would be 
fine to allow Token to be public.  I'll let others respond to the 
additional requests you've made.

Oleg mentioned about the HayStack project. In the HayStack source
code, they had to modifiy many lucene class to make them non-final in
order to customzie. They make sure during deployment their "versions"
gets loaded before the same classes in the lucene .jar. It is
cumbersome, but it is a Lucene restriction they had to live with.
Wow - I didn't realize that they've made local changes.  Did they post 
with requests for opening things up as you have?  Did they submit 
patches with their local changes?

I believe there are many other users feel the same way.
Then they should speak up :)
If I write some classes that derives from the lucene API and it
breaks, then it is my responsibility to fix it. I don't understand why
it would add burden to the Lucene developers.
Making things extensible for no good reason is asking for maintenance 
troubles later when you need more control internally.  Lucene has been 
well designed from the start with extensibility only where it was 
needed in mind.  It has evolved to be more open in very specific areas 
after careful consideration of the performance impact has been weighed. 
 "Breaking" is not really the concern with extensibility, I don't 
think.  Real-world use cases are needed to show that changes need to be 
made.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Very slow IndexReader.open() performance

2004-07-20 Thread Mark Florence
Hi -- We have a large index (~4m documents, ~14gb) that we haven't been
able to optimize for some time, because the JVM throws OutOfMemory, after
climbing to the maximum we can throw at it, 2gb. 

In fact, the OutOfMemory condition occurred most recently during a segment 
merge operation. maxMergeDocs was set to the default, and we seem to have
gotten around this problem by setting it to some lower value, currently
100,000. The index is highly interactive so I took the hint from earlier
posts to set it to this value.

Good news! No more OutOfMemory conditions.

Bad news: now, calling IndexReader.open() is taking 20+ seconds, and it 
is killing performance.

I followed the design pattern in another earlier post from Doug. I take a
batch of deletes, open an IndexReader, perform the deletes, then close it.
Then I take a batch of adds, open an IndexWriter, perform the adds, then
close it. Then I get a new IndexSearcher for searching.

But because the index is so interactive, this sequence repeats itself all
the time. 

My question is, is there a better way? Performance was fine when I could
optimize. Can I hold onto singleton a IndexReader/IndexWriter/IndexSearcher
to avoid the overhead of the open?

Any help would be most gratefully received.

Mark Florence, CTO, AIRS
[EMAIL PROTECTED]
800-897-7714x1703


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene cutomized indexing

2004-07-20 Thread Daniel Naber
On Tuesday 20 July 2004 18:12, John Wang wrote:

> They make sure during deployment their "versions"
> gets loaded before the same classes in the lucene .jar.

I don't see why people cannot just make their own lucene.jar. Just remove 
the "final" and recompile. Finally, Lucene is Open Source.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Very slow IndexReader.open() performance

2004-07-20 Thread Doug Cutting
Optimization should not require huge amounts of memory.  Can you tell a 
bit more about your configuration:  What JVM?  What OS?  How many 
fields?  What mergeFactor have you used?

Also, please attach the output of 'ls -l' of your index directory, as 
well as the stack trace you see when OutOfMemory is thrown.

Thanks,
Doug
Mark Florence wrote:
Hi -- We have a large index (~4m documents, ~14gb) that we haven't been
able to optimize for some time, because the JVM throws OutOfMemory, after
climbing to the maximum we can throw at it, 2gb. 

In fact, the OutOfMemory condition occurred most recently during a segment 
merge operation. maxMergeDocs was set to the default, and we seem to have
gotten around this problem by setting it to some lower value, currently
100,000. The index is highly interactive so I took the hint from earlier
posts to set it to this value.

Good news! No more OutOfMemory conditions.
Bad news: now, calling IndexReader.open() is taking 20+ seconds, and it 
is killing performance.

I followed the design pattern in another earlier post from Doug. I take a
batch of deletes, open an IndexReader, perform the deletes, then close it.
Then I take a batch of adds, open an IndexWriter, perform the adds, then
close it. Then I get a new IndexSearcher for searching.
But because the index is so interactive, this sequence repeats itself all
the time. 

My question is, is there a better way? Performance was fine when I could
optimize. Can I hold onto singleton a IndexReader/IndexWriter/IndexSearcher
to avoid the overhead of the open?
Any help would be most gratefully received.
Mark Florence, CTO, AIRS
[EMAIL PROTECTED]
800-897-7714x1703
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: join two indexes

2004-07-20 Thread Daniel Naber
On Tuesday 20 July 2004 19:19, Sergio wrote:

> i want to join two lucene indexes but i dont know how to do that.

There are two "addIndexes" methods in IndexWriter which you can use to 
write your own small merge tool (a ready-to-use tool for index merging 
doesn't exist AFAIK).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene cutomized indexing

2004-07-20 Thread John Wang
On Tue, 20 Jul 2004 13:40:28 -0400, Erik Hatcher
<[EMAIL PROTECTED]> wrote:
> On Jul 20, 2004, at 12:12 PM, John Wang wrote:
> >  There are few things I want to do to be able to customize lucene:
> >
> [...]
> >
> > 3) to be able to customize analyzers to add more information to the
> > Token while doing tokenization.
> 
> I have already provided my opinion on this one - I think it would be
> fine to allow Token to be public.  I'll let others respond to the
> additional requests you've made.

Great, what processes need to be in place before this gets in the code base? 
> 
> > Oleg mentioned about the HayStack project. In the HayStack source
> > code, they had to modifiy many lucene class to make them non-final in
> > order to customzie. They make sure during deployment their "versions"
> > gets loaded before the same classes in the lucene .jar. It is
> > cumbersome, but it is a Lucene restriction they had to live with.
> 
> Wow - I didn't realize that they've made local changes.  Did they post
> with requests for opening things up as you have?  Did they submit
> patches with their local changes?
> 
> > I believe there are many other users feel the same way.
> 
> Then they should speak up :)

Well, I AM speaking up. So have some other people in earlier emails.
But alike me, are getting ignored. The HayStack changes were needed
specifically due to the fact that many classes are declared to be
final and not extensible.

> 
> > If I write some classes that derives from the lucene API and it
> > breaks, then it is my responsibility to fix it. I don't understand why
> > it would add burden to the Lucene developers.
> 
> Making things extensible for no good reason is asking for maintenance
> troubles later when you need more control internally.  Lucene has been
> well designed from the start with extensibility only where it was
> needed in mind.  It has evolved to be more open in very specific areas
> after careful consideration of the performance impact has been weighed.
>  "Breaking" is not really the concern with extensibility, I don't
> think.  Real-world use cases are needed to show that changes need to be
> made.

I thought I gave many "real-world use cases" in the previous email.
And evidently also applies to the Haystack project. What other
information do we need to provide?

I don't want to diverge from the Lucene codebase like Haystack has
done. But I may not have a choice.

Thanks

-John

> 
>Erik
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene cutomized indexing

2004-07-20 Thread John Wang
That is what exactly they did and that's probably what I have to do.
But that means we are diverging from the lucene code base and future
fixes and enhancements need to be synchronized and that maybe a pain.

-John

On Tue, 20 Jul 2004 20:03:05 +0200, Daniel Naber
<[EMAIL PROTECTED]> wrote:
> On Tuesday 20 July 2004 18:12, John Wang wrote:
> 
> > They make sure during deployment their "versions"
> > gets loaded before the same classes in the lucene .jar.
> 
> I don't see why people cannot just make their own lucene.jar. Just remove
> the "final" and recompile. Finally, Lucene is Open Source.
> 
> Regards
> Daniel
> 
> --
> http://www.danielnaber.de
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene cutomized indexing

2004-07-20 Thread Erik Hatcher
On Jul 20, 2004, at 2:10 PM, John Wang wrote:
I have already provided my opinion on this one - I think it would be
fine to allow Token to be public.  I'll let others respond to the
additional requests you've made.
Great, what processes need to be in place before this gets in the code 
base?
You're doing the right thing.  Although codebase details are most 
appropriate for the lucene-dev list.  And filing issues in Bugzilla 
ensures your requests do not get lost e-mail inboxes.

At this point, Lucene 1.4 has been released and Doug has put forth a 
proposal for Lucene 2.0 (with a migration path of a version 1.9 
intermediate release).  I'm not sure when the best time is to make this 
change.  We should put API changes to a VOTE on the lucene-dev list 
though.  In fact, I'll post a VOTE for Token now! :)


Then they should speak up :)
Well, I AM speaking up. So have some other people in earlier emails.
But alike me, are getting ignored.
You are not being ignored - not at all.  Look at the replies you've 
gotten already.

 The HayStack changes were needed
specifically due to the fact that many classes are declared to be
final and not extensible.
Did they post their changes back?  Did they discuss them here?  I do 
not recall such discussions (although see above about being lost in 
e-mail inboxes - mine is swamped beyond belief).  Are there Bugzilla 
issues with their patches?

Making things extensible for no good reason is asking for maintenance
troubles later when you need more control internally.  Lucene has been
well designed from the start with extensibility only where it was
needed in mind.  It has evolved to be more open in very specific areas
after careful consideration of the performance impact has been 
weighed.
 "Breaking" is not really the concern with extensibility, I don't
think.  Real-world use cases are needed to show that changes need to 
be
made.
I thought I gave many "real-world use cases" in the previous email.
And evidently also applies to the Haystack project. What other
information do we need to provide?
I was not referring to your requests in my comment, but rather a 
general comment regarding requests to make things "public" when quite 
sufficient alternatives exist.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Tokenizers and java.text.BreakIterator

2004-07-20 Thread Grant Ingersoll
Hi,

Was wondering if anyone uses java.text.BreakIterator#getWordInstance(Locale) as a 
tokenizer for various languages?  Does it do a good job?  It seems like it does, at 
least for languages where words are separated by spaces or punctuation, but I have 
only done simple tests.

Anyone have any thoughts on this?  What am I missing?  Does this seem like a valid 
approach?

Thanks,
Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene vs. MySQL Full-Text

2004-07-20 Thread Tim Brennan
Someone came into my office today and asked me about the project I am
trying to Lucene for -- "why aren't you just using a MySQL full-text
index to do that" -- after thinking about it for a few minutes, I
realized I don't have a great answer.
 
MySQL builds inverted indexes for (in theory) doing the same type of
lookup that lucene does.  You'd maybe have to build some kind of a layer
on the front to mimic Lucene's analyzers, but that wouldn't be too
hard
 
My only experience with MySQLfulltext is trivial test apps -- but the
MySQL world does have some significant advantages (its a known quantity
from an operations perspective, etc).  Does anyone out there have
anything more concrete they can add?
 
--tim
 


Re: lucene cutomized indexing

2004-07-20 Thread Grant Ingersoll
It seems to me the answer to this is not necessarily to open up the API, but to 
provide a mechanism for adding Writers and Readers to the indexing/searching process 
at the application level.  These readers and writers could be passed to Lucene and 
used to read and write to separate files (thus, not harming the index file format).  
They could be used to read/write an arbitrary amount of metadata at the term, document 
and/or index level w/o affecting the core Lucene index.  Furthermore, previous 
versions could still work b/c they would just ignore the new files and the indexes 
could be used by other applications as well.

This is just a thought in the infancy stage, but it seems like it would solve the 
problem.  Of course, the trick is figuring out how it fits into the API (or maybe it 
becomes a part of 2.0).  Not sure if it is even feasible, but it seems like you could 
define interfaces for Readers and Writers that met the requirements to do this.

This may be better discussed on the dev list.

>>> [EMAIL PROTECTED] 07/20/04 11:28AM >>>
Hi:
   I am trying to store some Databased like field values into lucene.
I have my own way of storing field values in a customized format.

   I guess my question is wheather we can make the Reader/Writer
classes, e.g. FieldReader, FieldWriter, DocumentReader/Writer classes
non-final?

   I have asked to make the Lucene API less restrictive many many many
times but got no replies. Is this request feasible?

Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Limiting Term Queries

2004-07-20 Thread Shawn Konopinsky
Is it possible to limit a term query?

For example: I am indexing documents with (amongst other things)  a
string in one field and with a number in another field. All combinations
of strings and numbers are allowed and neither field is unique. I would
like a way to query Lucene to pull out all unique numbers for a specific
letter.

If I had: (a, 123) (b, 123) (a, 123) (b, 23) (a, 45)
I would want a way to pull all unqiue numbers such that the string is
'a':
- (a, 123) (a, 45)

Right now I am determining the unique numbers by performing a term
query: TermEnum enumerator = reader.terms(new Term(number_field, ""));

where reader is an IndexReader and number_field is the field containing
the number. This gives me a list of all unique numbers, but counts those
documents that might have different letters (i.e. not just 'a').

Any thoughts on this?

Shawn.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Tokenizers and java.text.BreakIterator

2004-07-20 Thread Grant Ingersoll
Answering my own question, I think it is b/c Tokenizer's work with a Reader and you 
would have to read in the whole document in order to use the BreakIterator, which 
operates on a String...

>>> [EMAIL PROTECTED] 07/20/04 03:23PM >>>
Hi,

Was wondering if anyone uses java.text.BreakIterator#getWordInstance(Locale) as a 
tokenizer for various languages?  Does it do a good job?  It seems like it does, at 
least for languages where words are separated by spaces or punctuation, but I have 
only done simple tests.

Anyone have any thoughts on this?  What am I missing?  Does this seem like a valid 
approach?

Thanks,
Grant


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene vs. MySQL Full-Text

2004-07-20 Thread Daniel Naber
On Tuesday 20 July 2004 21:29, Tim Brennan wrote:

> ÂDoes anyone out there have
> anything more concrete they can add?

Stemming is still on the MySQL TODO list: 
http://dev.mysql.com/doc/mysql/en/Fulltext_TODO.html

Also, for most people it's easier to extend Lucene than MySQL (as MySQL is 
written in C(++?)) and there are more powerful queries in Lucene, e.g. 
fuzzy phrase search.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene vs. MySQL Full-Text

2004-07-20 Thread Florian Sauvin
On Jul 20, 2004, at 12:29 PM, Tim Brennan wrote:
Someone came into my office today and asked me about the project I am
trying to Lucene for -- "why aren't you just using a MySQL full-text
index to do that" -- after thinking about it for a few minutes, I
realized I don't have a great answer.
MySQL builds inverted indexes for (in theory) doing the same type of
lookup that lucene does.  You'd maybe have to build some kind of a 
layer
on the front to mimic Lucene's analyzers, but that wouldn't be too
hard

My only experience with MySQLfulltext is trivial test apps -- but the
MySQL world does have some significant advantages (its a known quantity
from an operations perspective, etc).  Does anyone out there have
anything more concrete they can add?
--tim

I'd say that MySQL full text is much slower if you have a lot of
data... that is one of the reasons we started using lucene (We had a
mysql db to do the search), it's way faster!
--
Florian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Sorting on tokenized fields

2004-07-20 Thread Florian Sauvin
I see in the Javadoc that it is only possible to sort on fields that 
are not tokenized, I have two questions about that:

1) What happens if the field is tokenized, is sorting done anyway, 
using the first term only?

2) Is there a way to do some sorting anyway, by concatenating all the 
tokens into one string?

--
Florian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Token or not Token, PerFieldAnalyzer

2004-07-20 Thread Florian Sauvin
I still don't understand something, my analyzer contains a tokenizer, 
turning "hello world" into [hello] [world]

is this analyzer applied on non-tokenized field? What exactly is done 
on a field when the boolean token is set to true?

--
Florian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


speeding up lucene search

2004-07-20 Thread Anson Lau
Hello guys,

What are some general techniques to make lucene search faster?

I'm thinking about splitting up the index.  My current index has approx 1.8
million documents (small documents) and index size is about 550MB.  Am I
likely to get much gain out of splitting it up and use a
multiparallelsearcher?

Most of my search queries search queries search on 5-10 fields.

Are there other things I should look at?

Thanks to all,
Anson


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]