Interest in Lucene specialists..

2001-10-11 Thread Sunny Kapoor \(SunKap\)


  - Original Message - 
  From: William Wong 
  To: Lucene-user 
  Sent: Friday, October 05, 2001 5:12 PM
  Subject: RE: Lucene has moved to Jakarta


  How about adding filters for different file types such as
  -HTML (there is one in the demo already)
  -XML
  -PDF
  -MsWord/RTF
  -other common file formats
  THanks.

  -william 

  -Original Message-
  From: Doug Cutting [mailto:[EMAIL PROTECTED]]
  Sent: Friday, October 05, 2001 11:42 AM
  To: '[EMAIL PROTECTED]'; [EMAIL PROTECTED]
  Subject: RE: Lucene has moved to Jakarta


   From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
   
   Congratulations on the move! 

  Thanks!

   As near as I can see, the two major changes for 1.2-rc1 are:
 switch to org.apache.lucene package names.
 Apache license instead of LGPL.

  Yes.  Thanks for pointing these out.  These are big incompatible changes
  that I forgot to mention.  

  Other changes since 1.01b include:
- ant-only build -- no more makefiles
- addition of lock files--now fully thread  process safe
- addition of German stemmer
- MultiSearcher now supports low-level search API
- added RangeQuery, for term-range searching
- Analyzers can choose tokenizer based on field name
- misc bug fixes.

  I need to work up detailed release notes for the final 1.2 release.

   Sometime when someone has a chance, I'd love to hear a bit about what
   plans there are for Lucene development. 

  Let's see, some short term tasks for the 1.2 release:
- get source code back into releases
- clean up example code
- write release notes

  Some mid-term tasks:
- add contributed Chinese analyzers
- add Hits.SetOrdering() support
- add some term highlighting support

  Longer term tasks:
- add JDBC-based Directory
- optimize simple conjunctive queries
- optionally store document vectors in index

  Have I missed your favorite?

  Doug



Re: Interest in Lucene specialists..

2001-10-11 Thread Sunny Kapoor \(SunKap\)

I'm interested in exploring deeper if Lucene is suitable for one of our projects and 
would like to leverage specialist expertise. Is there any forum where specialist 
Lucene expertize (evaluation and dev support) can be sought on commercial terms. 

not sure if this is the right forum to post this question and if so, my apologies in 
advance with a request to be redirected to the appropriate forum. 

please advise..

Sun



File Handles issue

2001-10-11 Thread Scott Ganyo

We're having a heck of a time with too many file handles around here.  When
we create large indexes, we often get thousands of temporary files in a
given index!  Even worse, we just plain run out of file handles--even on
boxes where we've upped the limits as much as we think we can!  We've played
around with various settings for the mergeFactor and maxMergeDocs, but these
seem to have at best an indirect effect on the number of temporary files
created.

I'm not very familiar with the Lucene file system yet, so can someone
briefly explain how Lucene works on creating an index?  How does it
determine when to create a new temporary file in the index and when does it
decide to compress the index?  Also, is there any way we could limit the
number of file handles used by Lucene?

This is becoming a huge problem for us, so any insight would be appreciated.

Thanks,
Scott



RE: File Handles issue

2001-10-11 Thread Doug Cutting

 From: Scott Ganyo [mailto:[EMAIL PROTECTED]]
 
 We're having a heck of a time with too many file handles 
 around here.  When
 we create large indexes, we often get thousands of temporary 
 files in a given index!

Thousands, eh?  That seems high.

The maximum number of segments should be f*log_f(N), where f is the
IndexWriter.mergeFactor and N is the number of documents.  The default merge
factor is ten.  There are seven files per segment, plus one per field.  If
we assume that you have three fields per document, then its ten files per
segment.  So to get 1000 files in an index with three fields and a
mergeFactor of ten, you'd need 10 billion documents, which I doubt you have.
(Lucene can't handle more than 2 billion anyway...)

How many fields do you have?  (How many different .f files are there per
segment?)

Have you lowered IndexWriter.maxMergeDocs?  If you, e.g. lowered this to
10,000, then with a million documents you'd have 100 segments, which would
give you 1000 files.  So, to minimize the number of files, keep maxMergeDocs
at Integer.MAX_VALUE, its default.

Another possibility is that you're running on Win32 and obsolete files are
being kept open by IndexReaders and cannot be deleted.  Could that be the
case?

 Even worse, we just plain run out of file 
 handles--even on
 boxes where we've upped the limits as much as we think we 
 can!

You should endevour to keep just one IndexReader at a time for an index.
When it is out of date, don't close it, as this could break queries running
in other threads, just let it get garbage collected.  The finalizers will
close things and free the file handles.

 I'm not very familiar with the Lucene file system yet, so can someone
 briefly explain how Lucene works on creating an index?  How does it
 determine when to create a new temporary file in the index 
 and when does it
 decide to compress the index?

Assume mergeFactor is ten, the default.  A new segment is created on disk
for every ten documents added, or sooner if IndexWriter.close() is called
before ten have been added.  When the tenth segment of size ten is added,
all ten are merged into a single segment of size 100.  When ten such
segments of size 100 have been added, these are merged into a single segment
containing 1000 documents, and so on.  So at any time there can be no more
than nine segments in each power-of-ten index size.  When optimize() is
called all segments are merged into a single segment.

The exception is that no segments will be created larger than
IndexWriter.maxMergeDocs.  So if this were set to 1000, then when you add
the 10,000th document, instead of merging things into a single segment of
10,000, it would add a tenth segment of size 1000, and keep adding segments
of size 1000 for every 1000 documents added.

 Also, is there any way we 
 could limit the
 number of file handles used by Lucene?

An IndexReader keeps all files in all segments open while it is open.  So to
minimize the number of file handles you should minimize the number of
segments, minimize the number of fields, and minimize the number of
IndexReaders open at once.

An IndexWriter also has all files in all segments open at once.  So updating
in a separate process would also buy you more file handles.

Doug



Index Optimization: Which is Better?

2001-10-11 Thread W. Eliot Kimber

We are experimenting with XML-aware indexing. The approach we're trying
is to index every element in a given XML document as a separate Lucene
document along with a another Lucene document that captures just the
concatenated text content of the document (to handle searching for
phrases across element boundaries), what we're calling the all-content
Lucene document.

We are using a node type field to distinguish the different types of
XML document constructs we are indexing (elements, comments, PIs, etc.)
and also thought we would use node type to distinguish the all-content
document. When we get a hit list, we can then use the node type to
figure out which XML constructs contained the target text and reduce the
per-element Lucene documents to single XML documents for the final query
result. We can also use node type to limit the query (you might want to
search just in PIs or just in comments, for example).

Our question is this: given that for the all-content document we could
either use the default content field for the text and the node type
field to label the document as the all-content node or simply use a
different field name for the content (e.g., alltext or something),
which of the following queries would tend to perform better? This:

some text AND nodtype:ALL_CONTENT

or:
  
alltext:some text

Or is there any practical difference?

Which way we construct the Lucene document will affect how our front-end
and/or users have to construct queries. It would be slightly more
convenient for front-ends to get the all-content doc by default (using
the content field for the text), but we thought the AND query needed
to limit searches to just the text (thus ignoring element-specific
searching) might incur a performance penalty.

In a related question, is there anything we can or need to do to
optimize Lucene to handle lots of little Lucene documents? 

Thanks,

Eliot

-- 
. . . . . . . . . . . . . . . . . . . . . . . .

W. Eliot Kimber | Lead Brain

1016 La Posada Dr. | Suite 240 | Austin TX  78752
T 512.656.4139 |  F 512.419.1860 | [EMAIL PROTECTED]

w w w . d a t a c h a n n e l . c o m



Re: Index Optimization: Which is Better?

2001-10-11 Thread Steven J. Owens

Doug wrote:

 I'm having trouble getting a clear picture of your indexing scheme.

 I've been doing a lot of thinking about this same problem, so I
may be a little more in tune with what Elliot's saying.  By the way,
Elliot, I'm very interested in your results.  I considered the basic
approach you're using, but I thought it was a bit extreme in terms of
having zillions of tiny lucene Documents.  I'm working on a quick
kludge that may serve my immediate purposes (if it does, I'm planning
to post the deatils here).
 
 Could you provide some simple examples, e.g., for the xml:

   tag1this is some text
 tag2and some other text/tag2
   /tag1
 would you have something like the following?
   doc1
 node_type: tag1
 contents: this is some text
   doc2
 node_type: tag2
 contents: and some other text
   doc3
 node_type: all_contents
 contents: this is some text and some other text

 I think that's exactly what Elliot is intending.
 

 My first instinct would be to have something like:
   doc1
 tag1: this is some text
 tag2: and some other text
 all-tags: this is some text and some other text
 What do you need that that does not achieve?

 Name collision - you can have multiple Elements at different
levels, and you may have attributes and tags having the same name.
Obviously one way around this is Don't do that, but that could get
really tiresome, quickly.

 If you just conflate the elements and attributes under the same
name (i.e. field blah contains a concatenated set of values from all
occurrences of both elements and attributes) then your searches become
much more limited in what you can specify.  This is, by the way, the
approach I'm trying out, with a second stage to refine the results and
drop out false positives.  But I'll have to wait on saying any more
about that.

 All of this, of course, is in the context of having arbitrary XML
documents.  If you have predefined XML schemas then you can hand-code
the mappings from elements to lucene document fields.  But then you
trade a heck of a lot of flexibility for a lot of maintenance.

Steven J. Owens
[EMAIL PROTECTED]