Re: Getting an exact field match

2003-07-03 Thread Ian Lea
Add Location as Keyword and delete something like this:

 Term t = new Term(LOCATION, location);
 indexReader.delete(t);



--
Ian.
[EMAIL PROTECTED]


 [EMAIL PROTECTED] (Wilton, Reece) wrote 

 Hi,
 
 I am indexing XML files.  The XML files have a Location element.  For
 example, the Location is /Foo/Bar.html in one of the files.
 
 When I update the index, I want to remove the existing document.  I
 search for the Location and delete the existing document like this:
 
 Query query = QueryParser.parse(location, LOCATION, new
 StandardAnalyzer());
 Hits hits = searcher.search(query);
 for (int i = 0; i  hits.length(); i++) {
   indexReader.delete(hits.id(i));
 }
 
 But I never get anything returned from the searcher.  I'm passing in the
 exact value that is in the field.  How do I get an exact match of the
 field?  Should I be adding Location as Text or Keyword?  I've tried both
 but can't get it to return what I want.
 
 Is the problem because I have slashes (/) in the field?  Does the
 StandardAnalyzer filter those out or something?
 
 Any help is appreciated!
 Reece
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
--
Searchable personal storage and archiving from http://www.digimem.net/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: can't delete from an index using IndexReader.delete()

2003-06-24 Thread Ian Lea
You should use Field.Keyword rather than Field.Text for the identifier
because you do not want it tokenized.

  doc.add(Field.Keyword(id, whatever));

In 2 places in your example code.



--
Ian.
[EMAIL PROTECTED]


 [EMAIL PROTECTED] (Robert Koberg) wrote 

 Here is a simple class that can reproduce the problem (happens with the last
 stable release too). Let me know if you would prefer this as an attachment.
 
 Call like this:
 java TestReaderDelete existing_id new_label
 - or -
 
 Try:
 java TestReaderDelete B724547 ppp
 
 and then try:
 java TestReaderDelete a266122794 ppp
 
 If an index has not been created it will create one. Keep running the one of
 the above example commands (with and without deleting the index directory)
 and see what happens to the System.out.println's
 
 
 
 import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.index.Term;
 import org.apache.lucene.analysis.Analyzer;
 import org.apache.lucene.analysis.standard.StandardAnalyzer;
 import org.apache.lucene.index.IndexWriter;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
 import org.apache.lucene.document.DateField;
 
 import org.xml.sax.*;
 import org.xml.sax.helpers.*;
 import org.xml.sax.Attributes;
 import javax.xml.parsers.*;
 
 import java.io.*;
 import java.util.*;
 
 
 class TestReaderDelete {
 
   
 
   public static void main(String[] args) 
 throws IOException
   {
 File index = new File(./testindex);
 if (!index.exists()) {
   HashMap test_map = new HashMap();
   test_map.put(preamble_content, Preamble content bbb);
   test_map.put(art_01_section_01, Article 1, Section 1);
   test_map.put(toc_tester, Test TOC XML bbb);
   test_map.put(B724547, bio example);
   test_map.put(a266122794, tester);
   indexFiles(index, test_map);
 } 
 String identifier = args[0];
 String new_label = args[1];
 testDeleteAndAdd(index, identifier, new_label);
   }
   
 
   public static void indexFiles(File index, HashMap test_map) 
   {
 try {
   IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(),
 true);
   for (Iterator i=test_map.entrySet().iterator(); i.hasNext(); ) {
 Map.Entry e = (Map.Entry) i.next();
 System.out.println(Adding:  + e.getKey() +  =  + e.getValue());
 Document doc = new Document();
 doc.add(Field.Text(id, (String)e.getKey()));  
 doc.add(Field.Text(label, (String)e.getValue())); 
 writer.addDocument(doc);
   }
   writer.optimize();
   writer.close();
 } catch (Exception e) {
   System.out.println( caught a  + e.getClass() +
\n with message:  + e.getMessage());
 }
   }
   
   
   public static void testDeleteAndAdd(File index, String identifier, String
 new_label) 
 throws IOException
   {
 IndexReader reader = IndexReader.open(index);
 System.out.println(!!! reader.numDocs() :  + reader.numDocs());
 System.out.println(reader.indexExists():  + reader.indexExists(index));
 
 System.out.println(term field:  + new Term(id, identifier).field());
 System.out.println(term text:  + new Term(id, identifier).text());
 System.out.println(reader.docFreq:  + reader.docFreq(new Term(id,
 identifier)));  
 System.out.println(deleting target now...);
 int deleted_num = reader.delete(new Term(id, identifier));
 System.out.println(*** deleted_num:  + deleted_num);
 reader.close();
 try {
   IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(),
 false);
   String ident = identifier;
   Document doc = new Document();
   doc.add(Field.Text(id, identifier));  
   doc.add(Field.Text(label, new_label)); 
   writer.addDocument(doc);
   writer.optimize();
   writer.close();
 } catch (Exception e) {
   System.out.println( caught a  + e.getClass() +
\n with message:  + e.getMessage());
 }
 
 System.out.println(!!! reader.numDocs() after deleting and adding :  +
 reader.numDocs()); 
   } 
   
 }
 
 
 
  -Original Message-
  From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
  Sent: Sunday, June 22, 2003 9:42 PM
  To: Lucene Users List
  
  The code looks fine.  Unfortunately, the provided code is not a full,
  self-sufficient class that I can run on my machine to verify the
  behaviour that you are describing.
  
  Otis
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
--
Searchable personal storage and archiving from http://www.digimem.net/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: about increment update

2003-04-03 Thread Ian Lea
Try this:

1.  Open reader.
2.  removeModifiedFiles(reader)
3.  reader.close()
4.  Open writer.
5.  updateIndexDocs()
6.  writer.close();

i.e. don't have both reader and writer open at the same time.

btw I suspect you might be removing index entries for files that
have been modified, but adding all files. Another index keeps
growing problem!  Could be wrong.


--
Ian.

 [EMAIL PROTECTED] (kerr) wrote 

 Thank you Otis,
 Yes, reader should be closed. But it isn't the reason of this Exception.
 the errors happen before deleting file.
Kerr.
 close()
 Closes files associated with this index. Also saves any new deletions to disk. No 
 other methods should be called after this has been called.
 
 - Original Message - 
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, April 03, 2003 12:14 PM
 Subject: Re: about increment update
 
 
  Maybe this is missing?
  http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#close()
  
  Otis
  
  --- kerr [EMAIL PROTECTED] wrote:
   Hello everyone,
   Here I try to increment update index file and follow the idea to
   delete modified file first and re-add it. Here is the source.
   But when I execute it, the index directory create a file(write.lock)
   when execute the line
   reader.delete(i);, 
   and caught a class java.io.IOException   with message: Index locked
   for write.
   After that, when I execute the line
   IndexWriter writer = new IndexWriter(index, new
   StandardAnalyzer(), false);
   caught a class java.io.IOException   with message: Index locked for
   write
   if I delete the file(write.lock), the error will re-happen.
   anyone can help and thanks.
  Kerr.
   
   
   import org.apache.lucene.analysis.standard.StandardAnalyzer;
   import org.apache.lucene.index.IndexWriter;
   import org.apache.lucene.document.Document;
   import org.apache.lucene.document.Field;
   import org.apache.lucene.store.Directory;
   import org.apache.lucene.store.FSDirectory;
   import org.apache.lucene.index.IndexReader;
   import org.apache.lucene.index.Term;
   
   import java.io.File;
   import java.util.Date;
   
   
   public class UpdateIndexFiles {
 public static void main(String[] args) {
   try {
 Date start = new Date();
   
 Directory directory = FSDirectory.getDirectory(index, false);
 IndexReader reader = IndexReader.open(directory);
 System.out.println(reader.isLocked(directory));
 //reader.unlock(directory);
 IndexWriter writer = new IndexWriter(index, new
   StandardAnalyzer(), false);
   
 String base = ;
 if (args.length == 0){
   base = D:\\Tomcat\\webapps\\ROOT\\test;
 } else {
   base = args[0];
 }
 removeModifiedFiles(reader);
 updateIndexDocs(reader, writer, new File(base));
   
 writer.optimize();
 writer.close();
   
 Date end = new Date();
   
 System.out.print(end.getTime() - start.getTime());
 System.out.println( total milliseconds);
   
   } catch (Exception e) {
 System.out.println( caught a  + e.getClass() +
  \n with message:  + e.getMessage());
 e.printStackTrace();
   }
 }
   
 public static void removeModifiedFiles(IndexReader reader) throws
   Exception {
   Document adoc;
   String path;
   File aFile;
   for (int i=0; ireader.numDocs(); i++){
 adoc = reader.document(i);
 path = adoc.get(path);
 aFile = new File(path);
 if (reader.lastModified(path)  aFile.lastModified()){
   System.out.println(reader.isLocked(path));
   reader.delete(i);
 }
   }
 }
   
 public static void updateIndexDocs(IndexReader reader, IndexWriter
   writer, File file)
  throws Exception {
   
   if (file.isDirectory()) {
 String[] files = file.list();
 for (int i = 0; i  files.length; i++)
 updateIndexDocs(reader, writer, new File(file, files[i]));
   } else {
 if (!reader.indexExists(file)){
   System.out.println(adding  + file);
   writer.addDocument(FileDocument.Document(file));
 } else {}
   }
 }
   }

--
Searchable personal storage and archiving from http://www.digimem.net/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Indexing Growth

2003-04-02 Thread Ian Lea
What does the index directory look like before and after running
queries?  Are files growing or being added?  Which files? How many
documents are there in the index before and after? Are you absolutely
100% positive there is no way that your application is adding entries
to the index?  That still has to be the most likely explanation, I think.



--
Ian.
[EMAIL PROTECTED]


 [EMAIL PROTECTED] (Rob Outar) wrote 

 Hi all,
 
   This is too odd and I do not even know where to start.  We built a Windows
 Explorer type tool that indexes all files in a sabdboxed file system.
 Each Lucene document contains stuff like path, parent directory, last
 modified date, file_lock etc..  When we display the files in a given
 directory through the tool we query the index about 5 times for each file in
 the repository, this is done so we can display all attributes in the index
 about that file.  So for example if there are 5 files in the directory, each
 file has 6 attributes that means about 30 term queries are executed.  The
 initial index when build it about 10.4megs, after accessing about 3 or 4
 directories the index size increased to over 100megs, and we did not add
 anything!!  All we are doing is querying!!  Yesterday after querying became
 ungodly slow, we looked at the index size it had grown from 10megs to 1.5GB
 (granted we tested the tool all morning).  But I have no idea why the index
 is growing like this.  ANY help would be greatly appreciated.
 
 
 Thanks,
 
 Rob
 
 
 -Original Message-
 From: Rob Outar [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, April 01, 2003 3:32 PM
 To: Lucene Users List; [EMAIL PROTECTED]
 Subject: RE: Indexing Growth
 
 
 I reuse the same searcher, analyzer and Query object I don't think that
 should cause the problem.
 
 Thanks,
 
 Rob
 
 
 -Original Message-
 From: Alex Murzaku [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, April 01, 2003 3:22 PM
 To: 'Lucene Users List'
 Subject: RE: Indexing Growth
 
 
 I don't know if I remember this correctly: I think for every query
 (term) is created a file but the file should disappear after the query
 is completed.
 
 -Original Message-
 From: Rob Outar [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, April 01, 2003 3:13 PM
 To: Lucene Users List
 Subject: RE: Indexing Growth
 
 
 Dang I must be doing something crazy cause all my client app does is
 search and the index size increases.  I do not add anything.
 
 Thanks,
 
 Rob
 
 
 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, April 01, 2003 3:07 PM
 To: Lucene Users List
 Subject: Re: Indexing Growth
 
 
 Only when you add new documents to it.
 
 Otis
 
 --- Rob Outar [EMAIL PROTECTED] wrote:
  Hi all,
 
  Will the index grow based on queries alone?  I build my index,
 then
  run several queries against it and afterwards I check the size of the
  index and
  in some cases it has grown quite a bit although I did not add
  anything???
 
  Anyhow please let me know the cases when the index will grow.
 
  Thanks,
 
  Rob
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 __
 Do you Yahoo!?
 Yahoo! Tax Center - File online, calculators, forms, and more
 http://platinum.yahoo.com
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
--
Searchable personal storage and archiving from http://www.digimem.net/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Indexing Growth

2003-04-02 Thread Ian Lea
They look like the type of file name that would be created
when documents were added to the index.  So I still think
something is adding stuff to your index.  Could it be an
external process as someone suggested?  Does the index
grow even if you don't search?  In the code you posted,
what does checkForIndexChange() do?  Yes, I can guess what
it is supposed to do, but is it perhaps doing something else
as well or instead, directly or indirectly?


--
Ian.

 [EMAIL PROTECTED] (Rob Outar) wrote 

 After building the index for the first time:
 
 _l1d.f1  _l1d.f3  _l1d.f5  _l1d.f7  _l1d.f9   _l1d.fdx  _l1d.frq  _l1d.tii
 deletable
 _l1d.f2  _l1d.f4  _l1d.f6  _l1d.f8  _l1d.fdt  _l1d.fnm  _l1d.prx  _l1d.tis
 segments
 
 After running first query to get all attributes from all files in the given
 directory, there were 17 files, each file has 5 attributes so 85 queries
 were ran:
 
 _l1j.f1   _l1p.f9   _l21.f3   _l27.fdx  _l2j.f5   _l2p.prx  _l31.f7
 _l3j.f1   _l3p.f9   _l41.f3   _l44.fdx
 _l1j.f2   _l1p.fdt  _l21.f4   _l27.frq  _l2j.f6   _l2p.tis  _l31.f8
 _l3j.f2   _l3p.fdt  _l41.f4   _l44.frq
 ...

--
Searchable personal storage and archiving from http://www.digimem.net/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: indexing large documents problem

2002-12-11 Thread Ian Lea
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#maxFieldLength


--
Ian.
[EMAIL PROTECTED]

 [EMAIL PROTECTED] (Andrey Grishin) wrote 

 Hello, All!
 When I index large document 11 symbols
 and then try to search using words that are in the very end of that document - can't 
find anything... :((
 
 Is this a feature of Lucene or I am doing something wrong?
 Any help will be appretiated.
 
 Regards, Andrey Grishin
--
Searchable personal storage and archiving from http://www.digimem.net/


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: searching on for null/blank field val

2002-11-14 Thread Ian Lea
I think you will have to go the psuedo null/blank placeholder route.


--
Ian.
[EMAIL PROTECTED]


 [EMAIL PROTECTED] (aaz) wrote 

 Hi,
 We have a document with 2 Fields.
 
 a) title = X
 b) fieldX =   
 
 How can I do a search to only get documents where fieldX = . When I construct a 
TermQuery against FieldX with  as the value I get no results. What is the best way 
to do searches for such values or should we create some psuedo null/blank placeholder 
to store in such fields with blank values?
 
 thanks
 
 
--
Searchable personal storage and archiving from http://www.digimem.net/


--
To unsubscribe, e-mail:   mailto:lucene-user-unsubscribe;jakarta.apache.org
For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org


Re: Can any one help me?

2002-11-14 Thread Ian Lea
Uma


If you know servlets and JSP you should be able to figure
out how to integrate lucene.  Presumably you have already read
the Getting Started guide?

Suggestions:

1.  Create a lucene index of whatever it is you want to
 search across.  As a standalone program.  No servlets
 or JSP or database required.  See demo programs and
 instructions.

2.  Create a standalone program to search that index.

3.  Take whichever bits of that functionality that you want
 to be accessible as a servlet, and call or cut/paste/refactor/
 whatever to get what you want.


Good luck.


--
Ian.
[EMAIL PROTECTED]


 [EMAIL PROTECTED] (Uma Maheswar) wrote 

 Otis,
 Yes, I know Servlets and JSP. I am the only developer working on
 http://www.javagalaxy.com. All the contents in the site are developed by me.
 But I am not sure of working with Lucene. Can you help me?
 
 Uma

--
Searchable personal storage and archiving from http://www.digimem.net/


--
To unsubscribe, e-mail:   mailto:lucene-user-unsubscribe;jakarta.apache.org
For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org


Re: searching only the text of a XML-File???

2002-10-15 Thread Ian Lea

Certainly.  Just extract the content from the XML and index
it along with the file name.  Scan the archives of this
list or search Google for something like Lucene XML if
you want more.



--
Ian.
[EMAIL PROTECTED]

 [EMAIL PROTECTED] (Richly, Gerhard) wrote 

 Hello,
 
 is it possible to index and search only the content, not the tags, of a
 XML-File.
 The result should be the name of the XML-File. 
 
 Is that possible with Lucene???
 
 Made anyone expierence with indexing and searching XML-File???
 
 Thank you
 
 Gerhard

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: Full List of Stop Words for Standard Analyzer.

2002-08-02 Thread Ian Lea

In org/apache/lucene/analysis/standard/StandardAnalyzer.java.


--
Ian.
[EMAIL PROTECTED]

 [EMAIL PROTECTED] (Suneetha Rao) wrote 

 Hi,
 I would like  to include in my documentation all the stop words
 .
 Can somebody tell me where to find the list for the Standard Analyzer ?
 
 Thanks in Advance,
 Suneetha

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: Deleting Problem

2002-08-01 Thread Ian Lea

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html
says, for delete:

Deletes the document numbered docNum. Once a document is deleted it will not appear 
in TermDocs or TermPostitions enumerations. Attempts to read its field with the 
document(int) method will result in an error. The presence of this document may still 
be reflected in the docFreq(org.apache.lucene.index.Term) statistic, though this will 
be corrected eventually as the index is further modified.

This is from the delete(int) method rather than delete(Term) but I would
expect that it still holds true.

If you want the deleted documents to really disappear for good, now, optimize
the index.


--
Ian.
[EMAIL PROTECTED]


 [EMAIL PROTECTED] (Terry Steichen) wrote 

 I'm having difficulty deleting documents from my index.
 
 Here's code snippet 1:
 
 IndexReader reader = IndexReader.open(index_dir);
 Term dterm = new Term(pub_date,pub_date);
 int docs = reader.docFreq(dterm);
 reader.close();
 System.out.println(Found +docs+ docs matching term pub_date = +pub_date);
 
 It reports back that I have 48 matching documents.  Then I run code snippet 2:
 
 IndexReader reader = IndexReader.open(index_dir);
 Term dterm = new Term(pub_date,pub_date);
 int docs = reader.delete(dterm);
 reader.close();
 System.out.println(Deleted+docs+ docs matching term pub_date = +pub_date);
 
 It reports back that I deleted 48 documents.  
 
 But when I run snippet 1 once again, it reports 48 matching documents still exist. 
 
 If I run snippet 2 again, it reports that it (this time) deleted 0 docs.
 
 Obviously I'm overlooking something (probably obvious and simple), but I can't seem 
to delete the selected documents.  Ideas/help would be welcome.
 
 Regards,
 
 Terry

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


RE: contains

2002-07-16 Thread Ian Lea

I also think it might work.  Just for fun I tried a variation of it on
a copy of the unix file /usr/dict/words.

Indexing program reads each word and splits it up into substrings and
stores the original and the substrings e.g. beautiful stored as
Field.UnIndexed, substrings beautiful eautiful autiful utiful tiful
iful ful ul l stored as Field.UnStored. One Document per word.

Search program uses StandardAnalyzer and QueryParser.  Pass it ful*
and it returns 259 hits, including beautiful and beautifully.  Pass it
eaut and get 13 hits including beauteous and beauty.  ifu* gives 20
hits including beautiful, bifurcate and centrifuge.

There are 45,407 words in the file and the index, unoptimized, takes
up 3.6Mb of disk space.  Indexing the 45,407 words by themselves, one
Document per word with the word as Field.Text takes up 1.7Mb disk
space, unoptimized.

Didn't add inverse substrings since don't see why they are needed.
My sample program seems to work without them.  I expect I've missed
something or perhaps it works because the sample data is so simple.



--
Ian.
[EMAIL PROTECTED]

 [EMAIL PROTECTED] (Lothar Simon) wrote 

 Just to correct a few points:
 - The factor would be 2 * (average no of chars per word)/2 = (average no of
 chars per word).
 - One would probably create a set of 2 * (maximum number of chars per word)
 as Fields for a document. If this could work was actually my question...
 - Most important: my proposal is exactly (and almost only) designed to solve
 the substring (*uti*) problem !!! One field in the first group of fields
 in my example contains utiful and would be found by uti*, a field in the
 other group of fields contains itueb and would be found by itu*. Voila!
 
 I still think my idea would work (given you spend the space for the index).
 
 Lothar
 
 
 -Original Message-
 From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]]
 Sent: Friday, July 12, 2002 6:45 PM
 To: Lucene Users List
 Subject: RE: contains
 
 
 On Fri, 12 Jul 2002, Lothar Simon wrote:
 
 [in response to Peter Carlson pointing out that searching for *xyz* is a
 difficult problem]
  Of course you are right. And I am surely more the last then the first
  one to try to come up with THE solution for this. But still... Could
  the following work?
 
  If space (ok, a lot) is available you could store beutiful,
  eutiful, utiful, tiful, iful, ful, ul, l PLUS its
  inversions (lufitueb, ufitueb, fitueb, itueb, tueb, ueb,
  eb, b) in the index. Space needed would be something like (average
  no of chars per word) as much as in a normal index.
 
 Actually it would be twice that, because you're storing backward and
 forward versions.  I'd hazard a guess that this factor alone would mean
 something like a 10- or 12-fold increase in index size (the average length
 of a word is less than 5 or 6 letters, but by throwing out stop words you
 throw out a lot of the words that drag the average down).
 
 Another problem with this is that in order to be able to get from ful to
 beautiful, you have to store, in the index entry for ful, (pointers
 to) every single complete word in your document set that contains ful as
 a substring.  Just _creating_ such an index would be extremely
 time-consuming even with clever data structures, and consider how much
 extra storage for pointers would be necessary for entries like e or n.
 
 Finally, you're not including all substrings: your scheme doesn't allow me
 to search for *uti* and find beautiful.  If you did, the number of
 entries would then be multiplied by a factor of the _square_ of the
 average number of characters per word.  (You might be able to avoid this
 by doing prefix and suffix searches--which are difficult but less so--on
 the strings you specify, though.)
 
 There might be some clever way to get around these problems, but I suspect
 that developing one would be a dissertation topic.  :)
 
 Regards,
 
 Joshua O'Madadhain
 
  [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
   Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
  It's that moment of dawning comprehension that I live for--Bill Watterson
 My opinions are too rational and insightful to be those of any organization.

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: Problem in unicode field value retrival

2002-06-10 Thread Ian Lea

I don't think you can retrieve the contents of Fields that have
been loaded by a Reader.  From the javadoc for Field:

Text(String name, Reader value)

   Constructs a Reader-valued Field that is tokenized and indexed, but is
   not stored in the index verbatim.


--
Ian.
[EMAIL PROTECTED]


 [EMAIL PROTECTED] (Harpreet S Walia) wrote 

 Hi
 
 I am trying to index and search unicode (utf - 8) . the code i am using to index the 
documents is as follows :
 
 
/**/
 IndexWriter iw = new IndexWriter(d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\index, 
new SimpleAnalyzer(), true); 
 String dirBase = d:\\jakarta-tomcat3.2.3\\webapps\\lucene\\docs;
 File docDir = new File(dirBase);
 String[] docFiles  = docDir.list();
 InputStreamReader isr;
 InputStream is;
 Document doc;
 for(int i=0;idocFiles.length;i++)
{ 
   File tempFile = new File(dirBase + \\ + docFiles[i]);
   if(tempFile.isFile()==true)
 {
 System.out.println(Indexing File : + docFiles[i]);
 is = new FileInputStream(tempFile);
 isr=new InputStreamReader(is,utf-8);
doc= new Document();
doc.add(Field.UnIndexed(path,tempFile.toString()));
doc.add(Field.Text(abc,(Reader)isr));
doc.add(Field.Text(all,sansui));
iw.addDocument(doc);
is.close();
isr.close();
   doc=null;
   }
 }
  iw.close();
  is=null;
  isr=null;
  iw=null;
  docDir=null;
  
  System.out.println(Indexing Complete);
 
 
/**/
 
 Now when i try to search the contents and get the field called abc by using the 
method doc.get(abc) , i get null as the output.
 
 Can anyone please tell me where i am going wrong .
 
 Thanks And Regards
 Harpreet
 
--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: segment count

2002-05-31 Thread Ian Lea

 In order to make a search, the mergeSegments() function must be called 
 right? Otherwise IndexSearcher won't have the most updated index files to 
 work with to do a search. I guess my point is that do I have to 
 intermittenly call Optimize or Close (to call mergeSegments()) or make 
 maybeMergeSegments to find a merge to do before using IndexSearcher? Btw, I 
 am running IndexFiles and SearchFiles at the same time.

I don't know if you have to call close() to make all modifications visible
or not.  Sounds likely.  You do not have to call optimize.  Having
one writer and one or more readers concurrently is fine.  You can
(should) call IndexReader.lastModified() to find out if the index
has been modified since the IndexReader was opened.
 
 Also, when IndexWriter.addDocument is called per file, the function calls 
 newSegmentName() to create its corresponding segement name. That segment 
 name is used to create a SegmentInfo, which gets added to the SegmentInfos 
 vector. Am I missing something?

No idea.  I'm just a lucene user and have never needed to know about
that sort of stuff.



--
Ian.
[EMAIL PROTECTED]

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: MS Word Search ??

2002-05-30 Thread Ian Lea

 ...
 openoffice - www.openoffice.org knows how to parse all of the microsoft
 ...
  #2 - if open office is programmatically drivable (which I don't know 
  if it
  is), fire up a copy of open office and use it to convert the files as
  necessary.

See an earlier post to this list:
http://marc.theaimsgroup.com/?l=lucene-userm=101920039808700w=2

It is often worth searching or browsing the archive!


--
Ian.
[EMAIL PROTECTED]

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: segment count

2002-05-30 Thread Ian Lea

Lucene doesn't store one document per segment.  See
http://marc.theaimsgroup.com/?l=lucene-userm=102079295608850w=2
for detail on the files created.

On is this the right way ...?, here is an extract from the javadoc

   1. Create Document's by adding Field's.
   2. Create an IndexWriter and add documents to to it with addDocument();
   3. Call QueryParser.parse() to build a query from a string; and
   4. Create an IndexSearcher and pass the query to it's search() method.


--
Ian.
[EMAIL PROTECTED]


 [EMAIL PROTECTED] (Hyong Ko) wrote 

 I added a segment using IndexWriter.addDocument. Then I called 
 IndexWriter.optimize (IndexWriter.close works too) to generate index files 
 to do a search. Then I added another segment using IndexWriter.addDocument. 
 The total segment count should be 2, but instead it's 3. Any ideas? Is this 
 the right way to index and search concurrently? Thanks.

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: Wilcard Search Issues

2002-05-28 Thread Ian Lea

How about if you search for resloc:ccsa* i.e. all lower case? 

If using QueryParser.parse() with a standard analyzer the search term
does not get converted to lower case if it contains a trailing
wildcard.

Running code like this

   Analyzer analyzer = new StandardAnalyzer(); 
   Query query = QueryParser.parse(s, KEYFIELD, analyzer);
   System.out.println(s + ( + query.toString(KEYFIELD) + ));

with s set to various values gives something like this

   CCsa  (ccsa) // OK
   CCsa* (CCsa*)// Suspect
   ccsa* (ccsa*)// OK


Tested against rc5 and latest from CVS.



--
Ian.
[EMAIL PROTECTED]


 [EMAIL PROTECTED] (Nader S. Henein) wrote 

 I'm using the new Lucene 1.5 release and I remember a message
 in the lucene-user mailing list that talked about a wildcard issue that
 if you search something like this:
 
   reslocCCsa/resloc
 
 using the following query string : resloc:CCsa*
 it will yield no results, and them there was a reply saying that the issue
 has
 been resolved in the nightly builds, this was about two weeks before rc1.5
 (witch I'm using)
 and according the rc1.5 mailer that went out wildcard issues where hammered
 out. but I still
 have this problem if I search using resloc:CCsa I get 5 results but when I
 add the star sign to
 the right-hand side of the query string like so resloc:CCsa* I get no
 results.
 
 Anyone care to shed some light on this issue ?
 
 Nader S. Henein
 Bayt.com , Dubai Internet City
 Tel. +9714 3911900
 Fax. +9714 3911915
 GSM. +9715 05659557
 www.bayt.com

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: QueryParser.parse

2002-05-27 Thread Ian Lea

Why do you not want to tokenize AUTEUR and TITRE? Your example will work
if you do.  Using QueryParser.parse with a standard analyzer your
query will be tokenized and will match the tokens created when you
added the documents to the index.


--
Ian.
[EMAIL PROTECTED]


 [EMAIL PROTECTED] (Arpad KATONA) wrote 

 Hello,
 
 under http://jakarta.apache.org/lucene/docs/queryparsersyntax.html is to be
 read :
 
 +++
 As an example, let's assume a Lucene index contains two fields, title and text
 and text is the default field. If you want to find the document entitled The
 Right Way which contains the text don't go this way, you can enter: 
 
 title:The Right Way AND text:go  
 +++   
 
 So i tried, see the code below :
 1) first, i create a document with 3 fields: KEY, AUTEUR and TITRE, the fields
 are NOT to be tokenized.
 2) second, i read the index to see what it contains (see output below).
 3) third, i would like to find a record with jean auteur as AUTEUR or opus
 primum as TITRE; thus, the query string is :
 ?AUTEUR:jean auteur OR TITRE:opus primum?. But, h?las!, there is no answer.
 Could you tell me, please, where the error is?
 
 Thanks
 
 Arpad KATONA
 [EMAIL PROTECTED]
 
 --
 
 +++
 try {
   Analyzer analyzer = new StandardAnalyzer();
   String sIxPath = D:\\tempo\\lucene\\ix;
   //
   // indexing
   //
   IndexWriter writer = new IndexWriter(sIxPath, analyzer, true);
   org.apache.lucene.document.Document doc =
 new org.apache.lucene.document.Document();
   Field fd = new Field(KEY, 1, true, true, false);
   doc.add(fd); fd = null;
   fd = new Field(AUTEUR, jean auteur, false, true, false);
   doc.add(fd); fd = null;
   fd = new Field(TITRE, opus primum, false, true, false);
   doc.add(fd); fd = null;
   writer.addDocument(doc); doc = null;
   writer.close(); writer = null;
   //
   // checking
   //
   IndexReader reader = IndexReader.open(sIxPath);
   TermEnum tenum = reader.terms();
   for(; tenum.next(); ) {
 Term term = tenum.term();
 String sField = term.field();
 String sText = term.text();
 System.out.println(sField=?+sField+? sText=?+sText+?);
   }
   tenum.close(); tenum = null;
   reader.close(); reader=null;
   //
   // searching
   //
   String sQuery=TITRE:\opus primum\ OR AUTEUR:\jean auteur\;
   Query query = QueryParser.parse(sQuery, , analyzer);
   System.out.println(query = ?+query.toString()+?);
   Searcher searcher = new IndexSearcher(sIxPath);
   Hits hits = searcher.search(query);
   if(null==hits) {
 System.out.println(No answer.);
   } else {
 int nbHits = hits.length();
 System.out.println(nbHits = +Integer.toString(nbHits));
   }
   hits = null;
   searcher.close(); searcher=null;
   query = null;
 } catch (Exception ex) {
   ex.printStackTrace();
 }
 +++
 
 output:
 
 +++
 sField=?AUTEUR? sText=?jean auteur?
 
 sField=?KEY? sText=?1?
 
 sField=?TITRE? sText=?opus primum?
 
 query = ?TITRE:opus primum AUTEUR:jean auteur?
 
 nbHits = 0
 
 +++

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: AW: WildcardQuery

2002-05-24 Thread Ian Lea

Left wildcards seem to work if you explicitly use a
WildcardQuery e.g. 

Term t = new Term(id, *ucene);
Query query = new WildcardQuery(t);


but if use QueryParser with an analyzer e.g.

Analyzer analyzer = new StandardAnalyzer();
Query query = QueryParser.parse(*ucene, id, analyzer);

get an exception:

org.apache.lucene.queryParser.ParseException: 
Lexical error at line 1, column 1.  Encountered: * (42), after : 
  at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)


Tested on RC5.  Haven't tried other ways of building a query.  In my
simple tests terms with left and right wildcards like *lucene*
worked too, even if the whole word was included.



--
Ian.
[EMAIL PROTECTED]


 [EMAIL PROTECTED] (Christian Schrader) wrote 

 It works with the nightly builds and probably with 1.2-RC5 :-)
 
 Christian
  -Ursprungliche Nachricht-
  Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
  Gesendet: 07 May 2002 17:31
  An: Lucene Users List
  Betreff: Re: WildcardQuery
  
  
  Yes, me too.  I just tried it on some Lucene index (the search at
  blink.com) and it doesn't seem to work (try searching for travel and
  then *vel).
  I'm assuming the original poster confused something...
  
  Otis
  
  --- Joel Bernstein [EMAIL PROTECTED] wrote:
   I thought Lucene didn't support left wildcards like the following:
   
   *ucene
   
   - Original Message -
   From: Christian Schrader [EMAIL PROTECTED]
   To: Lucene Users List [EMAIL PROTECTED]
   Sent: Monday, May 06, 2002 7:14 PM
   Subject: WildcardQuery
   
   
I am pretty happy with the results of WildcardQueries like *ucen*
   that
matches lucene, but *lucene* doesn't match lucene. Is there a
   reason for
this? And what would be the patch.
It should be in WildcardTermEnum. I am wondering if somebody
   already
   patched
it?
   
Thanks, Chris

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


RE: AW: WildcardQuery

2002-05-24 Thread Ian Lea

 Hi... I am both a newbie to Lucene and to using this list, so please
 forgive me if I make some mistakes. I am trailing onto this post because I
 cannot seem to get the wildcard function to work at all, while all of the
 other features seem to work just fine.  I am using a very standard
 application (actually, it is just the demo version slightly modified) with
 the StandardAnalyzer and the QueryParser.  But the wildcard feature (using
 either ? or *) just doesn't work. I must be missing something very
 basic.  I would appreciate any ideas. Thanks!

Basic wildcard support (i.e. ignoring things like left wildcards)
comes pretty much out of the box.  Attached is a copy of the
program I was playing with before sending the earlier message.
It uses StandardAnalyzer and the static QueryParser.parse()
method so doesn't work with left wildcards.  I haven't tried
? rather than *.


Hope this helps.



--
Ian.
[EMAIL PROTECTED]

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


RE: AW: WildcardQuery

2002-05-24 Thread Ian Lea

Sorry - I said I was going to send some code ...


--
Ian.

import org.apache.lucene.queryParser.*; 
import org.apache.lucene.search.*; 
import org.apache.lucene.index.*; 
import org.apache.lucene.analysis.*; 
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*; 
import org.apache.lucene.store.*;


public class LuceneTest { 

RAMDirectory ramdir;
Analyzer analyzer;
IndexWriter writer;
IndexReader reader;
Searcher searcher;

public LuceneTest() {
analyzer = new StandardAnalyzer(); 
ramdir = new RAMDirectory();
}


public static void main(String args[]) throws Exception { 
LuceneTest ld = new LuceneTest();
ld.load();
ld.search();
}


void load() throws Exception {
writer = new IndexWriter(ramdir, analyzer, true); 
add(january);
add(february);
add(june);
add(july);
writer.close();
}   



void add(String s) throws Exception {
Document d = new Document();
d.add(Field.Keyword(id, s));
System.out.println(Adding +s);
writer.addDocument(d);
}



void search() throws Exception {
reader = IndexReader.open(ramdir);
searcher = new IndexSearcher(reader);
search(jan*);
search(jan*y);
search(j*y);
search(j*);
search(*y);
}


void search(String s) throws Exception {
Query query = QueryParser.parse(s, id, analyzer);
Hits hits = searcher.search(query);
System.out.println(s+ matched +hits.length());
for (int i = 0; i  hits.length(); i++) {
System.out.println( + 
   hits.doc(i).get(id));
}
}
}


--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: Using the UnStored Field() type

2002-05-13 Thread Ian Lea

 If the field is tokenized and indexed, can I still search that field?

Yes.
 
 My code looks like this:
 
 theDocument = new Document();
 if ( 0 != textString.length() ) {
 textField = Field.UnStored( FIELD_TEXT, textString );
 theDocument.add( textField );
 }
 
 then I search it like this:
 
 indexReader = IndexReader.open( C:\\temp\\index_store );
 Term searchTermTiny = new Term( DocumentVisitor.FIELD_TEXT, Syndeo ); 
 FuzzyQuery query = new FuzzyQuery( searchTermTiny );
 IndexSearcher search = new IndexSearcher( indexReader );
 Hits foundDocs = search.search( query );
 
 I never get any results for the documents. Syndeo occures often.
 Any ideas?

You don't show us code to save the document in the index
but I assume you are doing that!  Is FIELD_TEXT the
same as DocumentVisitor.FIELD_TEXT?

A short but complete program makes it easier to spot
problems rather than making wild guesses.



--
Ian.
[EMAIL PROTECTED]

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: rc4 and FileNotFoundException: an update

2002-04-26 Thread Ian Lea

Have you posted code that demonstrates this problem?
If so I missed it.  If you send, to this list, the
shortest program you can come up with that demonstrates
the problem there is a fair chance that someone may
spot something.  I, and many others, use that release
of Lucene to index far more than 16 objects so I think
that at this stage the assumption has to be that the
problem lies with your code.


--
Ian.
[EMAIL PROTECTED]


 [EMAIL PROTECTED] (petite_abeille) wrote 

 Hello again,
 
 I guess it's really not my day...
 
 Just to make sure I'm not hallucinating to much, I downloaded the latest 
 and greatest: rc4. Changed all the packages names to org.apache. Updated 
 a method here and there to reflect the APIs changes. And run my little 
 app. I would like to emphasize that except updating to the latest Lucene 
 release, nothing else has changed.
 
 Well, it's pretty ugly. Whatever I'm doing with Lucene in the previous 
 package (com.lucene) is magnified many folds in rc4. After processing a 
 paltry 16 objects I got:
 
 SZFinder.findObjectsWithSpecificationInStore: 
 java.io.FileNotFoundException: _2.f14 (Too many open files)
 
 At least in the previous version, I will see that after a couple of 
 thousand of objects...
 
 So, it seems, that there is something really rotten in the kingdom of 
 Denmark...
 
 Any help much appreciated.
 
 Thanks.

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: Incremental vs Full indexing

2002-04-11 Thread Ian Lea

You can just use optimize().


--
Ian.
[EMAIL PROTECTED]

 [EMAIL PROTECTED] (Andrew Smith) wrote 

 I wish to use incremental indexing in an application based on Lucene. Do
 I need to preiodcally perform a full re-build of the index to keep the
 index in an efficient state or can I simply use the IndexWriter's
 Optimise() function instead?
 
 TIA
 
 Andy
 
 --

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: Lucene with Number+Text

2002-03-25 Thread Ian Lea

 I've problem searching for number in Lucene.
 I'm using StandardAnalyzer for Index/Search.

 In my document, I have a field contains text
 this is a test for lucene with number 1727a and 1992 and 3562

 -  I was able to search for a 1992 or 3562.
 -  However, search return empty when I try to search for 1727 or 1727a.  It 
 seems like it didn't index number and text when it's one word.  Please help

Are you sure this is using StandardAnalyzer on the latest
release (1.2 rc4)?  If I index that string and search
for 1727a I get a hit.



--
Ian.

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: Lucene with Number+Text

2002-03-25 Thread Ian Lea

Good thinking.  In my test, using a Text field, searches
for 1727a and 1727* both return a hit but if switch to
Keyword they don't.


--
Ian.

 [EMAIL PROTECTED] (Shannon Booher) wrote 

 I think I have seen a similar problem.
 
 Are you guys using Keyword or Text fields?

--
Searchable personal storage and archiving from http://www.digimem.net/



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: indexing secure sites

2002-02-28 Thread Ian Lea

 Is it possible to index secure sites with Lucene?

Lucene will index whatever you can feed it.

 When I tried I get an error message ::javax.net.ssl.SSLException:
 untrusted server cert chain, is it a problem with my certificate?

Could well be.  It is definitely nothing to do with lucene or this list.



--
Ian.

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Build index using RAMDirectory out of memory errors

2002-02-25 Thread Ian Lea

Have you tried different values for IndexWriter.mergeFactor?
Setting it to 1000 gave me a 10* speed improvement on some
large index some time ago. Not RAMDirectory though.
Your mileage may vary.


--
Ian.


Kurt Vaag wrote:
 
 I have been using Lucene for 3 weeks and it rules.
 
 The indexing process can be slow. So I searched the mailgroup archives
 and found example code using RAMDirectory to improve indexing speed.
 The example code I found was indexing 100,000 files at a time to the
 RAMDirectory before writing to disk.
 
 I tried indexing 10,000 files at a time to the RAMDirectory before writing
 to disk. This drastically improved indexing times but sometimes I get
 out of memory errors. I am indexing text files and adding 9 fields from
 an Oracle database.
 
 Environment:
 Solaris 2.8 with 1G of ram and 2G of swap
 Java 1.3.1
 Lucene 1.2-rc4
 
 Any ideas for eliminating the out of memory errors ?

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Phrase Query

2002-02-18 Thread Ian Lea

 Hello All,
 Question on phrase queries-
 I have a medical reports document that has Anesth, Knee in it.
 If I use phrase query, it works but so does Anesth Knee (notice that the
 comma is missing.)
 
 Does Lucene remove special characters before indexing the documents?

Depends on the Analyzer you use.  Some certainly do, including
StandardAnalyzer.  You can build your own analyzer if you have
special requirements.



--
Ian.

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Lucene Build Instructions

2002-02-08 Thread Ian Lea

Today I built lucene-1.2-rc3 from the source distribution
for the first time.  On the whole it was easy enough but
a couple of points:


BUILD.txt says Install JDK 1.3.  Might it be better to say
Install JDK 1.3 or later?. Lots of people are probably
running 1.4 already.

BUILD.txt makes no mention of javaCC.  You find out soon
enough when run ant, but would be better to have it
mentioned up front.  Also I couldn't get the .ant.properties
method mentioned in build.xml to work (probably finger
trouble) but copying javaCC.zip to the lucene-1.2-rc3-src/lib/
directory worked fine.



Also, by default the code as compiled from source and as
distributed in the binary download doesn't have line numbers
in stack trace dumps.  This may be deliberate in which case
fine, but the line numbers do help in tracking down
problems.



--
Ian.
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Web demo example: Errors from Tomcat startup

2002-02-08 Thread Ian Lea

Looks like it does.


--
Ian.

Andrew C. Oliver wrote:
 
 Does tomcat 4 understand the 3.2.x header?  If so, lets just use the
 3.2.x header.

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Web demo example: Errors from Tomcat startup

2002-02-07 Thread Ian Lea

The UnknownHostException is probably because the parser
trying to read WEB-INF/web.xml wants to lookup the
dtd and is failing because it can't locate java.sun.com.
Perhaps suprising given your email address!  Perhaps
you need to fix your local DNS setup so it can find
java.sun.com or use a different parser.  I can't remember
all the ins and outs of this but I run tomcat off line,
without access to java.sun.com, using xerces.

Can't comment on the other stuff about the demo.



--
Ian.
[EMAIL PROTECTED]


Don Gilchrest - Sun Microsystems wrote:
 
 Hi,
 
 I'm going through the Lucene 'Getting Started Guide and am up to
 creating the index for the Web app.  After copying
 lucene-1.2-rc3/luceneweb.war to $TOMCAT_HOME/webapps, the following
 errors are reported when I startup Tomcat (v3.2.1):
 
 XmlMapper: Can't find resource for entity: -//Sun Microsystems, Inc.//DTD Web
 Application 2.3//EN -- http://java.sun.com/dtd/web-app_2_3.dtd null
 ERROR reading /opt/jakarta-tomcat-3.2.1/webapps/luceneweb/WEB-INF/web.xml
 At External entity not found: http://java.sun.com/dtd/web-app_2_3.dtd;.
 
 ERROR reading /opt/jakarta-tomcat-3.2.1/webapps/luceneweb/WEB-INF/web.xml
 java.net.UnknownHostException: java.sun.com
 at java.net.InetAddress.getAllByName0(InetAddress.java:571)
 at java.net.InetAddress.getAllByName0(InetAddress.java:540)
 at java.net.InetAddress.getByName(InetAddress.java:449)
 at java.net.Socket.init(Socket.java:100)
 at sun.net.NetworkClient.doConnect(NetworkClient.java:50)
 at sun.net.www.http.HttpClient.openServer(HttpClient.java:335)
 at sun.net.www.http.HttpClient.openServer(HttpClient.java:521)
 at sun.net.www.http.HttpClient.init(HttpClient.java:271)
 at sun.net.www.http.HttpClient.init(HttpClient.java:281)
 at sun.net.www.http.HttpClient.New(HttpClient.java:293)
 at
 sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:404)
 at
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.jav
 a:497)
 at
 java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:230)
 at com.sun.xml.parser.Resolver.createInputSource(Resolver.java:248)
 at
 com.sun.xml.parser.ExternalEntity.getInputSource(ExternalEntity.java:49)
 at com.sun.xml.parser.Parser.pushReader(Parser.java:2768)
 at com.sun.xml.parser.Parser.externalParameterEntity(Parser.java:2504)
 at com.sun.xml.parser.Parser.maybeDoctypeDecl(Parser.java:1137)
 at com.sun.xml.parser.Parser.parseInternal(Parser.java:481)
 at com.sun.xml.parser.Parser.parse(Parser.java:284)
 at javax.xml.parsers.SAXParser.parse(SAXParser.java:155)
 at javax.xml.parsers.SAXParser.parse(SAXParser.java:126)
 at org.apache.tomcat.util.xml.XmlMapper.readXml(XmlMapper.java:214)
 at
 org.apache.tomcat.context.WebXmlReader.processWebXmlFile(WebXmlReader.java:202)
 at
 org.apache.tomcat.context.WebXmlReader.contextInit(WebXmlReader.java:109)
 at
 org.apache.tomcat.core.ContextManager.initContext(ContextManager.java:491)
 at org.apache.tomcat.core.ContextManager.init(ContextManager.java:453)
 at org.apache.tomcat.startup.Tomcat.execute(Tomcat.java:195)
 at org.apache.tomcat.startup.Tomcat.main(Tomcat.java:235)
 2002-02-06 04:51:53 - PoolTcpConnector: Starting HttpConnectionHandler on 8080
 2002-02-06 04:51:53 - PoolTcpConnector: Starting Ajp12ConnectionHandler on 8007
 
 Also, the instructions (in demo3.html) related to this step are a bit
 confusing to me and seem like they may be out of order.  For example,
 in Indexing files section, it states to execute the command to create
 the index within  your {tomcat}/webapps/luceneweb directory; but that
 directory doesn't exist until I do the next step -- copying
 luceneweb.war to {tomcat}/webapps and re-start Tomcat, which is when I
 get the errors shown above.
 
 What am I missing here?
 
 Thanks in advance for any help with this.
 
 regards,
 -don
 
 PS Here's my classpath output from 'tomcat.sh start':
 
 /opt/jakarta-tomcat-3.2.1/lib/ant.jar:/opt/jakarta-tomcat-3.2.1/lib/jasper.jar:/
 opt/jakarta-tomcat-3.2.1/lib/jaxp.jar:/opt/jakarta-tomcat-3.2.1/lib/parser.jar:/
 opt/jakarta-tomcat-3.2.1/lib/servlet.jar:/opt/jakarta-tomcat-3.2.1/lib/test:/opt
 /jakarta-tomcat-3.2.1/lib/webserver.jar:/usr/local/j2sdk1_3_1_02/lib/tools.jar:/
 usr/local/lucene-1.2-rc3/lucene-1.2-rc3.jar:/usr/local/lucene-1.2-rc3/lucene-dem
 os-1.2-rc3.jar:.:/usr/local/j2sdk1_3_1_02/lib/dt.jar:/usr/local/j2sdk1_3_1_02/li
 b/tools.jar:/usr/local/j2sdk1_3_1_02/lib/htmlconverter.jar:/usr/local/jdom-b7/bu
 ild/jdom.jar:/opt/java/jsdk2.2/lib/jsdk.jar
 
 Let me know if I need to provide any other info.
 
 --
 To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, 

Re: Problem in deleteing the documents

2002-01-10 Thread Ian Lea

That's the version I use.


--
Ian.

Thutika, Swamy wrote:
 
 Thanks , Ian for the solution. It works now.
 In the api doc I am looking at , this is not mentioned.
 I am using the lucene-1.2-rc2.
 
 I am just wondering if i am using the current version of
 lucene. Could you let me know where I can find the latest one.
 The api i have says:
 
 
 delete
 public abstract void delete(int docNum)
  throws IOException
 Deletes the document numbered docNum. Once a document is deleted it will not
 appear in TermDocs or TermPostitions enumerations. Attempts to read its
 field with the document(int) method will result in an error. The presence of
 this document may still be reflected in the
 docFreq(org.apache.lucene.index.Term) statistic, though this will be
 corrected eventually as the index is further modified.
 
 
 **
 
 Thanks
 
 Swamy

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: AW: Can't get a DateFilter to work

2001-12-17 Thread Ian Lea

Sorry, my mistake.  I glanced at your message and leapt to
the wrong conclusion.  Don't know what is wrong in your code
but since I had been meaning to get to grips with Date
searching in Lucene but had never made time to do so, have
knocked up a test program, attached to this message, which
creates a small index and searches on a few ranges, including
0 to 2008269714990, and it seems to work fine.  Perhaps it
may help you track down your problem.

If not, and you still want help, I suggest you post the
simplest possible program that demonstrates the problem.



--
Ian.
[EMAIL PROTECTED]



Jan Stövesand wrote:
 
 Hi,
 
 it is no typo. I thought that a DateFilter will return everything between
 from (0) and to (2008269714990), specified in the parameter list.
 
 Jan
 
  -Ursprüngliche Nachricht-
  Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]Im
  Auftrag von Ian Lea
  Gesendet: Freitag, 14. Dezember 2001 12:40
  An: Lucene Users List
  Betreff: Re: Can't get a DateFilter to work
 
 
  Could be just a typo in the email message, but 1008269714990 !=
  2008269714990.
 
 
  --
  Ian.
  [EMAIL PROTECTED]
 
  Jan Stövesand wrote:
  
   Hi,
  
   I really tried everything to get a DateFilter to work but I failed. :-(
  
   What I did was:
  
   Indexing:
  
   doc.add( Field.Keyword(last-modifed,
   DateField.timeToString( timeInMillies ) );
  
   e.g.
   millies: 1008269714990
   field value: 0cv6xr772
  
   If i submit a normal query, looking for 0cv6xr772 I find the
  entry, i.e.
   the entry should be indexed correctly.
   If I search for a text in the body of the element I find about
  30 entries
   including the one mentioned above with last-midified=0cv6xr772.
  
   If I repeat the same query with
  
   DateFilter filter=new DateFilter(last-modified, 0, 2008269714990);
   Hits hits = searcher.search(query, filter);
  
   I do not get any results. What am I doing wrong?
   Any help appreciated.
  
   JAn
 
  --

import java.util.Date;
import org.apache.lucene.queryParser.*; 
import org.apache.lucene.search.*; 
import org.apache.lucene.index.*; 
import org.apache.lucene.analysis.*; 
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*; 
import org.apache.lucene.store.*;

/** Simple program to play with DateFilter stuff.
 *
 *  To compile and run on unix, save to e.g. /tmp/LuceneDateTest.java and
 *  pre
 *   $ cd /tmp
 *   $ CLASSPATH=./:/some/where/lucene.jar; export CLASSPATH
 *   $ javac LuceneDateTest.java
 *   $ java LuceneDateTest
 *  /pre
 *  
 *  Adjust as appropriate for other platforms.
 *
 *  Creates a RAMDirectory, adds some documents with dates and
 *searches on date ranges.
 */


public class LuceneDateTest { 

RAMDirectory ramdir;
Analyzer analyzer;
IndexWriter writer;
IndexReader reader;
Searcher searcher;
Date start;
Date middle;
Date end;

public LuceneDateTest() {
analyzer = new StandardAnalyzer(); 
ramdir = new RAMDirectory();
}


public static void main(String args[]) throws Exception { 
LuceneDateTest ld = new LuceneDateTest();
ld.load();
ld.search();
}


void load() throws Exception {
start = new Date();
writer = new IndexWriter(ramdir, analyzer, true); 
add(doc1);
add(doc2);
middle = new Date();
add(doc3);
add(doc4);
writer.close();
end = new Date();
}   



void add(String id) throws Exception {
Document d = new Document();
d.add(Field.Keyword(id, id));
String lmod = DateField.timeToString(System.currentTimeMillis());
String lmod2 = DateField.timeToString(1008269714990L);
d.add(Field.Keyword(last-modified, lmod));
d.add(Field.Keyword(last-mod2, lmod2));
System.out.println(Adding id:+id+
   , last-modified: +lmod+
   , last-mod2: +lmod2);
writer.addDocument(d);
}



void search() throws Exception {
reader = IndexReader.open(ramdir);
searcher = new IndexSearcher(reader);

DateFilter filter;
Query query = QueryParser.parse(d*, id, analyzer);

filter = new DateFilter(last-modified,
start,
end);
search(query, filter, start to end (4): );

filter = new DateFilter(last-modified,
middle,
end);
search(query, filter, middle to end (2): );

filter = new DateFilter(last-modified,
0,
System.currentTimeMillis());
search(query, filter, 0 to now (4): );


filter = new DateFilter(last-mod2,
0,
2008269714990L);
search(query, filter, 0 to 2008269714990 (4

Re: FileNotFoundException

2001-12-05 Thread Ian Lea

I don't have an explanation for this, but if it was me indexing
this large amount of data I'd be running each of the 6 in a
completely separate process.  More control, less damage when
one bit fails, perhaps better performance on a multi-processor
machine.  And perhaps you wouldn't get this problem!


--
Ian.
[EMAIL PROTECTED]


Chantal Ackermann wrote:
 
 hello all,
 
 I am still trying to find the best way to index a really big amount of data.
 at the moment I am trying to index each of the 29 textfiles in a single
 thread using for each an own IndexWriter and an own directory where to place
 the index. there are always six threads working the same time.
 
 the problem that occures now is that every second thread stops due to a
 FileNotFoundException or an ArrayIndexOutOfBoundsException (the latter only
 once) while the other half finishes fine. the file's name is different for
 each thread but has always the extension .fnm.
 
 for example:
 java.io.FileNotFoundException:
 /lucenetest/medlineIndex/1976-1977/_2zfj.fnm (Datei oder
 Verzeichnis nicht gefunden)
 at java.io.RandomAccessFile.open(Native Method)
 at java.io.RandomAccessFile.init(RandomAccessFile.java(Compiled
 Code))at
 java.io.RandomAccessFile.init(RandomAccessFile.java(Compiled Code))
 at org.apache.lucene.store.FSInputStream.init(Unknown Source)
 at org.apache.lucene.store.FSInputStream.init(Unknown Source)
 at org.apache.lucene.store.FSDirectory.openFile(Unknown Source)
 at org.apache.lucene.index.FieldInfos.init(Unknown Source)
 at org.apache.lucene.index.SegmentReader.init(Unknown Source)
 at org.apache.lucene.index.IndexWriter.mergeSegments(Unknown Source)
 at org.apache.lucene.index.IndexWriter.maybeMergeSegments(Unknown
 Source)
 at org.apache.lucene.index.IndexWriter.maybeMergeSegments(Unknown
 Source)
 at org.apache.lucene.index.IndexWriter.addDocument(Unknown Source)
 at
 
de.biomax.lucenetest.MedlineRecordIndexer.indexDocs(MedlineRecordIndexer.java(Compiled
 Code))
 
 since half of the files are indexed without throwing that kind of exception
 I'm at a loss where to start debugging. any ideas?
 
 thanks a lot
 chantal

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: OutOfMemoryError

2001-11-29 Thread Ian Lea

Doug sent the message below to the list on 3-Nov in response to
a query about file size limits.  There may have been more
related stuff on the thread as well.


--
Ian.



   *** Anyway, is there anyway to control how big the indexes 
 grow ? 

The easiset thing is to set IndexWriter.maxMergeDocs. Since you hit 2GB at
8M docs, set this to 7M.  That will keep Lucene from trying to merge an
index that won't fit in your filesystem.  (It will actually effectively
round this down to the next lower power of Index.mergeFactor.  So with the
default mergeFactor=10, maxMergeDocs=7M will generate a series of 1M
document indexes, since merging 10 of these would exceed the max.)

Slightly more complex: you could further minimize the number of segments,
if, when you've added seven million documents, optimize the index and start
a new index.  Then use MultiSearcher to search.

Even more complex and optimal: write a version of FSDirectory that, when a
file exceeds 2GB, creates a subdirectory and represents the file as a series
of files.  (I've done this before, and found that, on at least the version
of Solaris that I was using, the files had to be a few 100k less than 2GB
for programs like 'cp' and 'ftp' to operate correctly on them.)

Doug




Chantal Ackermann wrote:
 
 hi Ian, hi Winton, hi all,
 
 sorry I meant heap size of 100Mb. I'm  starting java with -Xmx100m. I'm not
 setting -Xms.
 
 For what I know now, I had a bug in my own code. still I don't understand
 where these OutOfMemoryErrors came from. I will try to index again in one
 thread without RAMDirectory just to check if the program is sane.
 
 the problem that the files get to big while merging remains. I wonder why
 there is not the possibility to tell lucene not to create files that are
 bigger than the system limit. how am i supposed to know after how many
 documents this limit is reached? lucene creates the documents - i just know
 the average size of a piece of text that is the input for a document. or am I
 missing something?!
 
 chantal

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: OutOfMemoryError

2001-11-28 Thread Ian Lea

I've loaded a large (but not as large as yours) index with mergeFactor
set to 1000.  Was substantially faster than with default setting. 
Making it higher didn't seem to make things much faster but did cause
it to use more memory. In addition I loaded the data in chunks in
separate processes and optimized the index after each chunk, again
in a separate process.  All done straight to disk, no messing about
with RAMDirectories.

Didn't play with maxMergeDocs and am not sure what you mean by
maximum heap size but 1MB doesn't sound very large.



--
Ian.
[EMAIL PROTECTED]


Chantal Ackermann wrote:
 
 hi to all,
 
 please help! I think I mixed my brain up already with this stuff...
 
 I'm trying to index about 29 textfiles where the biggest one is ~700Mb and
 the smallest ~300Mb. I achieved once to run the whole index, with a merge
 factor = 10 and maxMergeDocs=1. This took more than 35 hours I think
 (don't know exactly) and it didn't use much RAM (though it could have).
 unfortunately I had a call to optimize at the end and while optimization an
 IOException (File to big) occured (while merging).
 
 As I run the program on a multi-processor machine I now changed the code to
 index each file in a single thread and write to one single IndexWriter. the
 merge factor is still at 10. maxMergeDocs is at 1.000.000. I set the maximum
 heap size to 1MB.
 
 I tried to use RAMDirectory (as mentioned in the mailing list) and just use
 IndexWriter.addDocument(). At the moment it seems not to make any difference.
 after a while _all_ the threads exit one after another (not all at once!)
 with an OutOfMemoryError. the priority of all of them is at the minimum.
 
 even if the multithreading doesn't increase performance I would be glad if I
 could just once get it running again.
 
 I would be even happier if someone could give me a hint what would be the
 best way to index this amount of data. (the average size of an entry that
 gets parsed for a Document is about 1Kb.)
 
 thanx for any help!
 chantal

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Efficient document spooling and indexing

2001-11-22 Thread Ian Lea

Data may not be committed to disk, buffers flushed, files
closed, etc. until IndexWriter.close() is called, but file
IO does happen before then.  So I would expect the answer
to your question to be no.


--
Ian.
[EMAIL PROTECTED]


Otis Gospodnetic wrote:
 
 Hello,
 
 This is from a thread from about 2 weeks ago.
 What is the answer to this question?
 If data is written to disk only when IndexWriter's close() is called,
 wouldn't the sample code below be as efficient as the sample code that
 uses RAMDirectory, further down?
 
 Thanks,
 Otis
 
 
 When using the FSWriter, the actual file io doesn't occur until I close
 the writer, right?  So wouldn't it be just as efficient to do the
 following:
 
 IndexWriter fsWriter = new IndexWriter(new File(...), analyzer, false);
   while (... more docs to index...)
 ... add 100,000 docs to fsWriter ...
   }
   fsWriter.optimize();
   fsWriter.close();
 
 -Original Message-
 From: Scott Ganyo [mailto:[EMAIL PROTECTED]]
 Sent: Friday, November 02, 2001 10:47 AM
 To: 'Lucene Users List'
 Subject: RE: Indexing problem
 
 Well, I don't know if there's an archive of the list, so this what Doug
 wrote: 
 A more efficient and slightly more complex approach would be to build
 large
 indexes in RAM, and copy them to disk with IndexWriter.addIndexes:
   IndexWriter fsWriter = new IndexWriter(new File(...), analyzer,
 true);
   while (... more docs to index...)
 RAMDirectory ramDir = new RAMDirectory();
 IndexWriter ramWriter = new IndexWriter(ramDir, analyzer, true);
 ... add 100,000 docs to ramWriter ...
 ramWriter.optimize();
 ramWriter.close();
 fsWriter.addIndexes(new Directory[] { ramDir });
   }
   fsWriter.optimize();
   fsWriter.close();
 
 
 Scott

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Database Example

2001-11-20 Thread Ian Lea

 I'm new to Lucene, does any one have any good samples or links to tutorial sites?

You can find a tutorial at http://www.darksleep.com/puff/lucene/lucene.html
 
 I'm particularlly interested in the ability to index a database.

Someone asked a similar question a week or so back and I posted the
reply and sample code below.


Lucene works on Documents that you fill with data that you get from
anywhere you like.  To index data in a database table you could use
something like:

Analyzer analyzer = new StandardAnalyzer(); 
IndexWriter writer = new IndexWriter(dbindex, analyzer, true);
Connection conn = getConnection();
String sql = select id, firstname, lastname from people;
Statement stmt = conn.createStatement(sql);
ResultSet rs = stmt.executeQuery();
while (rs.next()) {
Document d = new Document();
d.add(Field.Text(id, rs.getString(id)));
d.add(Field.UnStored(firstname, rs.getString(firstname)));
d.add(Field.UnStored(lastname, rs.getString(lastname)));
writer.addDocument(d);
}
writer.close();

The id field is indexed and stored since will want to extract it from lucene.
The name fields are not stored since they are already stored on the database,
although you could store them if wanted to avoid having to go back to the
database for display.


To search and display results, with details coming from the database,
something along the lines of:

Searcher searcher = new IndexSearcher(IndexReader.open(dbindex));
Query query = QueryParser.parse(duncan, firstname, analyzer);
Hits hits = searcher.search(query);
String sql = select * from people where id = ?;
PreparedStatement pstmt = conn.prepareStatement(sql);
for (int i = 0; i  hits.length(); i++) {
String id = hits.doc(i).get(id);
pstmt.setString(1, id);
displayResults(pstmt);
}



Hope this helps.



--
Ian.
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Adding New Fields to Document

2001-11-01 Thread Ian Lea

Are you adding the document to the index once you've added the Field to the
document?
Do you see the authorname if search on some other field in the Document?

I don't know if there are limits or not, but if there are I suspect they
are high enough that you and I are unlikely to hit them.




--
Ian.
[EMAIL PROTECTED]



Vijay Jagannathan wrote:
 
 Hello All,
 I am trying to add a new field to the Document as below:
 
 doc.add(new Field(authorname, authorname, false, true, false));
 
 When I am doing the search after rebuilding the index, I am not getting any
 results if I search on this field.
 
 Is there a size limit or limit on the number of fields a Document can have ?
 
 Any guidance is much appreciated.
 
 Thanks.
 Vijay

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]