Re: indexing and searching different file formats

2002-02-14 Thread Pradeep Kumar K

Thanks  a lot Andy.
-Pradeep

On Wednesday, February 13, 2002, at 09:50 PM, Andrew Libby wrote:


 Pradeep,
 Currently Lucene does not provide the ability to convert documents
 to text for indexing.  There is talk of adding this kind of thing to the
 goal of the project, along with providing crawlers to traverse web,
 local disk, ftp, and RDBMS sources of data.

 The problem with indexining irrespective of file type is that each 
 document
 format contains embedded information that must be stripped out (or 
 ignored)
 and the text needs to be retrieved for indexing.  An extreeme example is
 a PDF which has a considerably complicated document format.

 On the contributions page there are some pointers that may provide 
 information
 about processing the types of documents you're interested in.

 http://jakarta.apache.org/lucene/docs/contributions.html

 If you've not taken the time to do so, look at the FAQs, they are very
 informative:

 http://www.lucene.com/cgi-bin/faq/faqmanager.cgi
 http://jakarta.apache.org/lucene/docs/gettingstarted.html
 http://www.jguru.com/faq/Lucene

 Good luck!

 Andy


 On Wed, Feb 13, 2002 at 09:24:33PM +0530, Pradeep Kumar K wrote:
 Hi Lucene friends!

How the files of different format can be indexed and searched? 
 ( As I
 know lucene is having HTML indexer and searcher, which comes along with
 it and also XML indexer, but is there any way to index files
 irrespective of the file type)
 Any suggestions will be greatly appreciated..

 Thanks in advance.
 Pradeep


 --
 Robosoft Technologies, Mangalore, India



 --
 To unsubscribe, e-mail:   mailto:lucene-user-
 [EMAIL PROTECTED]
 For additional commands, e-mail: mailto:lucene-user-
 [EMAIL PROTECTED]


 --
 --
 Andrew Libby
 CommNav, Inc
 [EMAIL PROTECTED]


 --
 To unsubscribe, e-mail:   mailto:lucene-user-
 [EMAIL PROTECTED]
 For additional commands, e-mail: mailto:lucene-user-
 [EMAIL PROTECTED]



--
Robosoft Technologies, Mangalore, India



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RC3 release

2002-02-14 Thread Aruna Raghavan

Hi,
I have been using an older release from back when lucene was not under
jakarta. I just tried the released RC3 version of apache.lucene libs, I was
getting errors while indexing documents. Usually, there is a write.lock file
left in the index dir. I did see some e-mails on a related subject, (RE:
problems with last patch  (obtain write.lock while deleting d ocuments)) 
I think Doug has fixed this on Feb 11th. I am at a point in my development
of a search engine using lucene that I need to put the new apache.lucene
libs in. Are there any release notes on rc3? Also, how soon the writelock
fix be released officially?
Thanks!

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: using lucene with a very large index

2002-02-14 Thread Doug Cutting

 From: tal blum [mailto:[EMAIL PROTECTED]]

 2) Does the Document id changes after merging indexes adding 
 or deleting documents?

Yes.

 4) assuming I have a term query that has a large number of 
 hits say 10 millions, is there a way to get the say the top  
 10 results without going through all the hits?

Your best bet is to use the normal search API.

 From: tal blum [mailto:[EMAIL PROTECTED]]

 one solution to that is to change the implementation and 
 store the docs
 sorted by their term score.

That would make incremental index updates much slower, since every time a
document is added, the list of documents containing each term in that
document would need to be re-sorted.  Currently we only need to append new
entries, which is much simpler.  You could optimize this in various ways
(e.g., instead take the hit at search time) but it would still make things
slower for rapidly changing indexes.

Also, while this would make single term queries faster, multi-term queries
are more complex to accellerate.  The highest scoring match for a two term
query may be in a document where one term has a very high weight and the
other has a very low weight.  There have been papers written (I don't have
the references handy) exploring this issue, and, in general, there isn't an
algorithm that is guaranteed to return the highest scoring documents for
multi-term queries that does not in most cases have to process nearly all of
the documents containing those terms.  That said, it is possible to use such
an index to vastly accellerate searches that *usually* return the highest
scoring documents.

Such a heuristic search technique is among the things required to scale
Lucene to extremely large collections (e.g., hundreds of millions of
documents).  There are also lower-tech optimizations.  For example, one can
simply keep a small index containing the highest-quality documents that is
always searched first.  If enough hits are found there, you're done.  A real
internet search engine combines lots of tricks in order to scale: segmenting
indexes by quality; heuristic search methods; and distributed searching.
Deploying something like Google is not a small task.

I would someday like to add a heuristic search component to Lucene, that
uses a special index format (possibly with term document lists sorted by
normalized frequency, as you suggest).  I have some experience doing this at
Excite, and it pays off big time.  But it would take me several weeks
full-time to implement this, and I don't currently have that time.  Perhaps
(with the support of an interested sponsor) I could make time this summer to
implement this.

In the meantime, if you encounter performance problems with a very large
index, you might try segmenting your index by document quality and/or
distributed search.

Doug

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: write.lock file

2002-02-14 Thread Doug Cutting

I cannot replicate the problem you are having.

Can you please submit a complete, self-contained, test case illustrating the
problem you are having with the write lock.

Please test this against the latest nightly build of Lucene, from:
  http://jakarta.apache.org/builds/jakarta-lucene/nightly/

Thanks,

Doug

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Searching multiple fields in one Index of Documents

2002-02-14 Thread Mark Tucker

Can you zip up those files or change the .js extension to .txt?  My mail server strips 
out potentially harmful files.

Thanks,

Mark

-Original Message-
From: Kelvin Tan [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, February 13, 2002 10:32 PM
To: Lucene Users List
Subject: Re: Searching multiple fields in one Index of Documents


Peter,

As advised, re-released under APL. :) There were some changes to QueryParser
constructors in rc3, and these are reflected here as well.

FWIW, I've also attached a javascript lib and accompanying HTML which
constructs a Lucene multi-field query using a HTML form.

Regards,
Kelvin

- Original Message -
From: Peter Carlson [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, February 13, 2002 10:56 PM
Subject: Re: Searching multiple fields in one Index of Documents


 This is great Kelvin,
 Sorry I didn't see it before.
 I'll add it to the list of contributions.

 --Peter

 On 2/13/02 12:43 AM, Kelvin Tan [EMAIL PROTECTED] wrote:

  Charles,
 
  See
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00176.html
 
  Regards,
  K
 
  - Original Message -
  From: Charles Harvey [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Tuesday, February 12, 2002 8:39 AM
  Subject: Searching multiple fields in one Index of Documents
 
 
  I have a working installation of Lucene running against indexes created
by
  a database query.
  Each Document in the Index contains fifteen or twenty fields. I am
  currently searching only one field (that contains concatenated database
  columns) because I cannot figure out how to search multiple fields. So:
 
  How can I use Lucene to search more than one field in an Index of
  Documents?
 
  eg:
  field CATEGORY is(or contains) 'bar'
  AND
  field BODY contains 'foo'
 
 
 
 
  _
 
  The trouble with the rat-race is that even if you win you're still a
  rat.
  --Lily Tomlin
  _
  Charles Harvey
  Developer
  http://www.philly.com
  Wk: 215 789 6057
  Cell: 215 588 0851
 
 
  --
  To unsubscribe, e-mail:
  mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
  mailto:[EMAIL PROTECTED]
 
 
 
 
  --
  To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
mailto:[EMAIL PROTECTED]
 
 


 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RES: My own steammer (brazilian)

2002-02-14 Thread Bizu de Anúncio

I know this has nothing to do with this list, but please give some help!

I downloaded ANT and installed it setting the classpath with all its jar
files. Then I tried to compile lucene using the suggested command:

ANT COMPILE

and I got the following message:

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
-=-=-=-=D:\Java\lucene-1.2-rc2..\jakarta-ant-1.4.1\bin\ant compile
Buildfile: build.xml

init:

javacc_check:

compile:

BUILD FAILED

D:\Java\lucene-1.2-rc2\build.xml:92: Could not create task of type: javacc.
Comm
on solutions are to use taskdef to declare your task, or, if this is an
optional
 task, to put the optional.jar in the lib directory of your ant installation
(AN
T_HOME).

Total time: 2 seconds
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
-=-=-=-=


I'm absolutelly ignorant about ANT. What is missing ? Am I too far from the
solution (if so, i promisse to study more) ? Where can I find the
'optional.jar' file ? Please, can someone give me some clue ?

bye
jk




-Mensagem original-
De: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
Enviada em: Wednesday, February 13, 2002 9:33 PM
Para: Lucene Users List
Assunto: Re: My own steammer (brazilian)


That file is created during the build process.
Try building Lucene by typing 'ant compile'.

Otis

--- Bizu_de_Anúncio [EMAIL PROTECTED] wrote:
   My brazilian steammer has the same structure as the German steammer,
 except
 for the inner logic.

   I created it , tested it and now I'm trying to compile it with no
 success.
 The problem is the 'StandartTokenizer.java' class ! I can´t find it
 in the
 package org.apache.lucene.analysis.standard .

   The only file that exists there is a file named
 'StandartTokenizer.jj'.
 What is this file for ?

   I have lucene-1.2-rc2. Can someone help me,

 thanks,

   jk



 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]



__
Do You Yahoo!?
Send FREE Valentine eCards with Yahoo! Greetings!
http://greetings.yahoo.com

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: indexing and searching different file formats

2002-02-14 Thread Kelvin Tan

Uhmmm, I can contribute something which does a pretty decent job if anyone's
interested...

Just have to clean it up a little...

Regards,
Kelvin
- Original Message -
From: W. Eliot Kimber [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, February 15, 2002 1:10 AM
Subject: Re: indexing and searching different file formats


 Andrew Libby wrote:

  and the text needs to be retrieved for indexing.  An extreeme example is
  a PDF which has a considerably complicated document format.

 The PJ library from www.etymon.com provides a pretty complete and
 easy-to-use API for getting info from PDF docs. It wouldn't be too hard
 to write a PDF indexer for Lucene using this library. The main challenge
 would be guessing word boundaries in strings where spaces have been
 replaced with explicit shift values by the formatter.

 Cheers,

 Eliot
 --
 W. Eliot Kimber, [EMAIL PROTECTED]
 Consultant, ISOGEN International

 1016 La Posada Dr., Suite 240
 Austin, TX  78752 Phone: 512.656.4139

 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Searching multiple fields in one Index of Documents

2002-02-14 Thread Kelvin Tan

As requested,

http://www.relevanz.com/lucene_contrib.zip

Regards,
Kelvin
- Original Message -
From: Mark Tucker [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, February 15, 2002 2:03 AM
Subject: RE: Searching multiple fields in one Index of Documents


Can you zip up those files or change the .js extension to .txt?  My mail
server strips out potentially harmful files.

Thanks,

Mark

-Original Message-
From: Kelvin Tan [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, February 13, 2002 10:32 PM
To: Lucene Users List
Subject: Re: Searching multiple fields in one Index of Documents


Peter,

As advised, re-released under APL. :) There were some changes to QueryParser
constructors in rc3, and these are reflected here as well.

FWIW, I've also attached a javascript lib and accompanying HTML which
constructs a Lucene multi-field query using a HTML form.

Regards,
Kelvin

- Original Message -
From: Peter Carlson [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, February 13, 2002 10:56 PM
Subject: Re: Searching multiple fields in one Index of Documents


 This is great Kelvin,
 Sorry I didn't see it before.
 I'll add it to the list of contributions.

 --Peter

 On 2/13/02 12:43 AM, Kelvin Tan [EMAIL PROTECTED] wrote:

  Charles,
 
  See
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00176.html
 
  Regards,
  K
 
  - Original Message -
  From: Charles Harvey [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Tuesday, February 12, 2002 8:39 AM
  Subject: Searching multiple fields in one Index of Documents
 
 
  I have a working installation of Lucene running against indexes created
by
  a database query.
  Each Document in the Index contains fifteen or twenty fields. I am
  currently searching only one field (that contains concatenated database
  columns) because I cannot figure out how to search multiple fields. So:
 
  How can I use Lucene to search more than one field in an Index of
  Documents?
 
  eg:
  field CATEGORY is(or contains) 'bar'
  AND
  field BODY contains 'foo'
 
 
 
 
  _
 
  The trouble with the rat-race is that even if you win you're still a
  rat.
  --Lily Tomlin
  _
  Charles Harvey
  Developer
  http://www.philly.com
  Wk: 215 789 6057
  Cell: 215 588 0851
 
 
  --
  To unsubscribe, e-mail:
  mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
  mailto:[EMAIL PROTECTED]
 
 
 
 
  --
  To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
mailto:[EMAIL PROTECTED]
 
 


 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: excluding files / refining search

2002-02-14 Thread Steven J. Owens

Brian Rook [EMAIL PROTECTED] writes:
 The site I'm working on has a lot of small html files that are used for page
 construction (nav bars, footers, etc) and they're being returned high in the
 results because they contain the search term(s) I'm looking for and are
 small so they rank higher than larger documents.
 
 I want to exclude them from the index and I've come up with two ideas:
 
 1) move them to a directory, which I will exclude from the index, but I'll
 have to change a bunch of links
 
 2) detect them with some sort of flag and exclude them from the index.  We
 were thinking that we could have a fake tag that lucene would detect and not
 index those pages.

Why not just have an exclude list of some sort?  In the code you
wrote to select files for indexing, just have it check against a list
of files you want to exclude.  In the demo application, you would edit

 jakarta-lucene/src/demo/org/apache/lucene/demo/IndexFiles.java


 The quick and dirty method would be to edit this section of code:

   public static void indexDocs(IndexWriter writer, File file)
throws Exception {
 if (file.isDirectory()) {
   String[] files = file.list();
   for (int i = 0; i  files.length; i++)
 indexDocs(writer, new File(file, files[i]));
 } else {
   System.out.println(adding  + file);
   writer.addDocument(FileDocument.Document(file));
 }
   }
 
 
 To something like this:
 
 
   public static void indexDocs(IndexWriter writer, File file)
throws Exception {
 if (file.isDirectory()) {
   String[] files = file.list();
   for (int i = 0; i  files.length; i++)
 indexDocs(writer, new File(file, files[i]));
 } else {
   if (checkFileName(file)) {
 System.out.println(skipping  + file) ;
   } else {
 System.out.println(adding  + file);
 writer.addDocument(FileDocument.Document(file));
   }
 }
   }
 
   public static boolean checkFileName(File file) {
 String name = file.getName() ;
 if (name == footer.html || 
 name == header.html || 
 name == menu.html || 
 name == navbar.html) {
return false ;
 } 
 return true ;
   }
 
 
 A more realistic implementation would use an exclude file of
filenames to ignore, load them into a collection (probably a HashSet)
and keep that collection around as an instance variable.  Then
checkFileName() just returns !excludedSet.contains(name).

Steven J. Owens
[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]