Re: indexing and searching different file formats
Thanks a lot Andy. -Pradeep On Wednesday, February 13, 2002, at 09:50 PM, Andrew Libby wrote: Pradeep, Currently Lucene does not provide the ability to convert documents to text for indexing. There is talk of adding this kind of thing to the goal of the project, along with providing crawlers to traverse web, local disk, ftp, and RDBMS sources of data. The problem with indexining irrespective of file type is that each document format contains embedded information that must be stripped out (or ignored) and the text needs to be retrieved for indexing. An extreeme example is a PDF which has a considerably complicated document format. On the contributions page there are some pointers that may provide information about processing the types of documents you're interested in. http://jakarta.apache.org/lucene/docs/contributions.html If you've not taken the time to do so, look at the FAQs, they are very informative: http://www.lucene.com/cgi-bin/faq/faqmanager.cgi http://jakarta.apache.org/lucene/docs/gettingstarted.html http://www.jguru.com/faq/Lucene Good luck! Andy On Wed, Feb 13, 2002 at 09:24:33PM +0530, Pradeep Kumar K wrote: Hi Lucene friends! How the files of different format can be indexed and searched? ( As I know lucene is having HTML indexer and searcher, which comes along with it and also XML indexer, but is there any way to index files irrespective of the file type) Any suggestions will be greatly appreciated.. Thanks in advance. Pradeep -- Robosoft Technologies, Mangalore, India -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED] -- -- Andrew Libby CommNav, Inc [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED] -- Robosoft Technologies, Mangalore, India -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RC3 release
Hi, I have been using an older release from back when lucene was not under jakarta. I just tried the released RC3 version of apache.lucene libs, I was getting errors while indexing documents. Usually, there is a write.lock file left in the index dir. I did see some e-mails on a related subject, (RE: problems with last patch (obtain write.lock while deleting d ocuments)) I think Doug has fixed this on Feb 11th. I am at a point in my development of a search engine using lucene that I need to put the new apache.lucene libs in. Are there any release notes on rc3? Also, how soon the writelock fix be released officially? Thanks! -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: using lucene with a very large index
From: tal blum [mailto:[EMAIL PROTECTED]] 2) Does the Document id changes after merging indexes adding or deleting documents? Yes. 4) assuming I have a term query that has a large number of hits say 10 millions, is there a way to get the say the top 10 results without going through all the hits? Your best bet is to use the normal search API. From: tal blum [mailto:[EMAIL PROTECTED]] one solution to that is to change the implementation and store the docs sorted by their term score. That would make incremental index updates much slower, since every time a document is added, the list of documents containing each term in that document would need to be re-sorted. Currently we only need to append new entries, which is much simpler. You could optimize this in various ways (e.g., instead take the hit at search time) but it would still make things slower for rapidly changing indexes. Also, while this would make single term queries faster, multi-term queries are more complex to accellerate. The highest scoring match for a two term query may be in a document where one term has a very high weight and the other has a very low weight. There have been papers written (I don't have the references handy) exploring this issue, and, in general, there isn't an algorithm that is guaranteed to return the highest scoring documents for multi-term queries that does not in most cases have to process nearly all of the documents containing those terms. That said, it is possible to use such an index to vastly accellerate searches that *usually* return the highest scoring documents. Such a heuristic search technique is among the things required to scale Lucene to extremely large collections (e.g., hundreds of millions of documents). There are also lower-tech optimizations. For example, one can simply keep a small index containing the highest-quality documents that is always searched first. If enough hits are found there, you're done. A real internet search engine combines lots of tricks in order to scale: segmenting indexes by quality; heuristic search methods; and distributed searching. Deploying something like Google is not a small task. I would someday like to add a heuristic search component to Lucene, that uses a special index format (possibly with term document lists sorted by normalized frequency, as you suggest). I have some experience doing this at Excite, and it pays off big time. But it would take me several weeks full-time to implement this, and I don't currently have that time. Perhaps (with the support of an interested sponsor) I could make time this summer to implement this. In the meantime, if you encounter performance problems with a very large index, you might try segmenting your index by document quality and/or distributed search. Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: write.lock file
I cannot replicate the problem you are having. Can you please submit a complete, self-contained, test case illustrating the problem you are having with the write lock. Please test this against the latest nightly build of Lucene, from: http://jakarta.apache.org/builds/jakarta-lucene/nightly/ Thanks, Doug -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Searching multiple fields in one Index of Documents
Can you zip up those files or change the .js extension to .txt? My mail server strips out potentially harmful files. Thanks, Mark -Original Message- From: Kelvin Tan [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 13, 2002 10:32 PM To: Lucene Users List Subject: Re: Searching multiple fields in one Index of Documents Peter, As advised, re-released under APL. :) There were some changes to QueryParser constructors in rc3, and these are reflected here as well. FWIW, I've also attached a javascript lib and accompanying HTML which constructs a Lucene multi-field query using a HTML form. Regards, Kelvin - Original Message - From: Peter Carlson [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, February 13, 2002 10:56 PM Subject: Re: Searching multiple fields in one Index of Documents This is great Kelvin, Sorry I didn't see it before. I'll add it to the list of contributions. --Peter On 2/13/02 12:43 AM, Kelvin Tan [EMAIL PROTECTED] wrote: Charles, See http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00176.html Regards, K - Original Message - From: Charles Harvey [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, February 12, 2002 8:39 AM Subject: Searching multiple fields in one Index of Documents I have a working installation of Lucene running against indexes created by a database query. Each Document in the Index contains fifteen or twenty fields. I am currently searching only one field (that contains concatenated database columns) because I cannot figure out how to search multiple fields. So: How can I use Lucene to search more than one field in an Index of Documents? eg: field CATEGORY is(or contains) 'bar' AND field BODY contains 'foo' _ The trouble with the rat-race is that even if you win you're still a rat. --Lily Tomlin _ Charles Harvey Developer http://www.philly.com Wk: 215 789 6057 Cell: 215 588 0851 -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RES: My own steammer (brazilian)
I know this has nothing to do with this list, but please give some help! I downloaded ANT and installed it setting the classpath with all its jar files. Then I tried to compile lucene using the suggested command: ANT COMPILE and I got the following message: -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= -=-=-=-=D:\Java\lucene-1.2-rc2..\jakarta-ant-1.4.1\bin\ant compile Buildfile: build.xml init: javacc_check: compile: BUILD FAILED D:\Java\lucene-1.2-rc2\build.xml:92: Could not create task of type: javacc. Comm on solutions are to use taskdef to declare your task, or, if this is an optional task, to put the optional.jar in the lib directory of your ant installation (AN T_HOME). Total time: 2 seconds -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= -=-=-=-= I'm absolutelly ignorant about ANT. What is missing ? Am I too far from the solution (if so, i promisse to study more) ? Where can I find the 'optional.jar' file ? Please, can someone give me some clue ? bye jk -Mensagem original- De: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Enviada em: Wednesday, February 13, 2002 9:33 PM Para: Lucene Users List Assunto: Re: My own steammer (brazilian) That file is created during the build process. Try building Lucene by typing 'ant compile'. Otis --- Bizu_de_Anúncio [EMAIL PROTECTED] wrote: My brazilian steammer has the same structure as the German steammer, except for the inner logic. I created it , tested it and now I'm trying to compile it with no success. The problem is the 'StandartTokenizer.java' class ! I can´t find it in the package org.apache.lucene.analysis.standard . The only file that exists there is a file named 'StandartTokenizer.jj'. What is this file for ? I have lucene-1.2-rc2. Can someone help me, thanks, jk -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Send FREE Valentine eCards with Yahoo! Greetings! http://greetings.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: indexing and searching different file formats
Uhmmm, I can contribute something which does a pretty decent job if anyone's interested... Just have to clean it up a little... Regards, Kelvin - Original Message - From: W. Eliot Kimber [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, February 15, 2002 1:10 AM Subject: Re: indexing and searching different file formats Andrew Libby wrote: and the text needs to be retrieved for indexing. An extreeme example is a PDF which has a considerably complicated document format. The PJ library from www.etymon.com provides a pretty complete and easy-to-use API for getting info from PDF docs. It wouldn't be too hard to write a PDF indexer for Lucene using this library. The main challenge would be guessing word boundaries in strings where spaces have been replaced with explicit shift values by the formatter. Cheers, Eliot -- W. Eliot Kimber, [EMAIL PROTECTED] Consultant, ISOGEN International 1016 La Posada Dr., Suite 240 Austin, TX 78752 Phone: 512.656.4139 -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Searching multiple fields in one Index of Documents
As requested, http://www.relevanz.com/lucene_contrib.zip Regards, Kelvin - Original Message - From: Mark Tucker [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, February 15, 2002 2:03 AM Subject: RE: Searching multiple fields in one Index of Documents Can you zip up those files or change the .js extension to .txt? My mail server strips out potentially harmful files. Thanks, Mark -Original Message- From: Kelvin Tan [mailto:[EMAIL PROTECTED]] Sent: Wednesday, February 13, 2002 10:32 PM To: Lucene Users List Subject: Re: Searching multiple fields in one Index of Documents Peter, As advised, re-released under APL. :) There were some changes to QueryParser constructors in rc3, and these are reflected here as well. FWIW, I've also attached a javascript lib and accompanying HTML which constructs a Lucene multi-field query using a HTML form. Regards, Kelvin - Original Message - From: Peter Carlson [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, February 13, 2002 10:56 PM Subject: Re: Searching multiple fields in one Index of Documents This is great Kelvin, Sorry I didn't see it before. I'll add it to the list of contributions. --Peter On 2/13/02 12:43 AM, Kelvin Tan [EMAIL PROTECTED] wrote: Charles, See http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00176.html Regards, K - Original Message - From: Charles Harvey [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, February 12, 2002 8:39 AM Subject: Searching multiple fields in one Index of Documents I have a working installation of Lucene running against indexes created by a database query. Each Document in the Index contains fifteen or twenty fields. I am currently searching only one field (that contains concatenated database columns) because I cannot figure out how to search multiple fields. So: How can I use Lucene to search more than one field in an Index of Documents? eg: field CATEGORY is(or contains) 'bar' AND field BODY contains 'foo' _ The trouble with the rat-race is that even if you win you're still a rat. --Lily Tomlin _ Charles Harvey Developer http://www.philly.com Wk: 215 789 6057 Cell: 215 588 0851 -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: excluding files / refining search
Brian Rook [EMAIL PROTECTED] writes: The site I'm working on has a lot of small html files that are used for page construction (nav bars, footers, etc) and they're being returned high in the results because they contain the search term(s) I'm looking for and are small so they rank higher than larger documents. I want to exclude them from the index and I've come up with two ideas: 1) move them to a directory, which I will exclude from the index, but I'll have to change a bunch of links 2) detect them with some sort of flag and exclude them from the index. We were thinking that we could have a fake tag that lucene would detect and not index those pages. Why not just have an exclude list of some sort? In the code you wrote to select files for indexing, just have it check against a list of files you want to exclude. In the demo application, you would edit jakarta-lucene/src/demo/org/apache/lucene/demo/IndexFiles.java The quick and dirty method would be to edit this section of code: public static void indexDocs(IndexWriter writer, File file) throws Exception { if (file.isDirectory()) { String[] files = file.list(); for (int i = 0; i files.length; i++) indexDocs(writer, new File(file, files[i])); } else { System.out.println(adding + file); writer.addDocument(FileDocument.Document(file)); } } To something like this: public static void indexDocs(IndexWriter writer, File file) throws Exception { if (file.isDirectory()) { String[] files = file.list(); for (int i = 0; i files.length; i++) indexDocs(writer, new File(file, files[i])); } else { if (checkFileName(file)) { System.out.println(skipping + file) ; } else { System.out.println(adding + file); writer.addDocument(FileDocument.Document(file)); } } } public static boolean checkFileName(File file) { String name = file.getName() ; if (name == footer.html || name == header.html || name == menu.html || name == navbar.html) { return false ; } return true ; } A more realistic implementation would use an exclude file of filenames to ignore, load them into a collection (probably a HashSet) and keep that collection around as an instance variable. Then checkFileName() just returns !excludedSet.contains(name). Steven J. Owens [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]