Re: Lucene crawler plan

2003-06-30 Thread Peter Becker
Peter Becker wrote: [...about the UNIX "file" command...] The idea is to recognize files by certain parts in them instead of using the extensions. The result of the classic file command is a user-readable string, although there have been extensions to MIME types. Unfortunately I can't find a p

Re: Lucene crawler plan

2003-06-30 Thread Peter Becker
Sorry Jack, after I did send my mail I realized that even many Unix users don't know that command and even less people from other platforms. Here are the relevant UNIX man pages: http://unixhelp.ed.ac.uk/CGI/man-cgi?file http://unixhelp.ed.ac.uk/CGI/man-cgi?magic+5 The idea is to recognize fi

Re: Lucene crawler plan

2003-06-30 Thread Victor Hadianto
> >does anyone know of a Java implementation for file(1) magic? > > Peter, > Can you explain what file(1) magic is? From: $man 1 file File tests each argument in an attempt to classify it. There are three sets of tests, performed in this order: filesystem tests, magic number tests, and langua

Re: Lucene crawler plan

2003-06-30 Thread Jack Park
At 07:21 PM 6/30/2003, you wrote: does anyone know of a Java implementation for file(1) magic? Peter, Can you explain what file(1) magic is? I feel dense. I'd like to help if I can. Thanks Jack --- XML Topic Maps: Creating and

Re: Lucene crawler plan

2003-06-30 Thread Peter Becker
Thanks Erik, this is far closer to what we are looking for. Using Ant is an interesting idea, although it probably won't help us for the UI tool. But we could try to layer things so we could use them for both -- we want to get some more sophisticated index management anyway. The option to crea

Re: Lucene crawler plan

2003-06-30 Thread Erik Hatcher
If you are after a pure file system indexing abstraction, check out the 'ant' project in the sandbox. It's got a DocumentHandler abstraction allowing it to be a bit pluggable. Its not perfect, but it has worked for me for quite some time quite sufficiently. Erik On Monday, June 30, 2003,

Re: Lucene crawler plan

2003-06-30 Thread Peter Becker
Clemens Marschner wrote: There's an experimental webcrawler in the lucene-sandbox area called larm-webcrawler (see http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html), and a project on Sourceforge (http://larm.sf.net) that tries to leverage this on a higher level. I want to en

DO NOT REPLY [Bug 21189] - Hits.length() returns to large value

2003-06-30 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT . ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE. http://nagoya.apache.org/bugzilla/show_bu

Re: Similarity byteToFloat() and and floatToByte()

2003-06-30 Thread Doug Cutting
Jim Hargrave wrote: This brings up a general question. Why all the 'final' classes? Is it a performance trick? Personally I would trade a little performance to have all the classes open to inheritance. More liberal use of public and protected would also be appreciated. Final declarations made some

Re: Similarity byteToFloat() and and floatToByte()

2003-06-30 Thread Doug Cutting
[EMAIL PROTECTED] wrote: I would like to remove our mod to Lucene by taking advantage of the scorer API and writing our own Similarity class that has a method to transform an int rank to a float boost. Unfortunately, byteToFloat() is private, so my Similarity class cannot make use of it. Try Simil

Re: Wildcard prefix

2003-06-30 Thread Doug Cutting
Pete Lewis wrote: The only real functionality that Lucene lacks that is supplied by other search engines is the wildcard prefix. The most efficient way to do this is to, when indexing, add text to two fields, one the normal field, and another using an analyzer that reverses the text of each token

DO NOT REPLY [Bug 21128] - Mixed case and keyword fields don't mix?

2003-06-30 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT . ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE. http://nagoya.apache.org/bugzilla/show_bu

Re: Query for term frequency

2003-06-30 Thread Doug Cutting
Giulio Cesare Solaroli wrote: One of our users would like to find all documents where a given term is present more than a given amount of times. This is not supported by the existing query classes. Doug - To unsubscribe, e-mail:

DO NOT REPLY [Bug 21189] - Hits.length() returns to large value

2003-06-30 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT . ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE. http://nagoya.apache.org/bugzilla/show_bu

DO NOT REPLY [Bug 21189] - Hits.length() returns to large value

2003-06-30 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT . ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE. http://nagoya.apache.org/bugzilla/show_bu

DO NOT REPLY [Bug 21189] - Hits.length() returns to large value

2003-06-30 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT . ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE. http://nagoya.apache.org/bugzilla/show_bu

DO NOT REPLY [Bug 21189] - Hits.length() returns to large value

2003-06-30 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT . ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE. http://nagoya.apache.org/bugzilla/show_bu

DO NOT REPLY [Bug 21189] - Hits.length() returns to large value

2003-06-30 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT . ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE. http://nagoya.apache.org/bugzilla/show_bu

DO NOT REPLY [Bug 21189] New: - Hits.length() returns to large value

2003-06-30 Thread bugzilla
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT . ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE. http://nagoya.apache.org/bugzilla/show_bu

Re: Lucene crawler plan

2003-06-30 Thread Clemens Marschner
There's an experimental webcrawler in the lucene-sandbox area called larm-webcrawler (see http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html), and a project on Sourceforge (http://larm.sf.net) that tries to leverage this on a higher level. I want to encourage you to go on that

ANN: Docco-0.1

2003-06-30 Thread Peter Becker
Hello all, since I couldn't find any rules against it in the mailing list guide, I assume that announcements for project releases are ok on this list. Please correct me if I am wrong. We just released a first version of a little personal document retrieval tool based on Lucene, POI, PDFBox and