XML Lucene Indexing Package Updated

2002-05-15 Thread W. Eliot Kimber
Client/runLuceneClient.bat script (on Windows) and it should just work. If it doesn't, let me know. Cheers, Eliot -- W. Eliot Kimber, [EMAIL PROTECTED] Consultant, ISOGEN International 1016 La Posada Dr., Suite 240 Austin, TX 78752 Phone: 512.656.4139 -- To unsubscribe, e-mail: <mailto:[E

Re: PDF4J Project: Gathering Feature Requests

2002-05-07 Thread W. Eliot Kimber
y experience using JNI to expose C libraries). Thanks for the tip. Cheers, Eliot -- W. Eliot Kimber, [EMAIL PROTECTED] Consultant, ISOGEN International 1016 La Posada Dr., Suite 240 Austin, TX 78752 Phone: 512.656.4139 -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For addition

Re: PDF4J Project: Gathering Feature Requests

2002-05-06 Thread W. Eliot Kimber
from various non-PDF inputs). Our main writing usecase is the rewriting of existing PDFs following some amount of manipulation through our API. A caution: I am still waiting to get approval from my employers to do this work as open source--it may be a while before I can even start on the coding.

PDF4J Project: Gathering Feature Requests

2002-05-06 Thread W. Eliot Kimber
Lucene integrators would want from a PDF access library. Thanks, Eliot -- W. Eliot Kimber, [EMAIL PROTECTED] Consultant, ISOGEN International 1016 La Posada Dr., Suite 240 Austin, TX 78752 Phone: 512.656.4139 -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional comma

Re: indexing PDF files

2002-05-03 Thread W. Eliot Kimber
have to implement Adobe's layout logic. However, you need this functionality in order to correlate PDF annotations (links, bookmarks, notes) to the page objects they relate to--it's all done with bounding boxes. Cheers, Eliot -- W. Eliot Kimber, [EMAIL PROTECTED] Consultant, ISOGEN Intern

Re: XML indexer

2002-03-21 Thread W. Eliot Kimber
the hard work of integrating this technique into one of our customer's systems will be presenting a paper on his experience at the XML Europe conference in Barcelona, Spain in May (http://www.idealliance.org/. Cheers, E. -- W. Eliot Kimber, [EMAIL PROTECTED] Consultant, ISOGEN International 1

Re: indexing and searching different file formats

2002-02-14 Thread W. Eliot Kimber
be too hard to write a PDF indexer for Lucene using this library. The main challenge would be guessing word boundaries in strings where spaces have been replaced with explicit shift values by the formatter. Cheers, Eliot -- W. Eliot Kimber, [EMAIL PROTECTED] Consultant, ISOGEN International 101

XML Indexing With Lucene: New Location For Package

2002-02-01 Thread W. Eliot Kimber
You can now find our package for doing XML indexing with Lucene on the ISOGEN web site: http://www.isogen.com/papers/lucene_xml_indexing.html The package (lucene_xml_indexing.zip) includes all the 3rd-party libraries it depends on (Lucene, Xerces 1.4.4, junit). This package is provided as-is an

Re: Zones

2002-01-25 Thread W. Eliot Kimber
"Ogren, Philip V." wrote: > We are indexing a large corpus of XML documents (~10M). One thing that > Verity does with XML notes is that it indexes each XML tag as a zone.* > What's cool about it is that the zones are nested so that it mirrors the > schema of your XML document. You can limit your

Re: Efficient doc information retrieval.

2001-11-14 Thread W. Eliot Kimber
Winton Davies wrote: > > Hi Eliot, > > Not really, all documents have an accountID, but I need to search > all the documents > first, and each document that is returned has an accountID, but I > just want one document > per accountID. I see the problem. Can't think of any other way to solve i

Re: Efficient doc information retrieval.

2001-11-14 Thread W. Eliot Kimber
Winton Davies wrote: > > Hi all, > In my application, I have to be able to return a list of documents, > that have been uniqified according to an accountID. The most relevant > document for an accountID is returned, and then susequent hits that > have the same accountID are dropped. Do you me

XML Indexing Samples

2001-10-16 Thread W. Eliot Kimber
I have put together a hopefully useful package that demonstrates our current experiments with using Lucene for XML indexing. You can get the files by anonymous ftp from che.isogen.com, /outgoing/lucene. There are two zip files: - lucene_xml_indexing.zip This is the core indexing code and a l

Re: Trying To Understand Query Syntax Details

2001-10-16 Thread W. Eliot Kimber
er 1, 2001. Cool--thanks! E. -- . . . . . . . . . . . . . . . . . . . . . . . . W. Eliot Kimber | Lead Brain 1016 La Posada Dr. | Suite 240 | Austin TX 78752 T 512.656.4139 | F 512.419.1860 | [EMAIL PROTECTED] w w w . d a t a c h a n n e l . c o m

Trying To Understand Query Syntax Details

2001-10-16 Thread W. Eliot Kimber
hat Lucene supports date matching, but I don't see how to specify this in a query. Also, is there a description of the algorithm "~" uses? Thanks, E. -- . . . . . . . . . . . . . . . . . . . . . . . . W. Eliot Kimber | Lead Brain 1016 La Posada Dr. | Suite 240 | Austin TX 787

Indexing XML With Lucene: Some Initial Results

2001-10-14 Thread W. Eliot Kimber
ome back (for example, organizing the hits by XML document or doing additional context-based filtering that can't be done at the Lucene level). Cheers, Eliot -- . . . . . . . . . . . . . . . . . . . . . . . . W. Eliot Kimber | Lead Brain 1016 La Posada Dr. | Suite 240 | Austin TX 78752 T 512.656.4139 | F 512.419.1860 | [EMAIL PROTECTED] w w w . d a t a c h a n n e l . c o m

Another Indexing Question: Case Sensitivity

2001-10-13 Thread W. Eliot Kimber
E. -- . . . . . . . . . . . . . . . . . . . . . . . . W. Eliot Kimber | Lead Brain 1016 La Posada Dr. | Suite 240 | Austin TX 78752 T 512.656.4139 | F 512.419.1860 | [EMAIL PROTECTED] w w w . d a t a c h a n n e l . c o m

Re: Index Optimization: Which is Better?

2001-10-12 Thread W. Eliot Kimber
hing I know how to do with Verity, Fulcrum, Excallibur, etc. and it was freaky easy to do once we got the idea for the approach. I just hope it performs adequately. Cheers, E. -- . . . . . . . . . . . . . . . . . . . . . . . . W. Eliot Kimber | Lead Brain 1016 La Posada Dr. | Suite 240 | Austin TX 78752 T 512.656.4139 | F 512.419.1860 | [EMAIL PROTECTED] w w w . d a t a c h a n n e l . c o m

Index Optimization: Which is Better?

2001-10-11 Thread W. Eliot Kimber
the text (thus ignoring element-specific searching) might incur a performance penalty. In a related question, is there anything we can or need to do to optimize Lucene to handle lots of little Lucene documents? Thanks, Eliot -- . . . . . . . . . . . . . . . . . . . . . . . . W. Eliot Kimber