Re: MultiFieldQueryParser seems broken... Fix attached.
Hi Bill, I think that more people wait for this patch of MultifieldIndexParser. It would be nice if it will be included in the next realease candidate All the best, Sergiu Bill Janssen wrote: René, Thanks for your note. I'd think that if a user specified a query "cutting lucene", with an implicit AND and the default fields "title" and "author", they'd expect to see a match in which both "cutting" and "lucene" appears. That is, (title:cutting OR author:cutting) AND (title:lucene OR author:lucene) Instead, what they'd get using the current (broken) strategy of outer combination used by the current MultiFieldQueryParser, would be (title:cutting OR title:lucene) AND (author:cutting OR author:lucene) Note that this would match even if only "lucene" occurred in the document, as long as it occurred both in the title field and in the author field. Or, for that matter, it would also match "Cutting on Cutting", by Doug Cutting :-). http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116 Yes, the approach there is similar. I attempted to complete the solution and provide a working replacement for MultiFieldQueryParser. Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: maximum index size
Chris Fraschetti wrote: I've seen throughout the list mentions of millions of documents.. 8 million, 20 million, etc etc.. but can lucene potentially handle billions of documents and still efficiently search through them? Lucene can currently handle up to 2^31 documents in a single index. To a large degree this is limited by Java ints and arrays (which are accessed by ints). There are also a few places where the file format limits things to 2^32. On typical PC hardware, 2-3 word searches of an index with 10M documents, each with around 10k of text, require around 1 second, including index i/o time. Performance is more-or-less linear, so that a 100M document index might require nearly 10 seconds per search. Thus, as indexes grow folks tend to distribute searches in parallel to many smaller indexes. That's what Nutch and Google (http://www.computer.org/micro/mi2003/m2022.pdf) do. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: maximum index size
Given adequate hardware, it can. Take a look at nutch.org. Nutch uses Lucene at its core. Otis --- Chris Fraschetti <[EMAIL PROTECTED]> wrote: > I know the index size is very dependent on the content being index... > > but running on a unix based machine w/o a filesize limit, best case > scenario... what is the largest number of documents that can be > indexed. > > I've seen throughout the list mentions of millions of documents.. 8 > million, 20 million, etc etc.. but can lucene potentially handle > billions of documents and still efficiently search through them? > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
maximum index size
I know the index size is very dependent on the content being index... but running on a unix based machine w/o a filesize limit, best case scenario... what is the largest number of documents that can be indexed. I've seen throughout the list mentions of millions of documents.. 8 million, 20 million, etc etc.. but can lucene potentially handle billions of documents and still efficiently search through them? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing size
Niraj Alok wrote: Hi PA, Thanks for the detail ! Since we are using lucene to store the data also, I guess I would not be able to use it. By the way, I could be wrong, but I think the 35% figure you referenced in the your first e-mail actually does not include any stored fields. The deal with 35% was, I think, to illustrate that index data structures used for searching by Lucene are efficient. But Lucene does nothing special about stored content - no compression or anything like that. So you end up with the pure size of your data plus the 35% of the indexed data. Cheers. Dmitry. Regards, Niraj - Original Message - From: "petite_abeille" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, September 01, 2004 1:14 PM Subject: Re: indexing size Hi Niraj, On Sep 01, 2004, at 06:45, Niraj Alok wrote: If I make some of them Field.Unstored, I can see from the javadocs that it will be indexed and tokenized but not stored. If it is not stored, how can I use it while searching? The different type of fields don't impact how you do your search. This is always the same. Using Unstored fields simply means that you use Lucene as a pure index for search purpose only, not for storing any data. Specifically, the assumption is that your original data lives somewhere else, outside of Lucene. If this assumption is true, then you can index everything as Unstored with the addition of one Keyword per document. The Keyword field holds some sort of unique identifier which allows you to retrieve the original data if necessary (e.g. a primary key, an URI, what not). Here is an example of this approach: (1) For indexing, check the indexValuesWithID() method http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZIndex.java?view=markup Note the addition of a Field.Keyword for each document and the use of Field.UnStored for everything else (2) For fetching, check objectsWithSpecificationAndHitsInStore() http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ SZFinder.java?view=markup HTH. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
René, Thanks for your note. I'd think that if a user specified a query "cutting lucene", with an implicit AND and the default fields "title" and "author", they'd expect to see a match in which both "cutting" and "lucene" appears. That is, (title:cutting OR author:cutting) AND (title:lucene OR author:lucene) Instead, what they'd get using the current (broken) strategy of outer combination used by the current MultiFieldQueryParser, would be (title:cutting OR title:lucene) AND (author:cutting OR author:lucene) Note that this would match even if only "lucene" occurred in the document, as long as it occurred both in the title field and in the author field. Or, for that matter, it would also match "Cutting on Cutting", by Doug Cutting :-). > http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116 Yes, the approach there is similar. I attempted to complete the solution and provide a working replacement for MultiFieldQueryParser. Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher.close() and aborting searches in progress
Dave, I haven't tried this, but I think this would be messy. Lucene needs to keep index files open, so that when you pull a Document from Hits, it can read this stuff from those files. If you close index files, you are likely to get some NPEs or some such. I don't think you'll find a ready to use API for this use case in Lucene. Instead, my guess is that you will have to manually keep track of your IndexSearcher's status (open/closed), and allow searches to return results only if status == open. Otis --- David Spencer <[EMAIL PROTECTED]> wrote: > http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#close() > > What is the intent of IndexSearcher.close()? > > I want to know how, in a web app, one can stop a search that's in > progress - use case is a user is limited to one search at at time, > and > when one (expensive) search is running they decide it's taking too > long > so they elaborate on the query and resubmit it. Goal is for the > server > to stop the search that's in progress and to start a new one. I know > how > to deal w/ session vars and so on in a web container - but can one > stop > a search that's in progress and is that the intent of close()? > > I haven't done the obvious experiment but regardless, the javadoc is > kinda terse so I wanted to hear from the all knowing people on the > list. > > thx, >Dave > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Compound File Format question
Ahh - two new discoveries: You have to add a document, remove a document, and then call optimize. Then everything works (nearly as expected) The version of Lucene that ships with Luke still has the broken optimize code in it that didn't clean up after itself - so you need to just download Luke, and then run it with 1.4.1 of Lucene, rather than what is ships with (which the website indicates is 1.4 RC4) Dan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Full web search engine package using Lucene
Thanks a lot! Ya - Original Message - From: "Bernhard Messer" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, September 08, 2004 3:38 PM Subject: Re: Full web search engine package using Lucene > Anne Y. Zhang wrote: > > >Thanks, David. But it seems that this is downloadable. > >Could you please provide me the link for download? > >Thank you very much! > > > > > http://www.nutch.org/release/ > > >Ya > >- Original Message - > >From: "David Spencer" <[EMAIL PROTECTED]> > >To: "Lucene Users List" <[EMAIL PROTECTED]> > >Sent: Wednesday, September 08, 2004 2:43 PM > >Subject: Re: Full web search engine package using Lucene > > > > > > > > > >>Anne Y. Zhang wrote: > >> > >> > >> > >>>Hi, I am assistanting a professor for a IR course. > >>>We need to provide the student with a full-fuctioned > >>>search engine package, and the professor prefers it > >>>being powered by lucene. Since I am new to lucene, > >>>can anyone provide me some information that where > >>>can I get the package? We also want the package > >>>contains the crawling function. Thank you very much! > >>> > >>> > >>http://www.nutch.org/ > >> > >> > >> > >>>Ya > >>> > >>> > >>> > >>>- > >>>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>For additional commands, e-mail: [EMAIL PROTECTED] > >>> > >>> > >>> > >>- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > >> > > > > > >- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Compound File Format question
Hmm, I tried that in Luke - but it doesn't seem to take. When I uncheck the use compound file check box, and then select optimize, it doesn't change anything. I guess I should just write some code already :) Dan -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 08, 2004 2:37 PM To: Lucene Users List Subject: Re: Compound File Format question Armbrust, Daniel C. wrote: > Is it safe to change the compound file format option at any time during the life of > an index? > > Can I build an index with it off, then turn it on, and call optimize, and have a > compound file formatted index? > > And then later, turn it on, call optimize again, and go back the other way? In my experience it's safe. I've been doing this in a couple of real applications, and also in Luke there is an option to re-pack the index using compound or not. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Full web search engine package using Lucene
Anne Y. Zhang wrote: Thanks, David. But it seems that this is downloadable. Could you please provide me the link for download? Thank you very much! http://www.nutch.org/release/ Ya - Original Message - From: "David Spencer" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, September 08, 2004 2:43 PM Subject: Re: Full web search engine package using Lucene Anne Y. Zhang wrote: Hi, I am assistanting a professor for a IR course. We need to provide the student with a full-fuctioned search engine package, and the professor prefers it being powered by lucene. Since I am new to lucene, can anyone provide me some information that where can I get the package? We also want the package contains the crawling function. Thank you very much! http://www.nutch.org/ Ya - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Compound File Format question
Armbrust, Daniel C. wrote: Is it safe to change the compound file format option at any time during the life of an index? Can I build an index with it off, then turn it on, and call optimize, and have a compound file formatted index? And then later, turn it on, call optimize again, and go back the other way? In my experience it's safe. I've been doing this in a couple of real applications, and also in Luke there is an option to re-pack the index using compound or not. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Full web search engine package using Lucene
Thanks, David. But it seems that this is downloadable. Could you please provide me the link for download? Thank you very much! Ya - Original Message - From: "David Spencer" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, September 08, 2004 2:43 PM Subject: Re: Full web search engine package using Lucene > Anne Y. Zhang wrote: > > > Hi, I am assistanting a professor for a IR course. > > We need to provide the student with a full-fuctioned > > search engine package, and the professor prefers it > > being powered by lucene. Since I am new to lucene, > > can anyone provide me some information that where > > can I get the package? We also want the package > > contains the crawling function. Thank you very much! > > http://www.nutch.org/ > > > > > Ya > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Compound File Format question
Is it safe to change the compound file format option at any time during the life of an index? Can I build an index with it off, then turn it on, and call optimize, and have a compound file formatted index? And then later, turn it on, call optimize again, and go back the other way? The JavaDocs don't say much of anything about it (oh - and PS - there is a copy and paste error in the description for the getUseCompoundFile() method) Thanks, Dan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Full web search engine package using Lucene
Anne Y. Zhang wrote: Hi, I am assistanting a professor for a IR course. We need to provide the student with a full-fuctioned search engine package, and the professor prefers it being powered by lucene. Since I am new to lucene, can anyone provide me some information that where can I get the package? We also want the package contains the crawling function. Thank you very much! http://www.nutch.org/ Ya - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Full web search engine package using Lucene
Hi, I am assistanting a professor for a IR course. We need to provide the student with a full-fuctioned search engine package, and the professor prefers it being powered by lucene. Since I am new to lucene, can anyone provide me some information that where can I get the package? We also want the package contains the crawling function. Thank you very much! Ya - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexSearcher.close() and aborting searches in progress
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#close() What is the intent of IndexSearcher.close()? I want to know how, in a web app, one can stop a search that's in progress - use case is a user is limited to one search at at time, and when one (expensive) search is running they decide it's taking too long so they elaborate on the query and resubmit it. Goal is for the server to stop the search that's in progress and to start a new one. I know how to deal w/ session vars and so on in a web container - but can one stop a search that's in progress and is that the intent of close()? I haven't done the obvious experiment but regardless, the javadoc is kinda terse so I wanted to hear from the all knowing people on the list. thx, Dave - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: where is the SnowBallAnalyzer?
Is in snowball-1.0.jar I sent you it in private email. Bye Ernesto. - Original Message - From: "Wermus Fernando" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, September 08, 2004 1:12 PM Subject: where is the SnowBallAnalyzer? I have to look better, but why the SnowBallAnalizer isn't in org.apache.lucene.analysis.snowball.SnowballAnalyzer package? I have lucene 1.4. I'm doing my own spanish stemmer. --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.754 / Virus Database: 504 - Release Date: 06/09/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDF->Text Performance comparison
Yes, that and a few other adjectives, but I didn't want to get carried away. Ben On Wed, 8 Sep 2004, Doug Cutting wrote: > Ben Litchfield wrote: > > PDFBox: slow PDF text extraction for Java applications > > http://www.pdfbox.org > > Shouldn't that read, "PDFBox: *free* slow PDF text extraction for Java > applications, with Lucene integration"? > > Doug > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDF->Text Performance comparison
Ben Litchfield wrote: PDFBox: slow PDF text extraction for Java applications http://www.pdfbox.org Shouldn't that read, "PDFBox: *free* slow PDF text extraction for Java applications, with Lucene integration"? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
where is the SnowBallAnalyzer?
I have to look better, but why the SnowBallAnalizer isn't in org.apache.lucene.analysis.snowball.SnowballAnalyzer package? I have lucene 1.4. I'm doing my own spanish stemmer.
RE: -- TomCat/Lucene, filesystem
I think you might be refering to the xml files you keep in C:\Program Files\Apache\Tomcat\conf\Catalina\localhost I have a file with the contents (myapp.xml): -Original Message- From: Rupinder Singh Mazara [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 31, 2004 12:36 PM To: Lucene Users List; [EMAIL PROTECTED] Subject: RE: -- TomCat/Lucene, filesystem i have a web application using lucene via tomcat, you may need to set the correct permissions in ur catalina.policy file i use a blanket policy of grant { permission java.io.FilePermission "/","read"; }; to manage allow access to lucene >-Original Message- >From: J.Ph DEGLETAGNE [mailto:[EMAIL PROTECTED] >Sent: 31 August 2004 17:12 >To: [EMAIL PROTECTED]; [EMAIL PROTECTED] >Subject: -- TomCat/Lucene, filesystem > > >Hello Somebody, > >..I beg your pardon... > >Under Windows XP / TomCat, > >How to "customize" Webapp Lucene to access directory filesystem which are >outside TomCat ? >like this : >D:\Program Files\Apache Software Foundation\Tomcat 5.0\.. >to access >E:\Data > >Thank's a lot > >JPhD > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving from a single server to a cluster
We went thru the same scenario as yours. We recently made our application clsuterable and I wrote our own version of jdbc directory (similar to the SQLDirectory posted by someone) with our own caching. It was great for searching for indexing had become a real bottleneck. So we have decided to move back to file system for non-clustered apps. I am still trying to figure the best way (whether using a RemoteSearcher or manage multiple index). I already tried multiple index and we didn't really like the solution of maintaining multiple copies. It requires more space, more maintaineance, all index needs to be in sync etc. I will be glad if I can get the best answer for this. Did anyone try RemoteSearchable and how is it compared to multiple index solution? Nader: I would appreciate if you can send me the docs. Praveen - Original Message - From: "David Townsend" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Wednesday, September 08, 2004 10:42 AM Subject: RE: Moving from a single server to a cluster Would it be cheeky to ask you to post the docs to the group? It would be interesting to read how you've tackled this. -Original Message- From: Nader Henein [mailto:[EMAIL PROTECTED] Sent: 08 September 2004 13:57 To: Lucene Users List Subject: Re: Moving from a single server to a cluster Hey Ben, We've been using a distributed environment with three servers and three separate indecies for the past 2 years since the first stable Lucene release and it has been great, recently and for the past two months I've been working on a redesign for our Lucene App and I've shared my findings and plans with Otis, Doug and Erik, they pointed out a few faults in my logic which you will probably come across soon enough that mainly have to do with keeping you updates atomic (not too hard) and your deletes atomic (a little more tricky), give me a few days and I'll send you both the early document and the newer version that deals squarely with Lucene in a distributed environment with high volume index. Regards. Nader Henein Ben Sinclair wrote: >My application currently uses Lucene with an index living on the >filesystem, and it works fine. I'm moving to a clustered environment >soon and need to figure out how to keep my indexes together. Since the >index is on the filesystem, each machine in the cluster will end up >with a different index. > >I looked into JDBC Directory, but it's not tested under Oracle and >doesn't seem like a very mature project. > >What are other people doing to solve this problem? > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Moving from a single server to a cluster
be a pleasure, just didn't want to mislead someone down the wrong way. Give me a few days and I'll have the new version up. Nader - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PDF->Text Performance comparison
Ben, Wow, thanks for the plug! :-) Truthfully, I was worried that our open-source brethren might feel slighted by the comparison -- that's partially why we wanted to make sure it was as thorough and transparent as possible so that anyone could review the results for themselves. I'm glad that you're not at all sore. Chas Emerick | [EMAIL PROTECTED] PDFTextStream: fast PDF text extraction for Java applications http://snowtide.com/home/PDFTextStream/ On Sep 8, 2004, at 10:41 AM, Ben Litchfield wrote: On Wed, 8 Sep 2004, Chas Emerick wrote: PDFTextStream: fast PDF text extraction for Java applications http://snowtide.com/home/PDFTextStream/ For those that have not seen, snowtide.com has done a performance comparison against several Java PDF->Text libraries, including Snowtide's PDFTextStream, PDFBox, Etymon PJ and JPedal. It appears to be fairly well done. http://snowtide.com/home/PDFTextStream/Performance PDFBox: slow PDF text extraction for Java applications http://www.pdfbox.org :) Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Moving from a single server to a cluster
Would it be cheeky to ask you to post the docs to the group? It would be interesting to read how you've tackled this. -Original Message- From: Nader Henein [mailto:[EMAIL PROTECTED] Sent: 08 September 2004 13:57 To: Lucene Users List Subject: Re: Moving from a single server to a cluster Hey Ben, We've been using a distributed environment with three servers and three separate indecies for the past 2 years since the first stable Lucene release and it has been great, recently and for the past two months I've been working on a redesign for our Lucene App and I've shared my findings and plans with Otis, Doug and Erik, they pointed out a few faults in my logic which you will probably come across soon enough that mainly have to do with keeping you updates atomic (not too hard) and your deletes atomic (a little more tricky), give me a few days and I'll send you both the early document and the newer version that deals squarely with Lucene in a distributed environment with high volume index. Regards. Nader Henein Ben Sinclair wrote: >My application currently uses Lucene with an index living on the >filesystem, and it works fine. I'm moving to a clustered environment >soon and need to figure out how to keep my indexes together. Since the >index is on the filesystem, each machine in the cluster will end up >with a different index. > >I looked into JDBC Directory, but it's not tested under Oracle and >doesn't seem like a very mature project. > >What are other people doing to solve this problem? > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
PDF->Text Performance comparison
On Wed, 8 Sep 2004, Chas Emerick wrote: > PDFTextStream: fast PDF text extraction for Java applications > http://snowtide.com/home/PDFTextStream/ For those that have not seen, snowtide.com has done a performance comparison against several Java PDF->Text libraries, including Snowtide's PDFTextStream, PDFBox, Etymon PJ and JPedal. It appears to be fairly well done. http://snowtide.com/home/PDFTextStream/Performance PDFBox: slow PDF text extraction for Java applications http://www.pdfbox.org :) Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: pdf in Chinese
I'm not aware of any Java library that can reliably extract Chinese text from PDF documents. We're planning on supporting Chinese, Japanese, and Korean in version 2 of PDFTextStream, but there's no doubt that it's a huge challenge. Chas Emerick | [EMAIL PROTECTED] PDFTextStream: fast PDF text extraction for Java applications http://snowtide.com/home/PDFTextStream/ On Sep 8, 2004, at 5:58 AM, [EMAIL PROTECTED] wrote: it is not about analyzer ,i need to read text from pdf file first. - Original Message - From: "Chandan Tamrakar" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, September 08, 2004 4:15 PM Subject: Re: pdf in Chinese which analyzer you are using to index chinese pdf documents ? I think you should use cjkanalyzer - Original Message - From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, September 08, 2004 11:27 AM Subject: pdf in Chinese Hi all, i use pdfbox to parse pdf file to lucene document.when i parse Chinese pdf file,pdfbox is not always success. Is anyone have some advice? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
The class is at the end of the message. But it hink that a better solution is that one suggested by Rene: http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116 Wermus Fernando wrote: Bill, I don't receive any .java. Could you send it again? Thanks. -Mensaje original- De: Bill Janssen [mailto:[EMAIL PROTECTED] Enviado el: Martes, 07 de Septiembre de 2004 10:06 p.m. Para: Lucene Users List CC: Ali Rouhi Asunto: MultiFieldQueryParser seems broken... Fix attached. Hi! I'm using Lucene for an application which has lots of fields/document, in which the users can specify in their config files what fields they wish to be included by default in a search. I'd been happily using MultiFieldQueryParser to do the searches, but the darn users started demanding more Google-like searches; that is, they want the search terms to be implicitly AND-ed instead of implicitly OR-ed. No problem, thinks I, I'll just set the "operator". Only to find this has no effect on MultiFieldQueryParser. Once I looked at the code, I find that MultiFieldQueryParser combines the clauses at the wrong level -- it combines them at the outermost level instead of the innermost level. This means that if you have two fields, "author" and "title", and the search string "cutting lucene", you'll get the final query (title:cutting title:lucene) (author:cutting author:lucene) If the search operator is "OR", this isn't a problem. But if it is, you have two problems. The first is that MultiFieldQueryParser seems to ignore the operator entirely. But even if it didn't, the second problem is that the query formed would be +(title:cutting title:lucene) +(author:cutting author:lucene) That is, if the word "Lucene" was in both the author field and the title field, the match would fit. This clearly isn't what the searcher intended. You can re-write MultiFieldQueryParser, as I've done in the example code which I append here. This little program allows you to run either my parser (-DSearchTest.QueryParser=new) or the old parser (-DSearchTest.QueryParser=old). It allows you to use either OR (-DSearchTest.QueryDefaultOperator=or) or AND (-DSearchTest.QueryDefaultOperator=and) as the operator. And it allows you to pick your favorite set of default search terms (-DSearchTest.QueryDefaultFields=author:title:body, for example). It takes one argument, a query string, and outputs the re-written query after running it through the query parser. So to evaluate the above query: % java -classpath /import/lucene/lucene-1.4.1.jar:. \ -DSearchTest.QueryDefaultFields="title:author" \ -DSearchTest.QueryDefaultOperator=AND \ -DSearchTest.QueryParser=old \ SearchTest "cutting lucene" query is (title:cutting title:lucene) (author:cutting author:lucene) % The class NewMultiFieldQueryParser does the combination at the inner level, using an override of "addClause", instead of the outer level. Note that it can't cover all cases (notably PhrasePrefixQuery, because that class has no access methods which allow one to introspect over it, and SpanQueries, because I don't understand them well enough :-). I post it here in advance of filing a formal bug report for early feedback. But it will show up in a bug report in the near future. Running the above query with the new parser gives: % java -classpath /import/lucene/lucene-1.4.1.jar:. \ -DSearchTest.QueryDefaultFields="title:author" \ -DSearchTest.QueryDefaultOperator=AND \ -DSearchTest.QueryParser=new \ SearchTest "cutting lucene" query is +(title:cutting author:cutting) +(title:lucene author:lucene) % which I claim is what the user is expecting. In addition, the new class uses an API more similar to QueryParser, so that the user has less to learn when using it. The code in it could probably just be folded into QueryParser, in fact. Bill the code for SearchTest: import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermDocs; import org.apache.lucene.document.Document; import org.apache.lucene.search.Searcher; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.FuzzyQuery; import org.apache.lucene.search.WildcardQuery; import org.apache.lucene.search.PrefixQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.RangeQuery; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.Hits; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.queryParser.MultiFieldQueryParser; import org.apache.lucene.queryParser.FastCharStream; import org.apache.lucene.queryParser.TokenMgrError; import org.apache.lucene.queryParser.ParseException; i
Re: Moving from a single server to a cluster
Hey Ben, We've been using a distributed environment with three servers and three separate indecies for the past 2 years since the first stable Lucene release and it has been great, recently and for the past two months I've been working on a redesign for our Lucene App and I've shared my findings and plans with Otis, Doug and Erik, they pointed out a few faults in my logic which you will probably come across soon enough that mainly have to do with keeping you updates atomic (not too hard) and your deletes atomic (a little more tricky), give me a few days and I'll send you both the early document and the newer version that deals squarely with Lucene in a distributed environment with high volume index. Regards. Nader Henein Ben Sinclair wrote: My application currently uses Lucene with an index living on the filesystem, and it works fine. I'm moving to a clustered environment soon and need to figure out how to keep my indexes together. Since the index is on the filesystem, each machine in the cluster will end up with a different index. I looked into JDBC Directory, but it's not tested under Oracle and doesn't seem like a very mature project. What are other people doing to solve this problem? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: MultiFieldQueryParser seems broken... Fix attached.
Bill, I don't receive any .java. Could you send it again? Thanks. -Mensaje original- De: Bill Janssen [mailto:[EMAIL PROTECTED] Enviado el: Martes, 07 de Septiembre de 2004 10:06 p.m. Para: Lucene Users List CC: Ali Rouhi Asunto: MultiFieldQueryParser seems broken... Fix attached. Hi! I'm using Lucene for an application which has lots of fields/document, in which the users can specify in their config files what fields they wish to be included by default in a search. I'd been happily using MultiFieldQueryParser to do the searches, but the darn users started demanding more Google-like searches; that is, they want the search terms to be implicitly AND-ed instead of implicitly OR-ed. No problem, thinks I, I'll just set the "operator". Only to find this has no effect on MultiFieldQueryParser. Once I looked at the code, I find that MultiFieldQueryParser combines the clauses at the wrong level -- it combines them at the outermost level instead of the innermost level. This means that if you have two fields, "author" and "title", and the search string "cutting lucene", you'll get the final query (title:cutting title:lucene) (author:cutting author:lucene) If the search operator is "OR", this isn't a problem. But if it is, you have two problems. The first is that MultiFieldQueryParser seems to ignore the operator entirely. But even if it didn't, the second problem is that the query formed would be +(title:cutting title:lucene) +(author:cutting author:lucene) That is, if the word "Lucene" was in both the author field and the title field, the match would fit. This clearly isn't what the searcher intended. You can re-write MultiFieldQueryParser, as I've done in the example code which I append here. This little program allows you to run either my parser (-DSearchTest.QueryParser=new) or the old parser (-DSearchTest.QueryParser=old). It allows you to use either OR (-DSearchTest.QueryDefaultOperator=or) or AND (-DSearchTest.QueryDefaultOperator=and) as the operator. And it allows you to pick your favorite set of default search terms (-DSearchTest.QueryDefaultFields=author:title:body, for example). It takes one argument, a query string, and outputs the re-written query after running it through the query parser. So to evaluate the above query: % java -classpath /import/lucene/lucene-1.4.1.jar:. \ -DSearchTest.QueryDefaultFields="title:author" \ -DSearchTest.QueryDefaultOperator=AND \ -DSearchTest.QueryParser=old \ SearchTest "cutting lucene" query is (title:cutting title:lucene) (author:cutting author:lucene) % The class NewMultiFieldQueryParser does the combination at the inner level, using an override of "addClause", instead of the outer level. Note that it can't cover all cases (notably PhrasePrefixQuery, because that class has no access methods which allow one to introspect over it, and SpanQueries, because I don't understand them well enough :-). I post it here in advance of filing a formal bug report for early feedback. But it will show up in a bug report in the near future. Running the above query with the new parser gives: % java -classpath /import/lucene/lucene-1.4.1.jar:. \ -DSearchTest.QueryDefaultFields="title:author" \ -DSearchTest.QueryDefaultOperator=AND \ -DSearchTest.QueryParser=new \ SearchTest "cutting lucene" query is +(title:cutting author:cutting) +(title:lucene author:lucene) % which I claim is what the user is expecting. In addition, the new class uses an API more similar to QueryParser, so that the user has less to learn when using it. The code in it could probably just be folded into QueryParser, in fact. Bill the code for SearchTest: import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermDocs; import org.apache.lucene.document.Document; import org.apache.lucene.search.Searcher; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.FuzzyQuery; import org.apache.lucene.search.WildcardQuery; import org.apache.lucene.search.PrefixQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.RangeQuery; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.Hits; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.queryParser.MultiFieldQueryParser; import org.apache.lucene.queryParser.FastCharStream; import org.apache.lucene.queryParser.TokenMgrError; import org.apache.lucene.queryParser.ParseException; import java.io.File; import java.io.StringReader; import java.util.Date; import java.util.HashMap; import java.util.Iterator; import java.util.StringTokenizer; class S
Re: pdf in Chinese
This appears to be more of a PDFBox issue than a lucene issue, please post an issue to the PDFBox site. Also note, that because of certain encodings that a PDF writer can use, it is impossible to extract text from all PDF documents. Ben On Wed, 8 Sep 2004, [EMAIL PROTECTED] wrote: > it is not about analyzer ,i need to read text from pdf file first. > > - Original Message - > From: "Chandan Tamrakar" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Wednesday, September 08, 2004 4:15 PM > Subject: Re: pdf in Chinese > > > > which analyzer you are using to index chinese pdf documents ? > > I think you should use cjkanalyzer > > - Original Message - > > From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Sent: Wednesday, September 08, 2004 11:27 AM > > Subject: pdf in Chinese > > > > > > > Hi all, > > > i use pdfbox to parse pdf file to lucene document.when i parse > > Chinese > > > pdf file,pdfbox is not always success. > > > Is anyone have some advice? > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: *term search
.. and here is the way to do it: (See attached file: SUPPOR~1.RAR) Erik Hatcher <[EMAIL PROTECTED]To: "Lucene Users List" utions.com> <[EMAIL PROTECTED]> cc: (bcc: Iouli Golovatyi/X/GP/Novartis) 08.09.2004 12:46 Subject: Re: *term search Please respond to "Lucene UsersCategory: |-| List"| ( ) Action needed | | ( ) Decision needed | | ( ) General Information | |-| On Sep 8, 2004, at 6:26 AM, sergiu gordea wrote: > I want to discuss a little problem, lucene doesn't support *Term like > queries. First of all, this is untrue. WildcardQuery itself most definitely supports wildcards at the beginning. > I would like to use "*schreiben". The dilemma you've encountered is that QueryParser prevents queries that begin with a wildcard. > So my question is if there is a simple solution for implementing the > funtionality mentioned above. > Maybe subclassing one class and overwriting some methods will sufice. It will require more than that in this case. You will need to create a custom parser that allows the grammar you'd like. Feel free to use the JavaCC source code to QueryParser as a basis of your customizations. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: *term search
sergiu gordea writes: > > > Hi all, > > I want to discuss a little problem, lucene doesn't support *Term like > queries. > I know that this can bring a lot of results in the memory and therefore > it is restricted. > That's not the reason for the restriction. That's possible with a* also. The problem is, that lucene has to check all terms to see if they end with Term. That makes the performance pretty poor. A prefix allows to restrict the search on words with this prefix efficiantly, since the wordlist is orderd. > > So my question is if there is a simple solution for implementing the > funtionality mentioned above. Sure. Just follow the way, wildcard query is implemented. Actually I'm not sure if the restriction you mention is in the wildcard query itself or only in the query parser. In the latter case, you might just create the query yourself. A better way for postfix queries is to create an additional search field where all words are reversed and search for mreT* on that field. Depends on the size of your index, how important such an optimization is. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: *term search
On Sep 8, 2004, at 6:26 AM, sergiu gordea wrote: I want to discuss a little problem, lucene doesn't support *Term like queries. First of all, this is untrue. WildcardQuery itself most definitely supports wildcards at the beginning. I would like to use "*schreiben". The dilemma you've encountered is that QueryParser prevents queries that begin with a wildcard. So my question is if there is a simple solution for implementing the funtionality mentioned above. Maybe subclassing one class and overwriting some methods will sufice. It will require more than that in this case. You will need to create a custom parser that allows the grammar you'd like. Feel free to use the JavaCC source code to QueryParser as a basis of your customizations. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
*term search
Hi all, I want to discuss a little problem, lucene doesn't support *Term like queries. I know that this can bring a lot of results in the memory and therefore it is restricted. I think that allowing this kind of search and limiting the amount of returned results would be a more usefull aproach. Since the german language has a lot of words that are concatenated or derivated from another words by using a prefix. I'm not a good german speaker but I can say that maybe a half of the german words are a part of the category described above. for example Himbeer, Erdbeer, Johanesbeer -- all of them are fruits from a certain category. So it will make sense to search for "*beer". Also ... I know that the word is ended in "beer" but I don't know the exact word ... "*beer" will help me a lot. also: schreiben = to write beschreiben = to describe verschreiben = to subscribe .. I would like to use "*schreiben". So my question is if there is a simple solution for implementing the funtionality mentioned above. Maybe subclassing one class and overwriting some methods will sufice. Thank in advance, Sergiu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: pdf in Chinese
it is not about analyzer ,i need to read text from pdf file first. - Original Message - From: "Chandan Tamrakar" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, September 08, 2004 4:15 PM Subject: Re: pdf in Chinese > which analyzer you are using to index chinese pdf documents ? > I think you should use cjkanalyzer > - Original Message - > From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Wednesday, September 08, 2004 11:27 AM > Subject: pdf in Chinese > > > > Hi all, > > i use pdfbox to parse pdf file to lucene document.when i parse > Chinese > > pdf file,pdfbox is not always success. > > Is anyone have some advice? > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Use of explain() vs search()
Could you create a simple piece of code (using a RAMDirectory) that demonstrates this issue? Erik On Sep 8, 2004, at 12:35 AM, Minh Kama Yie wrote: Hi all, Sorry I should clarify my last point. The search() would return no hits, but the explain() using the apparently invalid docId returns a value greater than 0. For what it's worth it's performing a PhraseQuery. Thanks in advance, Minh Minh Kama Yie wrote: Hi all, I was wondering if anyone could tell me what the expected behaviour is for calling an explain() without calling a search() first on a particular query. Would it effectively do a search and then I can examine the Explanation in order to check whether it matches? I'm currently looking at some existing code to this effect: Explanation exp = searcher.explain(myQuery, docId) // Where docId was _not_ returned by a search on myQuery if (exp.getValue() > 0.0f) { // Assuming document for docId matched query. } Is the assumption wrong? I ask because the result of this code is inconsistent with Hits h = searcher.search(myQuery); // there are not hits returned. Thanks in advance, Minh - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
Hi Bill, - But even if it didn't, the second problem is that the query formed would be +(title:cutting title:lucene) +(author:cutting author:lucene) That is, if the word "Lucene" was in both the author field and the title field, the match would fit. This clearly isn't what the searcher intended. - AFA my understanding of the query syntax goes, this would be interpreted as (A OR B) AND (C OR D) which would produce the same set as (A OR C) AND (B OR D) == +(title:cutting author:cutting) +(title:lucene author:lucene). But it would only be true for this special case with 2 terms and 2 fields. I reckon there has been a discussion (and solution :-) on how to achieve the functionality you've been after: http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116 I'm not sure if this would be the same though. Best regards, René -- Supergünstige DSL-Tarife + WLAN-Router für 0,- EUR* Jetzt zu GMX wechseln und sparen http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: pdf in Chinese
Hi, Can you pls,advice me any solution for hebrew analyzer -Original Message- From: Chandan Tamrakar [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 08, 2004 11:15 AM To: Lucene Users List Subject: Re: pdf in Chinese which analyzer you are using to index chinese pdf documents ? I think you should use cjkanalyzer - Original Message - From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, September 08, 2004 11:27 AM Subject: pdf in Chinese > Hi all, > i use pdfbox to parse pdf file to lucene document.when i parse Chinese > pdf file,pdfbox is not always success. > Is anyone have some advice? > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: pdf in Chinese
which analyzer you are using to index chinese pdf documents ? I think you should use cjkanalyzer - Original Message - From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, September 08, 2004 11:27 AM Subject: pdf in Chinese > Hi all, > i use pdfbox to parse pdf file to lucene document.when i parse Chinese > pdf file,pdfbox is not always success. > Is anyone have some advice? > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]