Re: Lucene with English and Spanish Best Practice?

2004-08-21 Thread Grant Ingersoll
 I think the Snowball stuff works well, although I have only used the
English Porter stemmer implementation.

As for indexes, do you anticipate adding more fields later in Spanish? 
Is the content just a translation of the English, or do you have
separate conetent in Spanish?  Are your users querying in only one
language (cross-lingual) or are the Spanish speakers only querying
against Spanish content?

I am doing Arabic and English (and have done Spanish, French, and
Japanese in the past), although our cross-lingual system supports any
languages that you have resources for.  We lean towards separate
indexes, but mostly b/c they are based on separate content.  The key is
you have to be able to match up the analysis of the query with the
analysis of the index.  Having a mixed index may make this more
difficult.  If you have a mixed index would you filter out Spanish
results that had hits from an English query?  For instance, what if the
query was a term that was common to both languages (banana, mosquito,
etc.) or are you requiring the user to specify which fields they are
searching against.  I guess we really need to know more about how your
user is going to be interacting.

-Grant

>>> [EMAIL PROTECTED] 8/20/2004 5:27:40 PM >>>
Hello,

I'm interested in any feedback from anyone who has worked through
implementing Internationalization (I18N) search with Lucene or has ideas
for this requirement.  Currently, we're using Lucene with straight
English and are looking to add Spanish to the mix (with maybe more
languages to follow).  

This is our current IndexWriter setup utilizing the
PerFieldAnalyzerWrapper:

   PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new
StandardAnalyzer());
   analyzer.addAnalyzer(FIELD_TITLE_STARTS_WITH, new
WhitespaceAnalyzer());
   analyzer.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
   IndexWriter writer = new IndexWriter(indexDir, analyzer, create);

Would people suggest we switch this over to Snowball so there are
English and Spanish Analyzers and IndexWriters?  Something like this:

PerFieldAnalyzerWrapper analyzerEnglish = new
PerFieldAnalyzerWrapper(new SnowballAnalyzer("English"));
analyzerEnglish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new
WhitespaceAnalyzer());
analyzerEnglish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
IndexWriter writerEnglish = new IndexWriter(indexDir, analyzerEnglish,
create);

PerFieldAnalyzerWrapper analyzerSpanish = new
PerFieldAnalyzerWrapper(new SnowballAnalyzer("Spanish"));
analyzerSpanish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new
WhitespaceAnalyzer());
analyzerSpanish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer());
IndexWriter writerSpanish = new IndexWriter(indexDir, analyzerSpanish,
create);


Are multiple indexes or mirrors of each index then usually created for
every language?  We currently have 4 indexes that are all English. 
Would we then create 4 more that are Spanish?  Then at search time we
would determine the language and which set of indexes to search against,
English or Spanish.

Or another approach could be to add a Spanish field to the existing 4
indexes since most of the indexes have only one field that will be
translated from English to Spanish.


thanks a bunch,
chad.


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Index of terms hierarchy

2004-08-21 Thread Gayo Diallo
Hello,
I wish to build a second index "on top" of the lucene index of simple term
using a list of predefined terms (which can be more than one word)
 contained in an hierarchy like a thesaurus.
 Has some one already did that ?
Best regards
Gayo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index and Search question in Lucene.

2004-08-21 Thread Ernesto De Santis
Hi Dimitri

What analyzer you use?

You need take carefully with Keyword fields and analyzers. When you
index a Document, the fields that have set tokenized = false, like
Keyword, are not analyzed. 
In search time you need parse the query with your analyzer but not
analyze the untokenized fields, like your filename.

> I can do a search as this
> "+contents:SomeWord  +filename:SomePath"
> 

The sintaxis is rigth, but if you search +filename:somepath, find only
this file.

For example, 
+content:version +filename:/my/path/myfile.ext

Only can found myfile.ext, and if this file don't content "version", not
going to find nothing. This is because you use +. + set the term
required.

You can see the queries sintaxis in lucene site.

http://jakarta.apache.org/lucene/docs/queryparsersyntax.html

http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.search&toc=faq#q5

good luck.

Bye
Ernesto.


El dom, 15 de 08 de 2004 a las 17:13, Dmitrii PapaGeorgio escribiÃ:
> Ok so when I index a file such as below
> 
> Document doc = new Document();
> doc.Add(Field.Text("contents", new StreamReader(dataDir)));
> doc.Add(Field.Keyword("filename", dataDir));
> 
> I can do a search as this
> "+contents:SomeWord  +filename:SomePath"
> 
> Correct?
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: speeding up queries (MySQL faster)

2004-08-21 Thread Bernhard Messer
Yonik,
there is another "synchronized" block in CSInputStream which could block 
your second cpu out. Do you think there is a chance to recreate the 
index (maybe a smaller subset) without compound file option enabled and 
run your test again, so that we can see if this helps ?

regards
Bernhard
Otis Gospodnetic wrote:
Ah, you may be right (no stack trace in email any more).  Somebody
recenly identified a few bottlenecks that, if I recall correctly, were
related to synchronized blocks.  I believe Doug committed some
improvements, but I can't remember which version of Lucene that is in. 
It's definitely in 1.4.1.

Otis
--- Yonik Seeley <[EMAIL PROTECTED]> wrote:
 

--- Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:
   

The bottleneck seems to be disk IO.
 

But it's not.  Linux is caching the whole file, and
there really isn't any disk activity at all.  Most of
the threads are blocked on InputStream.refill, not
waiting for the disk, but waiting for their turn into
the synchronized block to read from the disk (which is
why I asked about cacheing above that level).
CPU is a constant 50% on a dual CPU system (meaning
100% of 1 cpu).
-Yonik
__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: pdfboxhelp

2004-08-21 Thread Santosh
Yes I did the same.
I copied all the classes into classes folder but
now when I am building the index using IndexHTML the pdfs are not added to
this index, only text and htmls are added to index.
what changes should I do for IndexHTML.java to build index with pdf
- Original Message -
From: "Karthik N S" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Saturday, August 21, 2004 4:54 PM
Subject: RE: pdfboxhelp


> Hi
>
> If u are using the jar file with Web Interface for jsp/servlet dev, Place
> the jar file in  "webapps///lib"
> and also correct the Classpath for the present modification.
>
> 2)create u'r own package and put all u'r java files  copy the java files
to
> /Web-inf/Classes/
>
>
> Then use the same..;{
>
>
> Karthik
>
> -Original Message-
> From: Santosh [mailto:[EMAIL PROTECTED]
> Sent: Saturday, August 21, 2004 4:31 PM
> To: Lucene Users List
> Subject: Re: pdfboxhelp
>
>
> thanks  Natarajan and karthik,
>
> I corrected classpath
>
> but where I should write your code?
> should I write your code in IndexHTML.java  which comes along with lucene
or
> some other place?
> one more thing
> I kept pdfbox jar file in the classpath is this enough or I have to build
> the pdfbox?
>
> thankyou
> - Original Message -
> From: "Natarajan.T" <[EMAIL PROTECTED]>
> To: "'Lucene Users List'" <[EMAIL PROTECTED]>
> Sent: Saturday, August 21, 2004 3:20 PM
> Subject: RE: pdfboxhelp
>
>
> > Hi Santhosh,
> >
> > Try out this below code.(pdfbox.jar file must be in your classpath)
> >
> > public String getContent(InputStream  reader) throws
IOException{PDFParser
> parser = null;PDDocument pdDoc = null;PDFTextStripper stripper =
null;String
> pdftext = "";try{parser = new PDFParser(reader);parser.parse();pdDoc =
> parser.getPDDocument();if(pdDoc.isEncrypted()){DecryptDocument decryptor =
> new
> > DecryptDocument(pdDoc);decryptor.decryptDocument("");}stripper = new
> PDFTextStripper();pdftext = stripper.getText(pdDoc);
> >
> >info = pdDoc.getDocumentInformation();}catch(Exception err)
> {System.out.println(err.getMessage());}pdDoc.close();return pdftext;}
> >
> > Natarajan.
> >
> > -Original Message-
> > From: Santosh [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, August 21, 2004 3:14 PM
> > To: Lucene Users List
> > Subject: Re: pdfboxhelp
> >
> > Hi Don,
> >
> > your Idea is nice, but whenever I write the  following code in
> > IndexHTML.java of lucene
> >
> >
> > import org.pdfbox.searchengine.lucene.*;
> >
> > File pdfFile = new File("/path/to/the/file.pdf");
> >
> > // Below returns a parse PDF file in a Lucene Document object.
> > Document doc = LucenePDFDocument.getDocument(pdfFile);
> >
> > Iam getting the following error
> >
> > package org.pdfbox.searchengine.lucene does not exist
> >
> > I have downloaded pdfbox source code and kept the jar file in the
> > classpath, please help me on this- Original Message - From: Don
> Vaillancourt To: Lucene Users List Sent: Friday, August 20, 2004 7:37
> PMSubject: Re: pdfboxhelp
> >
> >
> >   Here is the super simple code required.
> >
> >   import org.pdfbox.searchengine.lucene.*;
> >
> >   File pdfFile = new File("/path/to/the/file.pdf");
> >
> >   // Below returns a parse PDF file in a Lucene Document object.Document
> doc = LucenePDFDocument.getDocument(pdfFile);
> >
> >   Santosh wrote:
> >
> > exactly, the same is required to me- Original Message - From:
Don
> Vaillancourt To: Lucene Users List Sent: Friday, August 20, 2004 6:39
> PMSubject: Re: pdfboxhelp
> >
> >
> >   What are your intensions with PDFBox?
> >
> >   You want to use it to index PDF files?
> >
> >   Santosh wrote:
> >
> > hi,
> >
> > I have downloaded pdfbox zip. but i am in ambigous state that where to
> > start. how can I check with demo, I dont see any help document with this
> > download, please help me.
> >
> >
> > regards
> > Santosh kumar
> > SoftPro Systems
> > Hyderabad
> >
> >
> > "The harder you train in peace, the lesser you bleed in war"
> >
> > ---SOFTPRO DISCLAIMER--
> >
> > Information contained in this E-MAIL and any attachments are
> > confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
> > and 'confidential'.
> >
> > If you are not an intended or authorised recipient of this E-MAIL or
> > have received it in error, You are notified that any use, copying or
> > dissemination  of the information contained in this E-MAIL in any
> > manner whatsoever is strictly prohibited. Please delete it immediately
> > and notify the sender by E-MAIL.
> >
> > In such a case reading, reproducing, printing or further dissemination
> > of this E-MAIL is strictly prohibited and may be unlawful.
> >
> > SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
> > hereto is free from computer viruses or other defects.
> >
> > The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
> > those of the author and are not necess

RE: pdfboxhelp

2004-08-21 Thread Karthik N S
Hi

If u are using the jar file with Web Interface for jsp/servlet dev, Place
the jar file in  "webapps///lib"
and also correct the Classpath for the present modification.

2)create u'r own package and put all u'r java files  copy the java files to
/Web-inf/Classes/


Then use the same..;{


Karthik

-Original Message-
From: Santosh [mailto:[EMAIL PROTECTED]
Sent: Saturday, August 21, 2004 4:31 PM
To: Lucene Users List
Subject: Re: pdfboxhelp


thanks  Natarajan and karthik,

I corrected classpath

but where I should write your code?
should I write your code in IndexHTML.java  which comes along with lucene or
some other place?
one more thing
I kept pdfbox jar file in the classpath is this enough or I have to build
the pdfbox?

thankyou
- Original Message -
From: "Natarajan.T" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Saturday, August 21, 2004 3:20 PM
Subject: RE: pdfboxhelp


> Hi Santhosh,
>
> Try out this below code.(pdfbox.jar file must be in your classpath)
>
> public String getContent(InputStream  reader) throws IOException{PDFParser
parser = null;PDDocument pdDoc = null;PDFTextStripper stripper = null;String
pdftext = "";try{parser = new PDFParser(reader);parser.parse();pdDoc =
parser.getPDDocument();if(pdDoc.isEncrypted()){DecryptDocument decryptor =
new
> DecryptDocument(pdDoc);decryptor.decryptDocument("");}stripper = new
PDFTextStripper();pdftext = stripper.getText(pdDoc);
>
>info = pdDoc.getDocumentInformation();}catch(Exception err)
{System.out.println(err.getMessage());}pdDoc.close();return pdftext;}
>
> Natarajan.
>
> -Original Message-
> From: Santosh [mailto:[EMAIL PROTECTED]
> Sent: Saturday, August 21, 2004 3:14 PM
> To: Lucene Users List
> Subject: Re: pdfboxhelp
>
> Hi Don,
>
> your Idea is nice, but whenever I write the  following code in
> IndexHTML.java of lucene
>
>
> import org.pdfbox.searchengine.lucene.*;
>
> File pdfFile = new File("/path/to/the/file.pdf");
>
> // Below returns a parse PDF file in a Lucene Document object.
> Document doc = LucenePDFDocument.getDocument(pdfFile);
>
> Iam getting the following error
>
> package org.pdfbox.searchengine.lucene does not exist
>
> I have downloaded pdfbox source code and kept the jar file in the
> classpath, please help me on this- Original Message - From: Don
Vaillancourt To: Lucene Users List Sent: Friday, August 20, 2004 7:37
PMSubject: Re: pdfboxhelp
>
>
>   Here is the super simple code required.
>
>   import org.pdfbox.searchengine.lucene.*;
>
>   File pdfFile = new File("/path/to/the/file.pdf");
>
>   // Below returns a parse PDF file in a Lucene Document object.Document
doc = LucenePDFDocument.getDocument(pdfFile);
>
>   Santosh wrote:
>
> exactly, the same is required to me- Original Message - From: Don
Vaillancourt To: Lucene Users List Sent: Friday, August 20, 2004 6:39
PMSubject: Re: pdfboxhelp
>
>
>   What are your intensions with PDFBox?
>
>   You want to use it to index PDF files?
>
>   Santosh wrote:
>
> hi,
>
> I have downloaded pdfbox zip. but i am in ambigous state that where to
> start. how can I check with demo, I dont see any help document with this
> download, please help me.
>
>
> regards
> Santosh kumar
> SoftPro Systems
> Hyderabad
>
>
> "The harder you train in peace, the lesser you bleed in war"
>
> ---SOFTPRO DISCLAIMER--
>
> Information contained in this E-MAIL and any attachments are
> confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
> and 'confidential'.
>
> If you are not an intended or authorised recipient of this E-MAIL or
> have received it in error, You are notified that any use, copying or
> dissemination  of the information contained in this E-MAIL in any
> manner whatsoever is strictly prohibited. Please delete it immediately
> and notify the sender by E-MAIL.
>
> In such a case reading, reproducing, printing or further dissemination
> of this E-MAIL is strictly prohibited and may be unlawful.
>
> SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
> hereto is free from computer viruses or other defects.
>
> The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
> those of the author and are not necessarily those of SOFTPRO SYSTEMS.
> 
>
>
>
>
>
>   -- Don VaillancourtDirector of Software Development
>
>   WEB IMPACT INC.phone: 416-815-2000 ext. 245fax: 416-815-2001email:
[EMAIL PROTECTED]: http://www.web-impact.com
>
>
>
>   This email message is intended only for the addressee(s)and contains
information that may be confidential and/orcopyright. If you are not the
intended recipient pleasenotify the sender by reply email and immediately
deletethis email. Use, disclosure or reproduction of this emailby anyone
other than the intended recipient(s) is strictlyprohibited. No
representation is made that this email orany attachments are fre

Re: pdfboxhelp

2004-08-21 Thread Santosh
thanks  Natarajan and karthik,

I corrected classpath

but where I should write your code?
should I write your code in IndexHTML.java  which comes along with lucene or
some other place?
one more thing
I kept pdfbox jar file in the classpath is this enough or I have to build
the pdfbox?

thankyou
- Original Message -
From: "Natarajan.T" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Saturday, August 21, 2004 3:20 PM
Subject: RE: pdfboxhelp


> Hi Santhosh,
>
> Try out this below code.(pdfbox.jar file must be in your classpath)
>
> public String getContent(InputStream  reader) throws IOException{PDFParser
parser = null;PDDocument pdDoc = null;PDFTextStripper stripper = null;String
pdftext = "";try{parser = new PDFParser(reader);parser.parse();pdDoc =
parser.getPDDocument();if(pdDoc.isEncrypted()){DecryptDocument decryptor =
new
> DecryptDocument(pdDoc);decryptor.decryptDocument("");}stripper = new
PDFTextStripper();pdftext = stripper.getText(pdDoc);
>
>info = pdDoc.getDocumentInformation();}catch(Exception err)
{System.out.println(err.getMessage());}pdDoc.close();return pdftext;}
>
> Natarajan.
>
> -Original Message-
> From: Santosh [mailto:[EMAIL PROTECTED]
> Sent: Saturday, August 21, 2004 3:14 PM
> To: Lucene Users List
> Subject: Re: pdfboxhelp
>
> Hi Don,
>
> your Idea is nice, but whenever I write the  following code in
> IndexHTML.java of lucene
>
>
> import org.pdfbox.searchengine.lucene.*;
>
> File pdfFile = new File("/path/to/the/file.pdf");
>
> // Below returns a parse PDF file in a Lucene Document object.
> Document doc = LucenePDFDocument.getDocument(pdfFile);
>
> Iam getting the following error
>
> package org.pdfbox.searchengine.lucene does not exist
>
> I have downloaded pdfbox source code and kept the jar file in the
> classpath, please help me on this- Original Message - From: Don
Vaillancourt To: Lucene Users List Sent: Friday, August 20, 2004 7:37
PMSubject: Re: pdfboxhelp
>
>
>   Here is the super simple code required.
>
>   import org.pdfbox.searchengine.lucene.*;
>
>   File pdfFile = new File("/path/to/the/file.pdf");
>
>   // Below returns a parse PDF file in a Lucene Document object.Document
doc = LucenePDFDocument.getDocument(pdfFile);
>
>   Santosh wrote:
>
> exactly, the same is required to me- Original Message - From: Don
Vaillancourt To: Lucene Users List Sent: Friday, August 20, 2004 6:39
PMSubject: Re: pdfboxhelp
>
>
>   What are your intensions with PDFBox?
>
>   You want to use it to index PDF files?
>
>   Santosh wrote:
>
> hi,
>
> I have downloaded pdfbox zip. but i am in ambigous state that where to
> start. how can I check with demo, I dont see any help document with this
> download, please help me.
>
>
> regards
> Santosh kumar
> SoftPro Systems
> Hyderabad
>
>
> "The harder you train in peace, the lesser you bleed in war"
>
> ---SOFTPRO DISCLAIMER--
>
> Information contained in this E-MAIL and any attachments are
> confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
> and 'confidential'.
>
> If you are not an intended or authorised recipient of this E-MAIL or
> have received it in error, You are notified that any use, copying or
> dissemination  of the information contained in this E-MAIL in any
> manner whatsoever is strictly prohibited. Please delete it immediately
> and notify the sender by E-MAIL.
>
> In such a case reading, reproducing, printing or further dissemination
> of this E-MAIL is strictly prohibited and may be unlawful.
>
> SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
> hereto is free from computer viruses or other defects.
>
> The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
> those of the author and are not necessarily those of SOFTPRO SYSTEMS.
> 
>
>
>
>
>
>   -- Don VaillancourtDirector of Software Development
>
>   WEB IMPACT INC.phone: 416-815-2000 ext. 245fax: 416-815-2001email:
[EMAIL PROTECTED]: http://www.web-impact.com
>
>
>
>   This email message is intended only for the addressee(s)and contains
information that may be confidential and/orcopyright. If you are not the
intended recipient pleasenotify the sender by reply email and immediately
deletethis email. Use, disclosure or reproduction of this emailby anyone
other than the intended recipient(s) is strictlyprohibited. No
representation is made that this email orany attachments are free of
viruses. Virus scanning isrecommended and is the responsibility of the
recipient.
>
>
>
> ---SOFTPRO DISCLAIMER--
>
> Information contained in this E-MAIL and any attachments are
> confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
> and 'confidential'.
>
> If you are not an intended or authorised recipient of this E-MAIL or
> have received it in error, You are notified that any use, copy

RE: pdfboxhelp

2004-08-21 Thread Karthik N S
Hi

  First Try to echo the Classpath and se if u'r able to get the relavent jar
file on the classpath, then proceed ... ;+)

Karthik

-Original Message-
From: Santosh [mailto:[EMAIL PROTECTED]
Sent: Saturday, August 21, 2004 3:14 PM
To: Lucene Users List
Subject: Re: pdfboxhelp


Hi Don,

your Idea is nice, but whenever I write the  following code in
IndexHTML.java of lucene


import org.pdfbox.searchengine.lucene.*;

File pdfFile = new File("/path/to/the/file.pdf");

// Below returns a parse PDF file in a Lucene Document object.
Document doc = LucenePDFDocument.getDocument(pdfFile);

Iam getting the following error

package org.pdfbox.searchengine.lucene does not exist

I have downloaded pdfbox source code and kept the jar file in the classpath,
please help me on this
  - Original Message -
  From: Don Vaillancourt
  To: Lucene Users List
  Sent: Friday, August 20, 2004 7:37 PM
  Subject: Re: pdfboxhelp


  Here is the super simple code required.

  import org.pdfbox.searchengine.lucene.*;

  File pdfFile = new File("/path/to/the/file.pdf");

  // Below returns a parse PDF file in a Lucene Document object.
  Document doc = LucenePDFDocument.getDocument(pdfFile);


  Santosh wrote:

exactly, the same is required to me
  - Original Message -
  From: Don Vaillancourt
  To: Lucene Users List
  Sent: Friday, August 20, 2004 6:39 PM
  Subject: Re: pdfboxhelp


  What are your intensions with PDFBox?

  You want to use it to index PDF files?

  Santosh wrote:

hi,

I have downloaded pdfbox zip. but i am in ambigous state that where to
start. how can I check with demo, I dont see any help document with this
download, please help me.


regards
Santosh kumar
SoftPro Systems
Hyderabad


"The harder you train in peace, the lesser you bleed in war"

---SOFTPRO DISCLAIMER--

Information contained in this E-MAIL and any attachments are
confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
and 'confidential'.

If you are not an intended or authorised recipient of this E-MAIL or
have received it in error, You are notified that any use, copying or
dissemination  of the information contained in this E-MAIL in any
manner whatsoever is strictly prohibited. Please delete it immediately
and notify the sender by E-MAIL.

In such a case reading, reproducing, printing or further dissemination
of this E-MAIL is strictly prohibited and may be unlawful.

SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
hereto is free from computer viruses or other defects.

The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
those of the author and are not necessarily those of SOFTPRO SYSTEMS.






  --
  Don Vaillancourt
  Director of Software Development

  WEB IMPACT INC.
  phone: 416-815-2000 ext. 245
  fax: 416-815-2001
  email: [EMAIL PROTECTED]
  web: http://www.web-impact.com



  This email message is intended only for the addressee(s)
  and contains information that may be confidential and/or
  copyright. If you are not the intended recipient please
  notify the sender by reply email and immediately delete
  this email. Use, disclosure or reproduction of this email
  by anyone other than the intended recipient(s) is strictly
  prohibited. No representation is made that this email or
  any attachments are free of viruses. Virus scanning is
  recommended and is the responsibility of the recipient.



---SOFTPRO DISCLAIMER--

Information contained in this E-MAIL and any attachments are
confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
and 'confidential'.

If you are not an intended or authorised recipient of this E-MAIL or
have received it in error, You are notified that any use, copying or
dissemination  of the information contained in this E-MAIL in any
manner whatsoever is strictly prohibited. Please delete it immediately
and notify the sender by E-MAIL.

In such a case reading, reproducing, printing or further dissemination
of this E-MAIL is strictly prohibited and may be unlawful.

SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
hereto is free from computer viruses or other defects.

The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
those of the author and are not necessarily those of SOFTPRO SYSTEMS.







--


  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

---SOFTPRO DISCLAIMER--

Information contained in this E-MAIL and any attachments are
confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
and 'confidential'.

If yo

RE: pdfboxhelp

2004-08-21 Thread Natarajan.T
Hi Santhosh,

Try out this below code.(pdfbox.jar file must be in your classpath)

public String getContent(InputStream  reader) throws IOException{
 PDFParser parser = null;
 PDDocument pdDoc = null;
 PDFTextStripper stripper = null;
 String pdftext = "";
 try{
parser = new PDFParser(reader);
parser.parse();
pdDoc = parser.getPDDocument();

if(pdDoc.isEncrypted()){
DecryptDocument decryptor = new
DecryptDocument(pdDoc);
decryptor.decryptDocument("");
}
stripper = new PDFTextStripper();
pdftext = stripper.getText(pdDoc);

info = pdDoc.getDocumentInformation();
 }
 catch(Exception err) {
 System.out.println(err.getMessage());
 }
pdDoc.close();
return pdftext;
 }

Natarajan.

-Original Message-
From: Santosh [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 21, 2004 3:14 PM
To: Lucene Users List
Subject: Re: pdfboxhelp

Hi Don,

your Idea is nice, but whenever I write the  following code in
IndexHTML.java of lucene 


import org.pdfbox.searchengine.lucene.*;

File pdfFile = new File("/path/to/the/file.pdf"); 

// Below returns a parse PDF file in a Lucene Document object.
Document doc = LucenePDFDocument.getDocument(pdfFile);

Iam getting the following error

package org.pdfbox.searchengine.lucene does not exist

I have downloaded pdfbox source code and kept the jar file in the
classpath, please help me on this
  - Original Message - 
  From: Don Vaillancourt 
  To: Lucene Users List 
  Sent: Friday, August 20, 2004 7:37 PM
  Subject: Re: pdfboxhelp


  Here is the super simple code required.

  import org.pdfbox.searchengine.lucene.*;

  File pdfFile = new File("/path/to/the/file.pdf"); 

  // Below returns a parse PDF file in a Lucene Document object.
  Document doc = LucenePDFDocument.getDocument(pdfFile);

  
  Santosh wrote:

exactly, the same is required to me
  - Original Message - 
  From: Don Vaillancourt 
  To: Lucene Users List 
  Sent: Friday, August 20, 2004 6:39 PM
  Subject: Re: pdfboxhelp


  What are your intensions with PDFBox?

  You want to use it to index PDF files?

  Santosh wrote:

hi,

I have downloaded pdfbox zip. but i am in ambigous state that where to
start. how can I check with demo, I dont see any help document with this
download, please help me.


regards
Santosh kumar
SoftPro Systems
Hyderabad


"The harder you train in peace, the lesser you bleed in war"

---SOFTPRO DISCLAIMER--

Information contained in this E-MAIL and any attachments are
confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
and 'confidential'.

If you are not an intended or authorised recipient of this E-MAIL or
have received it in error, You are notified that any use, copying or
dissemination  of the information contained in this E-MAIL in any
manner whatsoever is strictly prohibited. Please delete it immediately
and notify the sender by E-MAIL.

In such a case reading, reproducing, printing or further dissemination
of this E-MAIL is strictly prohibited and may be unlawful.

SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
hereto is free from computer viruses or other defects. 

The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
those of the author and are not necessarily those of SOFTPRO SYSTEMS.


  



  -- 
  Don Vaillancourt
  Director of Software Development

  WEB IMPACT INC.
  phone: 416-815-2000 ext. 245
  fax: 416-815-2001
  email: [EMAIL PROTECTED]
  web: http://www.web-impact.com



  This email message is intended only for the addressee(s)
  and contains information that may be confidential and/or
  copyright. If you are not the intended recipient please
  notify the sender by reply email and immediately delete
  this email. Use, disclosure or reproduction of this email
  by anyone other than the intended recipient(s) is strictly
  prohibited. No representation is made that this email or
  any attachments are free of viruses. Virus scanning is
  recommended and is the responsibility of the recipient.



---SOFTPRO DISCLAIMER--

Information contained in this E-MAIL and any attachments are
confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
and 'confidential'.

If you are not an intended or authorised recipient of this E-MAIL or
have received it in error, You are notified that any use, copying or
dissemination  of the information contained in this E-MAIL 

Re: pdfboxhelp

2004-08-21 Thread Santosh
Hi Don,

your Idea is nice, but whenever I write the  following code in IndexHTML.java of 
lucene 


import org.pdfbox.searchengine.lucene.*;

File pdfFile = new File("/path/to/the/file.pdf"); 

// Below returns a parse PDF file in a Lucene Document object.
Document doc = LucenePDFDocument.getDocument(pdfFile);

Iam getting the following error

package org.pdfbox.searchengine.lucene does not exist

I have downloaded pdfbox source code and kept the jar file in the classpath, please 
help me on this
  - Original Message - 
  From: Don Vaillancourt 
  To: Lucene Users List 
  Sent: Friday, August 20, 2004 7:37 PM
  Subject: Re: pdfboxhelp


  Here is the super simple code required.

  import org.pdfbox.searchengine.lucene.*;

  File pdfFile = new File("/path/to/the/file.pdf"); 

  // Below returns a parse PDF file in a Lucene Document object.
  Document doc = LucenePDFDocument.getDocument(pdfFile);

  
  Santosh wrote:

exactly, the same is required to me
  - Original Message - 
  From: Don Vaillancourt 
  To: Lucene Users List 
  Sent: Friday, August 20, 2004 6:39 PM
  Subject: Re: pdfboxhelp


  What are your intensions with PDFBox?

  You want to use it to index PDF files?

  Santosh wrote:

hi,

I have downloaded pdfbox zip. but i am in ambigous state that where to start. how can 
I check with demo, I dont see any help document with this download, please help me.


regards
Santosh kumar
SoftPro Systems
Hyderabad


"The harder you train in peace, the lesser you bleed in war"

---SOFTPRO DISCLAIMER--

Information contained in this E-MAIL and any attachments are
confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
and 'confidential'.

If you are not an intended or authorised recipient of this E-MAIL or
have received it in error, You are notified that any use, copying or
dissemination  of the information contained in this E-MAIL in any
manner whatsoever is strictly prohibited. Please delete it immediately
and notify the sender by E-MAIL.

In such a case reading, reproducing, printing or further dissemination
of this E-MAIL is strictly prohibited and may be unlawful.

SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
hereto is free from computer viruses or other defects. 

The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
those of the author and are not necessarily those of SOFTPRO SYSTEMS.


  



  -- 
  Don Vaillancourt
  Director of Software Development

  WEB IMPACT INC.
  phone: 416-815-2000 ext. 245
  fax: 416-815-2001
  email: [EMAIL PROTECTED]
  web: http://www.web-impact.com



  This email message is intended only for the addressee(s)
  and contains information that may be confidential and/or
  copyright. If you are not the intended recipient please
  notify the sender by reply email and immediately delete
  this email. Use, disclosure or reproduction of this email
  by anyone other than the intended recipient(s) is strictly
  prohibited. No representation is made that this email or
  any attachments are free of viruses. Virus scanning is
  recommended and is the responsibility of the recipient.



---SOFTPRO DISCLAIMER--

Information contained in this E-MAIL and any attachments are
confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
and 'confidential'.

If you are not an intended or authorised recipient of this E-MAIL or
have received it in error, You are notified that any use, copying or
dissemination  of the information contained in this E-MAIL in any
manner whatsoever is strictly prohibited. Please delete it immediately
and notify the sender by E-MAIL.

In such a case reading, reproducing, printing or further dissemination
of this E-MAIL is strictly prohibited and may be unlawful.

SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
hereto is free from computer viruses or other defects. 

The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
those of the author and are not necessarily those of SOFTPRO SYSTEMS.






--


  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

---SOFTPRO DISCLAIMER--

Information contained in this E-MAIL and any attachments are
confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
and 'confidential'.

If you are not an intended or authorised recipient of this E-MAIL or
have received it in error, You are notified that any use, copying or
dissemination  of the information contained in this E-MAIL in any
manner whatsoever is strictly prohibi

Re: speeding up queries (MySQL faster)

2004-08-21 Thread Otis Gospodnetic
Ah, you may be right (no stack trace in email any more).  Somebody
recenly identified a few bottlenecks that, if I recall correctly, were
related to synchronized blocks.  I believe Doug committed some
improvements, but I can't remember which version of Lucene that is in. 
It's definitely in 1.4.1.

Otis


--- Yonik Seeley <[EMAIL PROTECTED]> wrote:

> 
> --- Otis Gospodnetic <[EMAIL PROTECTED]>
> wrote:
> 
> > The bottleneck seems to be disk IO.
> 
> But it's not.  Linux is caching the whole file, and
> there really isn't any disk activity at all.  Most of
> the threads are blocked on InputStream.refill, not
> waiting for the disk, but waiting for their turn into
> the synchronized block to read from the disk (which is
> why I asked about cacheing above that level).
> 
> CPU is a constant 50% on a dual CPU system (meaning
> 100% of 1 cpu).
> 
> -Yonik
> 
> __
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]