Re: Zip Files

2005-03-01 Thread Ernesto De Santis
Hello
first, you need a parser for each file type: pdf, txt, word, etc.
and use a java api to iterate zip content, see:
http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html
use getNextEntry() method
little example:
ZipInputStream zis = new ZipInputStream(fileInputStream);
ZipEntry zipEntry;
while(zipEntry = zis.getNextEntry() != null){
   //use zipEntry to get name, etc.
   //get properly parser for current entry
   //use parser with zis (ZipInputStream)
}
good luck
Ernesto
Luke Shannon escribió:
Hello;
Anyone have an ideas on how to index the contents within zip files?
Thanks,
Luke
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

--
Ernesto De Santis - Colaborativa.net
Córdoba 1147 Piso 6 Oficinas 3 y 4
(S2000AWO) Rosario, SF, Argentina.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Disk space used by optimize - non space in disk corrupts index.

2005-02-04 Thread Ernesto De Santis
Hi all
We have a big index and a little space in disk.
When optimize and all space is consumed, our index is corrupted.
segments file point to nonexistent files.
Enviroment:
java 1.4.2_04
W2000 SP4
Tomat 5.5.4
Bye,
Ernesto.
Yura Smolsky escribió:
Hello, Otis.
There is a big difference when you use compound index format or
multiple files. I have tested it on the big index (45 Gb). When I used
compound file then optimize takes 3 times more space, b/c *.cfs needs
to be unpacked.
Now I do use non compound file format. It needs like twice as much
disk space.
OG> Have you tried using the multifile index format?  Now I wonder if there
OG> is actually a difference in disk space cosumed by optimize() when you
OG> use multifile and compound index format...
OG> Otis
OG> --- "Kauler, Leto S" <[EMAIL PROTECTED]> wrote:
 

Our copy of LIA is "in the mail" ;)
Yes the final three files are: the .cfs (46.8MB), deletable (4
bytes),
and segments (29 bytes).
--Leto

 

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 

Hello,
Yes, that is how optimize works - copies all existing index 
segments into one unified index segment, thus optimizing it.

see hit #1:
   

http://www.lucenebook.com/search?query=optimize+disk+space
 

However, three times the space sounds a bit too much, or I 
make a mistake in the book. :)

You said you end up with 3 files - .cfs is one of them, right?
Otis
--- "Kauler, Leto S" <[EMAIL PROTECTED]> wrote:
   

Just a quick question:  after writing an index and then calling
optimize(), is it normal for the index to expand to about 
 

three times 
   

the size before finally compressing?
In our case the optimise grinds the disk, expanding the index
 

into 
 

many files of about 145MB total, before compressing down to three
 

files of about 47MB total.  That must be a lot of disk activity
 

for 
 

the people with multi-gigabyte indexes!
Regards,
Leto
 

CONFIDENTIALITY NOTICE AND DISCLAIMER
Information in this transmission is intended only for the person(s)
to whom it is addressed and may contain privileged and/or
confidential information. If you are not the intended recipient, any
disclosure, copying or dissemination of the information is
unauthorised and you should delete/destroy all copies and notify the
sender. No liability is accepted for any unauthorised use of the
information contained in this transmission.
This disclaimer has been automatically added.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
 


OG> -
OG> To unsubscribe, e-mail: [EMAIL PROTECTED]
OG> For additional commands, e-mail:
OG> [EMAIL PROTECTED]
Yura Smolsky,

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.8.5 - Release Date: 03/02/2005
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Optimize not deleting all files

2005-02-04 Thread Ernesto De Santis
Hi all
We have the same problem.
We guess that the problem is that windows lock files.
Our enviroment:
Windows 2000
Tomcat 5.5.4
Ernesto.
[EMAIL PROTECTED] escribió:
Hi,
When I run an optimize in our production environment, old index are
left in the directory and are not deleted.  

My understanding is that an
optimize will create new index files and all existing index files should be
deleted.  Is this correct?
We are running Lucene 1.4.2 on Windows.  

Any help is appreciated.  Thanks!
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 


--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.8.5 - Release Date: 03/02/2005
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene and multiple languages

2005-01-21 Thread Ernesto De Santis
I send you the source code in a private mail.
Ernesto.
aurora escribió:
Thanks. I would like to give it a try. Is the source code available? 
I'm  using a Python version of Lucene so it would need to be wrapped 
or ported  :)

Hi Aurora
I develop a tool with this multiple languages issue. I found very useful
an nuke library "language-identifier". This jar have nuke dependencies,
but I delete all unnecessary code (for me obvious).
This language-identifier that I use work fine and is very simple:
For example:
LanguageIdentifier languageIdentifier = 
LanguageIdentifier.getInstance();
String userInputText = "free text";
String language = languageIdentifier.identify(text);

This work for 11 languages: English, Spanish, Portuguese, Dutch, German,
French, Italian, and Others.
I can send you this touched jar, but remember that this jar is from
Nuke, for copyright (or left :).
http://www.nutch.org/LICENSE.txt
More comments above...
aurora escribió:
I'm trying to build some web search tool that could work for 
multiple   languages. I understand that Lucene is shipped with 
StandardAnalyzer  plus  a German and Russian analyzers and some more 
in the sandbox. And  that  indexing and searching should use the 
same analyzer.

Now let's said I have an index with documents in multiple languages  
and  analyzed by an assortment of analyzers. When user enter a 
query,  what  analyzer should be used? Should the user be asked for 
the  language  upfront? What to expect when the analyzer and the 
document  doesn't match?  Let's said the query is parsed using 
StandardAnalyzer.  Would it match any  documents done in German 
analyzer at all. Or would  it end up in poor  result?

When this happen, in the major cases you do not obtain matchs.
Also is there a good way to find out the languages used in a web 
page?   There is a 'content-langage' header in http and a 'lang' 
attribute in   HTML. Looks like people don't really use them. How 
can we recognize  the  language?

With language identifier. :)
Even more interesting is multiple languages used in one document,  
let's  say half English and half French. Is there a good way to 
deal  with those  cases?

Language identifier only return one language. I look into
language-identifier and work with a score for each language, and return
the language with greater value.
Maybe you can modify the language-identifier for take the most greater
values.
Bye
Ernesto.
Thanks for any guidance.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene and multiple languages

2005-01-20 Thread Ernesto De Santis
Hi Aurora
I develop a tool with this multiple languages issue. I found very useful
an nuke library "language-identifier". This jar have nuke dependencies,
but I delete all unnecessary code (for me obvious).
This language-identifier that I use work fine and is very simple:
For example:
LanguageIdentifier languageIdentifier = LanguageIdentifier.getInstance();
String userInputText = "free text";
String language = languageIdentifier.identify(text);
This work for 11 languages: English, Spanish, Portuguese, Dutch, German,
French, Italian, and Others.
I can send you this touched jar, but remember that this jar is from
Nuke, for copyright (or left :).
http://www.nutch.org/LICENSE.txt
More comments above...
aurora escribió:
I'm trying to build some web search tool that could work for multiple  
languages. I understand that Lucene is shipped with StandardAnalyzer 
plus  a German and Russian analyzers and some more in the sandbox. And 
that  indexing and searching should use the same analyzer.

Now let's said I have an index with documents in multiple languages 
and  analyzed by an assortment of analyzers. When user enter a query, 
what  analyzer should be used? Should the user be asked for the 
language  upfront? What to expect when the analyzer and the document 
doesn't match?  Let's said the query is parsed using StandardAnalyzer. 
Would it match any  documents done in German analyzer at all. Or would 
it end up in poor  result?

When this happen, in the major cases you do not obtain matchs.
Also is there a good way to find out the languages used in a web 
page?  There is a 'content-langage' header in http and a 'lang' 
attribute in  HTML. Looks like people don't really use them. How can 
we recognize the  language?

With language identifier. :)
Even more interesting is multiple languages used in one document, 
let's  say half English and half French. Is there a good way to deal 
with those  cases?

Language identifier only return one language. I look into
language-identifier and work with a score for each language, and return
the language with greater value.
Maybe you can modify the language-identifier for take the most greater
values.
Bye
Ernesto.
Thanks for any guidance.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: where is the SnowBallAnalyzer?

2004-09-08 Thread Ernesto De Santis
Is in snowball-1.0.jar

I sent you it in private email.

Bye
Ernesto.

- Original Message - 
From: "Wermus Fernando" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, September 08, 2004 1:12 PM
Subject: where is the SnowBallAnalyzer?


I have to look better, but why the SnowBallAnalizer isn't in 
org.apache.lucene.analysis.snowball.SnowballAnalyzer package?
 
I have lucene 1.4.
 
I'm doing my own spanish stemmer.
 
 
 



---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.754 / Virus Database: 504 - Release Date: 06/09/2004

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: spanish stemmer

2004-08-23 Thread Ernesto De Santis
Hi Chad

> One more question to the group.  From what I have gathered, my choices for
indexing and querying Spanish content are:

> 1.  StandardAnalyzer (I read that this analyzer could be used for
"European" languages)

The StandardAnalyzer not is for European languages, is like a generic
analyzer.

> 2.  SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS);  <--custom stop words
from Ernesto class below

> Can I assume that choice 2 would be the better for Spanish content?

Yes, is too better.

For example:
In StandardAnalyzer, caminar, caminantes, camino, etc, are differents words,
only return hit if the match is exactly.
In SpanishAnalyzer, are the same word. This three words are conjugations of
caminar. If in your index, one document have the word "caminante", you can
get the hit with the differents conjugations of this verb.

The operation of stemmers is strip the words according to the rules of the
language (spanish for us).
caminar, caminantes, camino are stored as camin. (Camin not exist in
spanish).

This improvement the quality of hits


>thanks,
> chad.

Bye, Ernesto.


-Original Message-
From: Ernesto De Santis [mailto:[EMAIL PROTECTED]
Sent: Monday, August 23, 2004 3:31 PM
To: Lucene Users List
Subject: Re: spanish stemmer


Because the SnowballAnalyzer, and SpanishStemmer don´t have a default
stopword set.

SnowballAnalyzer constructor:

  /** Builds the named analyzer with no stop words. */
  public SnowballAnalyzer(String name) {
this.name = name;
  }

Note the comment.

Bye,
Ernesto.

- Original Message - 
From: "Chad Small" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, August 23, 2004 4:57 PM
Subject: RE: spanish stemmer


Excellent Ernesto.

Was there a reason you used your own stop word list and not just the default
constructor SnowballAnalyzer("Spanish")?

thanks,
chad.

-Original Message-
From: Ernesto De Santis [mailto:[EMAIL PROTECTED]
Sent: Monday, August 23, 2004 2:03 PM
To: Lucene Users List
Subject: Re: spanish stemmer


Yes, is too easy.

You need do a wrapper for spanish Snowball initilization.

analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS);

above the complete code.

Bye, Ernesto.


--
public class SpanishAnalyzer extends Analyzer {

private static SnowballAnalyzer analyzer;


private String SPANISH_STOP_WORDS[] = {

"un", "una", "unas", "unos", "uno", "sobre", "todo", "también", "tras",
"otro", "algún", "alguno", "alguna",

"algunos", "algunas", "ser", "es", "soy", "eres", "somos", "sois", "estoy",
"esta", "estamos", "estais",

"estan", "en", "para", "atras", "porque", "por qué", "estado", "estaba",
"ante", "antes", "siendo",

"ambos", "pero", "por", "poder", "puede", "puedo", "podemos", "podeis",
"pueden", "fui", "fue", "fuimos",

"fueron", "hacer", "hago", "hace", "hacemos", "haceis", "hacen", "cada",
"fin", "incluso", "primero",

"desde", "conseguir", "consigo", "consigue", "consigues", "conseguimos",
"consiguen", "ir", "voy", "va",

"vamos", "vais", "van", "vaya", "bueno", "ha", "tener", "tengo", "tiene",
"tenemos", "teneis", "tienen",

"el", "la", "lo", "las", "los", "su", "aqui", "mio", "tuyo", "ellos",
"ellas", "nos", "nosotros", "vosotros",

"vosotras", "si", "dentro", "solo", "solamente", "saber", "sabes", "sabe",
"sabemos", "sabeis", "saben",

"ultimo", "largo", "bastante", "haces", "muchos", "aquellos", "aquellas",
"sus", "entonces", "tiempo",

"verdad", "verdadero", "verdadera", "cierto", "ciertos", "cierta",
"ciertas", "intentar", "intento",

"intenta", "intentas", "intentamos", "intentais", "intentan", "dos", "bajo",
"arriba", "encima", "usar",

"uso", "usas", "usa", "usamos", "usais", "usan", "emplear", "empleo",
"empleas", "emplean", "ampleamos",

"empleais", "valor", "muy", "era", "eras", "eramos", "eran", "modo", "bien",
"cual", "cuando", "donde",

"mientras", "quien", "con", "entre", "sin", "trabajo", "trabajar",
"trabajas", "trabaja", "trabajamos",

"trabajais", "trabajan", "podria", "podrias", "podriamos", "podrian",
"podriais", "yo", "aquel", "mi",

"de", "a", "e", "i", "o", "u"};

public SpanishAnalyzer() {

analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS);

}

public SpanishAnalyzer(String stopWords[]) {

analyzer = new SnowballAnalyzer("Spanish", stopWords);

}

public TokenStream tokenStream(String fieldName, Reader reader) {

return analyzer.tokenStream(fieldName, reader);

}

}





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: spanish stemmer

2004-08-23 Thread Ernesto De Santis
Because the SnowballAnalyzer, and SpanishStemmer don´t have a default
stopword set.

SnowballAnalyzer constructor:

  /** Builds the named analyzer with no stop words. */
  public SnowballAnalyzer(String name) {
this.name = name;
  }

Note the comment.

Bye,
Ernesto.

- Original Message - 
From: "Chad Small" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, August 23, 2004 4:57 PM
Subject: RE: spanish stemmer


Excellent Ernesto.

Was there a reason you used your own stop word list and not just the default
constructor SnowballAnalyzer("Spanish")?

thanks,
chad.

-----Original Message-
From: Ernesto De Santis [mailto:[EMAIL PROTECTED]
Sent: Monday, August 23, 2004 2:03 PM
To: Lucene Users List
Subject: Re: spanish stemmer


Yes, is too easy.

You need do a wrapper for spanish Snowball initilization.

analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS);

above the complete code.

Bye, Ernesto.


--
public class SpanishAnalyzer extends Analyzer {

private static SnowballAnalyzer analyzer;


private String SPANISH_STOP_WORDS[] = {

"un", "una", "unas", "unos", "uno", "sobre", "todo", "también", "tras",
"otro", "algún", "alguno", "alguna",

"algunos", "algunas", "ser", "es", "soy", "eres", "somos", "sois", "estoy",
"esta", "estamos", "estais",

"estan", "en", "para", "atras", "porque", "por qué", "estado", "estaba",
"ante", "antes", "siendo",

"ambos", "pero", "por", "poder", "puede", "puedo", "podemos", "podeis",
"pueden", "fui", "fue", "fuimos",

"fueron", "hacer", "hago", "hace", "hacemos", "haceis", "hacen", "cada",
"fin", "incluso", "primero",

"desde", "conseguir", "consigo", "consigue", "consigues", "conseguimos",
"consiguen", "ir", "voy", "va",

"vamos", "vais", "van", "vaya", "bueno", "ha", "tener", "tengo", "tiene",
"tenemos", "teneis", "tienen",

"el", "la", "lo", "las", "los", "su", "aqui", "mio", "tuyo", "ellos",
"ellas", "nos", "nosotros", "vosotros",

"vosotras", "si", "dentro", "solo", "solamente", "saber", "sabes", "sabe",
"sabemos", "sabeis", "saben",

"ultimo", "largo", "bastante", "haces", "muchos", "aquellos", "aquellas",
"sus", "entonces", "tiempo",

"verdad", "verdadero", "verdadera", "cierto", "ciertos", "cierta",
"ciertas", "intentar", "intento",

"intenta", "intentas", "intentamos", "intentais", "intentan", "dos", "bajo",
"arriba", "encima", "usar",

"uso", "usas", "usa", "usamos", "usais", "usan", "emplear", "empleo",
"empleas", "emplean", "ampleamos",

"empleais", "valor", "muy", "era", "eras", "eramos", "eran", "modo", "bien",
"cual", "cuando", "donde",

"mientras", "quien", "con", "entre", "sin", "trabajo", "trabajar",
"trabajas", "trabaja", "trabajamos",

"trabajais", "trabajan", "podria", "podrias", "podriamos", "podrian",
"podriais", "yo", "aquel", "mi",

"de", "a", "e", "i", "o", "u"};

public SpanishAnalyzer() {

analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS);

}

public SpanishAnalyzer(String stopWords[]) {

analyzer = new SnowballAnalyzer("Spanish", stopWords);

}

public TokenSt

Re: spanish stemmer

2004-08-23 Thread Ernesto De Santis
Hello Grant

Thanks for your response.

I have a basic undertanding about analyzers. The problem is that I think
that the words finished in 'bol' need are striped.
like:

original->generated word
tornillos   ->tornill

I need:

basquetbol  ->basquet

Bye, Ernesto.


- Original Message - 
From: "Grant Ingersoll" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, August 23, 2004 4:09 PM
Subject: Re: spanish stemmer


Ernesto,


http://snowball.tartarus.org/texts/introduction.html might help w/ your
understanding.  The link provides basic info on why stemmer's are valuable
(not necessarily any insight on how the Spanish version works).  Of course,
they don't solve every problem and in some cases may make things worse.

A stemmer is not required to return a whole word.

Hope this helps.

>>> [EMAIL PROTECTED] 8/23/2004 9:29:30 AM >>>
Hello

I use the Snowball jar for implement my SpanishAnalyzer. I found that the
words finished in 'bol' are not stripped.
For example:

In spanish for say basketball, you can say basquet or basquetbol. But for
SpanishStemmer are different words.
Idem with voley and voleybol.

Not idem with futbol (football), we not say fut for futbol. But 'fut' don´t
exist in spanish.

you think that I are correct?

you can change this?

Ernesto.


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: spanish stemmer

2004-08-23 Thread Ernesto De Santis
Yes, is too easy.

You need do a wrapper for spanish Snowball initilization.

analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS);

above the complete code.

Bye, Ernesto.


--
public class SpanishAnalyzer extends Analyzer {

private static SnowballAnalyzer analyzer;


private String SPANISH_STOP_WORDS[] = {

"un", "una", "unas", "unos", "uno", "sobre", "todo", "también", "tras",
"otro", "algún", "alguno", "alguna",

"algunos", "algunas", "ser", "es", "soy", "eres", "somos", "sois", "estoy",
"esta", "estamos", "estais",

"estan", "en", "para", "atras", "porque", "por qué", "estado", "estaba",
"ante", "antes", "siendo",

"ambos", "pero", "por", "poder", "puede", "puedo", "podemos", "podeis",
"pueden", "fui", "fue", "fuimos",

"fueron", "hacer", "hago", "hace", "hacemos", "haceis", "hacen", "cada",
"fin", "incluso", "primero",

"desde", "conseguir", "consigo", "consigue", "consigues", "conseguimos",
"consiguen", "ir", "voy", "va",

"vamos", "vais", "van", "vaya", "bueno", "ha", "tener", "tengo", "tiene",
"tenemos", "teneis", "tienen",

"el", "la", "lo", "las", "los", "su", "aqui", "mio", "tuyo", "ellos",
"ellas", "nos", "nosotros", "vosotros",

"vosotras", "si", "dentro", "solo", "solamente", "saber", "sabes", "sabe",
"sabemos", "sabeis", "saben",

"ultimo", "largo", "bastante", "haces", "muchos", "aquellos", "aquellas",
"sus", "entonces", "tiempo",

"verdad", "verdadero", "verdadera", "cierto", "ciertos", "cierta",
"ciertas", "intentar", "intento",

"intenta", "intentas", "intentamos", "intentais", "intentan", "dos", "bajo",
"arriba", "encima", "usar",

"uso", "usas", "usa", "usamos", "usais", "usan", "emplear", "empleo",
"empleas", "emplean", "ampleamos",

"empleais", "valor", "muy", "era", "eras", "eramos", "eran", "modo", "bien",
"cual", "cuando", "donde",

"mientras", "quien", "con", "entre", "sin", "trabajo", "trabajar",
"trabajas", "trabaja", "trabajamos",

"trabajais", "trabajan", "podria", "podrias", "podriamos", "podrian",
"podriais", "yo", "aquel", "mi",

"de", "a", "e", "i", "o", "u"};

public SpanishAnalyzer() {

analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS);

}

public SpanishAnalyzer(String stopWords[]) {

analyzer = new SnowballAnalyzer("Spanish", stopWords);

}

public TokenStream tokenStream(String fieldName, Reader reader) {

return analyzer.tokenStream(fieldName, reader);

}

}



- Original Message - 
From: "Chad Small" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, August 23, 2004 3:49 PM
Subject: RE: spanish stemmer


Do you mind sharing how you implemented your SpanishAnalyzer using Snowball?

Sorry I can't help with your question.  I am trying to implement Snowball
Spanish or a Spanish Analyzer in Lucene.

thanks,
chad.

-Original Message-
From: Ernesto De Santis [mailto:[EMAIL PROTECTED]
Sent: Monday, August 23, 2004 8:30 AM
To: Lucene Users List
Subject: spanish stemmer


Hello

I use the Snowball jar for implement my SpanishAnalyzer. I found that the
words finished in 'bol' are not stripped.
For example:

In spanish for say basketball, you can say basquet or basquetbol. But for
SpanishStemmer are different words.
Idem with voley and voleybol.

Not idem with futbol (football), we not say fut for futbol. But 'fut' don´t
exist in spanish.

you think that I are correct?

you can change this?

Ernesto.


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



spanish stemmer

2004-08-23 Thread Ernesto De Santis
Hello

I use the Snowball jar for implement my SpanishAnalyzer. I found that the
words finished in 'bol' are not stripped.
For example:

In spanish for say basketball, you can say basquet or basquetbol. But for
SpanishStemmer are different words.
Idem with voley and voleybol.

Not idem with futbol (football), we not say fut for futbol. But 'fut' don´t
exist in spanish.

you think that I are correct?

you can change this?

Ernesto.


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index and Search question in Lucene.

2004-08-21 Thread Ernesto De Santis
Hi Dimitri

What analyzer you use?

You need take carefully with Keyword fields and analyzers. When you
index a Document, the fields that have set tokenized = false, like
Keyword, are not analyzed. 
In search time you need parse the query with your analyzer but not
analyze the untokenized fields, like your filename.

> I can do a search as this
> "+contents:SomeWord  +filename:SomePath"
> 

The sintaxis is rigth, but if you search +filename:somepath, find only
this file.

For example, 
+content:version +filename:/my/path/myfile.ext

Only can found myfile.ext, and if this file don't content "version", not
going to find nothing. This is because you use +. + set the term
required.

You can see the queries sintaxis in lucene site.

http://jakarta.apache.org/lucene/docs/queryparsersyntax.html

http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.search&toc=faq#q5

good luck.

Bye
Ernesto.


El dom, 15 de 08 de 2004 a las 17:13, Dmitrii PapaGeorgio escribiÃ:
> Ok so when I index a file such as below
> 
> Document doc = new Document();
> doc.Add(Field.Text("contents", new StreamReader(dataDir)));
> doc.Add(Field.Keyword("filename", dataDir));
> 
> I can do a search as this
> "+contents:SomeWord  +filename:SomePath"
> 
> Correct?
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



javadoc api

2004-08-17 Thread Ernesto De Santis
Hello Lucene developers

A litle issue about a Field documentation.

In Field class on getBoost() method it says:

"Returns the boost factor for hits on any field of this document."

I think that this comment are copied from Document class and forgot change
it.

Bye
Ernesto.


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



parce Query

2004-08-05 Thread Ernesto De Santis
Hello

What is the best practice to parce a Query object.?

QueryParcer only work with String, but if I have a Query?

I want that anothers applications build yours lucene Query´s, and I want
parse this when this applications do search with my server application. In
my server application I store the configuration, languages, analyzers,
IndexSearchers, how are indexed each field (Keyword or not), etc.

then, I need parce Query to Query with the appropriate analyzer over
appropriate terms (fields).

Thanks for your attention.
Ernesto.


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.732 / Virus Database: 486 - Release Date: 29/07/2004


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Weighting database fields

2004-07-21 Thread Ernesto De Santis
Hi Erik

> On Jul 21, 2004, at 11:40 AM, Anson Lau wrote:
> > Is there any benefit to set the boost during indexing rather than set
> > it
> > during query?
>
> It allows setting each document differently.  For example,
> TheServerSide is using field-level boosts at index time to control
> ordering by date, such that newer articles come up first.  This could
> not be done at query time since each document gets a different field
> boost.

If some field have set a boots value in index time, and when in search time
the query have another boost value for this field, what happens?
which value is used for boost?

Bye,
Ernesto.


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.725 / Virus Database: 480 - Release Date: 19/07/2004


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: languages lucene can support

2004-07-01 Thread Ernesto De Santis



Hi Praveen
 
You can develope your SpanishAnalyzer easily (or 
another language) with SnowballAnalyzer.
 
I send you my SpanishAnalyzer.
 
Bye, Ernesto.
 
 
- Original Message - 
From: "Praveen Peddi" <[EMAIL PROTECTED]>
To: "lucenelist" <[EMAIL PROTECTED]>
Sent: Thursday, July 01, 2004 6:13 PM
Subject: languages lucene can 
support

I have read many emails in lucene mailing list 
regarding analyzers.Following is the list of languages lucene supports 
out of box. So they will be supported with no change in our code but just a 
configuration change.EnglishGermanRussianFollowing is the 
list of languages that are available as external downloads on lucene's 
site:ChineseJapaneseKorean (all of the above come as single 
download)BrazilianCZechFrenchDutchI also read that 
lucene's StandardAnalyzer supports most of the european languages. Does it mean 
it supports spanish also? or is there a separate analyzer for that? I didn't see 
any spanish analyzer in the sand box or lucene release.Another question 
regarding FrenchAnalyzer. I downloaded FrenchAnalyzer and some methods do not 
throw IOException where it is supposed to throw. For example, the constructor. I 
am using 1.4 final (I know its relased only today :)). Whats the fix for 
it?PraveenPraveen** 
Praveen PeddiSr Software Engg, Context Media, Inc. 
email:[EMAIL PROTECTED] Tel:  401.854.3475 Fax:  
401.861.3596 web: http://www.contextmedia.com 
** Context 
Media- "The Leader in Enterprise Content Integration"
 
---Outgoing mail is certified Virus 
Free.Checked by AVG anti-virus system (http://www.grisoft.com).Version: 6.0.712 / 
Virus Database: 468 - Release Date: 27/06/2004
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: syntax of queries.

2003-12-19 Thread Ernesto De Santis
Erik, Thanks!

The article is very good. thanks.

I have news questions:

 - apiQuery.add(new TermQuery(new Term("contents", "dot")), false, true);

new Term("contents", "dot")

The Term class, work for only one word?
this is right?
new Term("contents", "dot java")
for search for dor OR java in contents.

My problem is that the user, entry a phrase, and i search for any word in a
phrase. No the entire phrase.
I need parse de string?, take word for word and add a TermQuery for each
word?

Bye, Ernesto.




- Original Message -
From: "Erik Hatcher" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Saturday, December 13, 2003 4:07 AM
Subject: Re: syntax of queries.


Try out the toString("fieldName") trick on your Query instances and
pair them up with what you have below - this will be quite insightful
for the issue - i promise!  :)

Look at my QueryParser article and search for "toString" on that page:
<http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html>

On Friday, December 12, 2003, at 10:38  PM, Ernesto De Santis wrote:

> Thanks Otis, I don´t resolve my problem.
>
> I see the Query sintaxis page, and the FAQ´s search section.
> I proof too many alternatives:
>
> body:(imprimir teclado) title:base = 451 hits
>
> body:(imprimir teclado)^5.1 title:base = 248 hits (* under 451)
>
> body:(imprimir teclado^5.1) title:base = 451 hits - first document:
> 3287.html
>
> body:(imprimir^5.1 teclado) title:base = 451 hits - first document:
> 1545.html
>
> conclusion:
> I think that the boost is only applicable for one word. not to
> parenthesys,
> and not to field.
>
> I wanna make the boost applicable to field.
> For me, is more important a hit in title that in body, for example.
>
> In the FAQ´s search secction:
>
> Clause  ::=  [ Modifier ] [ FieldName ':' ] BasicClause  [ Boost ]
> BasicClause ::= ( Term | Phrase | | PrefixQuery '(' Query ')'
>
> then, in my example BasicClause=(imprimir teclado) and Boost ^5.1.
> but not work.
>
> Regards, Ernesto.
>
> - Original Message -
> From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>; "Ernesto De
> Santis" <[EMAIL PROTECTED]>
> Sent: Friday, December 12, 2003 7:18 PM
> Subject: Re: syntax of queries.
>
>
>> Maybe it's the spaces after title:?
>> Try title:importar ... instead.
>>
>> Maybe it's the spaces before ^5.0?
>> Try title:importar^5 instead
>>
>> You shouldn't need the parentheses in this case either, I believe.
>>
>> See Query Synax page on Lucene's site.
>>
>> Otis
>>
>>
>> --- Ernesto De Santis <[EMAIL PROTECTED]> wrote:
>>> Hello
>>>
>>> I not undertanding the syntax of queries.
>>> I search with this string:
>>>
>>> title: (importar) ^5.0 OR title: (arquivos)
>>>
>>> return 6 hits.
>>>
>>> and with this:
>>>
>>> title: (arquivos) OR title: (importar) ^5.0
>>>
>>> 27 hits.
>>>
>>> why?
>>> in the first, I think that work like AND, but, why? :-(
>>>
>>> Regards, Ernesto.
>>>
>>
>>
>> __
>> Do you Yahoo!?
>> New Yahoo! Photos - easier uploading and sharing.
>> http://photos.yahoo.com/
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: syntax of queries.

2003-12-12 Thread Ernesto De Santis
Thanks Otis, I don´t resolve my problem.

I see the Query sintaxis page, and the FAQ´s search section.
I proof too many alternatives:

body:(imprimir teclado) title:base = 451 hits

body:(imprimir teclado)^5.1 title:base = 248 hits (* under 451)

body:(imprimir teclado^5.1) title:base = 451 hits - first document:
3287.html

body:(imprimir^5.1 teclado) title:base = 451 hits - first document:
1545.html

conclusion:
I think that the boost is only applicable for one word. not to parenthesys,
and not to field.

I wanna make the boost applicable to field.
For me, is more important a hit in title that in body, for example.

In the FAQ´s search secction:

Clause  ::=  [ Modifier ] [ FieldName ':' ] BasicClause  [ Boost ]
BasicClause ::= ( Term | Phrase | | PrefixQuery '(' Query ')'

then, in my example BasicClause=(imprimir teclado) and Boost ^5.1.
but not work.

Regards, Ernesto.

- Original Message -
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>; "Ernesto De
Santis" <[EMAIL PROTECTED]>
Sent: Friday, December 12, 2003 7:18 PM
Subject: Re: syntax of queries.


> Maybe it's the spaces after title:?
> Try title:importar ... instead.
>
> Maybe it's the spaces before ^5.0?
> Try title:importar^5 instead
>
> You shouldn't need the parentheses in this case either, I believe.
>
> See Query Synax page on Lucene's site.
>
> Otis
>
>
> --- Ernesto De Santis <[EMAIL PROTECTED]> wrote:
> > Hello
> >
> > I not undertanding the syntax of queries.
> > I search with this string:
> >
> > title: (importar) ^5.0 OR title: (arquivos)
> >
> > return 6 hits.
> >
> > and with this:
> >
> > title: (arquivos) OR title: (importar) ^5.0
> >
> > 27 hits.
> >
> > why?
> > in the first, I think that work like AND, but, why? :-(
> >
> > Regards, Ernesto.
> >
>
>
> __
> Do you Yahoo!?
> New Yahoo! Photos - easier uploading and sharing.
> http://photos.yahoo.com/
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



syntax of queries.

2003-12-12 Thread Ernesto De Santis
Hello

I not undertanding the syntax of queries.
I search with this string:

title: (importar) ^5.0 OR title: (arquivos) 

return 6 hits.

and with this: 

title: (arquivos) OR title: (importar) ^5.0 

27 hits.

why?
in the first, I think that work like AND, but, why? :-(

Regards, Ernesto.


Re: Index pdf files with your content in lucene.

2003-11-12 Thread Ernesto De Santis
Hello

well, not work zip the files.

I can send files, if somebody won, to personal email.

And if somebody can post this in a web site, very cool.
I don´t post in a web site.

Ernesto.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index pdf files with your content in lucene.

2003-11-11 Thread Ernesto De Santis
try again zipping the files.

after i post the files in the web site.

> Could you also tell us a bit about this code?  Is it better than
> existing PDF/Word parsing solutions?  Pure Java?  Uses POI?

This code use existing parsing solution.
The intent is make a lucene Document for index pdf and word files, with
content.
Is pure java.
Use TextExtraction library.
tm-extractors-0.2.jar
Use POI and PDFBox.

Ernesto
Sorry for my bad English.

>
> Thanks,
> Otis
>
>
> --- Ernesto De Santis <[EMAIL PROTECTED]> wrote:
> > Classes for index Pdf and word files in lucene.
> > Ernesto.
> >
> > ----- Original Message -
> > From: "Ernesto De Santis" <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Wednesday, October 29, 2003 12:04 PM
> > Subject: Re: [opencms-dev] Index pdf files with your content in
> > lucene.
> >
> >
> > Hello all,
> >
> > Thans very much Stephan for your valuable help.
> > Attached you will find the PDFDocument, and WordDocument class source
> > code
> >
> > Ernesto.
> >
> >
> > - Original Message -
> > From: "Hartmann, Waehrisch & Feykes GmbH"
> > <[EMAIL PROTECTED]>
> > To: <[EMAIL PROTECTED]>
> > Sent: Tuesday, October 28, 2003 11:10 AM
> > Subject: Re: [opencms-dev] Index pdf files with your content in
> > lucene.
> >
> >
> > > Hi Ernesto,
> > >
> > > the IndexManager retrieves a list of files of a folder by calling
> > the
> > method
> > > getFilesInFolder of CmsObject. This method returns only empty
> > files, i.e.
> > > with empty content. To get the content of a pdf file you have to
> > reread
> > the
> > > file:
> > > f = cms.readFile(f.getAbsolutePath());
> > >
> > > Bye,
> > > Stephan
> > >
> > > Am Montag, 27. Oktober 2003 19:18 schrieben Sie:
> > >
> > > > > Hello
> > > >
> > > > Thanks for the previous reply.
> > > >
> > > > Now, i use
> > > > - version 1.4 of lucene searche module. (the version attached in
> > this
> > list)
> > > > - new version of registry.xml format for module. (like you write
> > me)
> > > > - the pdf files are stored with the binary type.
> > > >
> > > > But i have the next problem:
> > > > i can´t make a InputStream for the cmsfile content.
> > > > For this i write this code in de Document method of my class
> > PDFDocument:
> > > >
> > > > -
> > > >
> > > > InputStream in = new ByteArrayInputStream(f.getContents()); //f
> > is the
> > > > parameter CmsFile of the Document method
> > > >
> > > > PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is
> > lib i
> > use.
> > > > in file system work fine.
> > > >
> > > >
> > > > bodyText = extractor.extractText(in);
> > > >
> > > > 
> > > >
> > > > Is correct use ByteArrayInputStream for make a InputStream for a
> > CmsFile?
> > > >
> > > > The error ocurr in the third line.
> > > > In the PDFParcer.
> > > > the error menssage in tomcat is:
> > > >
> > > > java.io.IOException: Error: Header is corrupt ''
> > > > at PDFParcer.parse
> > > > at PDFExtractor.extractText
> > > > at PDFDocument.Document (my class)
> > > > at.
> > > >
> > > > By, and thanks.
> > > > Ernesto.
> > > >
> > > >
> > > > - Original Message -
> > > >   From: Hartmann, Waehrisch & Feykes GmbH
> > > >   To: [EMAIL PROTECTED]
> > > >   Sent: Friday, October 24, 2003 4:45 AM
> > > >   Subject: Re: [opencms-dev] Index pdf files with your content in
> > lucene.
> > > >
> > > >
> > > >   Hello Ernesto,
> > > >
> > > >   i assume you are using the unpatched version 1.3 of the search
> > module.
> > > >   As i mentioned yesterday, the plainDocFactory does only index
> > cmsFiles
> > of
> > > > type "plain" but not of type "binary". PDF files are stored as
> > binary. I
> > > > suggest to use the version i posted yesterday. Then your
> > registry.xml
> > would
> > > > have to look like this: ...

Index pdf files with your content in lucene.

2003-11-11 Thread Ernesto De Santis
Classes for index Pdf and word files in lucene.
Ernesto.

- Original Message -
From: "Ernesto De Santis" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, October 29, 2003 12:04 PM
Subject: Re: [opencms-dev] Index pdf files with your content in lucene.


Hello all,

Thans very much Stephan for your valuable help.
Attached you will find the PDFDocument, and WordDocument class source code

Ernesto.


- Original Message -
From: "Hartmann, Waehrisch & Feykes GmbH" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, October 28, 2003 11:10 AM
Subject: Re: [opencms-dev] Index pdf files with your content in lucene.


> Hi Ernesto,
>
> the IndexManager retrieves a list of files of a folder by calling the
method
> getFilesInFolder of CmsObject. This method returns only empty files, i.e.
> with empty content. To get the content of a pdf file you have to reread
the
> file:
> f = cms.readFile(f.getAbsolutePath());
>
> Bye,
> Stephan
>
> Am Montag, 27. Oktober 2003 19:18 schrieben Sie:
>
> > > Hello
> >
> > Thanks for the previous reply.
> >
> > Now, i use
> > - version 1.4 of lucene searche module. (the version attached in this
list)
> > - new version of registry.xml format for module. (like you write me)
> > - the pdf files are stored with the binary type.
> >
> > But i have the next problem:
> > i can´t make a InputStream for the cmsfile content.
> > For this i write this code in de Document method of my class
PDFDocument:
> >
> > -
> >
> > InputStream in = new ByteArrayInputStream(f.getContents()); //f is the
> > parameter CmsFile of the Document method
> >
> > PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i
use.
> > in file system work fine.
> >
> >
> > bodyText = extractor.extractText(in);
> >
> > 
> >
> > Is correct use ByteArrayInputStream for make a InputStream for a
CmsFile?
> >
> > The error ocurr in the third line.
> > In the PDFParcer.
> > the error menssage in tomcat is:
> >
> > java.io.IOException: Error: Header is corrupt ''
> > at PDFParcer.parse
> > at PDFExtractor.extractText
> > at PDFDocument.Document (my class)
> > at.
> >
> > By, and thanks.
> > Ernesto.
> >
> >
> > - Original Message -
> >   From: Hartmann, Waehrisch & Feykes GmbH
> >   To: [EMAIL PROTECTED]
> >   Sent: Friday, October 24, 2003 4:45 AM
> >   Subject: Re: [opencms-dev] Index pdf files with your content in
lucene.
> >
> >
> >   Hello Ernesto,
> >
> >   i assume you are using the unpatched version 1.3 of the search module.
> >   As i mentioned yesterday, the plainDocFactory does only index cmsFiles
of
> > type "plain" but not of type "binary". PDF files are stored as binary. I
> > suggest to use the version i posted yesterday. Then your registry.xml
would
> > have to look like this: ...
> >   
> >   ...
> >  
> >   ...
> >  
> >  
> > 
> >.pdf
> >
net.grcomputing.opencms.search.lucene.PDFDocument
> > 
> >  
> >   ...
> >   
> >
> >   Important: The type attribute must match the file types of OpenCms
(also
> > defined in the registry.xml).
> >
> >   Bye,
> >   Stephan
> >
> > - Original Message -
> > From: Ernesto De Santis
> > To: Lucene Users List
> > Cc: [EMAIL PROTECTED]
> > Sent: Thursday, October 23, 2003 4:16 PM
> > Subject: [opencms-dev] Index pdf files with your content in lucene.
> >
> >
> > Hello
> >
> > I am new in opencms and lucene tecnology.
> >
> > I won index pdf files, and index de content of this files.
> >
> > I work in this way:
> >
> > Make a PDFDocument class like JspDocument class.
> > use org.textmining.text.extraction.PDFExtractor class, this class
work
> > fine out of vfs.
> >
> > and write my registry.xml for pdf document, in plainDocFactory tag.
> >
> > 
> > .pdf
> > 
> >
> > net.grcomputing.opencms.search.lucene.PDFDocument
> > 
> >
> > my PDFDocument content this code:
> > I think that the probrem is how take the content from CmsFile?, what
> > InputStream use? PDFExtractor work with extractText(InputStream) method.
> >
> > public clas

Index pdf files with your content in lucene.

2003-10-23 Thread Ernesto De Santis
Hello

I am new in opencms and lucene tecnology. 

I won index pdf files, and index de content of this files.

I work in this way:

Make a PDFDocument class like JspDocument class. 
use org.textmining.text.extraction.PDFExtractor class, this class work fine out of vfs.

and write my registry.xml for pdf document, in plainDocFactory tag.


.pdf


net.grcomputing.opencms.search.lucene.PDFDocument


my PDFDocument content this code:
I think that the probrem is how take the content from CmsFile?, what InputStream use?
PDFExtractor work with extractText(InputStream) method.

public class PDFDocument implements I_DocumentConstants, I_DocumentFactory {

public PDFDocument(){

}


public Document Document(CmsObject cmsobject, CmsFile cmsfile)

throws CmsException 

{

return Document(cmsobject, cmsfile, null);

}

public Document Document(CmsObject cmsobject, CmsFile cmsfile, HashMap hashmap)

throws CmsException

{

Document document=(new BodylessDocument()).Document(cmsobject, cmsfile);


//put de content in the pdf file.

String contenido = new String(cmsfile.getContents());

StringBufferInputStream in = new StringBufferInputStream(contenido);

// ByteArrayInputStream in = new ByteArrayInputStream(contenido.getBytes());


/* try{

FileInputStream in = new FileInputStream (cmsfile.getPath() + cmsfile.getName());

*/

PDFExtractor extractor = new PDFExtractor();

String body = extractor.extractText(in);


document.add(Field.Text("body", body));

/* }catch(FileNotFoundException e){

e.toString();

throw new CmsException();

}


*/ 

return (document);

}


thanks
Ernesto
PD: Sorry for my poor english.




- Original Message - 
From: "Hartmann, Waehrisch & Feykes GmbH" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, October 22, 2003 3:50 AM
Subject: Re: [opencms-dev] (no subject)


> Hi Ben,
> 
> i think this won't work since the plainDocFactory will only be used for
> files of type "plain" but not for files of type "binary".
> Recently we have done some additions to the module - by order of Lenord,
> Bauer & Co. GmbH - that could meet your needs. It introduces a more flexible
> way of defining docFactories that you can add new factories without having
> to recompile the whole module. So other modules (like the news) can bring
> their own docFactory and all you have to do is to edit the registry.xml.
> Here is an example:
> 
> 
> 
> 
> .txt
> 
> net.grcomputing.opencms.search.lucene.PlainDocument
> 
> 
> 
> 
> net.grcomputing.opencms.search.lucene.NewsDocument
> 
> 
> 
> To index binary files all you need to add is this:
> 
>
> 
> net.grcomputing.opencms.search.lucene.BodylessDocument
>
> 
> There should be no need for an extension mapping.
> 
> For the interested people:
> For ContentDefinitions (like news) i introduced the following:
> 
>  
> 
> com.opencms.modules.homepage.news.NewsContentDefinition
> 
> net.grcomputing.opencms.search.lucene.NewsInitialization ss>
> 
> 1
> -1
> 
> 
> 
> 
> 
> 
> In short:
> initClass is optional: For the news the news classes have to be loaded to
> initialize the db pool.
> listMethod: a method of the content definition class that returns a List of
> elements
> page: the page that can display an entry. Here a jsp that has a template
> element "entry". It also needs the id of the news item.
> getIntId is a method of the content definition class and newsid is the url
> parameter the page needs. A link like
> news.html?__element=entry&newsid=xy
> will be generated.
> 
> Best regards,
> Stephan
> 
> 
> - Original Message - 
> From: "Ben Rometsch" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Wednesday, October 22, 2003 6:15 AM
> Subject: [opencms-dev] (no subject)
> 
> 
> > Hi Matt,
> >
> > I am not having any joy! I've updated my registry.xml file, with the
> > appropriate section reading:
> >
> > 
> > 10
> > true
> > c:\search
> >
> > org.apache.lucene.analysis.standard.StandardAnalyzer
> > true
> > online
> > 
> > 
> >
> > net.grcomputing.opencms.search.lucene.PageDocument
> > 
> > 
> > 
> > .txt
> >
> > net.grcomputing.opencms.search.lucene.PlainDocument
> > 
> > 
> > .html
> > .htm
> > .xml
> > 
> >
> > net.grcomputing.opencms.search.lucene.TaggedPlainDocument
> > 
> >
> > 
> > 
> > .doc
> > .xls
> > .pdf
> >
> > net.grcomputing.opencms.search.lucene.BodylessDocument
> > 
> >
> > 
> > 
> >
> > net.grcomputing.opencms.search.lucene.JspDocument
> > 
> > 
> > 
> > 
> > 
> > Test
> > true
> > 
> > 
> > Test2
> > true
> > 
> > 
> > 
> >
> > Notice the section beginnin