Re: Zip Files
Hello first, you need a parser for each file type: pdf, txt, word, etc. and use a java api to iterate zip content, see: http://java.sun.com/j2se/1.4.2/docs/api/java/util/zip/ZipInputStream.html use getNextEntry() method little example: ZipInputStream zis = new ZipInputStream(fileInputStream); ZipEntry zipEntry; while(zipEntry = zis.getNextEntry() != null){ //use zipEntry to get name, etc. //get properly parser for current entry //use parser with zis (ZipInputStream) } good luck Ernesto Luke Shannon escribió: Hello; Anyone have an ideas on how to index the contents within zip files? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Ernesto De Santis - Colaborativa.net Córdoba 1147 Piso 6 Oficinas 3 y 4 (S2000AWO) Rosario, SF, Argentina. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Disk space used by optimize - non space in disk corrupts index.
Hi all We have a big index and a little space in disk. When optimize and all space is consumed, our index is corrupted. segments file point to nonexistent files. Enviroment: java 1.4.2_04 W2000 SP4 Tomat 5.5.4 Bye, Ernesto. Yura Smolsky escribió: Hello, Otis. There is a big difference when you use compound index format or multiple files. I have tested it on the big index (45 Gb). When I used compound file then optimize takes 3 times more space, b/c *.cfs needs to be unpacked. Now I do use non compound file format. It needs like twice as much disk space. OG> Have you tried using the multifile index format? Now I wonder if there OG> is actually a difference in disk space cosumed by optimize() when you OG> use multifile and compound index format... OG> Otis OG> --- "Kauler, Leto S" <[EMAIL PROTECTED]> wrote: Our copy of LIA is "in the mail" ;) Yes the final three files are: the .cfs (46.8MB), deletable (4 bytes), and segments (29 bytes). --Leto -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Hello, Yes, that is how optimize works - copies all existing index segments into one unified index segment, thus optimizing it. see hit #1: http://www.lucenebook.com/search?query=optimize+disk+space However, three times the space sounds a bit too much, or I make a mistake in the book. :) You said you end up with 3 files - .cfs is one of them, right? Otis --- "Kauler, Leto S" <[EMAIL PROTECTED]> wrote: Just a quick question: after writing an index and then calling optimize(), is it normal for the index to expand to about three times the size before finally compressing? In our case the optimise grinds the disk, expanding the index into many files of about 145MB total, before compressing down to three files of about 47MB total. That must be a lot of disk activity for the people with multi-gigabyte indexes! Regards, Leto CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] OG> - OG> To unsubscribe, e-mail: [EMAIL PROTECTED] OG> For additional commands, e-mail: OG> [EMAIL PROTECTED] Yura Smolsky, - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.8.5 - Release Date: 03/02/2005 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Optimize not deleting all files
Hi all We have the same problem. We guess that the problem is that windows lock files. Our enviroment: Windows 2000 Tomcat 5.5.4 Ernesto. [EMAIL PROTECTED] escribió: Hi, When I run an optimize in our production environment, old index are left in the directory and are not deleted. My understanding is that an optimize will create new index files and all existing index files should be deleted. Is this correct? We are running Lucene 1.4.2 on Windows. Any help is appreciated. Thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.8.5 - Release Date: 03/02/2005 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and multiple languages
I send you the source code in a private mail. Ernesto. aurora escribió: Thanks. I would like to give it a try. Is the source code available? I'm using a Python version of Lucene so it would need to be wrapped or ported :) Hi Aurora I develop a tool with this multiple languages issue. I found very useful an nuke library "language-identifier". This jar have nuke dependencies, but I delete all unnecessary code (for me obvious). This language-identifier that I use work fine and is very simple: For example: LanguageIdentifier languageIdentifier = LanguageIdentifier.getInstance(); String userInputText = "free text"; String language = languageIdentifier.identify(text); This work for 11 languages: English, Spanish, Portuguese, Dutch, German, French, Italian, and Others. I can send you this touched jar, but remember that this jar is from Nuke, for copyright (or left :). http://www.nutch.org/LICENSE.txt More comments above... aurora escribió: I'm trying to build some web search tool that could work for multiple languages. I understand that Lucene is shipped with StandardAnalyzer plus a German and Russian analyzers and some more in the sandbox. And that indexing and searching should use the same analyzer. Now let's said I have an index with documents in multiple languages and analyzed by an assortment of analyzers. When user enter a query, what analyzer should be used? Should the user be asked for the language upfront? What to expect when the analyzer and the document doesn't match? Let's said the query is parsed using StandardAnalyzer. Would it match any documents done in German analyzer at all. Or would it end up in poor result? When this happen, in the major cases you do not obtain matchs. Also is there a good way to find out the languages used in a web page? There is a 'content-langage' header in http and a 'lang' attribute in HTML. Looks like people don't really use them. How can we recognize the language? With language identifier. :) Even more interesting is multiple languages used in one document, let's say half English and half French. Is there a good way to deal with those cases? Language identifier only return one language. I look into language-identifier and work with a score for each language, and return the language with greater value. Maybe you can modify the language-identifier for take the most greater values. Bye Ernesto. Thanks for any guidance. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and multiple languages
Hi Aurora I develop a tool with this multiple languages issue. I found very useful an nuke library "language-identifier". This jar have nuke dependencies, but I delete all unnecessary code (for me obvious). This language-identifier that I use work fine and is very simple: For example: LanguageIdentifier languageIdentifier = LanguageIdentifier.getInstance(); String userInputText = "free text"; String language = languageIdentifier.identify(text); This work for 11 languages: English, Spanish, Portuguese, Dutch, German, French, Italian, and Others. I can send you this touched jar, but remember that this jar is from Nuke, for copyright (or left :). http://www.nutch.org/LICENSE.txt More comments above... aurora escribió: I'm trying to build some web search tool that could work for multiple languages. I understand that Lucene is shipped with StandardAnalyzer plus a German and Russian analyzers and some more in the sandbox. And that indexing and searching should use the same analyzer. Now let's said I have an index with documents in multiple languages and analyzed by an assortment of analyzers. When user enter a query, what analyzer should be used? Should the user be asked for the language upfront? What to expect when the analyzer and the document doesn't match? Let's said the query is parsed using StandardAnalyzer. Would it match any documents done in German analyzer at all. Or would it end up in poor result? When this happen, in the major cases you do not obtain matchs. Also is there a good way to find out the languages used in a web page? There is a 'content-langage' header in http and a 'lang' attribute in HTML. Looks like people don't really use them. How can we recognize the language? With language identifier. :) Even more interesting is multiple languages used in one document, let's say half English and half French. Is there a good way to deal with those cases? Language identifier only return one language. I look into language-identifier and work with a score for each language, and return the language with greater value. Maybe you can modify the language-identifier for take the most greater values. Bye Ernesto. Thanks for any guidance. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: where is the SnowBallAnalyzer?
Is in snowball-1.0.jar I sent you it in private email. Bye Ernesto. - Original Message - From: "Wermus Fernando" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, September 08, 2004 1:12 PM Subject: where is the SnowBallAnalyzer? I have to look better, but why the SnowBallAnalizer isn't in org.apache.lucene.analysis.snowball.SnowballAnalyzer package? I have lucene 1.4. I'm doing my own spanish stemmer. --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.754 / Virus Database: 504 - Release Date: 06/09/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: spanish stemmer
Hi Chad > One more question to the group. From what I have gathered, my choices for indexing and querying Spanish content are: > 1. StandardAnalyzer (I read that this analyzer could be used for "European" languages) The StandardAnalyzer not is for European languages, is like a generic analyzer. > 2. SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS); <--custom stop words from Ernesto class below > Can I assume that choice 2 would be the better for Spanish content? Yes, is too better. For example: In StandardAnalyzer, caminar, caminantes, camino, etc, are differents words, only return hit if the match is exactly. In SpanishAnalyzer, are the same word. This three words are conjugations of caminar. If in your index, one document have the word "caminante", you can get the hit with the differents conjugations of this verb. The operation of stemmers is strip the words according to the rules of the language (spanish for us). caminar, caminantes, camino are stored as camin. (Camin not exist in spanish). This improvement the quality of hits >thanks, > chad. Bye, Ernesto. -Original Message- From: Ernesto De Santis [mailto:[EMAIL PROTECTED] Sent: Monday, August 23, 2004 3:31 PM To: Lucene Users List Subject: Re: spanish stemmer Because the SnowballAnalyzer, and SpanishStemmer don´t have a default stopword set. SnowballAnalyzer constructor: /** Builds the named analyzer with no stop words. */ public SnowballAnalyzer(String name) { this.name = name; } Note the comment. Bye, Ernesto. - Original Message - From: "Chad Small" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, August 23, 2004 4:57 PM Subject: RE: spanish stemmer Excellent Ernesto. Was there a reason you used your own stop word list and not just the default constructor SnowballAnalyzer("Spanish")? thanks, chad. -Original Message- From: Ernesto De Santis [mailto:[EMAIL PROTECTED] Sent: Monday, August 23, 2004 2:03 PM To: Lucene Users List Subject: Re: spanish stemmer Yes, is too easy. You need do a wrapper for spanish Snowball initilization. analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS); above the complete code. Bye, Ernesto. -- public class SpanishAnalyzer extends Analyzer { private static SnowballAnalyzer analyzer; private String SPANISH_STOP_WORDS[] = { "un", "una", "unas", "unos", "uno", "sobre", "todo", "también", "tras", "otro", "algún", "alguno", "alguna", "algunos", "algunas", "ser", "es", "soy", "eres", "somos", "sois", "estoy", "esta", "estamos", "estais", "estan", "en", "para", "atras", "porque", "por qué", "estado", "estaba", "ante", "antes", "siendo", "ambos", "pero", "por", "poder", "puede", "puedo", "podemos", "podeis", "pueden", "fui", "fue", "fuimos", "fueron", "hacer", "hago", "hace", "hacemos", "haceis", "hacen", "cada", "fin", "incluso", "primero", "desde", "conseguir", "consigo", "consigue", "consigues", "conseguimos", "consiguen", "ir", "voy", "va", "vamos", "vais", "van", "vaya", "bueno", "ha", "tener", "tengo", "tiene", "tenemos", "teneis", "tienen", "el", "la", "lo", "las", "los", "su", "aqui", "mio", "tuyo", "ellos", "ellas", "nos", "nosotros", "vosotros", "vosotras", "si", "dentro", "solo", "solamente", "saber", "sabes", "sabe", "sabemos", "sabeis", "saben", "ultimo", "largo", "bastante", "haces", "muchos", "aquellos", "aquellas", "sus", "entonces", "tiempo", "verdad", "verdadero", "verdadera", "cierto", "ciertos", "cierta", "ciertas", "intentar", "intento", "intenta", "intentas", "intentamos", "intentais", "intentan", "dos", "bajo", "arriba", "encima", "usar", "uso", "usas", "usa", "usamos", "usais", "usan", "emplear", "empleo", "empleas", "emplean", "ampleamos", "empleais", "valor", "muy", "era", "eras", "eramos", "eran", "modo", "bien", "cual", "cuando", "donde", "mientras", "quien", "con", "entre", "sin", "trabajo", "trabajar", "trabajas", "trabaja", "trabajamos", "trabajais", "trabajan", "podria", "podrias", "podriamos", "podrian", "podriais", "yo", "aquel", "mi", "de", "a", "e", "i", "o", "u"}; public SpanishAnalyzer() { analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS); } public SpanishAnalyzer(String stopWords[]) { analyzer = new SnowballAnalyzer("Spanish", stopWords); } public TokenStream tokenStream(String fieldName, Reader reader) { return analyzer.tokenStream(fieldName, reader); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: spanish stemmer
Because the SnowballAnalyzer, and SpanishStemmer don´t have a default stopword set. SnowballAnalyzer constructor: /** Builds the named analyzer with no stop words. */ public SnowballAnalyzer(String name) { this.name = name; } Note the comment. Bye, Ernesto. - Original Message - From: "Chad Small" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, August 23, 2004 4:57 PM Subject: RE: spanish stemmer Excellent Ernesto. Was there a reason you used your own stop word list and not just the default constructor SnowballAnalyzer("Spanish")? thanks, chad. -----Original Message- From: Ernesto De Santis [mailto:[EMAIL PROTECTED] Sent: Monday, August 23, 2004 2:03 PM To: Lucene Users List Subject: Re: spanish stemmer Yes, is too easy. You need do a wrapper for spanish Snowball initilization. analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS); above the complete code. Bye, Ernesto. -- public class SpanishAnalyzer extends Analyzer { private static SnowballAnalyzer analyzer; private String SPANISH_STOP_WORDS[] = { "un", "una", "unas", "unos", "uno", "sobre", "todo", "también", "tras", "otro", "algún", "alguno", "alguna", "algunos", "algunas", "ser", "es", "soy", "eres", "somos", "sois", "estoy", "esta", "estamos", "estais", "estan", "en", "para", "atras", "porque", "por qué", "estado", "estaba", "ante", "antes", "siendo", "ambos", "pero", "por", "poder", "puede", "puedo", "podemos", "podeis", "pueden", "fui", "fue", "fuimos", "fueron", "hacer", "hago", "hace", "hacemos", "haceis", "hacen", "cada", "fin", "incluso", "primero", "desde", "conseguir", "consigo", "consigue", "consigues", "conseguimos", "consiguen", "ir", "voy", "va", "vamos", "vais", "van", "vaya", "bueno", "ha", "tener", "tengo", "tiene", "tenemos", "teneis", "tienen", "el", "la", "lo", "las", "los", "su", "aqui", "mio", "tuyo", "ellos", "ellas", "nos", "nosotros", "vosotros", "vosotras", "si", "dentro", "solo", "solamente", "saber", "sabes", "sabe", "sabemos", "sabeis", "saben", "ultimo", "largo", "bastante", "haces", "muchos", "aquellos", "aquellas", "sus", "entonces", "tiempo", "verdad", "verdadero", "verdadera", "cierto", "ciertos", "cierta", "ciertas", "intentar", "intento", "intenta", "intentas", "intentamos", "intentais", "intentan", "dos", "bajo", "arriba", "encima", "usar", "uso", "usas", "usa", "usamos", "usais", "usan", "emplear", "empleo", "empleas", "emplean", "ampleamos", "empleais", "valor", "muy", "era", "eras", "eramos", "eran", "modo", "bien", "cual", "cuando", "donde", "mientras", "quien", "con", "entre", "sin", "trabajo", "trabajar", "trabajas", "trabaja", "trabajamos", "trabajais", "trabajan", "podria", "podrias", "podriamos", "podrian", "podriais", "yo", "aquel", "mi", "de", "a", "e", "i", "o", "u"}; public SpanishAnalyzer() { analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS); } public SpanishAnalyzer(String stopWords[]) { analyzer = new SnowballAnalyzer("Spanish", stopWords); } public TokenSt
Re: spanish stemmer
Hello Grant Thanks for your response. I have a basic undertanding about analyzers. The problem is that I think that the words finished in 'bol' need are striped. like: original->generated word tornillos ->tornill I need: basquetbol ->basquet Bye, Ernesto. - Original Message - From: "Grant Ingersoll" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, August 23, 2004 4:09 PM Subject: Re: spanish stemmer Ernesto, http://snowball.tartarus.org/texts/introduction.html might help w/ your understanding. The link provides basic info on why stemmer's are valuable (not necessarily any insight on how the Spanish version works). Of course, they don't solve every problem and in some cases may make things worse. A stemmer is not required to return a whole word. Hope this helps. >>> [EMAIL PROTECTED] 8/23/2004 9:29:30 AM >>> Hello I use the Snowball jar for implement my SpanishAnalyzer. I found that the words finished in 'bol' are not stripped. For example: In spanish for say basketball, you can say basquet or basquetbol. But for SpanishStemmer are different words. Idem with voley and voleybol. Not idem with futbol (football), we not say fut for futbol. But 'fut' don´t exist in spanish. you think that I are correct? you can change this? Ernesto. --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: spanish stemmer
Yes, is too easy. You need do a wrapper for spanish Snowball initilization. analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS); above the complete code. Bye, Ernesto. -- public class SpanishAnalyzer extends Analyzer { private static SnowballAnalyzer analyzer; private String SPANISH_STOP_WORDS[] = { "un", "una", "unas", "unos", "uno", "sobre", "todo", "también", "tras", "otro", "algún", "alguno", "alguna", "algunos", "algunas", "ser", "es", "soy", "eres", "somos", "sois", "estoy", "esta", "estamos", "estais", "estan", "en", "para", "atras", "porque", "por qué", "estado", "estaba", "ante", "antes", "siendo", "ambos", "pero", "por", "poder", "puede", "puedo", "podemos", "podeis", "pueden", "fui", "fue", "fuimos", "fueron", "hacer", "hago", "hace", "hacemos", "haceis", "hacen", "cada", "fin", "incluso", "primero", "desde", "conseguir", "consigo", "consigue", "consigues", "conseguimos", "consiguen", "ir", "voy", "va", "vamos", "vais", "van", "vaya", "bueno", "ha", "tener", "tengo", "tiene", "tenemos", "teneis", "tienen", "el", "la", "lo", "las", "los", "su", "aqui", "mio", "tuyo", "ellos", "ellas", "nos", "nosotros", "vosotros", "vosotras", "si", "dentro", "solo", "solamente", "saber", "sabes", "sabe", "sabemos", "sabeis", "saben", "ultimo", "largo", "bastante", "haces", "muchos", "aquellos", "aquellas", "sus", "entonces", "tiempo", "verdad", "verdadero", "verdadera", "cierto", "ciertos", "cierta", "ciertas", "intentar", "intento", "intenta", "intentas", "intentamos", "intentais", "intentan", "dos", "bajo", "arriba", "encima", "usar", "uso", "usas", "usa", "usamos", "usais", "usan", "emplear", "empleo", "empleas", "emplean", "ampleamos", "empleais", "valor", "muy", "era", "eras", "eramos", "eran", "modo", "bien", "cual", "cuando", "donde", "mientras", "quien", "con", "entre", "sin", "trabajo", "trabajar", "trabajas", "trabaja", "trabajamos", "trabajais", "trabajan", "podria", "podrias", "podriamos", "podrian", "podriais", "yo", "aquel", "mi", "de", "a", "e", "i", "o", "u"}; public SpanishAnalyzer() { analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS); } public SpanishAnalyzer(String stopWords[]) { analyzer = new SnowballAnalyzer("Spanish", stopWords); } public TokenStream tokenStream(String fieldName, Reader reader) { return analyzer.tokenStream(fieldName, reader); } } - Original Message - From: "Chad Small" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, August 23, 2004 3:49 PM Subject: RE: spanish stemmer Do you mind sharing how you implemented your SpanishAnalyzer using Snowball? Sorry I can't help with your question. I am trying to implement Snowball Spanish or a Spanish Analyzer in Lucene. thanks, chad. -Original Message- From: Ernesto De Santis [mailto:[EMAIL PROTECTED] Sent: Monday, August 23, 2004 8:30 AM To: Lucene Users List Subject: spanish stemmer Hello I use the Snowball jar for implement my SpanishAnalyzer. I found that the words finished in 'bol' are not stripped. For example: In spanish for say basketball, you can say basquet or basquetbol. But for SpanishStemmer are different words. Idem with voley and voleybol. Not idem with futbol (football), we not say fut for futbol. But 'fut' don´t exist in spanish. you think that I are correct? you can change this? Ernesto. --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
spanish stemmer
Hello I use the Snowball jar for implement my SpanishAnalyzer. I found that the words finished in 'bol' are not stripped. For example: In spanish for say basketball, you can say basquet or basquetbol. But for SpanishStemmer are different words. Idem with voley and voleybol. Not idem with futbol (football), we not say fut for futbol. But 'fut' don´t exist in spanish. you think that I are correct? you can change this? Ernesto. --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index and Search question in Lucene.
Hi Dimitri What analyzer you use? You need take carefully with Keyword fields and analyzers. When you index a Document, the fields that have set tokenized = false, like Keyword, are not analyzed. In search time you need parse the query with your analyzer but not analyze the untokenized fields, like your filename. > I can do a search as this > "+contents:SomeWord +filename:SomePath" > The sintaxis is rigth, but if you search +filename:somepath, find only this file. For example, +content:version +filename:/my/path/myfile.ext Only can found myfile.ext, and if this file don't content "version", not going to find nothing. This is because you use +. + set the term required. You can see the queries sintaxis in lucene site. http://jakarta.apache.org/lucene/docs/queryparsersyntax.html http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.search&toc=faq#q5 good luck. Bye Ernesto. El dom, 15 de 08 de 2004 a las 17:13, Dmitrii PapaGeorgio escribiÃ: > Ok so when I index a file such as below > > Document doc = new Document(); > doc.Add(Field.Text("contents", new StreamReader(dataDir))); > doc.Add(Field.Keyword("filename", dataDir)); > > I can do a search as this > "+contents:SomeWord +filename:SomePath" > > Correct? > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
javadoc api
Hello Lucene developers A litle issue about a Field documentation. In Field class on getBoost() method it says: "Returns the boost factor for hits on any field of this document." I think that this comment are copied from Document class and forgot change it. Bye Ernesto. --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
parce Query
Hello What is the best practice to parce a Query object.? QueryParcer only work with String, but if I have a Query? I want that anothers applications build yours lucene Query´s, and I want parse this when this applications do search with my server application. In my server application I store the configuration, languages, analyzers, IndexSearchers, how are indexed each field (Keyword or not), etc. then, I need parce Query to Query with the appropriate analyzer over appropriate terms (fields). Thanks for your attention. Ernesto. --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.732 / Virus Database: 486 - Release Date: 29/07/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Weighting database fields
Hi Erik > On Jul 21, 2004, at 11:40 AM, Anson Lau wrote: > > Is there any benefit to set the boost during indexing rather than set > > it > > during query? > > It allows setting each document differently. For example, > TheServerSide is using field-level boosts at index time to control > ordering by date, such that newer articles come up first. This could > not be done at query time since each document gets a different field > boost. If some field have set a boots value in index time, and when in search time the query have another boost value for this field, what happens? which value is used for boost? Bye, Ernesto. --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.725 / Virus Database: 480 - Release Date: 19/07/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: languages lucene can support
Hi Praveen You can develope your SpanishAnalyzer easily (or another language) with SnowballAnalyzer. I send you my SpanishAnalyzer. Bye, Ernesto. - Original Message - From: "Praveen Peddi" <[EMAIL PROTECTED]> To: "lucenelist" <[EMAIL PROTECTED]> Sent: Thursday, July 01, 2004 6:13 PM Subject: languages lucene can support I have read many emails in lucene mailing list regarding analyzers.Following is the list of languages lucene supports out of box. So they will be supported with no change in our code but just a configuration change.EnglishGermanRussianFollowing is the list of languages that are available as external downloads on lucene's site:ChineseJapaneseKorean (all of the above come as single download)BrazilianCZechFrenchDutchI also read that lucene's StandardAnalyzer supports most of the european languages. Does it mean it supports spanish also? or is there a separate analyzer for that? I didn't see any spanish analyzer in the sand box or lucene release.Another question regarding FrenchAnalyzer. I downloaded FrenchAnalyzer and some methods do not throw IOException where it is supposed to throw. For example, the constructor. I am using 1.4 final (I know its relased only today :)). Whats the fix for it?PraveenPraveen** Praveen PeddiSr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- "The Leader in Enterprise Content Integration" ---Outgoing mail is certified Virus Free.Checked by AVG anti-virus system (http://www.grisoft.com).Version: 6.0.712 / Virus Database: 468 - Release Date: 27/06/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: syntax of queries.
Erik, Thanks! The article is very good. thanks. I have news questions: - apiQuery.add(new TermQuery(new Term("contents", "dot")), false, true); new Term("contents", "dot") The Term class, work for only one word? this is right? new Term("contents", "dot java") for search for dor OR java in contents. My problem is that the user, entry a phrase, and i search for any word in a phrase. No the entire phrase. I need parse de string?, take word for word and add a TermQuery for each word? Bye, Ernesto. - Original Message - From: "Erik Hatcher" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Saturday, December 13, 2003 4:07 AM Subject: Re: syntax of queries. Try out the toString("fieldName") trick on your Query instances and pair them up with what you have below - this will be quite insightful for the issue - i promise! :) Look at my QueryParser article and search for "toString" on that page: <http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html> On Friday, December 12, 2003, at 10:38 PM, Ernesto De Santis wrote: > Thanks Otis, I don´t resolve my problem. > > I see the Query sintaxis page, and the FAQ´s search section. > I proof too many alternatives: > > body:(imprimir teclado) title:base = 451 hits > > body:(imprimir teclado)^5.1 title:base = 248 hits (* under 451) > > body:(imprimir teclado^5.1) title:base = 451 hits - first document: > 3287.html > > body:(imprimir^5.1 teclado) title:base = 451 hits - first document: > 1545.html > > conclusion: > I think that the boost is only applicable for one word. not to > parenthesys, > and not to field. > > I wanna make the boost applicable to field. > For me, is more important a hit in title that in body, for example. > > In the FAQ´s search secction: > > Clause ::= [ Modifier ] [ FieldName ':' ] BasicClause [ Boost ] > BasicClause ::= ( Term | Phrase | | PrefixQuery '(' Query ')' > > then, in my example BasicClause=(imprimir teclado) and Boost ^5.1. > but not work. > > Regards, Ernesto. > > - Original Message - > From: "Otis Gospodnetic" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]>; "Ernesto De > Santis" <[EMAIL PROTECTED]> > Sent: Friday, December 12, 2003 7:18 PM > Subject: Re: syntax of queries. > > >> Maybe it's the spaces after title:? >> Try title:importar ... instead. >> >> Maybe it's the spaces before ^5.0? >> Try title:importar^5 instead >> >> You shouldn't need the parentheses in this case either, I believe. >> >> See Query Synax page on Lucene's site. >> >> Otis >> >> >> --- Ernesto De Santis <[EMAIL PROTECTED]> wrote: >>> Hello >>> >>> I not undertanding the syntax of queries. >>> I search with this string: >>> >>> title: (importar) ^5.0 OR title: (arquivos) >>> >>> return 6 hits. >>> >>> and with this: >>> >>> title: (arquivos) OR title: (importar) ^5.0 >>> >>> 27 hits. >>> >>> why? >>> in the first, I think that work like AND, but, why? :-( >>> >>> Regards, Ernesto. >>> >> >> >> __ >> Do you Yahoo!? >> New Yahoo! Photos - easier uploading and sharing. >> http://photos.yahoo.com/ >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: syntax of queries.
Thanks Otis, I don´t resolve my problem. I see the Query sintaxis page, and the FAQ´s search section. I proof too many alternatives: body:(imprimir teclado) title:base = 451 hits body:(imprimir teclado)^5.1 title:base = 248 hits (* under 451) body:(imprimir teclado^5.1) title:base = 451 hits - first document: 3287.html body:(imprimir^5.1 teclado) title:base = 451 hits - first document: 1545.html conclusion: I think that the boost is only applicable for one word. not to parenthesys, and not to field. I wanna make the boost applicable to field. For me, is more important a hit in title that in body, for example. In the FAQ´s search secction: Clause ::= [ Modifier ] [ FieldName ':' ] BasicClause [ Boost ] BasicClause ::= ( Term | Phrase | | PrefixQuery '(' Query ')' then, in my example BasicClause=(imprimir teclado) and Boost ^5.1. but not work. Regards, Ernesto. - Original Message - From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]>; "Ernesto De Santis" <[EMAIL PROTECTED]> Sent: Friday, December 12, 2003 7:18 PM Subject: Re: syntax of queries. > Maybe it's the spaces after title:? > Try title:importar ... instead. > > Maybe it's the spaces before ^5.0? > Try title:importar^5 instead > > You shouldn't need the parentheses in this case either, I believe. > > See Query Synax page on Lucene's site. > > Otis > > > --- Ernesto De Santis <[EMAIL PROTECTED]> wrote: > > Hello > > > > I not undertanding the syntax of queries. > > I search with this string: > > > > title: (importar) ^5.0 OR title: (arquivos) > > > > return 6 hits. > > > > and with this: > > > > title: (arquivos) OR title: (importar) ^5.0 > > > > 27 hits. > > > > why? > > in the first, I think that work like AND, but, why? :-( > > > > Regards, Ernesto. > > > > > __ > Do you Yahoo!? > New Yahoo! Photos - easier uploading and sharing. > http://photos.yahoo.com/ > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
syntax of queries.
Hello I not undertanding the syntax of queries. I search with this string: title: (importar) ^5.0 OR title: (arquivos) return 6 hits. and with this: title: (arquivos) OR title: (importar) ^5.0 27 hits. why? in the first, I think that work like AND, but, why? :-( Regards, Ernesto.
Re: Index pdf files with your content in lucene.
Hello well, not work zip the files. I can send files, if somebody won, to personal email. And if somebody can post this in a web site, very cool. I don´t post in a web site. Ernesto. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index pdf files with your content in lucene.
try again zipping the files. after i post the files in the web site. > Could you also tell us a bit about this code? Is it better than > existing PDF/Word parsing solutions? Pure Java? Uses POI? This code use existing parsing solution. The intent is make a lucene Document for index pdf and word files, with content. Is pure java. Use TextExtraction library. tm-extractors-0.2.jar Use POI and PDFBox. Ernesto Sorry for my bad English. > > Thanks, > Otis > > > --- Ernesto De Santis <[EMAIL PROTECTED]> wrote: > > Classes for index Pdf and word files in lucene. > > Ernesto. > > > > ----- Original Message - > > From: "Ernesto De Santis" <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Sent: Wednesday, October 29, 2003 12:04 PM > > Subject: Re: [opencms-dev] Index pdf files with your content in > > lucene. > > > > > > Hello all, > > > > Thans very much Stephan for your valuable help. > > Attached you will find the PDFDocument, and WordDocument class source > > code > > > > Ernesto. > > > > > > - Original Message - > > From: "Hartmann, Waehrisch & Feykes GmbH" > > <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Sent: Tuesday, October 28, 2003 11:10 AM > > Subject: Re: [opencms-dev] Index pdf files with your content in > > lucene. > > > > > > > Hi Ernesto, > > > > > > the IndexManager retrieves a list of files of a folder by calling > > the > > method > > > getFilesInFolder of CmsObject. This method returns only empty > > files, i.e. > > > with empty content. To get the content of a pdf file you have to > > reread > > the > > > file: > > > f = cms.readFile(f.getAbsolutePath()); > > > > > > Bye, > > > Stephan > > > > > > Am Montag, 27. Oktober 2003 19:18 schrieben Sie: > > > > > > > > Hello > > > > > > > > Thanks for the previous reply. > > > > > > > > Now, i use > > > > - version 1.4 of lucene searche module. (the version attached in > > this > > list) > > > > - new version of registry.xml format for module. (like you write > > me) > > > > - the pdf files are stored with the binary type. > > > > > > > > But i have the next problem: > > > > i can´t make a InputStream for the cmsfile content. > > > > For this i write this code in de Document method of my class > > PDFDocument: > > > > > > > > - > > > > > > > > InputStream in = new ByteArrayInputStream(f.getContents()); //f > > is the > > > > parameter CmsFile of the Document method > > > > > > > > PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is > > lib i > > use. > > > > in file system work fine. > > > > > > > > > > > > bodyText = extractor.extractText(in); > > > > > > > > > > > > > > > > Is correct use ByteArrayInputStream for make a InputStream for a > > CmsFile? > > > > > > > > The error ocurr in the third line. > > > > In the PDFParcer. > > > > the error menssage in tomcat is: > > > > > > > > java.io.IOException: Error: Header is corrupt '' > > > > at PDFParcer.parse > > > > at PDFExtractor.extractText > > > > at PDFDocument.Document (my class) > > > > at. > > > > > > > > By, and thanks. > > > > Ernesto. > > > > > > > > > > > > - Original Message - > > > > From: Hartmann, Waehrisch & Feykes GmbH > > > > To: [EMAIL PROTECTED] > > > > Sent: Friday, October 24, 2003 4:45 AM > > > > Subject: Re: [opencms-dev] Index pdf files with your content in > > lucene. > > > > > > > > > > > > Hello Ernesto, > > > > > > > > i assume you are using the unpatched version 1.3 of the search > > module. > > > > As i mentioned yesterday, the plainDocFactory does only index > > cmsFiles > > of > > > > type "plain" but not of type "binary". PDF files are stored as > > binary. I > > > > suggest to use the version i posted yesterday. Then your > > registry.xml > > would > > > > have to look like this: ...
Index pdf files with your content in lucene.
Classes for index Pdf and word files in lucene. Ernesto. - Original Message - From: "Ernesto De Santis" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, October 29, 2003 12:04 PM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. Hello all, Thans very much Stephan for your valuable help. Attached you will find the PDFDocument, and WordDocument class source code Ernesto. - Original Message - From: "Hartmann, Waehrisch & Feykes GmbH" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, October 28, 2003 11:10 AM Subject: Re: [opencms-dev] Index pdf files with your content in lucene. > Hi Ernesto, > > the IndexManager retrieves a list of files of a folder by calling the method > getFilesInFolder of CmsObject. This method returns only empty files, i.e. > with empty content. To get the content of a pdf file you have to reread the > file: > f = cms.readFile(f.getAbsolutePath()); > > Bye, > Stephan > > Am Montag, 27. Oktober 2003 19:18 schrieben Sie: > > > > Hello > > > > Thanks for the previous reply. > > > > Now, i use > > - version 1.4 of lucene searche module. (the version attached in this list) > > - new version of registry.xml format for module. (like you write me) > > - the pdf files are stored with the binary type. > > > > But i have the next problem: > > i can´t make a InputStream for the cmsfile content. > > For this i write this code in de Document method of my class PDFDocument: > > > > - > > > > InputStream in = new ByteArrayInputStream(f.getContents()); //f is the > > parameter CmsFile of the Document method > > > > PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is lib i use. > > in file system work fine. > > > > > > bodyText = extractor.extractText(in); > > > > > > > > Is correct use ByteArrayInputStream for make a InputStream for a CmsFile? > > > > The error ocurr in the third line. > > In the PDFParcer. > > the error menssage in tomcat is: > > > > java.io.IOException: Error: Header is corrupt '' > > at PDFParcer.parse > > at PDFExtractor.extractText > > at PDFDocument.Document (my class) > > at. > > > > By, and thanks. > > Ernesto. > > > > > > - Original Message - > > From: Hartmann, Waehrisch & Feykes GmbH > > To: [EMAIL PROTECTED] > > Sent: Friday, October 24, 2003 4:45 AM > > Subject: Re: [opencms-dev] Index pdf files with your content in lucene. > > > > > > Hello Ernesto, > > > > i assume you are using the unpatched version 1.3 of the search module. > > As i mentioned yesterday, the plainDocFactory does only index cmsFiles of > > type "plain" but not of type "binary". PDF files are stored as binary. I > > suggest to use the version i posted yesterday. Then your registry.xml would > > have to look like this: ... > > > > ... > > > > ... > > > > > > > >.pdf > > net.grcomputing.opencms.search.lucene.PDFDocument > > > > > > ... > > > > > > Important: The type attribute must match the file types of OpenCms (also > > defined in the registry.xml). > > > > Bye, > > Stephan > > > > - Original Message - > > From: Ernesto De Santis > > To: Lucene Users List > > Cc: [EMAIL PROTECTED] > > Sent: Thursday, October 23, 2003 4:16 PM > > Subject: [opencms-dev] Index pdf files with your content in lucene. > > > > > > Hello > > > > I am new in opencms and lucene tecnology. > > > > I won index pdf files, and index de content of this files. > > > > I work in this way: > > > > Make a PDFDocument class like JspDocument class. > > use org.textmining.text.extraction.PDFExtractor class, this class work > > fine out of vfs. > > > > and write my registry.xml for pdf document, in plainDocFactory tag. > > > > > > .pdf > > > > > > net.grcomputing.opencms.search.lucene.PDFDocument > > > > > > my PDFDocument content this code: > > I think that the probrem is how take the content from CmsFile?, what > > InputStream use? PDFExtractor work with extractText(InputStream) method. > > > > public clas
Index pdf files with your content in lucene.
Hello I am new in opencms and lucene tecnology. I won index pdf files, and index de content of this files. I work in this way: Make a PDFDocument class like JspDocument class. use org.textmining.text.extraction.PDFExtractor class, this class work fine out of vfs. and write my registry.xml for pdf document, in plainDocFactory tag. .pdf net.grcomputing.opencms.search.lucene.PDFDocument my PDFDocument content this code: I think that the probrem is how take the content from CmsFile?, what InputStream use? PDFExtractor work with extractText(InputStream) method. public class PDFDocument implements I_DocumentConstants, I_DocumentFactory { public PDFDocument(){ } public Document Document(CmsObject cmsobject, CmsFile cmsfile) throws CmsException { return Document(cmsobject, cmsfile, null); } public Document Document(CmsObject cmsobject, CmsFile cmsfile, HashMap hashmap) throws CmsException { Document document=(new BodylessDocument()).Document(cmsobject, cmsfile); //put de content in the pdf file. String contenido = new String(cmsfile.getContents()); StringBufferInputStream in = new StringBufferInputStream(contenido); // ByteArrayInputStream in = new ByteArrayInputStream(contenido.getBytes()); /* try{ FileInputStream in = new FileInputStream (cmsfile.getPath() + cmsfile.getName()); */ PDFExtractor extractor = new PDFExtractor(); String body = extractor.extractText(in); document.add(Field.Text("body", body)); /* }catch(FileNotFoundException e){ e.toString(); throw new CmsException(); } */ return (document); } thanks Ernesto PD: Sorry for my poor english. - Original Message - From: "Hartmann, Waehrisch & Feykes GmbH" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, October 22, 2003 3:50 AM Subject: Re: [opencms-dev] (no subject) > Hi Ben, > > i think this won't work since the plainDocFactory will only be used for > files of type "plain" but not for files of type "binary". > Recently we have done some additions to the module - by order of Lenord, > Bauer & Co. GmbH - that could meet your needs. It introduces a more flexible > way of defining docFactories that you can add new factories without having > to recompile the whole module. So other modules (like the news) can bring > their own docFactory and all you have to do is to edit the registry.xml. > Here is an example: > > > > > .txt > > net.grcomputing.opencms.search.lucene.PlainDocument > > > > > net.grcomputing.opencms.search.lucene.NewsDocument > > > > To index binary files all you need to add is this: > > > > net.grcomputing.opencms.search.lucene.BodylessDocument > > > There should be no need for an extension mapping. > > For the interested people: > For ContentDefinitions (like news) i introduced the following: > > > > com.opencms.modules.homepage.news.NewsContentDefinition > > net.grcomputing.opencms.search.lucene.NewsInitialization ss> > > 1 > -1 > > > > > > > In short: > initClass is optional: For the news the news classes have to be loaded to > initialize the db pool. > listMethod: a method of the content definition class that returns a List of > elements > page: the page that can display an entry. Here a jsp that has a template > element "entry". It also needs the id of the news item. > getIntId is a method of the content definition class and newsid is the url > parameter the page needs. A link like > news.html?__element=entry&newsid=xy > will be generated. > > Best regards, > Stephan > > > - Original Message - > From: "Ben Rometsch" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Wednesday, October 22, 2003 6:15 AM > Subject: [opencms-dev] (no subject) > > > > Hi Matt, > > > > I am not having any joy! I've updated my registry.xml file, with the > > appropriate section reading: > > > > > > 10 > > true > > c:\search > > > > org.apache.lucene.analysis.standard.StandardAnalyzer > > true > > online > > > > > > > > net.grcomputing.opencms.search.lucene.PageDocument > > > > > > > > .txt > > > > net.grcomputing.opencms.search.lucene.PlainDocument > > > > > > .html > > .htm > > .xml > > > > > > net.grcomputing.opencms.search.lucene.TaggedPlainDocument > > > > > > > > > > .doc > > .xls > > .pdf > > > > net.grcomputing.opencms.search.lucene.BodylessDocument > > > > > > > > > > > > net.grcomputing.opencms.search.lucene.JspDocument > > > > > > > > > > > > Test > > true > > > > > > Test2 > > true > > > > > > > > > > Notice the section beginnin