Re: Question about indexing (BrazilianAnalyzer)

2008-06-04 Thread Thomas Arni
Hi, First of all please please always make sure that you use exactly the same Analyser during indexing and searching. I am not confident with the BrazilianAnalyzer, but I saw in the source code that it does not use a ISOLatin1AccentFilter, which replaces the accented characters (รง -> c). Probab

Re: Common Words ignoring problem

2007-03-19 Thread thomas arni
You can adapt the source code of StopAnalyzer.java in the analysis package, or I suppose you can use the default constructor with a empty stop word list (but please check this). If you don't know "Luke" use this small tool to display your index and verify your index process. http://www.getopt

Re: PorterStemFilter

2007-03-27 Thread thomas arni
Write your own analyzer, which calls the appropriate Filter in the method "tokenStream". In the method "tokenStream" you can define, how the input should be analyzed and parsed. Your analyzer must extend the abstract class Analyzer. The easiest way is to create a new class (Analyzer), which

Re: TF-IDF API

2007-03-28 Thread thomas arni
Hava a look at the "TermDocs" Interface in the API. You can get term frequency with a open IndexReader TermDocs termDocs = reader.termDocs(term); where "term" represents the current Term. now you can call: termDocs.freq() to get the frequency of the term within the current document. For th

Re: Get the total term frequency vector of a specific field from the hit results

2007-04-10 Thread thomas arni
Hello Sengly First of all you have to make sure, that you create new Fields, which you add to a Document, with the appropriate constructor. You have to specify the usage of term vectors (Field.TermVector.YES): new Field("text", "your text...", Field.Store.YES, Field.Index.TOKENIZED,Field.Ter

Indexing PDF documents with structure information

2007-08-13 Thread Thomas Arni
Hello Luceners I have started a new project and need to index pdf documents. There are several projects around, which allow to extract the content, like pdfbox, xpdf and pjclassic. As far as I studied the FAQ's and examples, all these tools allow simple text extraction. Which of these open sour

Re: Searching Diacritics

2007-08-27 Thread thomas arni
You can extend the DefaultAnalyzer. The only thing you have to do, is to rewrite the method tokenStream like this: /** Constructs a [EMAIL PROTECTED] StandardTokenizer} filtered by a [EMAIL PROTECTED] StandardFilter}, a [EMAIL PROTECTED] LowerCaseFilter} and a [EMAIL PROTECTED] StopFilter}.