Re: StopWords problem

N. Hira Thu, 27 Dec 2007 00:46:37 -0800

Hi Liaqat,

Are you sure that the Urdu characters are being correctly interpretedby the JVM even during the file I/O operation?

I would expect Unicode characters to be encoded as multi-bytesequences and so, the string-matching operations would fail (if theliterals are different from the file encoding).


Can you try out a simple indexOf() to confirm that this is not going on?

E.g., for a document where you know a stop word occurs, print out thevalue of:

        line.indexOf(URDU_STOP_WORDS[1])

Regards,

-h
----------------------------------------------------------------------
Hira, N.R.
Solutions Architect
Cognocys, Inc.

On 27-Dec-2007, at 2:22 AM, Liaqat Ali wrote:

Doron Cohen wrote:

Hi Liagat,

This part of the code seems correct and should work, so problem
must be elsewhere.

Can you post a short program that demonstrates the problem?

You can start with something like this:
      Document doc = new Document();
      doc.add(new Field("text",URDU_STOP_WORDS[0] +
                  " regular text",Store.YES, Index.TOKENIZED));
      indexWriter.addDocument(doc);

Now URDU_STOP_WORDS[0] should not appear within the index terms.
You can easily verify this by iterating IndexReader.terms();

Regards, Doron

On Dec 27, 2007 9:36 AM, Liaqat Ali <[EMAIL PROTECTED]> wrote:

Hi, Grant
I think i did not make my self clear. I am trying to pass a listof UrduStop words as a argument to the Standard Analyzer. But it doeswork well
for me..
public static final String[] URDU_STOP_WORDS ={ "کی" ,"کا" ,"کو" ,"ہے"
,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"
,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" };
Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);


Kindly give some guidelines.

Regards,
Liaqat
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

The whole program is given below. But it does not eliminate stopwords from the index.



import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

import java.io.*;


public class urduIndexer1  {


   Reader file;
   BufferedReader buff;
   String line;
   IndexWriter writer;
   String indexDir;
   Directory dir;

public static final String[] URDU_STOP_WORDS ={ "کی" ,"کا" ,"کو" ,"ہے" ,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی","ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"وہ" ,"جس" ,"نہں" ,"تک" };



   public void index() throws IOException,
    UnsupportedEncodingException {

             indexDir = "D:\\UIR\\index";
       Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);
           boolean createFlag = true;

       dir = FSDirectory.getDirectory(indexDir);
       writer = new IndexWriter(dir, analyzer, createFlag);

       for (int i=1;i<=201;i++)  {

file = new InputStreamReader(new FileInputStream("corpus\\doc" + i + ".txt"), "UTF-8");


           StringBuffer sb = new StringBuffer();

           buff = new BufferedReader(file);

           //line = buff.readLine();


           while( (line = buff.readLine()) != null) {
                       sb.append(line);
               }



           boolean eof = false;

               Document document  = new Document();

document.add(new Field("contents",sb.toString(),Field.Store.NO, Field.Index.TOKENIZED));


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: StopWords problem

Reply via email to