Hi Liaqat,

Are you sure that the Urdu characters are being correctly interpreted by the JVM even during the file I/O operation?

I would expect Unicode characters to be encoded as multi-byte sequences and so, the string-matching operations would fail (if the literals are different from the file encoding).

Can you try out a simple indexOf() to confirm that this is not going on?
E.g., for a document where you know a stop word occurs, print out the value of:
        line.indexOf(URDU_STOP_WORDS[1])

Regards,

-h
----------------------------------------------------------------------
Hira, N.R.
Solutions Architect
Cognocys, Inc.

On 27-Dec-2007, at 2:22 AM, Liaqat Ali wrote:

Doron Cohen wrote:
Hi Liagat,

This part of the code seems correct and should work, so problem
must be elsewhere.

Can you post a short program that demonstrates the problem?

You can start with something like this:
      Document doc = new Document();
      doc.add(new Field("text",URDU_STOP_WORDS[0] +
                  " regular text",Store.YES, Index.TOKENIZED));
      indexWriter.addDocument(doc);

Now URDU_STOP_WORDS[0] should not appear within the index terms.
You can easily verify this by iterating IndexReader.terms();

Regards, Doron

On Dec 27, 2007 9:36 AM, Liaqat Ali <[EMAIL PROTECTED]> wrote:


Hi, Grant

I think i did not make my self clear. I am trying to pass a list of Urdu Stop words as a argument to the Standard Analyzer. But it does work well
for me..

public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" ,"ہے"
,"کے" ,"نے" ,"پر" ,"اور" ,"سے","میں" ,"بھی"
,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ," وہ" ,"جس" ,"نہں" ,"تک" };
Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);


Kindly give some guidelines.

Regards,
Liaqat

-------------------------------------------------------------------- -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




The whole program is given below. But it does not eliminate stop words from the index.


import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

import java.io.*;


public class urduIndexer1  {


   Reader file;
   BufferedReader buff;
   String line;
   IndexWriter writer;
   String indexDir;
   Directory dir;

public static final String[] URDU_STOP_WORDS = { "کی" ,"کا" ,"کو" ,"ہے" ,"کے" ,"نے" ,"پر" ,"اور" ," سے","میں" ,"بھی" ,"ان" ,"ایک" ,"تھا" ,"تھی" ,"کیا" ,"ہیں" ,"کر" ,"و ہ" ,"جس" ,"نہں" ,"تک" };


   public void index() throws IOException,
    UnsupportedEncodingException {

             indexDir = "D:\\UIR\\index";
       Analyzer analyzer = new StandardAnalyzer(URDU_STOP_WORDS);
           boolean createFlag = true;

       dir = FSDirectory.getDirectory(indexDir);
       writer = new IndexWriter(dir, analyzer, createFlag);

       for (int i=1;i<=201;i++)  {



file = new InputStreamReader(new FileInputStream("corpus\ \doc" + i + ".txt"), "UTF-8");

           StringBuffer sb = new StringBuffer();

           buff = new BufferedReader(file);

           //line = buff.readLine();


           while( (line = buff.readLine()) != null) {
                       sb.append(line);
               }



           boolean eof = false;

               Document document  = new Document();
document.add(new Field("contents",sb.toString(), Field.Store.NO, Field.Index.TOKENIZED));

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to