Brilliant - I see the problem now, thank you very much.

As you say in my first example I was calling StringReader straight with the
whatever read() returned - this is not necessarily utf-8.  My own
StringReader class didn't specify utf-8 either.

I've just simply added

uniText=unicode(textString, 'utf8','ignore' )

and passed this into PyLucene.StringReader and getBestFragment and it works!

Thanks again,

Phil.



-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Andi Vajda
Sent: 28 November 2006 17:05
To: '[email protected]'
Subject: Re: [pylucene-dev] Problems with StringReader()


On Tue, 28 Nov 2006, BEADLING, Philip, GBM wrote:

>    def highlight( self, searchText, searchResultFilenames ):
>        for filename in searchResultFilenames:
>            # Find text directory from documents directory and convert
> network fileshare to local mount
>            textFile = filename.replace("\\Documents\\","\\Text\\") +
".txt"
>            textFile = textFile.replace("\\", "/")
>            textFile =
> textFile.replace("//networkshare/IRDcaf/Documentation", "/Documentation")
>
>            print "<br>", searchText, "<br>", textFile
>            if os.path.isfile( textFile ):
>                filen = open( textFile, 'r' )
>                textString = filen.read()
>                filen.close()
>                term = Term( "field", searchText )
>                termQuery = TermQuery( term )
>                scorer = QueryScorer( termQuery )
>                highlighter = Highlighter( scorer )
>                simpAn = SimpleAnalyzer()
>                # PROBLEM IS HERE!!!!
>                reader = PyLucene.StringReader( textString )
>                tokenStream = simpAn.tokenStream("field", reader )
>                print highlighter.getBestFragment( tokenStream, textString
)
>

At first quick glance, it doesn't look like 'textString' is going to be of 
type 'unicode' in the above code sample. What comes out of a python file's 
read method is a object of type 'str'. I believe PyLucene will try to
convert 
the 'str' into a 'unicode' object by assuming 'utf-8' encoding. If your
'str' 
is not 'utf-8' encoded then that is going to fail.

If you send in a piece of code that runs (with the required data) that 
reproduces the problem you're experiencing, I might be able to help you 
better.

Andi..
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

***********************************************************************************
The Royal Bank of Scotland plc. Registered in Scotland No 90312. Registered 
Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
Authorised and regulated by the Financial Services Authority 
 
This e-mail message is confidential and for use by the 
addressee only. If the message is received by anyone other 
than the addressee, please return the message to the sender 
by replying to it and then delete the message from your 
computer. Internet e-mails are not necessarily secure. The 
Royal Bank of Scotland plc does not accept responsibility for 
changes made to this message after it was sent. 

Whilst all reasonable care has been taken to avoid the 
transmission of viruses, it is the responsibility of the recipient to 
ensure that the onward transmission, opening or use of this 
message and any attachments will not adversely affect its 
systems or data. No responsibility is accepted by The 
Royal Bank of Scotland plc in this regard and the recipient should carry 
out such virus and other checks as it considers appropriate. 
Visit our websites at: 
http://www.rbos.com
http://www.rbsmarkets.com 
***********************************************************************************
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Reply via email to