Brilliant - I see the problem now, thank you very much. As you say in my first example I was calling StringReader straight with the whatever read() returned - this is not necessarily utf-8. My own StringReader class didn't specify utf-8 either.
I've just simply added uniText=unicode(textString, 'utf8','ignore' ) and passed this into PyLucene.StringReader and getBestFragment and it works! Thanks again, Phil. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andi Vajda Sent: 28 November 2006 17:05 To: '[email protected]' Subject: Re: [pylucene-dev] Problems with StringReader() On Tue, 28 Nov 2006, BEADLING, Philip, GBM wrote: > def highlight( self, searchText, searchResultFilenames ): > for filename in searchResultFilenames: > # Find text directory from documents directory and convert > network fileshare to local mount > textFile = filename.replace("\\Documents\\","\\Text\\") + ".txt" > textFile = textFile.replace("\\", "/") > textFile = > textFile.replace("//networkshare/IRDcaf/Documentation", "/Documentation") > > print "<br>", searchText, "<br>", textFile > if os.path.isfile( textFile ): > filen = open( textFile, 'r' ) > textString = filen.read() > filen.close() > term = Term( "field", searchText ) > termQuery = TermQuery( term ) > scorer = QueryScorer( termQuery ) > highlighter = Highlighter( scorer ) > simpAn = SimpleAnalyzer() > # PROBLEM IS HERE!!!! > reader = PyLucene.StringReader( textString ) > tokenStream = simpAn.tokenStream("field", reader ) > print highlighter.getBestFragment( tokenStream, textString ) > At first quick glance, it doesn't look like 'textString' is going to be of type 'unicode' in the above code sample. What comes out of a python file's read method is a object of type 'str'. I believe PyLucene will try to convert the 'str' into a 'unicode' object by assuming 'utf-8' encoding. If your 'str' is not 'utf-8' encoded then that is going to fail. If you send in a piece of code that runs (with the required data) that reproduces the problem you're experiencing, I might be able to help you better. Andi.. _______________________________________________ pylucene-dev mailing list [email protected] http://lists.osafoundation.org/mailman/listinfo/pylucene-dev *********************************************************************************** The Royal Bank of Scotland plc. Registered in Scotland No 90312. Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. Authorised and regulated by the Financial Services Authority This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer. Internet e-mails are not necessarily secure. The Royal Bank of Scotland plc does not accept responsibility for changes made to this message after it was sent. Whilst all reasonable care has been taken to avoid the transmission of viruses, it is the responsibility of the recipient to ensure that the onward transmission, opening or use of this message and any attachments will not adversely affect its systems or data. No responsibility is accepted by The Royal Bank of Scotland plc in this regard and the recipient should carry out such virus and other checks as it considers appropriate. Visit our websites at: http://www.rbos.com http://www.rbsmarkets.com *********************************************************************************** _______________________________________________ pylucene-dev mailing list [email protected] http://lists.osafoundation.org/mailman/listinfo/pylucene-dev
