[pylucene-dev] Problems with StringReader()

BEADLING, Philip, GBM Tue, 28 Nov 2006 07:37:39 -0800

Hi,


I've been using pylucene to interface a documentation index I've been
writing.  The basic index reader works like a charm, but I'm having issues
using the Highlight function as shown below.

 

What I've tried to do is pretty simple if not overly elegant.  My (separate)
indexer creates a bog standard lucene index of pdf, doc and html files and
also dumps out the stripped text from each into a similar directory
structure to my documentation share.  The idea being that once I've located
a hit using my indexer, I can then read in the equivalent text file and use
the Highlighter class to give me a snippet of the best search result.

 

If rather than reading in one of these text file using the read() function
and instead I pass in a dummy string the whole thing works.  But when
passing in the text file as shown below the whole thing freezes when we try
to create a StringReader off the text from the textfile.  When stripping
text from PDFs etc, I'm using the standard Java tools like PDFBox recommend
for use with Lucene, and they work fine in my Jython Indexer app (separate
tool) - creating indexes that work on their own through PyLucene (I'm pretty
certain my build of PyLucene is healthy as the rest of my app works).

 

These text files are by no means huge, a few pages of text, which I wouldn't
expect to be an insane amount to pass around as a string, or cause any issue
because of this.


The example is pretty much straight out of the Lucene In Action book
(converted to Python, of course):

 

 

    def highlight( self, searchText, searchResultFilenames ):

        for filename in searchResultFilenames:

            # Find text directory from documents directory and convert
network fileshare to local mount

            textFile = filename.replace("\\Documents\\","\\Text\\") + ".txt"

            textFile = textFile.replace("\\", "/")

            textFile =
textFile.replace("//networkshare/IRDcaf/Documentation", "/Documentation")

            

            print "<br>", searchText, "<br>", textFile

            if os.path.isfile( textFile ):

                filen = open( textFile, 'r' ) 

                textString = filen.read() 

                filen.close()

                term = Term( "field", searchText )

                termQuery = TermQuery( term )

                scorer = QueryScorer( termQuery )

                highlighter = Highlighter( scorer )

                simpAn = SimpleAnalyzer()

                # PROBLEM IS HERE!!!!

                reader = PyLucene.StringReader( textString )

                tokenStream = simpAn.tokenStream("field", reader )

                print highlighter.getBestFragment( tokenStream, textString )

 

 

Having had no joy with this I have tried writing my own StringReader
function rather than using the one that comes with PyLucene.

 

I've tried a few things, but the code is basically - as below.  My own
string reader succeeds (in that it instantiates), but I then get a barf on
getBestFragment usually telling me that textString is type <str> and not
type Unicode - which is weird because of course it is (I haven't change this
part of the functionality).  I tried wrapping textString using the python
Unicode() call, with no joy.

 

I'm now totally out of ideas, google is pretty much no help on this, and
gutted because if I can't get this working I'm going to have to ditch
PyLucene and use the full blown Java version (or Jython).  Neither of which
are really suitable for the environment I'm working in (just a pokey little
cgi-bin script).

 

So if anyone can be of any help I'd be very grateful.

 

Many thanks,

 

Phil.

 

 

***My StringReader attempt***

 

class StringReader(object):

 

     def __init__(self, text):

         self.text = unicode(text, errors='ignore')

 

     def read(self, length = -1):

 

         text = self.text

         if text is None:

             return ''

 

         if length == -1 or length >= len(text):

             self.text = None

             return text

 

         text = text[0:length]

         self.text = self.text[length:]

 

 

         return text

 

     def close(self):

         pass

 

 

 

 


***********************************************************************************
The Royal Bank of Scotland plc. Registered in Scotland No 90312. Registered 
Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
Authorised and regulated by the Financial Services Authority 
 
This e-mail message is confidential and for use by the 
addressee only. If the message is received by anyone other 
than the addressee, please return the message to the sender 
by replying to it and then delete the message from your 
computer. Internet e-mails are not necessarily secure. The 
Royal Bank of Scotland plc does not accept responsibility for 
changes made to this message after it was sent. 

Whilst all reasonable care has been taken to avoid the 
transmission of viruses, it is the responsibility of the recipient to 
ensure that the onward transmission, opening or use of this 
message and any attachments will not adversely affect its 
systems or data. No responsibility is accepted by The 
Royal Bank of Scotland plc in this regard and the recipient should carry 
out such virus and other checks as it considers appropriate. 
Visit our websites at: 
http://www.rbos.com
http://www.rbsmarkets.com 
***********************************************************************************

_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

[pylucene-dev] Problems with StringReader()

Reply via email to