Hi,
I've been using pylucene to interface a documentation index I've been
writing. The basic index reader works like a charm, but I'm having issues
using the Highlight function as shown below.
What I've tried to do is pretty simple if not overly elegant. My (separate)
indexer creates a bog standard lucene index of pdf, doc and html files and
also dumps out the stripped text from each into a similar directory
structure to my documentation share. The idea being that once I've located
a hit using my indexer, I can then read in the equivalent text file and use
the Highlighter class to give me a snippet of the best search result.
If rather than reading in one of these text file using the read() function
and instead I pass in a dummy string the whole thing works. But when
passing in the text file as shown below the whole thing freezes when we try
to create a StringReader off the text from the textfile. When stripping
text from PDFs etc, I'm using the standard Java tools like PDFBox recommend
for use with Lucene, and they work fine in my Jython Indexer app (separate
tool) - creating indexes that work on their own through PyLucene (I'm pretty
certain my build of PyLucene is healthy as the rest of my app works).
These text files are by no means huge, a few pages of text, which I wouldn't
expect to be an insane amount to pass around as a string, or cause any issue
because of this.
The example is pretty much straight out of the Lucene In Action book
(converted to Python, of course):
def highlight( self, searchText, searchResultFilenames ):
for filename in searchResultFilenames:
# Find text directory from documents directory and convert
network fileshare to local mount
textFile = filename.replace("\\Documents\\","\\Text\\") + ".txt"
textFile = textFile.replace("\\", "/")
textFile =
textFile.replace("//networkshare/IRDcaf/Documentation", "/Documentation")
print "<br>", searchText, "<br>", textFile
if os.path.isfile( textFile ):
filen = open( textFile, 'r' )
textString = filen.read()
filen.close()
term = Term( "field", searchText )
termQuery = TermQuery( term )
scorer = QueryScorer( termQuery )
highlighter = Highlighter( scorer )
simpAn = SimpleAnalyzer()
# PROBLEM IS HERE!!!!
reader = PyLucene.StringReader( textString )
tokenStream = simpAn.tokenStream("field", reader )
print highlighter.getBestFragment( tokenStream, textString )
Having had no joy with this I have tried writing my own StringReader
function rather than using the one that comes with PyLucene.
I've tried a few things, but the code is basically - as below. My own
string reader succeeds (in that it instantiates), but I then get a barf on
getBestFragment usually telling me that textString is type <str> and not
type Unicode - which is weird because of course it is (I haven't change this
part of the functionality). I tried wrapping textString using the python
Unicode() call, with no joy.
I'm now totally out of ideas, google is pretty much no help on this, and
gutted because if I can't get this working I'm going to have to ditch
PyLucene and use the full blown Java version (or Jython). Neither of which
are really suitable for the environment I'm working in (just a pokey little
cgi-bin script).
So if anyone can be of any help I'd be very grateful.
Many thanks,
Phil.
***My StringReader attempt***
class StringReader(object):
def __init__(self, text):
self.text = unicode(text, errors='ignore')
def read(self, length = -1):
text = self.text
if text is None:
return ''
if length == -1 or length >= len(text):
self.text = None
return text
text = text[0:length]
self.text = self.text[length:]
return text
def close(self):
pass
***********************************************************************************
The Royal Bank of Scotland plc. Registered in Scotland No 90312. Registered
Office: 36 St Andrew Square, Edinburgh EH2 2YB.
Authorised and regulated by the Financial Services Authority
This e-mail message is confidential and for use by the
addressee only. If the message is received by anyone other
than the addressee, please return the message to the sender
by replying to it and then delete the message from your
computer. Internet e-mails are not necessarily secure. The
Royal Bank of Scotland plc does not accept responsibility for
changes made to this message after it was sent.
Whilst all reasonable care has been taken to avoid the
transmission of viruses, it is the responsibility of the recipient to
ensure that the onward transmission, opening or use of this
message and any attachments will not adversely affect its
systems or data. No responsibility is accepted by The
Royal Bank of Scotland plc in this regard and the recipient should carry
out such virus and other checks as it considers appropriate.
Visit our websites at:
http://www.rbos.com
http://www.rbsmarkets.com
***********************************************************************************
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev