Re: testing PyLucene 6.2

Dirk Rothe Thu, 08 Sep 2016 06:43:28 -0700

Am 08.09.2016, 11:10 Uhr, schrieb Andi Vajda <va...@apache.org>:


On Thu, 8 Sep 2016, Dirk Rothe wrote:

Am 05.09.2016, 21:27 Uhr, schrieb Andi Vajda <va...@apache.org>:

class _Tokenizer(PythonTokenizer):
  def __init__(self, INPUT):
        super(_Tokenizer, self).__init__(INPUT)
      # prepare INPUT
  def incrementToken(self):
      # stuff into termAtt/offsetAtt/posIncrAtt

class Analyzer6(PythonAnalyzer):
  def createComponents(self, fieldName):
      return Analyzer.TokenStreamComponents(_Tokenizer())

The PositionIncrementTestCase is pretty similar but initialized withstatic input. Would be a nice place for an example with dynamic input,I think.


This was our 3.6 approach:
class Analyzer3(PythonAnalyzer):
  def tokenStream(self, fieldName, reader):
     data = data_from_reader(reader)
     class _tokenStream(PythonTokenStream):
         def __init__(self):
              super(_tokenStream, self).__init__()
              # prepare termAtt/offsetAtt/posIncrAtt
         def incrementToken(self):
              # stuff from data into termAtt/offsetAtt/posIncrAtt
    return _tokenStream()

Any hints how to get Analyzer6 working?


I've lost track of the countless API changes since 3.x.

The Lucene project does a good job at tracking them in the CHANGES.txtfile, usually pointing at the issue that tracked it, often with examplesabout how to accomplish the same in the new way and the rationale behindthe change.


I guess we are here:
https://issues.apache.org/jira/browse/LUCENE-5388
https://svn.apache.org/viewvc?view=revision&revision=1556801

You can also look at the PyLucene tests I just ported to 6.x. Forexample, in test_Analyzers.py, you can see that Tokenizer no longertakes a reader but can be set one with setReader() after construction.

Yes, I've done that pretty carefully. I think, this quote points in theright direction: "The tokenStream method takes a String or Reader and willpass this to Tokenizer#setReader()."from:http://mail-archives.apache.org/mod_mbox/lucene-java-user/201502.mbox/%3C021701d04f86$55331f10$ff995d30$@thetaphi.de%3E

I've checked the lucene source and this happens automatically an cannot beoverwritten.


So I've hacked something ugly together which seems to work.

class _Tokenizer(PythonTokenizer):
    def __init__(self, getReader):
        super(_Tokenizer, self).__init__()
        self.getReader = getReader
        self.i = 0
        self.data = []

    def incrementToken(self):
        if self.i == 0:
            self.data = data_from_reader(self.getReader())
        if self.i == len(self.data):
            # we are reused - reset
            self.i = 0
            return False
        # stuff from self.data into termAtt/offsetAtt/posIncrAtt
        self.i += 1
        return True

class Analyzer6(PythonAnalyzer):
    def createComponents(self, fieldName):

return Analyzer.TokenStreamComponents(_Tokenizer(lambda:self._reader))

    def initReader(self, fieldName, reader):
        # capture reader
        self._reader = reader
        return reader

I've made initReader() python-overridable (see patch). What do you think?

--dirk

Re: testing PyLucene 6.2

Reply via email to