It's still buggy, so not ready to share.

I keep a copy of Solr source which I use for this type of development. I
don't see PatternCaptureGroupTokenFilterFactory in the Solr 6.2 code base
at all. I was thinking of seeing how it treated the positions etc...

My code now looks reasonable in the Analysis tool,  but doesn't seem to
create searchable lucene data. I've changed it considerably since my first
post so I see output in the tool which was an improvement


On Wed, Sep 27, 2017 at 10:30 AM, Stefan Matheis <matheis.ste...@gmail.com>
wrote:

> > In any case I figured out my problem. I was over thinking it.
>
> Mind to share?
>
> -Stefan
>
> On Sep 27, 2017 4:34 PM, "Webster Homer" <webster.ho...@sial.com> wrote:
>
> > There is a need for a special filter since the input has to be
> normalized.
> > That is the main requirement, splitting into pieces is optional. As far
> as
> > I know there is nothing in solr that knows about molecular formulas.
> >
> > In any case I figured out my problem. I was over thinking it.
> >
> > On Wed, Sep 27, 2017 at 3:52 AM, Emir Arnautović <
> > emir.arnauto...@sematext.com> wrote:
> >
> > > Hi Homer,
> > > There is no need for special filter, there is one that is for some
> reason
> > > not part of documentation (will ask why so follow that thread if
> decided
> > to
> > > go this way): You can use something like:
> > > <filter class=“solr.PatternCaptureGroupTokenFilterFactory”
> > > pattern=“([A-Z][a-z]?\d+)” preserveOriginal=“true” />
> > >
> > > This will capture all atom counts as a separate tokens.
> > >
> > > HTH,
> > > Emir
> > >
> > > > On 26 Sep 2017, at 23:14, Webster Homer <webster.ho...@sial.com>
> > wrote:
> > > >
> > > > I am trying to create a filter that normalizes an input token, but
> also
> > > > splits it inot multiple pieces. Sort of like what the
> > WordDelimiterFilter
> > > > does.
> > > >
> > > > It's meant to take a molecular formula like C2H6O and normalize it to
> > > C2H6O1
> > > >
> > > > That part works. However I was also going to have it put out the
> > > individual
> > > > atom counts as tokens.
> > > > C2H6O1
> > > > C2
> > > > H6
> > > > O1
> > > >
> > > > When I enable this feature in the factory, I don't get any output at
> > all.
> > > >
> > > > I looked over a couple of filters that do what I want and it's not
> > > entirely
> > > > clear what they're doing. So I have some questions:
> > > > Looking at ShingleFilter and WordDelimitierFilter
> > > > They both set several attributes:
> > > > CharTermAttribute : Seems to be the actual terms being set. Seemed
> > > straight
> > > > forward, works fine when I only have one term to add.
> > > >
> > > > PositionIncrementAttribute: What does this do? It appears that
> > > > WordDelimiterFilter sets this to 0 most of the time. This has decent
> > > > documentation.
> > > >
> > > > OffsetAttribute: I think that this tracks offsets for each term being
> > > > processed. Not really sure though. The documentation mentions tokens.
> > So
> > > if
> > > > I have multiple variations for for a token is this for each
> variation?
> > > >
> > > > TypeAttribute: default is "word". Don't know what this is for.
> > > >
> > > > PositionLengthAttribute: WordDelimiterFilter doesn' use this but
> > Shingle
> > > > does. It defaults to 1. What's it good for when should I use it?
> > > >
> > > > Here is my incrementToken method.
> > > >
> > > >    @Override
> > > >    public boolean incrementToken() throws IOException {
> > > >    while(true) {
> > > >    if (!hasSavedState) {
> > > >    if (! input.incrementToken()) {
> > > >    return false;
> > > >    }
> > > >    if (! generateFragments) { // This part works fine!
> > > >        String normalizedFormula = molFormula.normalize(new
> > > > String(termAttribute.buffer()));
> > > >        char[]newBuffer = normalizedFormula.toCharArray();
> > > >        termAttribute.setEmpty();
> > > >        termAttribute.copyBuffer(newBuffer, 0, newBuffer.length);
> > > >        return true;
> > > >    }
> > > >    formulas = molFormula.normalizeToList(new
> > > > String(termAttribute.buffer()));
> > > >    iterator = formulas.listIterator();
> > > >    savedPositionIncrement += posIncAttribute.getPositionIncrement();
> > > >    hasSavedState = true;
> > > >    first = true;
> > > >    saveState();
> > > >    }
> > > >    if (!iterator.hasNext()) {
> > > >    posIncAttribute.setPositionIncrement(savedPositionIncrement);
> > > >    savedPositionIncrement = 0;
> > > >    hasSavedState = false;
> > > >    continue;
> > > >    }
> > > >    String formula = iterator.next();
> > > >        int startOffset = savedStartOffset;
> > > >
> > > >        if (first) {
> > > >        termAttribute.setEmpty();
> > > >        }
> > > >        int endOffset = savedStartOffset + formula.length();
> > > >        System.out.printf("Writing formula %s %d to %d%n", formula,
> > > > startOffset, endOffset);;
> > > >        termAttribute.append(formula);
> > > >            offsetAttribute.setOffset(startOffset, endOffset);
> > > >            savedStartOffset = endOffset + 1;
> > > >            if (first) {
> > > >            posIncAttribute.setPositionIncrement(0);
> > > >            } else {
> > > >            first = false;
> > > >                posIncAttribute.setPositionIncrement(0);
> > > >            }
> > > >            typeAttribute.setType(savedType);
> > > >            return true;
> > > >    }
> > > >    }
> > > >
> > > > --
> > > >
> > > >
> > > > This message and any attachment are confidential and may be
> privileged
> > or
> > > > otherwise protected from disclosure. If you are not the intended
> > > recipient,
> > > > you must not copy this message or attachment or disclose the contents
> > to
> > > > any other person. If you have received this transmission in error,
> > please
> > > > notify the sender immediately and delete the message and any
> attachment
> > > > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > > > subsidiaries do not accept liability for any omissions or errors in
> > this
> > > > message which may arise as a result of E-Mail-transmission or for
> > damages
> > > > resulting from any unauthorized changes of the content of this
> message
> > > and
> > > > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > > > subsidiaries do not guarantee that this message is free of viruses
> and
> > > does
> > > > not accept liability for any damages caused by any virus transmitted
> > > > therewith.
> > > >
> > > > Click http://www.emdgroup.com/disclaimer to access the German,
> French,
> > > > Spanish and Portuguese versions of this disclaimer.
> > >
> > >
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
> >
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Reply via email to