It is still on master: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.java <https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.java>
Emir > On 28 Sep 2017, at 17:32, Erick Erickson <erickerick...@gmail.com> wrote: > > PatternCaptureGroupTokenFilter has been around since 2013 (at least > that's the earliest revision in Git). I located it even in 5x so it > should be there in > ...lucene/analysis/common/src/java/org/apache/lucene/analysis/pattern > > Best, > Erick > > On Thu, Sep 28, 2017 at 7:45 AM, Webster Homer <webster.ho...@sial.com> wrote: >> It's still buggy, so not ready to share. >> >> I keep a copy of Solr source which I use for this type of development. I >> don't see PatternCaptureGroupTokenFilterFactory in the Solr 6.2 code base >> at all. I was thinking of seeing how it treated the positions etc... >> >> My code now looks reasonable in the Analysis tool, but doesn't seem to >> create searchable lucene data. I've changed it considerably since my first >> post so I see output in the tool which was an improvement >> >> >> On Wed, Sep 27, 2017 at 10:30 AM, Stefan Matheis <matheis.ste...@gmail.com> >> wrote: >> >>>> In any case I figured out my problem. I was over thinking it. >>> >>> Mind to share? >>> >>> -Stefan >>> >>> On Sep 27, 2017 4:34 PM, "Webster Homer" <webster.ho...@sial.com> wrote: >>> >>>> There is a need for a special filter since the input has to be >>> normalized. >>>> That is the main requirement, splitting into pieces is optional. As far >>> as >>>> I know there is nothing in solr that knows about molecular formulas. >>>> >>>> In any case I figured out my problem. I was over thinking it. >>>> >>>> On Wed, Sep 27, 2017 at 3:52 AM, Emir Arnautović < >>>> emir.arnauto...@sematext.com> wrote: >>>> >>>>> Hi Homer, >>>>> There is no need for special filter, there is one that is for some >>> reason >>>>> not part of documentation (will ask why so follow that thread if >>> decided >>>> to >>>>> go this way): You can use something like: >>>>> <filter class=“solr.PatternCaptureGroupTokenFilterFactory” >>>>> pattern=“([A-Z][a-z]?\d+)” preserveOriginal=“true” /> >>>>> >>>>> This will capture all atom counts as a separate tokens. >>>>> >>>>> HTH, >>>>> Emir >>>>> >>>>>> On 26 Sep 2017, at 23:14, Webster Homer <webster.ho...@sial.com> >>>> wrote: >>>>>> >>>>>> I am trying to create a filter that normalizes an input token, but >>> also >>>>>> splits it inot multiple pieces. Sort of like what the >>>> WordDelimiterFilter >>>>>> does. >>>>>> >>>>>> It's meant to take a molecular formula like C2H6O and normalize it to >>>>> C2H6O1 >>>>>> >>>>>> That part works. However I was also going to have it put out the >>>>> individual >>>>>> atom counts as tokens. >>>>>> C2H6O1 >>>>>> C2 >>>>>> H6 >>>>>> O1 >>>>>> >>>>>> When I enable this feature in the factory, I don't get any output at >>>> all. >>>>>> >>>>>> I looked over a couple of filters that do what I want and it's not >>>>> entirely >>>>>> clear what they're doing. So I have some questions: >>>>>> Looking at ShingleFilter and WordDelimitierFilter >>>>>> They both set several attributes: >>>>>> CharTermAttribute : Seems to be the actual terms being set. Seemed >>>>> straight >>>>>> forward, works fine when I only have one term to add. >>>>>> >>>>>> PositionIncrementAttribute: What does this do? It appears that >>>>>> WordDelimiterFilter sets this to 0 most of the time. This has decent >>>>>> documentation. >>>>>> >>>>>> OffsetAttribute: I think that this tracks offsets for each term being >>>>>> processed. Not really sure though. The documentation mentions tokens. >>>> So >>>>> if >>>>>> I have multiple variations for for a token is this for each >>> variation? >>>>>> >>>>>> TypeAttribute: default is "word". Don't know what this is for. >>>>>> >>>>>> PositionLengthAttribute: WordDelimiterFilter doesn' use this but >>>> Shingle >>>>>> does. It defaults to 1. What's it good for when should I use it? >>>>>> >>>>>> Here is my incrementToken method. >>>>>> >>>>>> @Override >>>>>> public boolean incrementToken() throws IOException { >>>>>> while(true) { >>>>>> if (!hasSavedState) { >>>>>> if (! input.incrementToken()) { >>>>>> return false; >>>>>> } >>>>>> if (! generateFragments) { // This part works fine! >>>>>> String normalizedFormula = molFormula.normalize(new >>>>>> String(termAttribute.buffer())); >>>>>> char[]newBuffer = normalizedFormula.toCharArray(); >>>>>> termAttribute.setEmpty(); >>>>>> termAttribute.copyBuffer(newBuffer, 0, newBuffer.length); >>>>>> return true; >>>>>> } >>>>>> formulas = molFormula.normalizeToList(new >>>>>> String(termAttribute.buffer())); >>>>>> iterator = formulas.listIterator(); >>>>>> savedPositionIncrement += posIncAttribute.getPositionIncrement(); >>>>>> hasSavedState = true; >>>>>> first = true; >>>>>> saveState(); >>>>>> } >>>>>> if (!iterator.hasNext()) { >>>>>> posIncAttribute.setPositionIncrement(savedPositionIncrement); >>>>>> savedPositionIncrement = 0; >>>>>> hasSavedState = false; >>>>>> continue; >>>>>> } >>>>>> String formula = iterator.next(); >>>>>> int startOffset = savedStartOffset; >>>>>> >>>>>> if (first) { >>>>>> termAttribute.setEmpty(); >>>>>> } >>>>>> int endOffset = savedStartOffset + formula.length(); >>>>>> System.out.printf("Writing formula %s %d to %d%n", formula, >>>>>> startOffset, endOffset);; >>>>>> termAttribute.append(formula); >>>>>> offsetAttribute.setOffset(startOffset, endOffset); >>>>>> savedStartOffset = endOffset + 1; >>>>>> if (first) { >>>>>> posIncAttribute.setPositionIncrement(0); >>>>>> } else { >>>>>> first = false; >>>>>> posIncAttribute.setPositionIncrement(0); >>>>>> } >>>>>> typeAttribute.setType(savedType); >>>>>> return true; >>>>>> } >>>>>> } >>>>>> >>>>>> -- >>>>>> >>>>>> >>>>>> This message and any attachment are confidential and may be >>> privileged >>>> or >>>>>> otherwise protected from disclosure. If you are not the intended >>>>> recipient, >>>>>> you must not copy this message or attachment or disclose the contents >>>> to >>>>>> any other person. If you have received this transmission in error, >>>> please >>>>>> notify the sender immediately and delete the message and any >>> attachment >>>>>> from your system. Merck KGaA, Darmstadt, Germany and any of its >>>>>> subsidiaries do not accept liability for any omissions or errors in >>>> this >>>>>> message which may arise as a result of E-Mail-transmission or for >>>> damages >>>>>> resulting from any unauthorized changes of the content of this >>> message >>>>> and >>>>>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its >>>>>> subsidiaries do not guarantee that this message is free of viruses >>> and >>>>> does >>>>>> not accept liability for any damages caused by any virus transmitted >>>>>> therewith. >>>>>> >>>>>> Click http://www.emdgroup.com/disclaimer to access the German, >>> French, >>>>>> Spanish and Portuguese versions of this disclaimer. >>>>> >>>>> >>>> >>>> -- >>>> >>>> >>>> This message and any attachment are confidential and may be privileged or >>>> otherwise protected from disclosure. If you are not the intended >>> recipient, >>>> you must not copy this message or attachment or disclose the contents to >>>> any other person. If you have received this transmission in error, please >>>> notify the sender immediately and delete the message and any attachment >>>> from your system. Merck KGaA, Darmstadt, Germany and any of its >>>> subsidiaries do not accept liability for any omissions or errors in this >>>> message which may arise as a result of E-Mail-transmission or for damages >>>> resulting from any unauthorized changes of the content of this message >>> and >>>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its >>>> subsidiaries do not guarantee that this message is free of viruses and >>> does >>>> not accept liability for any damages caused by any virus transmitted >>>> therewith. >>>> >>>> Click http://www.emdgroup.com/disclaimer to access the German, French, >>>> Spanish and Portuguese versions of this disclaimer. >>>> >>> >> >> -- >> >> >> This message and any attachment are confidential and may be privileged or >> otherwise protected from disclosure. If you are not the intended recipient, >> you must not copy this message or attachment or disclose the contents to >> any other person. If you have received this transmission in error, please >> notify the sender immediately and delete the message and any attachment >> from your system. Merck KGaA, Darmstadt, Germany and any of its >> subsidiaries do not accept liability for any omissions or errors in this >> message which may arise as a result of E-Mail-transmission or for damages >> resulting from any unauthorized changes of the content of this message and >> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its >> subsidiaries do not guarantee that this message is free of viruses and does >> not accept liability for any damages caused by any virus transmitted >> therewith. >> >> Click http://www.emdgroup.com/disclaimer to access the German, French, >> Spanish and Portuguese versions of this disclaimer.