[jira] [Comment Edited] (SOLR-7981) term based ValueSourceParsers should support an option to run an analyzer for hte specified field on the input

Jason Gerlowski (JIRA) Tue, 03 Nov 2015 16:36:06 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988595#comment-14988595
 ]


Jason Gerlowski edited comment on SOLR-7981 at 11/4/15 12:35 AM:
-----------------------------------------------------------------

Haha, funny; I've definitely been there.

I also don't have a huge opinion about adding this option.  I didn't pick this 
up because I wanted the feature in Solr; I just wanted to learn how to work on 
Solr.  And it's been a good first introduction, so "SUCCESS" on that front.  if 
there's a consensus that this is a thing people would like to have, I'm happy 
to keep working on it (should I assign myself on this JIRA? Or is that only for 
commiters?)  If we *do* think this would be useful for people, I could use a 
bit of clarification on what the desired behavior actually is.  If not, should 
I close this JIRA?

Questions about 'Desired' Behavior:

1.) Currently, analysis is only done on things that ValueSourceParser 
identifies as being TextFields.  Are numeric/date/other fields typically 
analyzed?  If so, do we want them to be analyzed here too?  Even among fields 
containing text, this doesn't cover as much as I'd expect.  For example, I was 
writing some tests for this stuff and tried to use a field like:

<!-- A text field with mismatched analyzers for query/index..used for testing. 
-->
    <fieldType name="text_different_analyzers" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="query"> <!-- Whitespace only for query-analysis -->
        <tokenizer class="solr.MockTokenizerFactory"/>
      </analyzer>
      <analyzer type="index">
        <tokenizer class="solr.MockTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>
   <field name="text_analysis_mismatch" type="text_different_analyzers" 
indexed="true" stored="true"/>

(Sorry, couldn't figure out how to format that as code, I used "{{ }}" but it 
didn't seem to work.

but it turns out that it wasn't being analyzed by the current ValueSourceParser 
code.  Maybe this is just me being new to Solr, but I expected this to be 
considered a "TextField" by the code.

2.) Do we care whether the input-value gets analyzed to > 1 token?  The initial 
bug description mentioned error handling for this, but I didn't see any special 
error-handling for this in the default-to-query-analyzer case that's already in 
the code.

Thanks for any clarification anyone can give.  Still getting used to the 
process of working on these things.


was (Author: gerlowskija):
Haha, funny; I've definitely been there.

I also don't have a huge opinion about adding this option.  I didn't pick this 
up because I wanted the feature in Solr; I just wanted to learn how to work on 
Solr.  And it's been a good first introduction, so "SUCCESS" on that front.  if 
there's a consensus that this is a thing people would like to have, I'm happy 
to keep working on it (should I assign myself on this JIRA? Or is that only for 
commiters?)  If we *do* think this would be useful for people, I could use a 
bit of clarification on what the desired behavior actually is.  If not, should 
I close this JIRA?

Questions about 'Desired' Behavior:

1.) Currently, analysis is only done on things that ValueSourceParser 
identifies as being TextFields.  Are numeric/date/other fields typically 
analyzed?  If so, do we want them to be analyzed here too?  Even among fields 
containing text, this doesn't cover as much as I'd expect.  For example, I was 
writing some tests for this stuff and tried to use a field like:

{{<!-- A text field with mismatched analyzers for query/index..used for 
testing. -->
    <fieldType name="text_different_analyzers" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="query"> <!-- Whitespace only for query-analysis -->
        <tokenizer class="solr.MockTokenizerFactory"/>
      </analyzer>
      <analyzer type="index">
        <tokenizer class="solr.MockTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>
   <field name="text_analysis_mismatch" type="text_different_analyzers" 
indexed="true" stored="true"/>}}

but it turns out that it wasn't being analyzed by the current ValueSourceParser 
code.  Maybe this is just me being new to Solr, but I expected this to be 
considered a "TextField" by the code.

2.) Do we care whether the input-value gets analyzed to > 1 token?  The initial 
bug description mentioned error handling for this, but I didn't see any special 
error-handling for this in the default-to-query-analyzer case that's already in 
the code.

Thanks for any clarification anyone can give.  Still getting used to the 
process of working on these things.

> term based ValueSourceParsers should support an option to run an analyzer for 
> hte specified field on the input
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7981
>                 URL: https://issues.apache.org/jira/browse/SOLR-7981
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Hoss Man
>              Labels: newdev
>         Attachments: SOLR-7981.patch
>
>
> The following functions all take exactly 2 arguments: a field name, and a 
> term value...
> * idf
> * termfreq
> * tf
> * totaltermfreq
> ...we should consider adding an optional third argument to indicate if an 
> analyzer for the specified field should be used on the input to find the real 
> "Term" to consider.
> For example, the following might all result in equivilent numeric values for 
> all docs assuming simple plural stemming and lowercasing...
> {noformat}
> termfreq(foo_t,'Bicycles',query) // use the query analyzer for field foo_t on 
> input Bicycles
> termfreq(foo_t,'Bicycles',index) // use the index analyzer for field foo_t on 
> input Bicycles
> termfreq(foo_t,'bicycle',none) // no analyzer used to construct Term
> termfreq(foo_t,'bicycle') // legacy 2 arg syntax, same as 'none'
> {noformat}
> (Special error checking needed if analyzer creates more then one term for the 
> given input string)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-7981) term based ValueSourceParsers should support an option to run an analyzer for hte specified field on the input

Reply via email to