[ https://issues.apache.org/jira/browse/SOLR-7981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988595#comment-14988595 ]
Jason Gerlowski edited comment on SOLR-7981 at 11/4/15 12:35 AM: ----------------------------------------------------------------- Haha, funny; I've definitely been there. I also don't have a huge opinion about adding this option. I didn't pick this up because I wanted the feature in Solr; I just wanted to learn how to work on Solr. And it's been a good first introduction, so "SUCCESS" on that front. if there's a consensus that this is a thing people would like to have, I'm happy to keep working on it (should I assign myself on this JIRA? Or is that only for commiters?) If we *do* think this would be useful for people, I could use a bit of clarification on what the desired behavior actually is. If not, should I close this JIRA? Questions about 'Desired' Behavior: 1.) Currently, analysis is only done on things that ValueSourceParser identifies as being TextFields. Are numeric/date/other fields typically analyzed? If so, do we want them to be analyzed here too? Even among fields containing text, this doesn't cover as much as I'd expect. For example, I was writing some tests for this stuff and tried to use a field like: <!-- A text field with mismatched analyzers for query/index..used for testing. --> <fieldType name="text_different_analyzers" class="solr.TextField" positionIncrementGap="100"> <analyzer type="query"> <!-- Whitespace only for query-analysis --> <tokenizer class="solr.MockTokenizerFactory"/> </analyzer> <analyzer type="index"> <tokenizer class="solr.MockTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> <field name="text_analysis_mismatch" type="text_different_analyzers" indexed="true" stored="true"/> (Sorry, couldn't figure out how to format that as code, I used "{{ }}" but it didn't seem to work. but it turns out that it wasn't being analyzed by the current ValueSourceParser code. Maybe this is just me being new to Solr, but I expected this to be considered a "TextField" by the code. 2.) Do we care whether the input-value gets analyzed to > 1 token? The initial bug description mentioned error handling for this, but I didn't see any special error-handling for this in the default-to-query-analyzer case that's already in the code. Thanks for any clarification anyone can give. Still getting used to the process of working on these things. was (Author: gerlowskija): Haha, funny; I've definitely been there. I also don't have a huge opinion about adding this option. I didn't pick this up because I wanted the feature in Solr; I just wanted to learn how to work on Solr. And it's been a good first introduction, so "SUCCESS" on that front. if there's a consensus that this is a thing people would like to have, I'm happy to keep working on it (should I assign myself on this JIRA? Or is that only for commiters?) If we *do* think this would be useful for people, I could use a bit of clarification on what the desired behavior actually is. If not, should I close this JIRA? Questions about 'Desired' Behavior: 1.) Currently, analysis is only done on things that ValueSourceParser identifies as being TextFields. Are numeric/date/other fields typically analyzed? If so, do we want them to be analyzed here too? Even among fields containing text, this doesn't cover as much as I'd expect. For example, I was writing some tests for this stuff and tried to use a field like: {{<!-- A text field with mismatched analyzers for query/index..used for testing. --> <fieldType name="text_different_analyzers" class="solr.TextField" positionIncrementGap="100"> <analyzer type="query"> <!-- Whitespace only for query-analysis --> <tokenizer class="solr.MockTokenizerFactory"/> </analyzer> <analyzer type="index"> <tokenizer class="solr.MockTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> <field name="text_analysis_mismatch" type="text_different_analyzers" indexed="true" stored="true"/>}} but it turns out that it wasn't being analyzed by the current ValueSourceParser code. Maybe this is just me being new to Solr, but I expected this to be considered a "TextField" by the code. 2.) Do we care whether the input-value gets analyzed to > 1 token? The initial bug description mentioned error handling for this, but I didn't see any special error-handling for this in the default-to-query-analyzer case that's already in the code. Thanks for any clarification anyone can give. Still getting used to the process of working on these things. > term based ValueSourceParsers should support an option to run an analyzer for > hte specified field on the input > -------------------------------------------------------------------------------------------------------------- > > Key: SOLR-7981 > URL: https://issues.apache.org/jira/browse/SOLR-7981 > Project: Solr > Issue Type: Improvement > Reporter: Hoss Man > Labels: newdev > Attachments: SOLR-7981.patch > > > The following functions all take exactly 2 arguments: a field name, and a > term value... > * idf > * termfreq > * tf > * totaltermfreq > ...we should consider adding an optional third argument to indicate if an > analyzer for the specified field should be used on the input to find the real > "Term" to consider. > For example, the following might all result in equivilent numeric values for > all docs assuming simple plural stemming and lowercasing... > {noformat} > termfreq(foo_t,'Bicycles',query) // use the query analyzer for field foo_t on > input Bicycles > termfreq(foo_t,'Bicycles',index) // use the index analyzer for field foo_t on > input Bicycles > termfreq(foo_t,'bicycle',none) // no analyzer used to construct Term > termfreq(foo_t,'bicycle') // legacy 2 arg syntax, same as 'none' > {noformat} > (Special error checking needed if analyzer creates more then one term for the > given input string) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org