Thanks for your input David. I won't accept the patch because I think there's a more appropriate way to go about this -- have the Tagger constructor take an Analyzer instead of a TokenStream in the constructor, and then have the process method take the InputStream and/or string (the fundamental input to the tagger), thus allowing repeated use of the same Tagger. It's been a long-standing FAQ: how do I tag in bulk, and this change would kind of help with that, at least at a low level which is your need. I'll filed a JIRA: SOLR-14292 - Refactor Tagger for re-use, thus aiding bulk-tagging <https://issues.apache.org/jira/browse/SOLR-14292> I don't plan on doing this anytime soon so feel free to take it up if you wish.
~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Fri, Feb 28, 2020 at 4:12 AM David '-1' Schmid < [email protected]> wrote: > On 27.02.20 19:01, David Smiley wrote: > > I'm glad you got it working! It's sad you felt the need to copy-paste > > the tagger; perhaps you can recommend changes to make it more extensible > > so that you or others needn't fork it. > > Don't need to feel sad, just as I mentioned: it's quick, dirty and I did > not know better. > I was wondering how to feed multiple Strings into the tagger w/o > creating new instances of everything, but as I don't know much about how > the tokenizers work, I just slapped everything together. > > I had planned to maybe use an InputStream that blocks once one string > was exhausted, so I can feed the tags back into the stream and feed the > InputStream new data, once TupleStream::read is called again. > But since I wanted to get this done quickly, ... yeah. That happened. > Not happy with it, but I learned a lot. > > I'm not sure if I'm qualified enough to recommend changes about the > tagger. I'd maybe change the constructor to not accept a TokenStream, > but just the configuration (reduce strategy, terms, ...). And provide a > setter for the TokenStream. (patch attached) > But that implies that a TokenStream is cheap to construct and use, which > I don't know. > > > I'm not sure if something like this should be contributed back to Solr > > itself. I don't even know the bigger picture of why you are doing this, > > so I am pessimistic :-). > Which is completely fine :D > Thank you for the guidance! > > best regards, > David > > > > > ~ David Smiley > > Apache Lucene/Solr Search Developer > > http://www.linkedin.com/in/davidwsmiley > > > > > > On Thu, Feb 27, 2020 at 8:01 AM David '-1' Schmid > > <[email protected] > > <mailto:[email protected]>> wrote: > > > > Hello again! > > > > On 25.02.20 22:39, David Smiley wrote: > > > I haven't worked on streaming expressions yet but I did a little > > bit of > > > digging around. I think the ClassifyStream might be somewhat > > similar to > > > learn from. It takes a stream of docs, not unlike what you > > want. And > > > crucially it implements setStreamContext with an implementation > > which > > > demonstrates how to get access to a SolrCore. From a core, you > > can get > > > a SolrIndexSearcher. [...] > > > > That worked beautifully! Or let's say: I got it working, the code is > > not > > beautiful, as is. > > Would this be interesting/relevant enough to be adopted upstream? > > > > If so, should I open up a JIRA ticket? > > > > best regards, > > David > > > > > > > > > On Fri, Feb 21, 2020 at 8:05 AM David '-1' Schmid > > > <[email protected] > > <mailto:[email protected]> > > > <mailto:[email protected] > > <mailto:[email protected]>>> wrote: > > > > > > Hello dear developers! > > > > > > I've been wondering if I'd be able to adapt the current > > > TaggerRequestHandler for using it within the /stream request > > handler. > > > > > > Starting out is a tad confusing, which I expected since I have > > > almost no > > > experience with the solr/lucene codebase. > > > > > > My goal is as follows: I want to use the result of a previous > > > select(coll1, ...) as input for adding tags to the result > > document. > > > > > > Possibly: > > > tag( > > > select(...), field_to_analyze_for_tags, > > > collection_with_tag_dict, tag_dict_field, > > > ... // remaining tagger configuration options > > > ) > > > > > > I'm currently stuck at some steps in writing a > > > 'public class TaggerStream extends TupleStream implements > > Expressible' > > > at two points: > > > > > > == Problem 1: Getting 'terms' == > > > > > > The TaggerRequestHandler gets a SolrIndexSearcher via the > request > > > > > > > final SolrIndexSearcher searcher = req.getSearcher(); > > > > > > Which in turn is used to to acquire the terms > > > > > > > Terms terms = > > searcher.getSlowAtomicReader().terms(indexedField); > > > > > > which are used for tagging. > > > > > > I've tried finding something that will yield the equivalent, > > but as you > > > might have guessed: I didn't find anything so far. > > > > > > > > > == Problem 2: Multiple Shards == > > > > > > I guess, this might come up sooner or later, hence this is > > related to > > > SOLR-14190 (requesting the tagger to work across multiple > > shards). > > > I suspect (mind: I really don't know) that acquiring the > > terms will > > > have > > > to do something with that, at least when we need to merge the > > results > > > from multiple shards, but I have not yet found any code that > > does that. > > > Might have been blinded by my confusion, tho. > > > > > > > > > I'd be thankful if someone can help with any pointers > > regarding this. > > > > > > best regards, > > > David > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [email protected] > > <mailto:[email protected]> > > > <mailto:[email protected] > > <mailto:[email protected]>> > > > For additional commands, e-mail: [email protected] > > <mailto:[email protected]> > > > <mailto:[email protected] > > <mailto:[email protected]>> > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > <mailto:[email protected]> > > For additional commands, e-mail: [email protected] > > <mailto:[email protected]> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected]
