Re: Streaming Tagger

David Smiley Fri, 28 Feb 2020 05:13:40 -0800

Thanks for your input David.  I won't accept the patch because I think
there's a more appropriate way to go about this -- have the Tagger
constructor take an Analyzer instead of a TokenStream in the constructor,
and then have the process method take the InputStream and/or string (the
fundamental input to the tagger), thus allowing repeated use of the same
Tagger.  It's been a long-standing FAQ: how do I tag in bulk, and this
change would kind of help with that, at least at a low level which is your
need.  I'll filed a JIRA:  SOLR-14292 - Refactor Tagger for re-use, thus
aiding bulk-tagging <https://issues.apache.org/jira/browse/SOLR-14292>  I
don't plan on doing this anytime soon so feel free to take it up if you
wish.


~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Fri, Feb 28, 2020 at 4:12 AM David '-1' Schmid <
[email protected]> wrote:

> On 27.02.20 19:01, David Smiley wrote:
> > I'm glad you got it working!  It's sad you felt the need to copy-paste
> > the tagger; perhaps you can recommend changes to make it more extensible
> > so that you or others needn't fork it.
>
> Don't need to feel sad, just as I mentioned: it's quick, dirty and I did
> not know better.
> I was wondering how to feed multiple Strings into the tagger w/o
> creating new instances of everything, but as I don't know much about how
> the tokenizers work, I just slapped everything together.
>
> I had planned to maybe use an InputStream that blocks once one string
> was exhausted, so I can feed the tags back into the stream and feed the
> InputStream new data, once TupleStream::read is called again.
> But since I wanted to get this done quickly, ... yeah. That happened.
> Not happy with it, but I learned a lot.
>
> I'm not sure if I'm qualified enough to recommend changes about the
> tagger. I'd maybe change the constructor to not accept a TokenStream,
> but just the configuration (reduce strategy, terms, ...). And provide a
> setter for the TokenStream. (patch attached)
> But that implies that a TokenStream is cheap to construct and use, which
> I don't know.
>
> > I'm not sure if something like this should be contributed back to Solr
> > itself.  I don't even know the bigger picture of why you are doing this,
> > so I am pessimistic :-).
> Which is completely fine :D
> Thank you for the guidance!
>
> best regards,
> David
>
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Thu, Feb 27, 2020 at 8:01 AM David '-1' Schmid
> > <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Hello again!
> >
> >     On 25.02.20 22:39, David Smiley wrote:
> >      > I haven't worked on streaming expressions yet but I did a little
> >     bit of
> >      > digging around.  I think the ClassifyStream might be somewhat
> >     similar to
> >      > learn from.  It takes a stream of docs, not unlike what you
> >     want.  And
> >      > crucially it implements setStreamContext with an implementation
> >     which
> >      > demonstrates how to get access to a SolrCore.  From a core, you
> >     can get
> >      > a SolrIndexSearcher. [...]
> >
> >     That worked beautifully! Or let's say: I got it working, the code is
> >     not
> >     beautiful, as is.
> >     Would this be interesting/relevant enough to be adopted upstream?
> >
> >     If so, should I open up a JIRA ticket?
> >
> >     best regards,
> >     David
> >
> >
> >
> >      > On Fri, Feb 21, 2020 at 8:05 AM David '-1' Schmid
> >      > <[email protected]
> >     <mailto:[email protected]>
> >      > <mailto:[email protected]
> >     <mailto:[email protected]>>> wrote:
> >      >
> >      >     Hello dear developers!
> >      >
> >      >     I've been wondering if I'd be able to adapt the current
> >      >     TaggerRequestHandler for using it within the /stream request
> >     handler.
> >      >
> >      >     Starting out is a tad confusing, which I expected since I have
> >      >     almost no
> >      >     experience with the solr/lucene codebase.
> >      >
> >      >     My goal is as follows: I want to use the result of a previous
> >      >     select(coll1, ...) as input for adding tags to the result
> >     document.
> >      >
> >      >     Possibly:
> >      >     tag(
> >      >         select(...), field_to_analyze_for_tags,
> >      >         collection_with_tag_dict, tag_dict_field,
> >      >         ... // remaining tagger configuration options
> >      >     )
> >      >
> >      >     I'm currently stuck at some steps in writing a
> >      >     'public class TaggerStream extends TupleStream implements
> >     Expressible'
> >      >     at two points:
> >      >
> >      >     == Problem 1: Getting 'terms' ==
> >      >
> >      >     The TaggerRequestHandler gets a SolrIndexSearcher via the
> request
> >      >
> >      >       > final SolrIndexSearcher searcher = req.getSearcher();
> >      >
> >      >     Which in turn is used to to acquire the terms
> >      >
> >      >       > Terms terms =
> >     searcher.getSlowAtomicReader().terms(indexedField);
> >      >
> >      >     which are used for tagging.
> >      >
> >      >     I've tried finding something that will yield the equivalent,
> >     but as you
> >      >     might have guessed: I didn't find anything so far.
> >      >
> >      >
> >      >     == Problem 2: Multiple Shards ==
> >      >
> >      >     I guess, this might come up sooner or later, hence this is
> >     related to
> >      >     SOLR-14190 (requesting the tagger to work across multiple
> >     shards).
> >      >     I suspect (mind: I really don't know) that acquiring the
> >     terms will
> >      >     have
> >      >     to do something with that, at least when we need to merge the
> >     results
> >      >     from multiple shards, but I have not yet found any code that
> >     does that.
> >      >     Might have been blinded by my confusion, tho.
> >      >
> >      >
> >      >     I'd be thankful if someone can help with any pointers
> >     regarding this.
> >      >
> >      >     best regards,
> >      >     David
> >      >
> >      >
> >
>  ---------------------------------------------------------------------
> >      >     To unsubscribe, e-mail: [email protected]
> >     <mailto:[email protected]>
> >      >     <mailto:[email protected]
> >     <mailto:[email protected]>>
> >      >     For additional commands, e-mail: [email protected]
> >     <mailto:[email protected]>
> >      >     <mailto:[email protected]
> >     <mailto:[email protected]>>
> >      >
> >
> >     ---------------------------------------------------------------------
> >     To unsubscribe, e-mail: [email protected]
> >     <mailto:[email protected]>
> >     For additional commands, e-mail: [email protected]
> >     <mailto:[email protected]>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

Re: Streaming Tagger

Reply via email to