Re: Improving GetSolr processor of Apache NiFi

Joel Bernstein Sun, 24 Sep 2017 17:58:26 -0700

The "topic" streaming expression may be exactly what you are looking for:


http://lucene.apache.org/solr/guide/6_6/stream-sources.html

Joel Bernstein
http://joelsolr.blogspot.com/

On Sun, Sep 24, 2017 at 8:17 PM, Erick Erickson <[email protected]>
wrote:

> bq: The processor requires a field with a timestamp depending on indexing
> time.
>
> what are you referring to? CursorMark does not require this at all. Is
> this a getSolr/NiFi requirement?
>
> "...sort must include at least one field that is unique per document –
> typically just the uniqueKey field" , see:
> https://lucidworks.com/2013/12/12/coming-soon-to-solr-
> efficient-cursor-based-iteration-of-large-result-sets/
>
> Best,
> Erick
>
> On Sun, Sep 24, 2017 at 1:51 PM, Johannes Peter
> <[email protected]> wrote:
> > Hello Solr / Lucene developers,
> >
> > I currently consider ways to improve the integration of Apache Solr and
> > Apache NiFi with respect to the GetSolr processor. This processor aims to
> > retrieve all Solr documents, to use Solr as a source. However, the
> current
> > implementation reveals several problems in a way that Solr's documents
> are
> > not retrieved reliably:
> >
> > 1. The documents are retrieved in batches by successively increasing the
> > "start" parameter. The problem here is that documents are skipped or the
> > same documents are fetched twice if the result set changes due to updates
> > (besides, this stategy should lead to performance issues querying large
> > collections).
> >
> > 2. The processor requires a field with a timestamp depending on indexing
> > time. The GetSolr processor internally stores the time of the latest Solr
> > request. This time is used within a filter query for subsequent requests
> > (something like fq=dateField:[latestTime TO NOW]). This shows problems
> due
> > to commit delays.
> >
> > 3. Furthermore, I think requiring the user to specify a timestamp field
> is
> > far from optimal as this makes the processor only suitable for
> collections
> > including such kind of field.
> >
> > Problem 1 can be addressed quite well using the cursorMark parameter and
> > fixing the sort parameter e. g. to  sort=dateField asc, id asc. By doing
> so,
> > documents are not skipped due to deletions and documents are only fetched
> > twice if they are updated (setting a new timestamp).
> >
> > Problem 2 could be addressed by storing the latest timestamp of the
> > documents within the resultset. However, this requires the dateField to
> be
> > included if the fl parameter is set (otherwise, the field has to be
> added to
> > the fl parameter, and the field has to be removed for each result
> document
> > to get the originally desired results). Alternatively, the stats
> component
> > could be used. However, I expect this to lower the performance
> significantly
> > for large collections.
> >
> > For problem 3, I firstly considered to use the field _version_ as I
> thought
> > this field to be a transformed timestamp and to increase monotonically
> over
> > a collection. However, Cassandra Targett already helped me clearing up
> this
> > misconception (that it cannot be transformed to a timestamp and that it
> is
> > monotonically increasing only within a shard). Theoretically the
> processor
> > could iterate over shards, but I expect this to be accompanied by several
> > complications in a way that shard names have to be figured out by query /
> > specified by the user or that shard splits have to be addressed...
> >
> > Any ideas?
> > Thank you in advance!!
> > --------------------------------------------------------------------- To
> > unsubscribe, e-mail: [email protected] For additional
> > commands, e-mail: [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Improving GetSolr processor of Apache NiFi

Reply via email to