Re: Improving GetSolr processor of Apache NiFi

Erick Erickson Sun, 24 Sep 2017 17:18:08 -0700

bq: The processor requires a field with a timestamp depending on indexing time.


what are you referring to? CursorMark does not require this at all. Is
this a getSolr/NiFi requirement?

"...sort must include at least one field that is unique per document –
typically just the uniqueKey field" , see:
https://lucidworks.com/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

Best,
Erick

On Sun, Sep 24, 2017 at 1:51 PM, Johannes Peter
<[email protected]> wrote:
> Hello Solr / Lucene developers,
>
> I currently consider ways to improve the integration of Apache Solr and
> Apache NiFi with respect to the GetSolr processor. This processor aims to
> retrieve all Solr documents, to use Solr as a source. However, the current
> implementation reveals several problems in a way that Solr's documents are
> not retrieved reliably:
>
> 1. The documents are retrieved in batches by successively increasing the
> "start" parameter. The problem here is that documents are skipped or the
> same documents are fetched twice if the result set changes due to updates
> (besides, this stategy should lead to performance issues querying large
> collections).
>
> 2. The processor requires a field with a timestamp depending on indexing
> time. The GetSolr processor internally stores the time of the latest Solr
> request. This time is used within a filter query for subsequent requests
> (something like fq=dateField:[latestTime TO NOW]). This shows problems due
> to commit delays.
>
> 3. Furthermore, I think requiring the user to specify a timestamp field is
> far from optimal as this makes the processor only suitable for collections
> including such kind of field.
>
> Problem 1 can be addressed quite well using the cursorMark parameter and
> fixing the sort parameter e. g. to  sort=dateField asc, id asc. By doing so,
> documents are not skipped due to deletions and documents are only fetched
> twice if they are updated (setting a new timestamp).
>
> Problem 2 could be addressed by storing the latest timestamp of the
> documents within the resultset. However, this requires the dateField to be
> included if the fl parameter is set (otherwise, the field has to be added to
> the fl parameter, and the field has to be removed for each result document
> to get the originally desired results). Alternatively, the stats component
> could be used. However, I expect this to lower the performance significantly
> for large collections.
>
> For problem 3, I firstly considered to use the field _version_ as I thought
> this field to be a transformed timestamp and to increase monotonically over
> a collection. However, Cassandra Targett already helped me clearing up this
> misconception (that it cannot be transformed to a timestamp and that it is
> monotonically increasing only within a shard). Theoretically the processor
> could iterate over shards, but I expect this to be accompanied by several
> complications in a way that shard names have to be figured out by query /
> specified by the user or that shard splits have to be addressed...
>
> Any ideas?
> Thank you in advance!!
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: [email protected] For additional
> commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Improving GetSolr processor of Apache NiFi

Reply via email to