bq: The processor requires a field with a timestamp depending on indexing time.
what are you referring to? CursorMark does not require this at all. Is this a getSolr/NiFi requirement? "...sort must include at least one field that is unique per document – typically just the uniqueKey field" , see: https://lucidworks.com/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/ Best, Erick On Sun, Sep 24, 2017 at 1:51 PM, Johannes Peter <[email protected]> wrote: > Hello Solr / Lucene developers, > > I currently consider ways to improve the integration of Apache Solr and > Apache NiFi with respect to the GetSolr processor. This processor aims to > retrieve all Solr documents, to use Solr as a source. However, the current > implementation reveals several problems in a way that Solr's documents are > not retrieved reliably: > > 1. The documents are retrieved in batches by successively increasing the > "start" parameter. The problem here is that documents are skipped or the > same documents are fetched twice if the result set changes due to updates > (besides, this stategy should lead to performance issues querying large > collections). > > 2. The processor requires a field with a timestamp depending on indexing > time. The GetSolr processor internally stores the time of the latest Solr > request. This time is used within a filter query for subsequent requests > (something like fq=dateField:[latestTime TO NOW]). This shows problems due > to commit delays. > > 3. Furthermore, I think requiring the user to specify a timestamp field is > far from optimal as this makes the processor only suitable for collections > including such kind of field. > > Problem 1 can be addressed quite well using the cursorMark parameter and > fixing the sort parameter e. g. to sort=dateField asc, id asc. By doing so, > documents are not skipped due to deletions and documents are only fetched > twice if they are updated (setting a new timestamp). > > Problem 2 could be addressed by storing the latest timestamp of the > documents within the resultset. However, this requires the dateField to be > included if the fl parameter is set (otherwise, the field has to be added to > the fl parameter, and the field has to be removed for each result document > to get the originally desired results). Alternatively, the stats component > could be used. However, I expect this to lower the performance significantly > for large collections. > > For problem 3, I firstly considered to use the field _version_ as I thought > this field to be a transformed timestamp and to increase monotonically over > a collection. However, Cassandra Targett already helped me clearing up this > misconception (that it cannot be transformed to a timestamp and that it is > monotonically increasing only within a shard). Theoretically the processor > could iterate over shards, but I expect this to be accompanied by several > complications in a way that shard names have to be figured out by query / > specified by the user or that shard splits have to be addressed... > > Any ideas? > Thank you in advance!! > --------------------------------------------------------------------- To > unsubscribe, e-mail: [email protected] For additional > commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
