The "topic" streaming expression may be exactly what you are looking for:
http://lucene.apache.org/solr/guide/6_6/stream-sources.html Joel Bernstein http://joelsolr.blogspot.com/ On Sun, Sep 24, 2017 at 8:17 PM, Erick Erickson <[email protected]> wrote: > bq: The processor requires a field with a timestamp depending on indexing > time. > > what are you referring to? CursorMark does not require this at all. Is > this a getSolr/NiFi requirement? > > "...sort must include at least one field that is unique per document – > typically just the uniqueKey field" , see: > https://lucidworks.com/2013/12/12/coming-soon-to-solr- > efficient-cursor-based-iteration-of-large-result-sets/ > > Best, > Erick > > On Sun, Sep 24, 2017 at 1:51 PM, Johannes Peter > <[email protected]> wrote: > > Hello Solr / Lucene developers, > > > > I currently consider ways to improve the integration of Apache Solr and > > Apache NiFi with respect to the GetSolr processor. This processor aims to > > retrieve all Solr documents, to use Solr as a source. However, the > current > > implementation reveals several problems in a way that Solr's documents > are > > not retrieved reliably: > > > > 1. The documents are retrieved in batches by successively increasing the > > "start" parameter. The problem here is that documents are skipped or the > > same documents are fetched twice if the result set changes due to updates > > (besides, this stategy should lead to performance issues querying large > > collections). > > > > 2. The processor requires a field with a timestamp depending on indexing > > time. The GetSolr processor internally stores the time of the latest Solr > > request. This time is used within a filter query for subsequent requests > > (something like fq=dateField:[latestTime TO NOW]). This shows problems > due > > to commit delays. > > > > 3. Furthermore, I think requiring the user to specify a timestamp field > is > > far from optimal as this makes the processor only suitable for > collections > > including such kind of field. > > > > Problem 1 can be addressed quite well using the cursorMark parameter and > > fixing the sort parameter e. g. to sort=dateField asc, id asc. By doing > so, > > documents are not skipped due to deletions and documents are only fetched > > twice if they are updated (setting a new timestamp). > > > > Problem 2 could be addressed by storing the latest timestamp of the > > documents within the resultset. However, this requires the dateField to > be > > included if the fl parameter is set (otherwise, the field has to be > added to > > the fl parameter, and the field has to be removed for each result > document > > to get the originally desired results). Alternatively, the stats > component > > could be used. However, I expect this to lower the performance > significantly > > for large collections. > > > > For problem 3, I firstly considered to use the field _version_ as I > thought > > this field to be a transformed timestamp and to increase monotonically > over > > a collection. However, Cassandra Targett already helped me clearing up > this > > misconception (that it cannot be transformed to a timestamp and that it > is > > monotonically increasing only within a shard). Theoretically the > processor > > could iterate over shards, but I expect this to be accompanied by several > > complications in a way that shard names have to be figured out by query / > > specified by the user or that shard splits have to be addressed... > > > > Any ideas? > > Thank you in advance!! > > --------------------------------------------------------------------- To > > unsubscribe, e-mail: [email protected] For additional > > commands, e-mail: [email protected] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
