Hello Solr / Lucene developers,

I currently consider ways to improve the integration of Apache Solr and Apache NiFi with respect to the GetSolr processor. This processor aims to retrieve all Solr documents, to use Solr as a source. However, the current implementation reveals several problems in a way that Solr's documents are not retrieved reliably:

1. The documents are retrieved in batches by successively increasing the "start" parameter. The problem here is that documents are skipped or the same documents are fetched twice if the result set changes due to updates (besides, this stategy should lead to performance issues querying large collections). 
 
2. The processor requires a field with a timestamp depending on indexing time. The GetSolr processor internally stores the time of the latest Solr request. This time is used within a filter query for subsequent requests (something like fq=dateField:[latestTime TO NOW]). This shows problems due to commit delays.
 
3. Furthermore, I think requiring the user to specify a timestamp field is far from optimal as this makes the processor only suitable for collections including such kind of field. 
 
Problem 1 can be addressed quite well using the cursorMark parameter and fixing the sort parameter e. g. to  sort=dateField asc, id asc. By doing so, documents are not skipped due to deletions and documents are only fetched twice if they are updated (setting a new timestamp). 
 
Problem 2 could be addressed by storing the latest timestamp of the documents within the resultset. However, this requires the dateField to be included if the fl parameter is set (otherwise, the field has to be added to the fl parameter, and the field has to be removed for each result document to get the originally desired results). Alternatively, the stats component could be used. However, I expect this to lower the performance significantly for large collections. 
 
For problem 3, I firstly considered to use the field _version_ as I thought this field to be a transformed timestamp and to increase monotonically over a collection. However, Cassandra Targett already helped me clearing up this misconception (that it cannot be transformed to a timestamp and that it is monotonically increasing only within a shard). Theoretically the processor could iterate over shards, but I expect this to be accompanied by several complications in a way that shard names have to be figured out by query / specified by the user or that shard splits have to be addressed...
 
Any ideas?
Thank you in advance!!
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

Reply via email to