[ 
https://issues.apache.org/jira/browse/SOLR-6816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234145#comment-14234145
 ] 

Per Steffensen edited comment on SOLR-6816 at 12/4/14 12:11 PM:
----------------------------------------------------------------

Just want to add my 5 cents on this one. It is only regarding indexing when you 
do version-check/optimistic-locking (SOLR-3178). We have a very different 
implementation of SOLR-3178, but the performance problems will be the same for 
"your" implementation.

Doing optimistic-locking you typically do a lot of this
* 1) real-time-get document D from Solr
* 2) update D to D' locally on client
* 3) try to replace D with D' in Solr. In case of version-conflict-error go to 
1)

In step 1) you get-by-id document D, and I step 3) you UpdateLog.lookupVersion 
on the same id.
In our system it is most likely that two processes, both wanting to update 
document D, run at the same time or fairly shortly after each other. It is rare 
that the same document gets updated a long time apart.
In order to speed up on those aspects, we have introduced a "recently looked-up 
or updated" cache, where we store documents that has recently been fetch by 
real-time-get or updated. It has improved our indexing speed significantly. We 
have a mature solution that is running in production.

In the scenarios above you most often discover that the document you try to 
real-time-get or lookup-version for does NOT exist, but it is relatively 
time-consuming to realize that (looking in index). We have a PoC of introducing 
a bloom-filter that can help say one of "document definitely does not exist" 
(you do not have to search the index) or "document may exist" (you will have to 
search the index to see if it exists). Our PoC shows that this will speed up 
our indexing-speed tremendously (like 60-80% reduction), but we havnt 
prioritized to mature and put it into production yet. The PoC was using a 
modified version of Guava bloom-filter - modified to be able work in a 
memory-mapped file, so that we do not lose bloom-filter information when 
shutting down Solr (it will take some time building it from scratch every time 
you start Solr). Guava bloom-filter currently is memory only - you can save it 
to file and load it again, but it will not go on continuously, and it is not 
efficient to store it completely to disk at every update :-) Hence the "work in 
memory-mapped file" modification.

Of course, let me know if any of this sounds interesting to you.


was (Author: steff1193):
Just want to add my 5 cents on this one. It is only regarding indexing when you 
do version-check/optimistic-locking (SOLR-3178). We have a very different 
implementation of SOLR-3178, but the performance problems will be the same for 
"your" implementation.

Doing optimistic-locking you typically do a lot of this
* 1) real-time-get document D from Solr
* 2) update D to D' locally on client
* 3) try to replace D with D' in Solr. In case of version-conflict-error go to 
1)

In step 1) you get-by-id document D, and I step 3) you UpdateLog.lookupVersion 
on the same id.
In our system it is most likely that two processes, both wanting to update 
document D, run at the same time or fairly shortly after each other. It is rare 
that the same document gets updated a long time apart.
In order to speed up on those aspects, we have introduced a "recently looked-up 
or updated" cache, where we store documents that has recently been fetch by 
real-time-get or updated. It has improved our indexing speed significantly. We 
have a mature solution that is running in production.

In the scenarios above you most often discover that the document you try to 
real-time-get or lookup-version for does NOT exist, but it is relatively 
time-consuming to realize that (looking in index). We have a PoC of introducing 
a bloom-filter that can help say one of "document definitely does not exist" 
(you do not have to search the index) or "document may exist" (you will have to 
search the index to see if it exists). Our PoC shows that this will speed up 
our indexing-speed tremendously (like 60-80% reduction), but we havnt 
prioritized to mature and put it into production yet. The PoC was using a 
modified version of Guava bloom-filter - modified to be able work in a 
memory-mapped file, so that we do not lose bloom-filter information when 
shutting down Solr (it will take some time building it from scratch every time 
you start Solr). Guava bloom-filter currently is memory only - you can save it 
to file and load it again, but it will not go on continuously, and it is not 
efficient to store it completely to disk at every update :-) Hence the "work in 
memory-mapped file" modification.

> Review SolrCloud Indexing Performance.
> --------------------------------------
>
>                 Key: SOLR-6816
>                 URL: https://issues.apache.org/jira/browse/SOLR-6816
>             Project: Solr
>          Issue Type: Task
>          Components: SolrCloud
>            Reporter: Mark Miller
>            Priority: Critical
>         Attachments: SolrBench.pdf
>
>
> We have never really focused on indexing performance, just correctness and 
> low hanging fruit. We need to vet the performance and try to address any 
> holes.
> Note: A common report is that adding any replication is very slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to