[Dspace-tech] R: Solr stats data loss

Andrea Bollini Sat, 12 Apr 2014 07:24:28 -0700

Peter is probably right, some stats loss can happen in old release of dspace 
due to solr issues before that autocommit happen.
Starting with dspace 4 these issues are solved as we have moved to solr 4 that 
implement transaction log. This assure that also uncommitted data will be never 
loss.
To extra performance and robustness you could evaluate to implement a solr 
cloud farm as we do in our hosting enterprise solution.
Hope this clearify a bit,
Andrea



Inviato da Samsung Mobile

<div>-------- Messaggio originale --------</div><div>Da: Peter Dietz 
<pdiet...@gmail.com> </div><div>Data:12/04/2014  03:23  (GMT+01:00) 
</div><div>A: Anja Le Blanc <anja.lebl...@manchester.ac.uk> </div><div>Cc: 
Dspace Tech <dspace-tech@lists.sourceforge.net> </div><div>Oggetto: Re: 
[Dspace-tech] Solr stats data loss </div><div>
</div>Hi Anja,

One idea I have is that with solr, for performance reasons, we have an 
auto-commit process where UsageEvents don't write/commit/persist into SOLR 
until the commit gets triggered, so they live only in memory until triggered to 
write.

...so... If these periods had a higher than normal, or perhaps even normal 
occurrence of tomcat restarts, then perhaps pending documents are never 
written, thus lost, upon restart.

Perhaps in the servlet container shutdown process, we could add something to 
have it signal for dspace/solr to write/save/flush/persist the documents before 
shutdown.

Off the top of my head I don't recall how I've written to the elastic search 
API, but I'm assuming I never made these auto-commit / bulk / batch submit 
changes since I never encountered performance issues with elastic search. I'm 
guessing one UsageEvent equals one commit to Elastic Search, so no data loss on 
shutdown.

This is just my guess of what could be happening. I suppose there could be 
other explanations too, such as corrupt solr index, but I would guess that 
would lose a greater amount of data. Another guess would be a server migration 
that didn't sync all data properly... An unguarded solr index that a 
mischievous user did a delete query... It's possibly possible that solr and 
elastic search dspace-stats could have slightly different robot rule processing 
(unlikely), so if your usage baseline was entirely robots, then GoogleBot 
taking a few days off from crawling you could cause a valley...

Stats is tricky, part of me wishes I just leveraged Google analytics for 
everything, just to have one less system to manage. However I do like the 
flexibility when you build it yourself.

On Apr 11, 2014 9:54 AM, "Anja Le Blanc" <anja.lebl...@manchester.ac.uk> wrote:
Hello All,

(We are running on DSpace 1.8.2)

I was looking at our stats data for the last year and a half and I
noticed periodical drops in views/downloads which are inconsistent with
the overall usage pattern. (I did not filter out bots for that
exercise.) Numbers dropped for 1 to 5 days to below 10 and even to 0
sometimes (from an average of about 5000 per day). I counted about 8
such events since Jan 2013. (There are possibly more which don't stand
out as much.) Our DSpace was always running and being monitored during
that period.

In our set-up we record stats in both Solr and ElasticSearch (at least
we have done for the last half year). The data for ElasticSearch do not
show drops for the days where Solr has data gaps. ElsaticSearch stats
recording is triggered by the same DSpace events as Solr is.

Unfortunately we have not kept log files for the periods with Solr data
gaps.

Has anyone else seen unexpected fluctuations in their stats?
Anyone any idea of what could cause it. DSpace and Solr were running at
the time since there are some data just not enough.

To look at the data I use for views
http://localhost:8080/solr/statistics/select/?q=type+%3A+2+&version=2.2&start=0&rows=0&indent=on&facet=true&facet.range=time&f.time.facet.range.start=2013-01-01T00:00:00Z&f.time.facet.range.gap=%2B1DAY&f.time.facet.range.end=2014-04-11T00:00:00Z


downloads
http://localhost:8080/solr/statistics/select/?q=type+%3A+0+&version=2.2&start=0&rows=0&indent=on&facet=true&facet.range=time&f.time.facet.range.start=2013-01-01T00:00:00Z&f.time.facet.range.gap=%2B1DAY&f.time.facet.range.end=2014-04-11T00:00:00Z

Interestingly we can prove that there were more events.

Any comments welcome :-)

Best regards,
Anja

------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees

_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

[Dspace-tech] R: Solr stats data loss

Reply via email to