Peter is probably right, some stats loss can happen in old release of dspace
due to solr issues before that autocommit happen.
Starting with dspace 4 these issues are solved as we have moved to solr 4 that
implement transaction log. This assure that also uncommitted data will be never
loss.
To extra performance and robustness you could evaluate to implement a solr
cloud farm as we do in our hosting enterprise solution.
Hope this clearify a bit,
Andrea
Inviato da Samsung Mobile
<div>-------- Messaggio originale --------</div><div>Da: Peter Dietz
<pdiet...@gmail.com> </div><div>Data:12/04/2014 03:23 (GMT+01:00)
</div><div>A: Anja Le Blanc <anja.lebl...@manchester.ac.uk> </div><div>Cc:
Dspace Tech <dspace-tech@lists.sourceforge.net> </div><div>Oggetto: Re:
[Dspace-tech] Solr stats data loss </div><div>
</div>Hi Anja,
One idea I have is that with solr, for performance reasons, we have an
auto-commit process where UsageEvents don't write/commit/persist into SOLR
until the commit gets triggered, so they live only in memory until triggered to
write.
...so... If these periods had a higher than normal, or perhaps even normal
occurrence of tomcat restarts, then perhaps pending documents are never
written, thus lost, upon restart.
Perhaps in the servlet container shutdown process, we could add something to
have it signal for dspace/solr to write/save/flush/persist the documents before
shutdown.
Off the top of my head I don't recall how I've written to the elastic search
API, but I'm assuming I never made these auto-commit / bulk / batch submit
changes since I never encountered performance issues with elastic search. I'm
guessing one UsageEvent equals one commit to Elastic Search, so no data loss on
shutdown.
This is just my guess of what could be happening. I suppose there could be
other explanations too, such as corrupt solr index, but I would guess that
would lose a greater amount of data. Another guess would be a server migration
that didn't sync all data properly... An unguarded solr index that a
mischievous user did a delete query... It's possibly possible that solr and
elastic search dspace-stats could have slightly different robot rule processing
(unlikely), so if your usage baseline was entirely robots, then GoogleBot
taking a few days off from crawling you could cause a valley...
Stats is tricky, part of me wishes I just leveraged Google analytics for
everything, just to have one less system to manage. However I do like the
flexibility when you build it yourself.
On Apr 11, 2014 9:54 AM, "Anja Le Blanc" <anja.lebl...@manchester.ac.uk> wrote:
Hello All,
(We are running on DSpace 1.8.2)
I was looking at our stats data for the last year and a half and I
noticed periodical drops in views/downloads which are inconsistent with
the overall usage pattern. (I did not filter out bots for that
exercise.) Numbers dropped for 1 to 5 days to below 10 and even to 0
sometimes (from an average of about 5000 per day). I counted about 8
such events since Jan 2013. (There are possibly more which don't stand
out as much.) Our DSpace was always running and being monitored during
that period.
In our set-up we record stats in both Solr and ElasticSearch (at least
we have done for the last half year). The data for ElasticSearch do not
show drops for the days where Solr has data gaps. ElsaticSearch stats
recording is triggered by the same DSpace events as Solr is.
Unfortunately we have not kept log files for the periods with Solr data
gaps.
Has anyone else seen unexpected fluctuations in their stats?
Anyone any idea of what could cause it. DSpace and Solr were running at
the time since there are some data just not enough.
To look at the data I use for views
http://localhost:8080/solr/statistics/select/?q=type+%3A+2+&version=2.2&start=0&rows=0&indent=on&facet=true&facet.range=time&f.time.facet.range.start=2013-01-01T00:00:00Z&f.time.facet.range.gap=%2B1DAY&f.time.facet.range.end=2014-04-11T00:00:00Z
downloads
http://localhost:8080/solr/statistics/select/?q=type+%3A+0+&version=2.2&start=0&rows=0&indent=on&facet=true&facet.range=time&f.time.facet.range.start=2013-01-01T00:00:00Z&f.time.facet.range.gap=%2B1DAY&f.time.facet.range.end=2014-04-11T00:00:00Z
Interestingly we can prove that there were more events.
Any comments welcome :-)
Best regards,
Anja
------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette
------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette