Re: [Dspace-devel] We need to think a bit more about how we use the 'statistics' Solr core

Andrea Schweer Thu, 26 Mar 2015 13:23:12 -0700

Hi,

On 27/03/15 02:31, Tim Donohue wrote:
> I wonder how hard the "incremental export" is to implement?  If it's 
> really not that complex overall, then it seems like it'd be a quick 
> win for just doing the Solr Stats backups in general.


It's easy-ish, I just can't think of a way that isn't at least a little 
bit ugly. At the moment, my code relies on the natural order of the 
documents in the solr core. This would need to be changed to use a 
timestamp field appropriate for each index (time for statistics, last 
modified for authority), the queries would need to be adjusted a little 
bit, and the filenames need to change in a way that subsequent exports 
don't overwrite previous ones. What I don't like is that the timestamp 
field needs to be specified (not nice for the user) or hardcoded (ie, 
needs to be changed when the solr schema changes). I could also imagine 
that the date math will be a bit ugly. From another e-mail by helix:

> For running from cron it's easier to specify a start date and
> duration rather than end date (which you have to calculate). Although
> we'd need to make sure that plays well with differing lengths of
> months, i.e. we should guarantee that if you specify 31 days in a
> month that has 28, the 3 days that overlap will not be duplicated.

Well see I think it's still easier not even to have to specify a start 
date. If you look at the stats-reports scripts, they just run (by 
default) for the current month; so I had been thinking along the lines 
of having flags for "yesterday", "last week" and "last month" and/or 
"last n days". Solr is pretty good at date math already and this is the 
type of date manipulation you can put into a query pretty easily. No 
need for us to keep track of how many days there are in a month. The 
only somewhat tricky thing might be time zone issues, but even that 
isn't insurmountable.

> If I want to ensure my Solr Stats are "safe" in DSpace 5, my only real 
> option is to back them up via a CSV export (and a full dump is the 
> only option right now). Since my stats will obviously only grow and 
> grow over time, this full CSV export is going to take longer and 
> longer to perform -- so it may become less possible to perform as an 
> 'overnight backup'.

Agreed.

> But, the question is whether this is something that is a "quick win" 
> or if it requires larger changes (I admit, I'm not well versed on the 
> Solr APIs/queries when it comes to this).

Well, it's not a quick win in the sense that the code I have right now 
solves the problem I have :) I would really like to have something in 
place that works for the majority of the DSpace user base though.

I thought about this more overnight, and I think one way forward might 
be to recognise that we are trying to solve two different problems. The 
exact same solution won't necessarily work for both / the entry points 
may need to be different. I have an idea for how to avoid losing data 
while the reindex is running that involves hot-swapping cores, keeping a 
temporary one up while exporting data from the actual one, then just 
dumping the actual one and re-importing to the temporary one (or similar 
strategy). And for the incremental export (back-up use case), as I said 
yesterday, maybe with adding the time field to the launcher it isn't so 
bad to hard-code it.

> As a sidenote: with regards to the Authority index, it seems like the 
> data in that index is possible to *repopulate* from the existing 
> metadata in the database (using ./dspace index-authority). So, it 
> seems like that index may not suffer from the same problems as the 
> Stats one (though I haven't tried it -- just reading the docs):

No, as helix pointed out, there is information stored in the authority 
core that is not stored in the database (just compare the authority core 
solr schema with the columns in the metadatavalue table). It definitely 
is affected by the not-stored-elsewhere issue.

cheers,
Andrea

-- 
Dr Andrea Schweer
IRR Technical Specialist, ITS Information Systems
The University of Waikato, Hamilton, New Zealand


------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Re: [Dspace-devel] We need to think a bit more about how we use the 'statistics' Solr core

Reply via email to