Re: [Dspace-devel] We need to think a bit more about how we use the 'statistics' Solr core

Andrea Schweer Thu, 26 Mar 2015 21:10:35 -0700

Hi again,

just a quick update before my weekend starts -- I've updated my pull 
request with code that does a loss-less reindex and also uses the time 
field in the export queries. It can't actually do incremental exports 
yet, and the new reindex functionality has to be run using dsrun for now 
([dspace]/bin/dspace dsrun org.dspace.util.SolrImportExport -a reindex 
-i statistics) in its simplest form.


https://github.com/DSpace/DSpace/pull/894

I've tested the reindex once on data gathered using DSpace 4 and it 
appeared to work well. It could still do with better error handling and, 
obviously, more testing -- in particular with non-standard Solr 
directory set-ups, with actual stats hits recorded during the reindex 
and with the authority core. Thoughts/code welcome, and thanks everyone 
who has given input on this issue so far.

cheers,
Andrea

On 27/03/15 09:21, Andrea Schweer wrote:
> Hi,
>
> On 27/03/15 02:31, Tim Donohue wrote:
>> I wonder how hard the "incremental export" is to implement?  If it's 
>> really not that complex overall, then it seems like it'd be a quick 
>> win for just doing the Solr Stats backups in general.
>
> It's easy-ish, I just can't think of a way that isn't at least a 
> little bit ugly. At the moment, my code relies on the natural order of 
> the documents in the solr core. This would need to be changed to use a 
> timestamp field appropriate for each index (time for statistics, last 
> modified for authority), the queries would need to be adjusted a 
> little bit, and the filenames need to change in a way that subsequent 
> exports don't overwrite previous ones. What I don't like is that the 
> timestamp field needs to be specified (not nice for the user) or 
> hardcoded (ie, needs to be changed when the solr schema changes). I 
> could also imagine that the date math will be a bit ugly. From another 
> e-mail by helix:
>
>> For running from cron it's easier to specify a start date and
>> duration rather than end date (which you have to calculate). Although
>> we'd need to make sure that plays well with differing lengths of
>> months, i.e. we should guarantee that if you specify 31 days in a
>> month that has 28, the 3 days that overlap will not be duplicated.
>
> Well see I think it's still easier not even to have to specify a start 
> date. If you look at the stats-reports scripts, they just run (by 
> default) for the current month; so I had been thinking along the lines 
> of having flags for "yesterday", "last week" and "last month" and/or 
> "last n days". Solr is pretty good at date math already and this is 
> the type of date manipulation you can put into a query pretty easily. 
> No need for us to keep track of how many days there are in a month. 
> The only somewhat tricky thing might be time zone issues, but even 
> that isn't insurmountable.
>
>> If I want to ensure my Solr Stats are "safe" in DSpace 5, my only 
>> real option is to back them up via a CSV export (and a full dump is 
>> the only option right now). Since my stats will obviously only grow 
>> and grow over time, this full CSV export is going to take longer and 
>> longer to perform -- so it may become less possible to perform as an 
>> 'overnight backup'.
>
> Agreed.
>
>> But, the question is whether this is something that is a "quick win" 
>> or if it requires larger changes (I admit, I'm not well versed on the 
>> Solr APIs/queries when it comes to this).
>
> Well, it's not a quick win in the sense that the code I have right now 
> solves the problem I have :) I would really like to have something in 
> place that works for the majority of the DSpace user base though.
>
> I thought about this more overnight, and I think one way forward might 
> be to recognise that we are trying to solve two different problems. 
> The exact same solution won't necessarily work for both / the entry 
> points may need to be different. I have an idea for how to avoid 
> losing data while the reindex is running that involves hot-swapping 
> cores, keeping a temporary one up while exporting data from the actual 
> one, then just dumping the actual one and re-importing to the 
> temporary one (or similar strategy). And for the incremental export 
> (back-up use case), as I said yesterday, maybe with adding the time 
> field to the launcher it isn't so bad to hard-code it.
>
>> As a sidenote: with regards to the Authority index, it seems like the 
>> data in that index is possible to *repopulate* from the existing 
>> metadata in the database (using ./dspace index-authority). So, it 
>> seems like that index may not suffer from the same problems as the 
>> Stats one (though I haven't tried it -- just reading the docs):
>
> No, as helix pointed out, there is information stored in the authority 
> core that is not stored in the database (just compare the authority 
> core solr schema with the columns in the metadatavalue table). It 
> definitely is affected by the not-stored-elsewhere issue.
>
> cheers,
> Andrea
>

-- 
Dr Andrea Schweer
IRR Technical Specialist, ITS Information Systems
The University of Waikato, Hamilton, New Zealand


------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Re: [Dspace-devel] We need to think a bit more about how we use the 'statistics' Solr core

Reply via email to