Re: [dspace-tech] Re: Questions about the checksum checker in DSpace 6.2

2018-08-15 Thread Tim Donohue
Just a belated follow-up to this thread.  If you are still hitting issues
with the Checksum checker in 6.x, I'd recommend looking at the recently
logged bug (and proposed fix): https://jira.duraspace.org/browse/DS-3975

Here's the proposed fix (which could use some testers to help us verify):
https://github.com/DSpace/DSpace/pull/2169   If you install the fix, please
report back on your findings by adding a comment to either the GitHub PR or
the JIRA ticket.  Reviews & reports back from the community can help us to
approve & merge fixes more quickly.

Thanks,

Tim

On Thu, Jul 5, 2018 at 9:20 AM Evgeni Dimitrov 
wrote:

> Thank you Mark,
>
> I am afraid that the checker first tries to update the
> most_recent_checksum table comparing with the bitstream table and only
> after that it looks at the options (in my case -c 1000).
>
> Which means that it will always first try to add 500 000 rows to
> most_recent_checksum table regardless the options.
>
>
> On Thursday, July 5, 2018 at 4:10:19 PM UTC+3, Mark H. Wood wrote:
>>
>> On Thursday, July 5, 2018 at 8:20:28 AM UTC-4, Evgeni Dimitrov wrote:
>>>
>>>
>>> Judging by MostRecentChecksumServiceImpl, first a simple List was
>>> created in memory with (in this case) 500 000 bitstreams.
>>>
>>> Now for every element in this list a row is being added to the
>>> most_recent_checksum table. Perhaps the transaction will be committed when
>>> all 500 000 rows are added . . . in two weeks time . . .
>>>
>>>
>> I believe that you are correct.  There's a very unfortunate collision of
>> Java culture (it's easy to create elastic Collections, don't worry about
>> the 500 000 members case), ORM (of course you always want all 500 000 rows
>> trapped in one transaction until you are done consuming them sequentially),
>> and layers of service/DAO/holder, which makes it difficult to do
>> large-scale operations efficiently. We may need to augment the storage
>> layer with explicit support for bulk operations, such as the ability to
>> pass an arbitrary instance of a bulk-operator interface to be applied
>> iteratively to each result of a query, so that code which understands the
>> storage model can use it well to avoid memory bloat.  The business logic
>> layer does not and should not have access to these storage details.  There
>> are a number of places in DSpace which might be made less resource-hungry
>> by such means.
>>
>> Until this changes, you may wish to look over the checksum checker's
>> options for limiting the amount of work that it does.  It can be run for a
>> given amount of time or over a specific count of bitstreams, for example,
>> and continue where it stopped when run again.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "DSpace Technical Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dspace-tech+unsubscr...@googlegroups.com.
> To post to this group, send email to dspace-tech@googlegroups.com.
> Visit this group at https://groups.google.com/group/dspace-tech.
> For more options, visit https://groups.google.com/d/optout.
>
-- 
Tim Donohue
Technical Lead for DSpace & DSpaceDirect
DuraSpace.org | DSpace.org | DSpaceDirect.org

-- 
All messages to this mailing list should adhere to the DuraSpace Code of 
Conduct: https://duraspace.org/about/policies/code-of-conduct/
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To post to this group, send email to dspace-tech@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.


Re: [dspace-tech] Re: Questions about the checksum checker in DSpace 6.2

2018-08-16 Thread Evgeni Dimitrov
Hi Tim,

As I understand this fix does not change the main design - first create a 
potentially very large list in memory, second update the DB in one single 
transaction. As I understand this fix promises to make the single 
transaction shorter/faster. One needs a big repository to test the fix and 
to compare. My only big repository is the production one. I can not run 
tests on it.

Best regards
Evgeni


On Wednesday, August 15, 2018 at 10:17:57 PM UTC+3, Tim Donohue wrote:
>
> Just a belated follow-up to this thread.  If you are still hitting issues 
> with the Checksum checker in 6.x, I'd recommend looking at the recently 
> logged bug (and proposed fix): https://jira.duraspace.org/browse/DS-3975
>
> Here's the proposed fix (which could use some testers to help us verify): 
> https://github.com/DSpace/DSpace/pull/2169   If you install the fix, 
> please report back on your findings by adding a comment to either the 
> GitHub PR or the JIRA ticket.  Reviews & reports back from the community 
> can help us to approve & merge fixes more quickly.
>
> Thanks,
>
> Tim
>
> On Thu, Jul 5, 2018 at 9:20 AM Evgeni Dimitrov  > wrote:
>
>> Thank you Mark,
>>
>> I am afraid that the checker first tries to update the 
>> most_recent_checksum table comparing with the bitstream table and only 
>> after that it looks at the options (in my case -c 1000).
>>
>> Which means that it will always first try to add 500 000 rows to 
>> most_recent_checksum table regardless the options.
>>
>>
>> On Thursday, July 5, 2018 at 4:10:19 PM UTC+3, Mark H. Wood wrote:
>>>
>>> On Thursday, July 5, 2018 at 8:20:28 AM UTC-4, Evgeni Dimitrov wrote:


 Judging by MostRecentChecksumServiceImpl, first a simple List was 
 created in memory with (in this case) 500 000 bitstreams.

 Now for every element in this list a row is being added to the 
 most_recent_checksum table. Perhaps the transaction will be committed when 
 all 500 000 rows are added . . . in two weeks time . . .


>>> I believe that you are correct.  There's a very unfortunate collision of 
>>> Java culture (it's easy to create elastic Collections, don't worry about 
>>> the 500 000 members case), ORM (of course you always want all 500 000 rows 
>>> trapped in one transaction until you are done consuming them sequentially), 
>>> and layers of service/DAO/holder, which makes it difficult to do 
>>> large-scale operations efficiently. We may need to augment the storage 
>>> layer with explicit support for bulk operations, such as the ability to 
>>> pass an arbitrary instance of a bulk-operator interface to be applied 
>>> iteratively to each result of a query, so that code which understands the 
>>> storage model can use it well to avoid memory bloat.  The business logic 
>>> layer does not and should not have access to these storage details.  There 
>>> are a number of places in DSpace which might be made less resource-hungry 
>>> by such means.
>>>
>>> Until this changes, you may wish to look over the checksum checker's 
>>> options for limiting the amount of work that it does.  It can be run for a 
>>> given amount of time or over a specific count of bitstreams, for example, 
>>> and continue where it stopped when run again.
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "DSpace Technical Support" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to dspace-tech...@googlegroups.com .
>> To post to this group, send email to dspac...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/dspace-tech.
>> For more options, visit https://groups.google.com/d/optout.
>>
> -- 
> Tim Donohue
> Technical Lead for DSpace & DSpaceDirect
> DuraSpace.org | DSpace.org | DSpaceDirect.org
>

-- 
All messages to this mailing list should adhere to the DuraSpace Code of 
Conduct: https://duraspace.org/about/policies/code-of-conduct/
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To post to this group, send email to dspace-tech@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.


Re: [dspace-tech] Re: Questions about the checksum checker in DSpace 6.2

2018-08-16 Thread Tim Donohue
Hello Evgeni,

The fix in https://github.com/DSpace/DSpace/pull/2169 is a major change in
behavior (it reverts back to the same way this was done in 5.x).  In the
current 6.x code, the giant list of objects is loaded into JVM memory (in a
big Java code loop), which is a significant performance hit in Hibernate.

You can see this behavior here:
https://github.com/DSpace/DSpace/blob/dspace-6_x/dspace-api/src/main/java/org/dspace/checker/MostRecentChecksumServiceImpl.java#L121

The proposed fix no longer loads these objects into the JVM / Hibernate,
and instead reverts back to performing this action in a single database
query (as it was in 5.x). Yes, the database query is complex and may still
take a while to run (for large sites), but it should be significantly
faster than loading thousands of objects into the JVM.

You can see the old 5.x behavior here:
https://github.com/DSpace/DSpace/blob/dspace-5_x/dspace-api/src/main/java/org/dspace/checker/BitstreamInfoDAO.java#L63

I suspect anyone who is seeing significant performance issues in Checksum
Checker after upgrading from 5.x to 6.x will likely find this new PR will
bring performance back to 5.x levels. I'm sure we could improve performance
even further, though it may require some more analysis as to the best way
to refactor the database & code to do so. Obviously, we always welcome
improvements & contributions to DSpace in the form of Pull Requests.

- Tim



On Thu, Aug 16, 2018 at 12:48 PM Evgeni Dimitrov 
wrote:

> Hi Tim,
>
> As I understand this fix does not change the main design - first create a
> potentially very large list in memory, second update the DB in one single
> transaction. As I understand this fix promises to make the single
> transaction shorter/faster. One needs a big repository to test the fix and
> to compare. My only big repository is the production one. I can not run
> tests on it.
>
> Best regards
> Evgeni
>
>
>
> On Wednesday, August 15, 2018 at 10:17:57 PM UTC+3, Tim Donohue wrote:
>
>> Just a belated follow-up to this thread.  If you are still hitting issues
>> with the Checksum checker in 6.x, I'd recommend looking at the recently
>> logged bug (and proposed fix): https://jira.duraspace.org/browse/DS-3975
>>
>> Here's the proposed fix (which could use some testers to help us verify):
>> https://github.com/DSpace/DSpace/pull/2169   If you install the fix,
>> please report back on your findings by adding a comment to either the
>> GitHub PR or the JIRA ticket.  Reviews & reports back from the community
>> can help us to approve & merge fixes more quickly.
>>
>> Thanks,
>>
>> Tim
>>
>> On Thu, Jul 5, 2018 at 9:20 AM Evgeni Dimitrov 
>> wrote:
>>
> Thank you Mark,
>>>
>>> I am afraid that the checker first tries to update the
>>> most_recent_checksum table comparing with the bitstream table and only
>>> after that it looks at the options (in my case -c 1000).
>>>
>>> Which means that it will always first try to add 500 000 rows to
>>> most_recent_checksum table regardless the options.
>>>
>>>
>>> On Thursday, July 5, 2018 at 4:10:19 PM UTC+3, Mark H. Wood wrote:

 On Thursday, July 5, 2018 at 8:20:28 AM UTC-4, Evgeni Dimitrov wrote:
>
>
> Judging by MostRecentChecksumServiceImpl, first a simple List was
> created in memory with (in this case) 500 000 bitstreams.
>
> Now for every element in this list a row is being added to the
> most_recent_checksum table. Perhaps the transaction will be committed when
> all 500 000 rows are added . . . in two weeks time . . .
>
>
 I believe that you are correct.  There's a very unfortunate collision
 of Java culture (it's easy to create elastic Collections, don't worry about
 the 500 000 members case), ORM (of course you always want all 500 000 rows
 trapped in one transaction until you are done consuming them sequentially),
 and layers of service/DAO/holder, which makes it difficult to do
 large-scale operations efficiently. We may need to augment the storage
 layer with explicit support for bulk operations, such as the ability to
 pass an arbitrary instance of a bulk-operator interface to be applied
 iteratively to each result of a query, so that code which understands the
 storage model can use it well to avoid memory bloat.  The business logic
 layer does not and should not have access to these storage details.  There
 are a number of places in DSpace which might be made less resource-hungry
 by such means.

 Until this changes, you may wish to look over the checksum checker's
 options for limiting the amount of work that it does.  It can be run for a
 given amount of time or over a specific count of bitstreams, for example,
 and continue where it stopped when run again.

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "DSpace Technical Support" group.
>>>
>> To unsubscribe from this group and stop receiving emails from it, send