[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)
Title: Message Title Bram Luyten (@mire) commented on an issue Re: Add a way for harvesters to find recently added items (request from Google) Marking closed since this has been addressed by DS-1188 Add Comment DSpace / DS-1482 Add a way for harvesters to find recently added items (request from Google) This request came out of a discussion I had with Anurag Acharya and Darcy Darpa at Google / Google Scholar. Anurag mentioned that often the Google harvesters seem to need to do a lot of paging / clicking in order to find new items in a DSpace instance. This can cause both a performance hit in DSpace (as the crawler keeps requesting pages), and also ... This message was sent by Atlassian JIRA (v6.1.1#6155-sha1:7188aee) -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE
[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)
Title: Message Title Bram Luyten (@mire) closed an issue as Fixed DSpace / DS-1482 Add a way for harvesters to find recently added items (request from Google) Change By: Bram Luyten (@mire) Resolution: Fixed Fix Version/s: 4.0 Status: VolunteerNeeded Closed Add Comment This message was sent by Atlassian JIRA (v6.1.1#6155-sha1:7188aee) -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel
[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)
Ivan Masár updated DS-1482 Add a way for harvesters to find recently added items (request from Google) We discussed this in Jira review today and there seem to be three possible ways that could solve this, the question is which one is satisfactory: a) DS-1188 makes available the full browse index by date accessioned in reverse chronological order b) it was suggested ResourceSync might address this; but ResourceSync is not yet in DSpace and we dont know if Scholar will be using it c) make sure to keep sitemaps up-to-date; implement a check in the sitemap servlet that checks for sitemap files and if missing starts a thread to generate them Change By: Ivan Masár (25/Sep/13 7:47 PM) Status: Received VolunteerNeeded This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60133471iu=/4140/ostg.clktrk___ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel
[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)
[ https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27681#comment-27681 ] Mark H. Wood commented on DS-1482: -- Still more surprising that they haven't corrected this omission. It would be interesting to know why. Many sites have worked hard to be open and transparent to Google. It would be discouraging to know that our effort is partly wasted. Add a way for harvesters to find recently added items (request from Google) --- Key: DS-1482 URL: https://jira.duraspace.org/browse/DS-1482 Project: DSpace Issue Type: New Feature Reporter: Tim Donohue This request came out of a discussion I had with Anurag Acharya and Darcy Darpa at Google / Google Scholar. Anurag mentioned that often the Google harvesters seem to need to do a lot of paging / clicking in order to find new items in a DSpace instance. This can cause both a performance hit in DSpace (as the crawler keeps requesting pages), and also can result in delays where items may not appear in Google for some time (if the crawler gives up or moves on before it ever finds the item). Anurag mentioned that it'd be much easier (both on DSpace performance and on the Google crawlers), if DSpace provided some way to easily locate recently added items. This could be something like a Browse Recently Added Items (i.e. browse by dc.date.accessioned), or similar. It was noted that EPrints has such a feature (called Latest Additions). For example, see their demo site: http://demoprints.eprints.org/cgi/latest It's also worth noting this could just be as simple as adding a More Option to our existing Recently Added list (of 5 items), so that you can see other recently added items. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel
[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)
[ https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27686#comment-27686 ] Tim Donohue commented on DS-1482: - Unless I'm mistaken (or the demo site is misconfigured), the lastmod dates in our Sitemaps look to all specify the last date/time that the Sitemap generator ran. I.e. they are all identical. http://demo.dspace.org/xmlui/sitemap?map=0 However, I'll still ask Anurag whether Sitemaps are an option here for Google Scholar. I still wonder though whether it still may be useful (to humans spiders alike) to have a way to browse items by date added (as it's a different way, besides RSS feeds to actually view what content has come into the system -- the current recently added list is obviously a limited view). Add a way for harvesters to find recently added items (request from Google) --- Key: DS-1482 URL: https://jira.duraspace.org/browse/DS-1482 Project: DSpace Issue Type: New Feature Reporter: Tim Donohue This request came out of a discussion I had with Anurag Acharya and Darcy Darpa at Google / Google Scholar. Anurag mentioned that often the Google harvesters seem to need to do a lot of paging / clicking in order to find new items in a DSpace instance. This can cause both a performance hit in DSpace (as the crawler keeps requesting pages), and also can result in delays where items may not appear in Google for some time (if the crawler gives up or moves on before it ever finds the item). Anurag mentioned that it'd be much easier (both on DSpace performance and on the Google crawlers), if DSpace provided some way to easily locate recently added items. This could be something like a Browse Recently Added Items (i.e. browse by dc.date.accessioned), or similar. It was noted that EPrints has such a feature (called Latest Additions). For example, see their demo site: http://demoprints.eprints.org/cgi/latest It's also worth noting this could just be as simple as adding a More Option to our existing Recently Added list (of 5 items), so that you can see other recently added items. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel
[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)
[ https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27689#comment-27689 ] Tim Donohue commented on DS-1482: - Response from Anurag: Hi Tim: XML/HTML sitemaps would be fine if they were always enabled. As far as I remember, sitemaps need to be explicitly enabled at configuration time in dspace and most instances don't enable them. Also, it would be good if the sitemap when enabled was automatically linked from either the robots.txt (if XML) or the homepage (if HTML). Again, depending on individual instances to do this means it usually doesn't happen. That said, a by-year browse would be good for human readers as well. cheers, anurag Add a way for harvesters to find recently added items (request from Google) --- Key: DS-1482 URL: https://jira.duraspace.org/browse/DS-1482 Project: DSpace Issue Type: New Feature Reporter: Tim Donohue This request came out of a discussion I had with Anurag Acharya and Darcy Darpa at Google / Google Scholar. Anurag mentioned that often the Google harvesters seem to need to do a lot of paging / clicking in order to find new items in a DSpace instance. This can cause both a performance hit in DSpace (as the crawler keeps requesting pages), and also can result in delays where items may not appear in Google for some time (if the crawler gives up or moves on before it ever finds the item). Anurag mentioned that it'd be much easier (both on DSpace performance and on the Google crawlers), if DSpace provided some way to easily locate recently added items. This could be something like a Browse Recently Added Items (i.e. browse by dc.date.accessioned), or similar. It was noted that EPrints has such a feature (called Latest Additions). For example, see their demo site: http://demoprints.eprints.org/cgi/latest It's also worth noting this could just be as simple as adding a More Option to our existing Recently Added list (of 5 items), so that you can see other recently added items. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel
[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)
[ https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27690#comment-27690 ] Tim Donohue commented on DS-1482: - A brief followup to Anurag's points (in previous comment). We make recommendations similar to what he states on our wiki at: https://wiki.duraspace.org/display/DSPACE/Ensuring+your+instance+is+indexed (And we do embed an invisible link to HTML sitemaps in JSPUI and our various XMLUI themes) However, he does make a good point that currently we don't have any way to default sitemaps to be enabled (as they are generated/refreshed by a recommended cron job). So, even though Google Scholar can index the sitemaps, they often are not enabled, so the Scholar crawler cannot really depend on them. So, there may be a couple options here: (1) Look into whether we can auto-update sitemaps (perhaps via a new event consumer or similar) so that Google / Google Scholar can use those. AND/OR (2) Potentially add a way to browse content by the date it was added (this may even be useful / interesting to repo managers as a sort of report of recently added content) Add a way for harvesters to find recently added items (request from Google) --- Key: DS-1482 URL: https://jira.duraspace.org/browse/DS-1482 Project: DSpace Issue Type: New Feature Reporter: Tim Donohue This request came out of a discussion I had with Anurag Acharya and Darcy Darpa at Google / Google Scholar. Anurag mentioned that often the Google harvesters seem to need to do a lot of paging / clicking in order to find new items in a DSpace instance. This can cause both a performance hit in DSpace (as the crawler keeps requesting pages), and also can result in delays where items may not appear in Google for some time (if the crawler gives up or moves on before it ever finds the item). Anurag mentioned that it'd be much easier (both on DSpace performance and on the Google crawlers), if DSpace provided some way to easily locate recently added items. This could be something like a Browse Recently Added Items (i.e. browse by dc.date.accessioned), or similar. It was noted that EPrints has such a feature (called Latest Additions). For example, see their demo site: http://demoprints.eprints.org/cgi/latest It's also worth noting this could just be as simple as adding a More Option to our existing Recently Added list (of 5 items), so that you can see other recently added items. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel
[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)
[ https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27692#comment-27692 ] Andrea Schweer commented on DS-1482: I don't know how the demo site is set up, but in one of my repositories the items certainly don't all have the same date: http://researchcommons.waikato.ac.nz/sitemap?map=0 The sitemap job runs once a day via cron on that box. I added the link to robots.txt and I see the sitemap being requested by Googlebot and other crawlers. I guess they really need content by the time it was last modified, don't they? I guess they'll want to re-crawl items after they've been edited. So Tim's first option sounds like a good idea to me, even though it's likely to be the one that involves more work... Do we know what user agent the scholar crawlers use? Or do they piggyback onto Googlebot? Add a way for harvesters to find recently added items (request from Google) --- Key: DS-1482 URL: https://jira.duraspace.org/browse/DS-1482 Project: DSpace Issue Type: New Feature Reporter: Tim Donohue This request came out of a discussion I had with Anurag Acharya and Darcy Darpa at Google / Google Scholar. Anurag mentioned that often the Google harvesters seem to need to do a lot of paging / clicking in order to find new items in a DSpace instance. This can cause both a performance hit in DSpace (as the crawler keeps requesting pages), and also can result in delays where items may not appear in Google for some time (if the crawler gives up or moves on before it ever finds the item). Anurag mentioned that it'd be much easier (both on DSpace performance and on the Google crawlers), if DSpace provided some way to easily locate recently added items. This could be something like a Browse Recently Added Items (i.e. browse by dc.date.accessioned), or similar. It was noted that EPrints has such a feature (called Latest Additions). For example, see their demo site: http://demoprints.eprints.org/cgi/latest It's also worth noting this could just be as simple as adding a More Option to our existing Recently Added list (of 5 items), so that you can see other recently added items. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel
[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)
[ https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27679#comment-27679 ] Mark H. Wood commented on DS-1482: -- What he said. I note that one of the forms we generate is known as Google sitemap, and that we ship DSpace with sample configuration to ping Google whenever a sitemap is generated. Surprising that they didn't mention this. Add a way for harvesters to find recently added items (request from Google) --- Key: DS-1482 URL: https://jira.duraspace.org/browse/DS-1482 Project: DSpace Issue Type: New Feature Reporter: Tim Donohue This request came out of a discussion I had with Anurag Acharya and Darcy Darpa at Google / Google Scholar. Anurag mentioned that often the Google harvesters seem to need to do a lot of paging / clicking in order to find new items in a DSpace instance. This can cause both a performance hit in DSpace (as the crawler keeps requesting pages), and also can result in delays where items may not appear in Google for some time (if the crawler gives up or moves on before it ever finds the item). Anurag mentioned that it'd be much easier (both on DSpace performance and on the Google crawlers), if DSpace provided some way to easily locate recently added items. This could be something like a Browse Recently Added Items (i.e. browse by dc.date.accessioned), or similar. It was noted that EPrints has such a feature (called Latest Additions). For example, see their demo site: http://demoprints.eprints.org/cgi/latest It's also worth noting this could just be as simple as adding a More Option to our existing Recently Added list (of 5 items), so that you can see other recently added items. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel
[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)
[ https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27680#comment-27680 ] Andrea Schweer commented on DS-1482: I agree that the sitemap is the obvious place for this, but I seem to recall reading somewhere that Scholar doesn't use the sitemap at all. Tim, if you're in contact with Scholar folks, could you try to get confirmation about Scholar's use of sitemaps? Add a way for harvesters to find recently added items (request from Google) --- Key: DS-1482 URL: https://jira.duraspace.org/browse/DS-1482 Project: DSpace Issue Type: New Feature Reporter: Tim Donohue This request came out of a discussion I had with Anurag Acharya and Darcy Darpa at Google / Google Scholar. Anurag mentioned that often the Google harvesters seem to need to do a lot of paging / clicking in order to find new items in a DSpace instance. This can cause both a performance hit in DSpace (as the crawler keeps requesting pages), and also can result in delays where items may not appear in Google for some time (if the crawler gives up or moves on before it ever finds the item). Anurag mentioned that it'd be much easier (both on DSpace performance and on the Google crawlers), if DSpace provided some way to easily locate recently added items. This could be something like a Browse Recently Added Items (i.e. browse by dc.date.accessioned), or similar. It was noted that EPrints has such a feature (called Latest Additions). For example, see their demo site: http://demoprints.eprints.org/cgi/latest It's also worth noting this could just be as simple as adding a More Option to our existing Recently Added list (of 5 items), so that you can see other recently added items. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel
[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)
[ https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27677#comment-27677 ] Graham Triggs commented on DS-1482: --- I thought this is why we have the sitemaps? If you look at a sitemap, for every item we give the lastmod date as part of the sitemap entry (but not collection / community, as we don't have that information). Retrieving those sitemaps are fast (pre-generated), and it's pretty trivial to either work out from the lastmod date is newer than the last indexed date - or even if it is more recent that the previous time the sitemap was accessed (new item). Add a way for harvesters to find recently added items (request from Google) --- Key: DS-1482 URL: https://jira.duraspace.org/browse/DS-1482 Project: DSpace Issue Type: New Feature Reporter: Tim Donohue This request came out of a discussion I had with Anurag Acharya and Darcy Darpa at Google / Google Scholar. Anurag mentioned that often the Google harvesters seem to need to do a lot of paging / clicking in order to find new items in a DSpace instance. This can cause both a performance hit in DSpace (as the crawler keeps requesting pages), and also can result in delays where items may not appear in Google for some time (if the crawler gives up or moves on before it ever finds the item). Anurag mentioned that it'd be much easier (both on DSpace performance and on the Google crawlers), if DSpace provided some way to easily locate recently added items. This could be something like a Browse Recently Added Items (i.e. browse by dc.date.accessioned), or similar. It was noted that EPrints has such a feature (called Latest Additions). For example, see their demo site: http://demoprints.eprints.org/cgi/latest It's also worth noting this could just be as simple as adding a More Option to our existing Recently Added list (of 5 items), so that you can see other recently added items. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel