[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)

2013-11-28 Thread Bram Luyten (@mire) (DuraSpace JIRA)
Title: Message Title










 

 Bram Luyten (@mire) commented on an issue


















  Re: Add a way for harvesters to find recently added items (request from Google) 










 Marking closed since this has been addressed by 
DS-1188













   

 Add Comment

























 DSpace /  DS-1482



  Add a way for harvesters to find recently added items (request from Google) 







 This request came out of a discussion I had with Anurag Acharya and Darcy Darpa at Google / Google Scholar.   Anurag mentioned that often the Google harvesters seem to need to do a lot of paging / clicking in order to find new items in a DSpace instance. This can cause both a performance hit in DSpace (as the crawler keeps requesting pages), and also ...















 This message was sent by Atlassian JIRA (v6.1.1#6155-sha1:7188aee)




 












--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE 

[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)

2013-11-28 Thread Bram Luyten (@mire) (DuraSpace JIRA)
Title: Message Title










 

 Bram Luyten (@mire) closed an issue as Fixed


















 DSpace /  DS-1482



  Add a way for harvesters to find recently added items (request from Google) 










Change By:

 Bram Luyten (@mire)




Resolution:

 Fixed




Fix Version/s:

 4.0




Status:

 VolunteerNeeded Closed












   

 Add Comment






















 This message was sent by Atlassian JIRA (v6.1.1#6155-sha1:7188aee)




 












--
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET,  PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351iu=/4140/ostg.clktrk___
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel


[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)

2013-09-25 Thread DuraSpace JIRA














































Ivan Masár
 updated  DS-1482


Add a way for harvesters to find recently added items (request from Google)
















We discussed this in Jira review today and there seem to be three possible ways that could solve this, the question is which one is satisfactory:

a) DS-1188 makes available the full browse index by date accessioned in reverse chronological order

b) it was suggested ResourceSync might address this; but ResourceSync is not yet in DSpace and we dont know if Scholar will be using it

c) make sure to keep sitemaps up-to-date; implement a check in the sitemap servlet that checks for sitemap files and if missing starts a thread to generate them





Change By:


Ivan Masár
(25/Sep/13 7:47 PM)




Status:


Received
VolunteerNeeded



























This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira





--
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register 
http://pubads.g.doubleclick.net/gampad/clk?id=60133471iu=/4140/ostg.clktrk___
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel


[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)

2013-02-11 Thread Mark H. Wood (DuraSpace JIRA)

[ 
https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27681#comment-27681
 ] 

Mark H. Wood commented on DS-1482:
--

Still more surprising that they haven't corrected this omission.  It would be 
interesting to know why.  Many sites have worked hard to be open and 
transparent to Google.  It would be discouraging to know that our effort is 
partly wasted.

 Add a way for harvesters to find recently added items (request from Google)
 ---

 Key: DS-1482
 URL: https://jira.duraspace.org/browse/DS-1482
 Project: DSpace
  Issue Type: New Feature
Reporter: Tim Donohue

 This request came out of a discussion I had with Anurag Acharya and Darcy 
 Darpa at Google / Google Scholar.
 Anurag mentioned that often the Google harvesters seem to need to do a lot of 
 paging / clicking in order to find new items in a DSpace instance.  This 
 can cause both a performance hit in DSpace (as the crawler keeps requesting 
 pages), and also can result in delays where items may not appear in Google 
 for some time (if the crawler gives up or moves on before it ever finds the 
 item).
 Anurag mentioned that it'd be much easier (both on DSpace performance and on 
 the Google crawlers), if DSpace provided some way to easily locate recently 
 added items.  
 This could be something like a Browse Recently Added Items (i.e. browse by 
 dc.date.accessioned), or similar.  It was noted that EPrints has such a 
 feature (called Latest Additions).  For example, see their demo site:
 http://demoprints.eprints.org/cgi/latest
 It's also worth noting this could just be as simple as adding a More 
 Option to our existing Recently Added list (of 5 items), so that you can 
 see other recently added items.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

--
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
___
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel


[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)

2013-02-11 Thread Tim Donohue (DuraSpace JIRA)

[ 
https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27686#comment-27686
 ] 

Tim Donohue commented on DS-1482:
-

Unless I'm mistaken (or the demo site is misconfigured), the lastmod dates in 
our Sitemaps look to all specify the last date/time that the Sitemap generator 
ran.  I.e. they are all identical.  http://demo.dspace.org/xmlui/sitemap?map=0

However, I'll still ask Anurag whether Sitemaps are an option here for Google 
Scholar.

I still wonder though whether it still may be useful (to humans  spiders 
alike) to have a way to browse items by date added (as it's a different way, 
besides RSS feeds to actually view what content has come into the system -- the 
current recently added list is obviously a limited view).

 Add a way for harvesters to find recently added items (request from Google)
 ---

 Key: DS-1482
 URL: https://jira.duraspace.org/browse/DS-1482
 Project: DSpace
  Issue Type: New Feature
Reporter: Tim Donohue

 This request came out of a discussion I had with Anurag Acharya and Darcy 
 Darpa at Google / Google Scholar.
 Anurag mentioned that often the Google harvesters seem to need to do a lot of 
 paging / clicking in order to find new items in a DSpace instance.  This 
 can cause both a performance hit in DSpace (as the crawler keeps requesting 
 pages), and also can result in delays where items may not appear in Google 
 for some time (if the crawler gives up or moves on before it ever finds the 
 item).
 Anurag mentioned that it'd be much easier (both on DSpace performance and on 
 the Google crawlers), if DSpace provided some way to easily locate recently 
 added items.  
 This could be something like a Browse Recently Added Items (i.e. browse by 
 dc.date.accessioned), or similar.  It was noted that EPrints has such a 
 feature (called Latest Additions).  For example, see their demo site:
 http://demoprints.eprints.org/cgi/latest
 It's also worth noting this could just be as simple as adding a More 
 Option to our existing Recently Added list (of 5 items), so that you can 
 see other recently added items.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

--
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
___
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel


[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)

2013-02-11 Thread Tim Donohue (DuraSpace JIRA)

[ 
https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27689#comment-27689
 ] 

Tim Donohue commented on DS-1482:
-

Response from Anurag:

Hi Tim: XML/HTML sitemaps would be fine if they were always enabled. As far as 
I remember, sitemaps need to be explicitly enabled at configuration time in 
dspace and most instances don't enable them. Also, it would be good if the 
sitemap when enabled was automatically linked from either the robots.txt (if 
XML) or the homepage (if HTML). Again, depending on individual instances to do 
this means it usually doesn't happen. 

That said, a by-year browse would be good for human readers as well.

cheers,
anurag

 Add a way for harvesters to find recently added items (request from Google)
 ---

 Key: DS-1482
 URL: https://jira.duraspace.org/browse/DS-1482
 Project: DSpace
  Issue Type: New Feature
Reporter: Tim Donohue

 This request came out of a discussion I had with Anurag Acharya and Darcy 
 Darpa at Google / Google Scholar.
 Anurag mentioned that often the Google harvesters seem to need to do a lot of 
 paging / clicking in order to find new items in a DSpace instance.  This 
 can cause both a performance hit in DSpace (as the crawler keeps requesting 
 pages), and also can result in delays where items may not appear in Google 
 for some time (if the crawler gives up or moves on before it ever finds the 
 item).
 Anurag mentioned that it'd be much easier (both on DSpace performance and on 
 the Google crawlers), if DSpace provided some way to easily locate recently 
 added items.  
 This could be something like a Browse Recently Added Items (i.e. browse by 
 dc.date.accessioned), or similar.  It was noted that EPrints has such a 
 feature (called Latest Additions).  For example, see their demo site:
 http://demoprints.eprints.org/cgi/latest
 It's also worth noting this could just be as simple as adding a More 
 Option to our existing Recently Added list (of 5 items), so that you can 
 see other recently added items.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

--
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
___
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel


[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)

2013-02-11 Thread Tim Donohue (DuraSpace JIRA)

[ 
https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27690#comment-27690
 ] 

Tim Donohue commented on DS-1482:
-

A brief followup to Anurag's points (in previous comment).

We make recommendations similar to what he states on our wiki at:
https://wiki.duraspace.org/display/DSPACE/Ensuring+your+instance+is+indexed
(And we do embed an invisible link to HTML sitemaps in JSPUI and our various 
XMLUI themes)

However, he does make a good point that currently we don't have any way to 
default sitemaps to be enabled (as they are generated/refreshed by a 
recommended cron job).  So, even though Google Scholar can index the sitemaps, 
they often are not enabled, so the Scholar crawler cannot really depend on them.

So, there may be a couple options here:
(1) Look into whether we can auto-update sitemaps (perhaps via a new event 
consumer or similar) so that Google / Google Scholar can use those.
AND/OR
(2) Potentially add a way to browse content by the date it was added (this may 
even be useful / interesting to repo managers as a sort of report of recently 
added content)

 Add a way for harvesters to find recently added items (request from Google)
 ---

 Key: DS-1482
 URL: https://jira.duraspace.org/browse/DS-1482
 Project: DSpace
  Issue Type: New Feature
Reporter: Tim Donohue

 This request came out of a discussion I had with Anurag Acharya and Darcy 
 Darpa at Google / Google Scholar.
 Anurag mentioned that often the Google harvesters seem to need to do a lot of 
 paging / clicking in order to find new items in a DSpace instance.  This 
 can cause both a performance hit in DSpace (as the crawler keeps requesting 
 pages), and also can result in delays where items may not appear in Google 
 for some time (if the crawler gives up or moves on before it ever finds the 
 item).
 Anurag mentioned that it'd be much easier (both on DSpace performance and on 
 the Google crawlers), if DSpace provided some way to easily locate recently 
 added items.  
 This could be something like a Browse Recently Added Items (i.e. browse by 
 dc.date.accessioned), or similar.  It was noted that EPrints has such a 
 feature (called Latest Additions).  For example, see their demo site:
 http://demoprints.eprints.org/cgi/latest
 It's also worth noting this could just be as simple as adding a More 
 Option to our existing Recently Added list (of 5 items), so that you can 
 see other recently added items.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

--
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
___
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel


[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)

2013-02-11 Thread Andrea Schweer (DuraSpace JIRA)

[ 
https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27692#comment-27692
 ] 

Andrea Schweer commented on DS-1482:


I don't know how the demo site is set up, but in one of my repositories the 
items certainly don't all have the same date: 
http://researchcommons.waikato.ac.nz/sitemap?map=0
The sitemap job runs once a day via cron on that box. I added the link to 
robots.txt and I see the sitemap being requested by Googlebot and other 
crawlers.

I guess they really need content by the time it was last modified, don't they? 
I guess they'll want to re-crawl items after they've been edited. So Tim's 
first option sounds like a good idea to me, even though it's likely to be the 
one that involves more work...

Do we know what user agent the scholar crawlers use? Or do they piggyback onto 
Googlebot?

 Add a way for harvesters to find recently added items (request from Google)
 ---

 Key: DS-1482
 URL: https://jira.duraspace.org/browse/DS-1482
 Project: DSpace
  Issue Type: New Feature
Reporter: Tim Donohue

 This request came out of a discussion I had with Anurag Acharya and Darcy 
 Darpa at Google / Google Scholar.
 Anurag mentioned that often the Google harvesters seem to need to do a lot of 
 paging / clicking in order to find new items in a DSpace instance.  This 
 can cause both a performance hit in DSpace (as the crawler keeps requesting 
 pages), and also can result in delays where items may not appear in Google 
 for some time (if the crawler gives up or moves on before it ever finds the 
 item).
 Anurag mentioned that it'd be much easier (both on DSpace performance and on 
 the Google crawlers), if DSpace provided some way to easily locate recently 
 added items.  
 This could be something like a Browse Recently Added Items (i.e. browse by 
 dc.date.accessioned), or similar.  It was noted that EPrints has such a 
 feature (called Latest Additions).  For example, see their demo site:
 http://demoprints.eprints.org/cgi/latest
 It's also worth noting this could just be as simple as adding a More 
 Option to our existing Recently Added list (of 5 items), so that you can 
 see other recently added items.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

--
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
___
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel


[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)

2013-02-10 Thread Mark H. Wood (DuraSpace JIRA)

[ 
https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27679#comment-27679
 ] 

Mark H. Wood commented on DS-1482:
--

What he said.  I note that one of the forms we generate is known as Google 
sitemap, and that we ship DSpace with sample configuration to ping Google 
whenever a sitemap is generated.  Surprising that they didn't mention this.

 Add a way for harvesters to find recently added items (request from Google)
 ---

 Key: DS-1482
 URL: https://jira.duraspace.org/browse/DS-1482
 Project: DSpace
  Issue Type: New Feature
Reporter: Tim Donohue

 This request came out of a discussion I had with Anurag Acharya and Darcy 
 Darpa at Google / Google Scholar.
 Anurag mentioned that often the Google harvesters seem to need to do a lot of 
 paging / clicking in order to find new items in a DSpace instance.  This 
 can cause both a performance hit in DSpace (as the crawler keeps requesting 
 pages), and also can result in delays where items may not appear in Google 
 for some time (if the crawler gives up or moves on before it ever finds the 
 item).
 Anurag mentioned that it'd be much easier (both on DSpace performance and on 
 the Google crawlers), if DSpace provided some way to easily locate recently 
 added items.  
 This could be something like a Browse Recently Added Items (i.e. browse by 
 dc.date.accessioned), or similar.  It was noted that EPrints has such a 
 feature (called Latest Additions).  For example, see their demo site:
 http://demoprints.eprints.org/cgi/latest
 It's also worth noting this could just be as simple as adding a More 
 Option to our existing Recently Added list (of 5 items), so that you can 
 see other recently added items.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

--
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
___
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel


[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)

2013-02-10 Thread Andrea Schweer (DuraSpace JIRA)

[ 
https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27680#comment-27680
 ] 

Andrea Schweer commented on DS-1482:


I agree that the sitemap is the obvious place for this, but I seem to recall 
reading somewhere that Scholar doesn't use the sitemap at all. Tim, if you're 
in contact with Scholar folks, could you try to get confirmation about 
Scholar's use of sitemaps?

 Add a way for harvesters to find recently added items (request from Google)
 ---

 Key: DS-1482
 URL: https://jira.duraspace.org/browse/DS-1482
 Project: DSpace
  Issue Type: New Feature
Reporter: Tim Donohue

 This request came out of a discussion I had with Anurag Acharya and Darcy 
 Darpa at Google / Google Scholar.
 Anurag mentioned that often the Google harvesters seem to need to do a lot of 
 paging / clicking in order to find new items in a DSpace instance.  This 
 can cause both a performance hit in DSpace (as the crawler keeps requesting 
 pages), and also can result in delays where items may not appear in Google 
 for some time (if the crawler gives up or moves on before it ever finds the 
 item).
 Anurag mentioned that it'd be much easier (both on DSpace performance and on 
 the Google crawlers), if DSpace provided some way to easily locate recently 
 added items.  
 This could be something like a Browse Recently Added Items (i.e. browse by 
 dc.date.accessioned), or similar.  It was noted that EPrints has such a 
 feature (called Latest Additions).  For example, see their demo site:
 http://demoprints.eprints.org/cgi/latest
 It's also worth noting this could just be as simple as adding a More 
 Option to our existing Recently Added list (of 5 items), so that you can 
 see other recently added items.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

--
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
___
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel


[Dspace-devel] [DuraSpace JIRA] (DS-1482) Add a way for harvesters to find recently added items (request from Google)

2013-02-09 Thread Graham Triggs (DuraSpace JIRA)

[ 
https://jira.duraspace.org/browse/DS-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=27677#comment-27677
 ] 

Graham Triggs commented on DS-1482:
---

I thought this is why we have the sitemaps? If you look at a sitemap, for every 
item we give the lastmod date as part of the sitemap entry (but not collection 
/ community, as we don't have that information).

Retrieving those sitemaps are fast (pre-generated), and it's pretty trivial to 
either work out from the lastmod date is newer than the last indexed date - or 
even if it is more recent that the previous time the sitemap was accessed (new 
item).

 Add a way for harvesters to find recently added items (request from Google)
 ---

 Key: DS-1482
 URL: https://jira.duraspace.org/browse/DS-1482
 Project: DSpace
  Issue Type: New Feature
Reporter: Tim Donohue

 This request came out of a discussion I had with Anurag Acharya and Darcy 
 Darpa at Google / Google Scholar.
 Anurag mentioned that often the Google harvesters seem to need to do a lot of 
 paging / clicking in order to find new items in a DSpace instance.  This 
 can cause both a performance hit in DSpace (as the crawler keeps requesting 
 pages), and also can result in delays where items may not appear in Google 
 for some time (if the crawler gives up or moves on before it ever finds the 
 item).
 Anurag mentioned that it'd be much easier (both on DSpace performance and on 
 the Google crawlers), if DSpace provided some way to easily locate recently 
 added items.  
 This could be something like a Browse Recently Added Items (i.e. browse by 
 dc.date.accessioned), or similar.  It was noted that EPrints has such a 
 feature (called Latest Additions).  For example, see their demo site:
 http://demoprints.eprints.org/cgi/latest
 It's also worth noting this could just be as simple as adding a More 
 Option to our existing Recently Added list (of 5 items), so that you can 
 see other recently added items.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

--
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
___
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel