Re: [Dspace-tech] Searching PDF-scanned documents: Adobe Capture asolution?

2007-07-04 Thread Cory Snavely
Another way to get experience with the quality of Acrobat OCR is to use Acrobat 
Pro, which can do functionally the same thing, with a less batch-oriented 
interface. We ended up using this at a fairly large scale to meet a similar 
need.

We have documentation on preparing PDFs that we supply for submitters, and that 
you may find useful, at

http://deepblue.lib.umich.edu/html/2027.42/40244/PDF-Best_Practice.html

The section toward the bottom provides instructions on making image PDF files 
searchable.

Cory Snavely
University of Michigan Library IT Core Services
  - Original Message - 
  From: Jennifer Ash 
  To: dspace-tech@lists.sourceforge.net 
  Sent: Wednesday, July 04, 2007 6:55 AM
  Subject: [Dspace-tech] Searching PDF-scanned documents: Adobe Capture 
asolution?


  Dear Community Members



  The Water Research Commission (WRC, South Africa) is currently assessing a 
pilot installation of DSpace.

  We want to use DSpace to store, search and retrieve all our WRC research 
reports and Water SA (a scientific publication, 4 issues pa) issues (this is 
the primary goal; other collections will most likely be added over time).

  We are faced with a problem in that most of our older publications are not in 
electronic format and will have to be scanned.

  Scanning and saving as PDF does not provide a full text searchable document 
in DSpace; I've tried it.



  A product, Adobe Capture, is advertised as a 'tool that teams with your 
scanner to convert volumes of paper documents into searchable Adobe Portable 
Document Format (PDF) files'.

  We are keen to investigate this product but there are no trial downloads 
offered by Adobe.

  Do you have any knowledge of this product? Can you advise on a suitable 
tehnology solution for our problem? Our backlog is vast and spans many years, 
so there are loads of documents that need to be scanned.



  I do hope someone can give me advice.



  Kind regards





  Jennifer Ash 
  ..
  Business Systems Manager
  Water Research Commission 
  Private Bag X03 
  GEZINA (Pretoria) 
  0031 
  Tel: (012) 330-9036 / 330-0340 
  Fax: (012) 330-9010 / 331-2565 
  E-mail: [EMAIL PROTECTED] 




  DISCLAIMER AND CONFIDENTIALITY NOTE: All factual and other information within 
this e-mail, including any attachments relating to the official business of the 
Water Research Commission (WRC), is the property of the WRC. It is 
confidential, legally privileged and protected against unauthorized use. The 
WRC neither owns nor endorses any other content. Views and opinions are those 
of the senders unless clearly stated as being that of the WRC. The addressee in 
the e-mail is the intended recipient. Please notify the sender immediately if 
it has unintentionally reached you and do not read, disclose or use the content 
in any way whatsoever. The WRC cannot assure that the integrity of this 
communication has been maintained nor that it is free of errors, viruses, 
interception or interferences. 

   






--


  -
  This SF.net email is sponsored by DB2 Express
  Download DB2 Express C - the FREE version of DB2 express and take
  control of your XML. No limits. Just data. Click to get it now.
  http://sourceforge.net/powerbar/db2/


--


  ___
  DSpace-tech mailing list
  DSpace-tech@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/dspace-tech
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] [DSpace-Manakin] Manakin over Jetty

2007-09-27 Thread Cory Snavely
Way interested. Can you say what some of the tricky parts were? Maybe
even a wiki entry on this?

On Thu, 2007-09-27 at 15:27 +0300, Mika Stenberg wrote:
> Just briefly reporting that my experiments with Dspace & Manakin over 
> Jetty has been a success. It seems that DSpace runs significantly faster 
> on Jetty v6 than Tomcat v5. We've also gotten rid of nasty crashes that 
> occurred occasionally when the system was under heavy load with lots of 
> user activity.
> 
> If someone is interested, I can provide more information.
> 
> Cheers,
> Mika
> 
> -
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2005.
> http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
> ___
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Questions about DSpace Features

2007-10-04 Thread Cory Snavely
FYI we are having discussions with Sun about integrating DSpace with
their Honeycomb CAS system. However, the approach I am advocating is to
build an SRB compatibility layer/driver/translator for the product, and
so insulate DSpace from the specifics of the Honeycomb API. Contact me
if interested.

On Wed, 2007-10-03 at 17:26 -0400, MacKenzie Smith wrote:
> Hi Robert,
> > * Does DSpace have service devices (like SOA or SOAP)?
> >   
> Yes, for submission (see 
> http://wiki.dspace.org/index.php/LightweightNetworkInterface).
> > * Is it correct that DSpace does not have an internal storage
> > management, which would mean (e.g.) to compress documents which are not
> > accessed for a given period, or to move them to an other storage 
> > location (e.g. a tape server) if the last access is much older?
> >   
> You can implement any storage layer underneath DSpace using the storage 
> API. There are implementations now for the local filesystem (the 
> default), SRB and S3 (in prototype, I believe). I think HP has also 
> implemented it with their HSM, but I don't know if there are other HSM 
> systems implemented now.
> > * And is it possible to bundle / relate different versions of the same 
> > document, e.g. preprint and postprint?
> >   
> This is handled now via metadata. For MIT's method of doing this see 
> http://wiki.dspace.org/static_files/f/fa/DSpace_Versioning_Feature_Summary_(July_2004).pdf
> 
> There are plans to change the DSpace data model in a future version so 
> that it can handle versions directly within an item. This is described 
> on the wiki (http://wiki.dspace.org/index.php/ArchReviewSynthesis). A 
> lot of this work has already started, and the plan is to complete these 
> changes in 2008.
> > * Does DSpace keep track of different versions of the same document to 
> > have a history of minor changes (compared to pre- and postprint)?
> >   
> It is a digital archive rather than an authoring system, so no, minor 
> changes to documents are noant normally kept. The idea is to store final 
> versions of documents and keep them forever, and to link different 
> *editions* of documents via metadata (see the last answer) so that users 
> can safely cite a particular version and not worry about it disappearing 
> later.
> 
> MacKenzie
> 

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Storing bitstreams using SRB

2007-10-17 Thread Cory Snavely
Presumably you would need an SRB server:
http://www.sdsc.edu/srb/index.php/Main_Page .

On Wed, 2007-10-17 at 06:44 -0700, Shwe Yee Than wrote:
> Hi,
>  
> What else should I need to do other than the normal installation and
> configuration of DSpace if I want to store bitstreams using SRB?
> Anyone can help me?
>  
> regards,
> Shwe
>  
> __
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 
> -
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> ___ DSpace-tech mailing list 
> DSpace-tech@lists.sourceforge.net 
> https://lists.sourceforge.net/lists/listinfo/dspace-tech

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Academic SRB support

2007-10-24 Thread Cory Snavely
...and if it seems odd to anyone following this thread that the
developers of Nirvana SRB would suggest we achieve this integration by
using the filesystem emulation provided by Nirvana SRB, which in turn
uses the Honeycomb API, know that I definitely did point out that irony
to them.

However, according to these developers, the Nirvana and SDSC SRB APIs
differ enough that that is the only way to do this without recoding the
DSpace bitstream storage manager.

Disappointing? Yeah.

So am I understanding correctly that in future versions of DSpace,
support for CAS systems and the like would be done in DSpace? I.e. we
might expect there to be direct Honeycomb, EMC Celera, iRODS, etc
support right within DSpace? We're trying to see the roadmap here.

c

On Wed, 2007-10-24 at 10:46 -0400, Blanco, Jose wrote:
> We just had a phone conference with Sun and the developer for the
> commercial version of SRB at Nirvana ( Tino ) and were told that the
> commercial version of SRB they have developed is not the same as the
> academic SRB.  One thing they have developed is file system based SRB
> which *should* work, and we are going to try it out.
> 
> Thanks for this information!
> 
> Jose
> 
> -Original Message-
> From: MacKenzie Smith [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, October 24, 2007 10:37 AM
> To: Blanco, Jose
> Cc: dspace-tech@lists.sourceforge.net
> Subject: Re: [Dspace-tech] Academic SRB support
> 
> Hi Jose,
> 
> I haven't gotten the official story from SDSC, but I do know that their
> attention has shifted to iRODS as the next generation storage
> architecture for long-term data management. iRODS will be 100% open
> source software (no more dual license) which will be easier for the
> community to deal with.
> 
> My understanding is that the commercial (Nirvana) and non-commercial
> (plain SRB) are actually the same thing... they just have dual license
> arrangement for the codebase. So the API that Sun develops *should* also
> work for your plain vanilla SRB instance too. You can verify that with
> the SDSC folks (or I can ask them).
> 
> The DSpace work that we've done at MIT was for the old non-commercial
> SRB, and we recently got the jargon client for iRODS, so those should be
> tested with the 1.4.x and 1.5 releases.
> 
> MacKenzie
> > I wonder if any one has heard if the academic SRB ( non-commercial ) 
> > is going to be discontinued?  We have been discussing using a 
> > Honeycomb server for bit storage, and they have informed us that the 
> > academic SRB is going to be discontinued, so they are not interested 
> > in developing an API for it.  They are working on developing a 
> > commercial Nirvana SRB API.  I'm assuming that the configurable SRB 
> > coming out in a future release of Dspace is the academic?
> >
> > http://wiki.dspace.org/index.php/PluggableStorage ?
> >
> > Thank you!
> > Jose
> 
> 
> --
> MacKenzie Smith
> Associate Director for Technology
> MIT Libraries
> 
> 
> -
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> ___
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Academic SRB support

2007-10-24 Thread Cory Snavely
Thanks. You say

> I've had some discussions with Honeycomb developers

Who are those folks?

On Wed, 2007-10-24 at 11:19 -0400, Richard Rodgers wrote:
> Hi Cory:
> 
> See remarks below...
> 
> Thanks,
> 
> Richard
> On Wed, 2007-10-24 at 14:55 +, Cory Snavely wrote:
> > ...and if it seems odd to anyone following this thread that the
> > developers of Nirvana SRB would suggest we achieve this integration by
> > using the filesystem emulation provided by Nirvana SRB, which in turn
> > uses the Honeycomb API, know that I definitely did point out that irony
> > to them.
> > 
> > However, according to these developers, the Nirvana and SDSC SRB APIs
> > differ enough that that is the only way to do this without recoding the
> > DSpace bitstream storage manager.
> > 
> > Disappointing? Yeah.
> > 
> > So am I understanding correctly that in future versions of DSpace,
> > support for CAS systems and the like would be done in DSpace? I.e. we
> > might expect there to be direct Honeycomb, EMC Celera, iRODS, etc
> > support right within DSpace? We're trying to see the roadmap here.
> 
> The roadmap certainly embraces selectable back-end storage options.
> There is a prototype pluggable storage abstraction at
> 
> http://wiki.dspace.org/index.php/PluggableStorage
> 
> that I hope will eventually make it's way into a release, and it's
> precise purpose is to avoid the need to recode BitStreamStorage manager
> or any other core code. It currently supports legacy DSpace filesystem,
> SRB, and Amazon S3, and I've had some discussions with Honeycomb
> developers and I believe they are attempting to write a DSpace
> implementation against that interface.
> 
> We are just now seeing enough options to generalize & abstract this
> layer, but I am sure we will continue these efforts.
> 
> > c
> > 
> > On Wed, 2007-10-24 at 10:46 -0400, Blanco, Jose wrote:
> > > We just had a phone conference with Sun and the developer for the
> > > commercial version of SRB at Nirvana ( Tino ) and were told that the
> > > commercial version of SRB they have developed is not the same as the
> > > academic SRB.  One thing they have developed is file system based SRB
> > > which *should* work, and we are going to try it out.
> > > 
> > > Thanks for this information!
> > > 
> > > Jose
> > > 
> > > -Original Message-
> > > From: MacKenzie Smith [mailto:[EMAIL PROTECTED] 
> > > Sent: Wednesday, October 24, 2007 10:37 AM
> > > To: Blanco, Jose
> > > Cc: dspace-tech@lists.sourceforge.net
> > > Subject: Re: [Dspace-tech] Academic SRB support
> > > 
> > > Hi Jose,
> > > 
> > > I haven't gotten the official story from SDSC, but I do know that their
> > > attention has shifted to iRODS as the next generation storage
> > > architecture for long-term data management. iRODS will be 100% open
> > > source software (no more dual license) which will be easier for the
> > > community to deal with.
> > > 
> > > My understanding is that the commercial (Nirvana) and non-commercial
> > > (plain SRB) are actually the same thing... they just have dual license
> > > arrangement for the codebase. So the API that Sun develops *should* also
> > > work for your plain vanilla SRB instance too. You can verify that with
> > > the SDSC folks (or I can ask them).
> > > 
> > > The DSpace work that we've done at MIT was for the old non-commercial
> > > SRB, and we recently got the jargon client for iRODS, so those should be
> > > tested with the 1.4.x and 1.5 releases.
> > > 
> > > MacKenzie
> > > > I wonder if any one has heard if the academic SRB ( non-commercial ) 
> > > > is going to be discontinued?  We have been discussing using a 
> > > > Honeycomb server for bit storage, and they have informed us that the 
> > > > academic SRB is going to be discontinued, so they are not interested 
> > > > in developing an API for it.  They are working on developing a 
> > > > commercial Nirvana SRB API.  I'm assuming that the configurable SRB 
> > > > coming out in a future release of Dspace is the academic?
> > > >
> > > > http://wiki.dspace.org/index.php/PluggableStorage ?
> > > >
> > > > Thank you!
> > > > Jose
> > > 
> > > 
> > > --
> > > MacKenzie Smith
> > > Associate Director for Technology
> > > MIT Libraries
> > > 
&

Re: [Dspace-tech] Blocking a malicious user

2007-10-30 Thread Cory Snavely
If they're nasty enough, though, they'll drown your Apache or Tomcat
server in replying with 403s. I've had times that I needed to be
absolutely merciless and block at the firewall level, using iptables;
then they don't even get as far as userspace.

On Tue, 2007-10-30 at 14:01 -0500, Tim Donohue wrote:
> George,
> 
> We had a similar problem to this one in the past (a year or so ago).  I 
> just flat out blocked the IP altogether (not even specific to 
> /bitstream/) via this Apache configuration:
> 
> 
>  Order Allow,Deny
> 
>  Deny from {malicious ip}
> 
>  Allow from all
> 
> 
> This looks similar to your config though (except it blocks all access 
> from that IP).
> 
> - Tim
> 
> George Kozak wrote:
> > Hi...
> > 
> > I am having a problem with an IP that keeps sending thousands of "GET 
> > /bitstream/..." requests for the same item.
> > 
> > I have placed the following in my Apache.conf file:
> > 
> > 
> > Options Indexes FollowSymLinks MultiViews
> > AllowOverride All
> > Order allow,deny
> > allow from all
> > deny from {malicious ip}
> > 
> > 
> > I also placed the following in my server.xml in Tomcat:
> >  > deny="xxx\.xxx\.xxx\.xx" />
> > 
> > However, this person still seems to be getting through.  My java 
> > process is running from 50%-80% CPU usage.  Does anyone have a good 
> > idea on how to shutout a malicious IP in DSpace?
> > 
> > ***
> > George Kozak
> > Coordinator
> > Web Development and Management
> > Digital Media Group
> > 501 Olin Library
> > Cornell University
> > 607-255-8924
> > ***
> > [EMAIL PROTECTED] 
> > 
> > 
> > -
> > This SF.net email is sponsored by: Splunk Inc.
> > Still grepping through log files to find problems?  Stop.
> > Now Search log events and configuration files using AJAX and a browser.
> > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > ___
> > DSpace-tech mailing list
> > DSpace-tech@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dspace-tech
> > 
> 

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Blocking a malicious user

2007-10-31 Thread Cory Snavely
It's probably worth saying that if you run postgres and dspace on the
same server, you can completely block postgres at the firewall
(iptables) level.

On Wed, 2007-10-31 at 12:51 -0500, Thornton, Susan M. (LARC-B702)[NCI
INFORMATION SYSTEMS] wrote:
> You can block ip addresses at the postgreSQL level in the pg_hba.conf
> file.  Here is a person I blocked by ip address who was sending all
> kinds of GET requests to our DSpace server:
> 
> hostall all malicious.ip255.255.255.255
> reject
> 
> Sue Walker-Thornton
> NASA Langley Research Center
> ConITS Contract
> 757-224-4074
> [EMAIL PROTECTED]
> 
> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Mika
> Stenberg
> Sent: Wednesday, October 31, 2007 6:00 AM
> To: dspace-tech@lists.sourceforge.net
> Subject: Re: [Dspace-tech] Blocking a malicious user
> 
> We've had problems like that as well. Blocking specific IP's works only
> for
> a while since many bots and spammers seem to change their IP frequently.
> We
> didnt come up with a decent solution for this, but  blocking an entire
> country of origin for a period of time has been on my mind. Managing the
> allowed requests / timeslot for a specific IP might also do the trick.
> 
> -Mika
> 
> 
> > If they're nasty enough, though, they'll drown your Apache or Tomcat
> > server in replying with 403s. I've had times that I needed to be
> > absolutely merciless and block at the firewall level, using iptables;
> > then they don't even get as far as userspace.
> > 
> > On Tue, 2007-10-30 at 14:01 -0500, Tim Donohue wrote:
> > > George,
> > > 
> > > We had a similar problem to this one in the past (a year or so ago).
> I
> > 
> > > just flat out blocked the IP altogether (not even specific to 
> > > /bitstream/) via this Apache configuration:
> > > 
> > > 
> > >  Order Allow,Deny
> > > 
> > >  Deny from {malicious ip}
> > > 
> > >  Allow from all
> > > 
> > > 
> > > This looks similar to your config though (except it blocks all
> access 
> > > from that IP).
> > > 
> > > - Tim
> > > 
> > > George Kozak wrote:
> > > > Hi...
> > > > 
> > > > I am having a problem with an IP that keeps sending thousands of
> "GET
> > 
> > > > /bitstream/..." requests for the same item.
> > > > 
> > > > I have placed the following in my Apache.conf file:
> > > > 
> > > > 
> > > > Options Indexes FollowSymLinks MultiViews
> > > > AllowOverride All
> > > > Order allow,deny
> > > > allow from all
> > > > deny from {malicious ip}
> > > > 
> > > > 
> > > > I also placed the following in my server.xml in Tomcat:
> > > >  > > > deny="xxx\.xxx\.xxx\.xx" />
> > > > 
> > > > However, this person still seems to be getting through.  My java 
> > > > process is running from 50%-80% CPU usage.  Does anyone have a
> good 
> > > > idea on how to shutout a malicious IP in DSpace?
> > > > 
> > > > ***
> > > > George Kozak
> > > > Coordinator
> > > > Web Development and Management
> > > > Digital Media Group
> > > > 501 Olin Library
> > > > Cornell University
> > > > 607-255-8924
> > > > ***
> > > > [EMAIL PROTECTED] 
> > > > 
> > > > 
> > > >
> >
> 
> -
> > > > This SF.net email is sponsored by: Splunk Inc.
> > > > Still grepping through log files to find problems?  Stop.
> > > > Now Search log events and configuration files using AJAX and a
> > browser.
> > > > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > > > ___
> > > > DSpace-tech mailing list
> > > > DSpace-tech@lists.sourceforge.net
> > > > https://lists.sourceforge.net/lists/listinfo/dspace-tech
> > > > 
> > > 
> > 
> >
> 
> -
> > This SF.net email is sponsored by: Splunk Inc.
> > Still grepping through log files to find problems?  Stop.
> > Now Search log events and configuration files using AJAX and a
> browser.
> > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > ___
> > DSpace-tech mailing list
> > DSpace-tech@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dspace-tech
> > 
> > 
> 
> 
> 
> 
> -
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> ___
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> 
> -
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search 

Re: [Dspace-tech] Blocking a malicious user

2007-11-01 Thread Cory Snavely
It has an effect if your Postgres instance isn't blocked at the
firewall, and people are actually trying to access it. Which they will,
unless you block them. As I said, probably much safer to block at the
firewall level--better protection from DOS as well.

On Thu, 2007-11-01 at 08:51 +, Stuart Lewis [sdl] wrote:
> Hi Sue,
> 
> pg_hba.conf only controls who can communicate with Postgres, not who can
> communicate with DSpace.
> 
> Normally it is only 'applications' (e.g. DSpace) that talk to Postgres,
> not users.
> 
> A user talks to DSpace, who in turn talks to Postgres. Postgres has no
> idea or interest in the IP address of the user who is using DSpace, only
> that of the DSpace application.
> 
> Therefore adding malicious IP address into that config file will sadly
> have no effect. You have to block users higher in the stack, either at
> the application level (apache or tomcat directives), or at the network
> level (firewall changes).
> 
> Thanks,
> 
> 
> Stuart
> _
> 
> Gwasanaethau Gwybodaeth  Information Services
> Prifysgol Aberystwyth  Aberystwyth University
> 
> E-bost / E-mail: [EMAIL PROTECTED] 
>  Ffon / Tel: (01970) 622860
> _
> 
> 
> 
> 
> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of
> Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]
> Sent: 31 October 2007 17:51
> To: Mika Stenberg; dspace-tech@lists.sourceforge.net
> Subject: Re: [Dspace-tech] Blocking a malicious user
> 
> You can block ip addresses at the postgreSQL level in the pg_hba.conf
> file.  Here is a person I blocked by ip address who was sending all
> kinds of GET requests to our DSpace server:
> 
> hostall all malicious.ip255.255.255.255
> reject
> 
> Sue Walker-Thornton
> NASA Langley Research Center
> ConITS Contract
> 757-224-4074
> [EMAIL PROTECTED]
> 
> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Mika
> Stenberg
> Sent: Wednesday, October 31, 2007 6:00 AM
> To: dspace-tech@lists.sourceforge.net
> Subject: Re: [Dspace-tech] Blocking a malicious user
> 
> We've had problems like that as well. Blocking specific IP's works only
> for
> a while since many bots and spammers seem to change their IP frequently.
> We
> didnt come up with a decent solution for this, but  blocking an entire
> country of origin for a period of time has been on my mind. Managing the
> allowed requests / timeslot for a specific IP might also do the trick.
> 
> -Mika
> 
> 
> > If they're nasty enough, though, they'll drown your Apache or Tomcat
> > server in replying with 403s. I've had times that I needed to be
> > absolutely merciless and block at the firewall level, using iptables;
> > then they don't even get as far as userspace.
> > 
> > On Tue, 2007-10-30 at 14:01 -0500, Tim Donohue wrote:
> > > George,
> > > 
> > > We had a similar problem to this one in the past (a year or so ago).
> I
> > 
> > > just flat out blocked the IP altogether (not even specific to 
> > > /bitstream/) via this Apache configuration:
> > > 
> > > 
> > >  Order Allow,Deny
> > > 
> > >  Deny from {malicious ip}
> > > 
> > >  Allow from all
> > > 
> > > 
> > > This looks similar to your config though (except it blocks all
> access 
> > > from that IP).
> > > 
> > > - Tim
> > > 
> > > George Kozak wrote:
> > > > Hi...
> > > > 
> > > > I am having a problem with an IP that keeps sending thousands of
> "GET
> > 
> > > > /bitstream/..." requests for the same item.
> > > > 
> > > > I have placed the following in my Apache.conf file:
> > > > 
> > > > 
> > > > Options Indexes FollowSymLinks MultiViews
> > > > AllowOverride All
> > > > Order allow,deny
> > > > allow from all
> > > > deny from {malicious ip}
> > > > 
> > > > 
> > > > I also placed the following in my server.xml in Tomcat:
> > > >  > > > deny="xxx\.xxx\.xxx\.xx" />
> > > > 
> > > > However, this person still seems to be getting through.  My java 
> > > > process is running from 50%-80% CPU usage.  Does anyone have a
> good 
> > > > idea on how to shutout a malicious IP in DSpace?
> > > > 
> > > > ***
> > > > George Kozak
> > > > Coordinator
> > > > Web Development and Management
> > > > Digital Media Group
> > > > 501 Olin Library
> > > > Cornell University
> > > > 607-255-8924
> > > > ***
> > > > [EMAIL PROTECTED] 
> > > > 
> > > > 
> > > >
> >
> 
> -
> > > > This SF.net email is sponsored by: Splunk Inc.
> > > > Still grepping through log files to find problems?  Stop.
> > > > Now Search log events and configuration files using AJAX and a
> > browser.
> > > > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > > > __

Re: [Dspace-tech] "Too many open files" error

2007-12-20 Thread Cory Snavely
You need to read the full thread. This turned out to be a scalability
problem in how the indexing routine worked. IOW you could bloat the
kernel fd data structure as much as you want, and sooner or later as
your repository grows you will hit the limit, because the number of
files open was linear to the size of the repository.

I would suspect the patch is in the main trunk now, but others will
probably know more specifics.

Cory Snavely
University of Michigan Library IT Core Services

On Thu, 2007-12-20 at 17:04 +1030, Steve Thomas wrote:
> We're having a lot of trouble currently with tomcat crashing with a
> "Too many open files" error. This is happening roughly twice a day --
> I am restarting tomcat every morning, and usually get a call around
> lunch time that it has crashed and need to restart again.
>  
> Restart is quick and fixes the problem, temporarily, but naturally I
> wish it didn't happen at all.
>  
> I did some digging and found a thread from this list from last year,
> which petered out without apparent resolution, but where Mark Diggory
> suggested tinkering with the fd.files-max value in sysctl.conf.
>  
> [ 
> http://sourceforge.net/mailarchive/message.php?msg_id=E1Grhzs-Nj-JN%40mail.sourceforge.net
>  ]
>  
> Well, I tried that, but it made no difference. So, back to Google,
> where I found (searching for "files-nr") that you can list all the
> open file handles used by a process, using
>  
> # ls -l /proc/PID/fd/
>  
> where PID is the process id.
>  
> So using this with the pid for the DSpace tomcat, I found lots of
> items like this:
>  
> lr-x--  1 uals uals 64 Dec 20 16:30 237
> -> /data/dspace/search/_vzb.cfs (deleted)
>  
> This is a symlink to one of the lucene index "overflow" files, which
> [in my limited understanding] are dynamically created and deleted as
> the index grows. These "deleted" items increase in number over time,
> and I imagine DSpace eventually hits the ulimit for open files (1024)
> and dies.
>  
> So I think the problem may be due to the lucene indexing not releasing
> file descriptors when they are deleted. Certainly, watching the list
> over an hour I've seen the number of "deleted" lines rise steadily. I
> guess we're noticing this as a problem here because of the very large
> amount of editing work we're engaged in currently. Other sites with a
> more "sedate" use of DSpace might never run into it.
>  
> Well, that's how it looks to me right now. Nothing I can do about it,
> but maybe someone expert in the lucene side of DSpace could look into
> it?
>  
>  
> Cheers. :D
>  
> Stephen Thomas,
> Senior Systems Analyst,
> University of Adelaide Library
> UNIVERSITY OF ADELAIDE SA 5005 AUSTRALIA
> Phone: +61 8 830 35190
> Fax: +61 8 830 34369
> Email: [EMAIL PROTECTED]
> URL: http://www.adelaide.edu.au/directory/stephen.thomas
> CRICOS Provider Number 00123M
> 
> ---
> This email message is intended only for the addressee(s) and contains
> information that may be confidential and/or copyright. If you are not
> the intended recipient please notify the sender by reply email and
> immediately delete this email. Use, disclosure or reproduction of this
> email by anyone other than the intended recipient(s) is strictly
> prohibited. No representation is made that this email or any
> attachments are free of viruses. Virus scanning is recommended and is
> the responsibility of the recipient.
> 
> 
>  
> -
> SF.Net email is sponsored by:
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services
> for just about anything Open Source.
> http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
> ___ DSpace-tech mailing list 
> DSpace-tech@lists.sourceforge.net 
> https://lists.sourceforge.net/lists/listinfo/dspace-tech

-
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services
for just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


[Dspace-tech] tomcat/jetty/resin

2008-03-14 Thread Cory Snavely
We're upgrading our DSpace server and taking another look at what
servlet engine we should use.

Has anyone done research/comparison and ended up particularly passionate
about their choice? I would be interested in objective benefits of one
over another, and I suspect others would too.

Cory Snavely
University of Michigan Library IT Core Services


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Hardware recommendations

2008-05-22 Thread Cory Snavely
Our experience is markedly different, and I'm particularly struck by the
comment about 15Krpm disk.

We use 15Krpm SCSI disk in RAID 1 for Postgres, but we use SATA RAID 6
for the assetstore and do not see DSpace I/O-bound against it. (FYI
we're working on transitioning our assetstore to a Sun Honeycomb.)

I'm not sure your units are correct for the assetstore--7GB?--but if so,
sure, a little space on your 15Krpm disk is fine. If you mean 7TB, then
you're talking considerable expense over SATA that I am pretty confident
would *not* give any benefit. I/O against the assetstore you is low
volume sequential read and occasional high-volume sequential write, and
that is 100% consistent with "tier 3" type storage products and what
SATA RAID does best. In fact I would think it would be perfectly
reasonable to put the assetstore on NAS.

FWIW we are currently deploying a replacement server for DSpace; it's a
dual quad-core Xeon server w/ 8GB RAM with 15Krpm internal SAS disk in
RAID 1 and 9TB SATA RAID 6. Our experience has shown that we need to
handle heavy multiprocessing Postgres load as well as large memory
allocation and that aside from Postgres storage I/O requirements are
relatively light.

Cory Snavely
University of Michigan Library IT Core Services

On Thu, 2008-05-22 at 14:54 +0200, Bram Luyten wrote:
> Hello Jim,
> 
> we have installed and manage some very large DSpaces, but also a few
> moderate ones. An example of a large installation:
> 
> http://lirias.kuleuven.be holds around 130.000 items, of which only
> around 2500 of them contain full-text. This asset store is around 7GB.
> It's mainly academic research output (papers, conference
> presentations, ...). So currently, the average size of a bitstream is
> around 2.8MB. But this will be very different if your repository is
> oriented towards datasets, audio or video.
> 
> Concerning processing power and memory: the system currently has
> around 1000 unique visitors daily. There are 4000 e-persons. During
> office hours, we experience an average of 4 concurrent logged-in
> users, performing submissions, etc (= intensive on database and
> indexes). The tomcat has been given 2GB of memory, while the whole
> system has 3.5GB of RAM. This doesn't really cover the total load, as
> we use swapping, but it's enough to keep the system running. A
> recommendation: don't cut down on disk speed, you will need 15.000 rpm
> disks.
> 
> The server has 1 physical dual-core processor, with hyper threading
> (so actually 4 virtual processors). In peak times, this is becoming a
> bottle neck.
> 
> If you could illustrate the purposes of your installation, or the
> estimated number of users, I could provide you with a specific case,
> that possibly more closly matches what you're looking at.
> 
> with kindest regards,
> 
> Bram Luyten
> 
> -- 
> @mire NV
> Romeinse Straat 18
> 3001 Heverlee
> Belgium
> +32 2 888 29 56
> 
> http://www.atmire.com - Institutional Repository Solutions
> http://www.togather.eu - Before getting together, get [EMAIL PROTECTED]
> 
> On Wed, May 21, 2008 at 4:33 PM, Jim Price
> <[EMAIL PROTECTED]> wrote:
> Hello,
> 
> We are currently running a test instance on a Sun Enterprise
> 450 running Solaris 10. We are currently looking into new
> hardware for a production environment.
> We would like to find out if anyone is willing to share their
> experiences with hardware?
> 
> What platform would you recommend for a school starting with
> dspace?
> What are you using for hardware?
> What do you store?
> How big is your data storage?
> How much growth did you see in your first year of use?
> What kind of loads are you experiencing?
> 
> 
> Thanks,
> Jim
> 
> 
> 
> 
> -
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
> ___
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> 
> 
> 
> -
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
> ___ DSpace-tech mailing list 
>

Re: [Dspace-tech] connections to db seem to be getting stuck

2007-01-19 Thread Cory Snavely
Note that this error is not referring to the Postgres connections
themselves, but the connection pool within DSpace from which the
database connections are allocated. Postgres is blissfully ignorant of
the problem, and I believe we'd see this problem even if we tripled the
number of connections.

At one point we did see the number of Postgres connections being
exhausted because I hadn't "done the math" for how many DSpace instances
we're running and configured Postgres accordingly, but as soon as I
tweaked that up to account for that, that problem went away.

What we are observing now is much more like a database connection pool
leak of some kind. Little by little, apparently after aggressive hits,
Postgres connections go into a permanent "idle in transaction" state,
and eventually all of the pool is used up. A restart of Tomcat or
Postgres will free the connections.

Apparently "idle in transaction" is Postgres waiting on the client
mid-transaction. We don't seem to see hangs on database activity
manifested in the web interface, which makes me suspect there is not a
problem with queries completing successfully but rather something more
insidious in how the pool is managed--maybe the "idle in transaction"
state is caused due to some sort of race condition as an active
connection in the pool is assigned to another running thread.

For the moment, I have installed a dirty little crontab entry that runs
this on the minute:

/usr/bin/test `/usr/bin/pgrep -f 'idle in transaction' | \
  /usr/bin/wc -l ` -gt 20 && /usr/bin/pkill -o -f 'idle in transaction'

In English: every minute, if there are more than 20 "idle in
transaction" Postgres processes, it kills the oldest one.

Cory Snavely
University of Michigan Library IT Core Services

On Fri, 2007-01-19 at 11:58 -0500, Mark Diggory wrote:
> What about postgres? How many connections is it making available?  
> You'll want to roughly multiply it by the number of webapplications  
> your running, so for instance
> 
> db.maxconnections = 50
> db.maxwait = 5000
> db.maxidle = 5
> 
> running dspace.war, dspace-oai.war and dspace-srw.war postgres needs  
> about 150 connections in it postgres.conf.  I usually increment that  
> by one for cron jobs as well:
> 
> 
> for instance in my current config we run two virtual hosts with 3  
> webapps each and 1 set for crons:
> 
> 2 vhosts *( 3 webservices +1 cron) * 50 in pool = 400
> 
> > #- 
> > --
> > # CONNECTIONS AND AUTHENTICATION
> > #- 
> > --
> >
> > max_connections = 400
> > # note: increasing max_connections costs ~400 bytes of shared  
> > memory per
> > # connection slot, plus lock space (see  
> > max_locks_per_transaction).  You
> > # might also need to raise shared_buffers to support more connections.
> 
> Its not a hard-fast rule, we never really exhaust that many  
> connections in one instance, but somewhere between that and the  
> default "100" there is a sweet spot.
> 
> -Mark
> 
> On Jan 19, 2007, at 11:43 AM, Jose Blanco wrote:
> 
> > Actually I mean, more frequently today.  Sorry about that.
> >
> > -Original Message-
> > From: [EMAIL PROTECTED]
> > [mailto:[EMAIL PROTECTED] On Behalf Of  
> > Jose Blanco
> > Sent: Friday, January 19, 2007 11:42 AM
> > To: 'Dorothea Salo'
> > Cc: dspace-tech@lists.sourceforge.net
> > Subject: Re: [Dspace-tech] connections to db seem to be getting stuck
> >
> > It was dying on us a couple of times a week, but for some reason,  
> > it's dying
> > more frequently this week.  Could you share your config db parameters.
> > Right now I have the default settings.
> >
> > -Original Message-
> > From: [EMAIL PROTECTED]
> > [mailto:[EMAIL PROTECTED] On Behalf Of  
> > Dorothea
> > Salo
> > Sent: Friday, January 19, 2007 11:28 AM
> > Cc: dspace-tech@lists.sourceforge.net
> > Subject: Re: [Dspace-tech] connections to db seem to be getting stuck
> >
> > Jose Blanco wrote:
> >> So what do you do?  Restart tomcat all day long?  For some reason,  
> >> it is
> >> happening very frequently today.  It's making the system kind of  
> >> unusable
> >> when every 30 minutes to an hour tomcat has to be restarted.
> >
> > That often? Wow. It dies on us a couple of times a week, and not
> > always for
> > this reason as best I can tell.

Re: [Dspace-tech] connections to db seem to be getting stuck (Dorothea Salo)

2007-01-19 Thread Cory Snavely
That's also something that occurred to me, although I've shied away from
an upgrade because I haven't seen a good example of how to dump and
restore a DSpace database. As I recall there are issues about objects
and special instructions that must be given in order to preserve them.

IF anyone has good known-working example for how to do that, that would
be much appreciated.

Cory Snavely
University of Michigan Library IT Core Services

On Fri, 2007-01-19 at 15:40 -0500, Mark H. Wood wrote:
> What version of Postgres are you all running?  We used to have these
> problems all the time.  I think things got a lot better when I moved
> from the dusty old 7.4 recommended in the docs to 8.0.something and a
> recent JDBC driver.  (I'm actually using the 8.1 driver.)
> 
> -
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ___ DSpace-tech mailing list 
> DSpace-tech@lists.sourceforge.net 
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] connections to db seem to be getting stuck (Dorothea Salo)

2007-01-19 Thread Cory Snavely
This sounds spot-on to me, John.

Does anybody know where DSpace's version of commons-dbcp.jar comes from?
It's dated September 12, 2004 (!), whereas most of the other JARs in
there look pretty recent.

On Fri, 2007-01-19 at 16:23 -0500, John Preston wrote:
> I've been logging a problem for a while now with no response, where by
> a lot of connection and connection pool exceptions were occuring, and
> I tried every JDBC driver out there with no solution.
> 
> I then replaced the apache pooling package with another database
> connection pooling package (dbpool If I remember correctly) and
> changed some code in dspace and I don't get these messages any more. I
> haven't stress tested the setup so I didn't log this in the mailing
> list but maybe someone can see if maybe there is an issue with the
> apache connection pooling setup in the stock dspace. 
> 
> John
> 
> On 1/19/07, Cory Snavely <[EMAIL PROTECTED]> wrote:
> That's also something that occurred to me, although I've shied
> away from
> an upgrade because I haven't seen a good example of how to
> dump and
> restore a DSpace database. As I recall there are issues about
> objects 
> and special instructions that must be given in order to
> preserve them.
> 
> IF anyone has good known-working example for how to do that,
> that would
> be much appreciated.
> 
> Cory Snavely
> University of Michigan Library IT Core Services 
> 
> On Fri, 2007-01-19 at 15:40 -0500, Mark H. Wood wrote:
> > What version of Postgres are you all running?  We used to
> have these
> > problems all the time.  I think things got a lot better when
> I moved
> > from the dusty old 7.4 recommended in the docs to
> 8.0.something and a
> > recent JDBC driver.  (I'm actually using the 8.1 driver.)
> >
> >
> 
> -
> > Take Surveys. Earn Cash. Influence the Future of IT 
> > Join SourceForge.net's Techsay panel and you'll get the
> chance to share your
> > opinions on IT & business topics through brief surveys - and
> earn cash
> >
> 
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > ___ DSpace-tech
> mailing list DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> 
> 
> 
> -
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance
> to share your
> opinions on IT & business topics through brief surveys - and
> earn cash
> 
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ___ 
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> 
> -
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ___ DSpace-tech mailing list 
> DSpace-tech@lists.sourceforge.net 
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] need some suggestions plzzzzz

2007-02-13 Thread Cory Snavely
We looked into using a single naming authority for items in DSpace and
not in DSpace, and it's problematic because DSpace essentially has
naming authority for submitted items. It would be difficult to predict
its naming and work around it.

So we have a main naming authority and a DSpace sub-naming authority off
that. It's no big deal.

If you were really really tied to having one, you could in theory create
handles that were pointers into DSpace either using the DSpace handle
resolution mechanism, or not. Note that you would have to customize the
link generation in DSpace where it provides a bookmarkable URL to the
user. I'm not sure how you would tell DSpace what the externally-created
identifier is, though. It sounds messy.

In my estimation, it's much easier to accept the fact that DSpace is a
relatively self-contained system that creates and resolves its own
identifiers.

Cory Snavely
University of Michigan Library IT Core Services

On Tue, 2007-02-13 at 10:05 -0600, Krishna wrote:
> Hello everyone,
> 
> I need some suggestions. We are trying to integrate DSpace to a system
> which already uses handle system. If we want to use DSpace to store
> the data which also uses internal handle system, how do we do it. we
> would like to use only the handles which we already have and not the
> handles that DSpace uses . Is there any place in DSpace(may be
> metadata) to store the handle identifier generated by our system and
> use these handles to retrieve the data from the DSpace repository.
> Thanking you all'
> 
> Krishna
> 
> -
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier.
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> ___
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] How to configure Postfix...??

2007-02-15 Thread Cory Snavely
sendmail is one of the most arcane Unix systems known to exist. It is
also extrmely popular and ubiquitous. Choose it if you want to impress
your nerdy friends.

postfix is much simpler to configure. Nobody could possibly disagree
with that.

There are others. Debian systems install with exim, for example.

As other have mentioned, the distro you choose should give you a working
MTA configuration out of the box, and you probably don't even need to
know what it is. Your first order of business should be finding that
feature and employing it.

Cory Snavely
University of Michigan Library IT Core Services

On Fri, 2007-02-16 at 00:20 +0530, Sahil Dave wrote:
> well i have never configured any MTA bfore .. so i needed some good
> info..
> which do u think is more supported ... sendmail or postfix???
> 
> 
> On 2/15/07, James Rutherford <[EMAIL PROTECTED]> wrote:
> apologies for sending this twice. in future, make sure you
> 'reply-all'
> on the mailing list emails so that your responses go back to
> the list.
> 
> cheers,
> 
> jim.
> 
> On 15/02/07, James Rutherford < [EMAIL PROTECTED]> wrote:
> > On 14/02/07, Sahil Dave <[EMAIL PROTECTED]> wrote:
> > > yes i am running Mandriva 2007.. but i need to deploy
> Dspace on RHEL  4 - ES 
> > > in my Library...
> > > what all changes do i need to make to the postfix & DSpace
> config.
> > > files??
> >
> > RHEL4 will probably have sendmail setup and configured
> already. You 
> > can check to see if it is by running (as root) lsof -i
> tcp:25
> >
> > you should see something like the following if it is
> running:
> >
> > [EMAIL PROTECTED] ~]# lsof -i tcp:25
> > COMMAND   PID USER   FD   TYPE DEVICE SIZE NODE NAME 
> > sendmail 2995 root3u  IPv4   6365   TCP
> > localhost.localdomain:smtp (LISTEN)
> >
> > If this is the case, you just need to configure the mail
> server in
> > your dspace.cfg to be localhost, and add the username and
> password as 
> > required for the sendmail configuration. Note that if you're
> running
> > sendmail purely for your DSpace repository, you should
> configure your
> > firewall to block external connections to port 25 to avoid
> being used 
> > as a relay.
> >
> > There is nothing special about DSpace SMTP requirements, so
> for
> > whichever software you use, you should be able to find ample
> > documentation and sample configuration files. I'm afraid I
> don't 
> > really know much about postfix, but I do know that it is a
> > well-documented project, so you should have no problems
> using it if
> > you really want to.
> >
> > Jim.
> 
> 
> 
> -- 
> Sahil
> MCA(SE)
> USIT 
> -
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ___ DSpace-tech mailing list 
> DSpace-tech@lists.sourceforge.net 
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Data integrity/preservation issues and mirroring development-production servers

2007-02-20 Thread Cory Snavely
This illustrates the importance of NOT confusing *replication* for
redundancy, whether that be rsync, LOCKSS, something SAN-based, etc,
with *backups* for version retention, whether that be conventional
weekly-full/daily-incr, snapshots, CDP, etc.

(It also illustrates the importance of validating checksums regularly!)

This is the kind of thing Mark was getting at. SDR guidelines and good
preservation policies should require redundancy for availability and/or
disaster recovery, checksums (and periodic validation!) for integrity
purposes, and backups for protection against human error and/or for
disaster recovery. HOWEVER, implementing those things in a way that
serves their preservation goals requires a sysadmin who understands
those preservation goals. For example, ideally, backup or snapshot
retention would be at least twice as long as the frequency with which
checksums are validated, so that if a validation error is detected, you
have at least two previous copies to go back to.

Ultimately there is a level of detail below which local decisions on
implementation are irrelevant--for example, the architecture of the
backup system--but without some understanding of the preservation goals,
a sysadmin is not guaranteed to make the right decision.

Cory Snavely
University of Michigan Library IT Core Services

On Tue, 2007-02-20 at 09:30 +, Philip Adams wrote:
> Hi,
> 
>  
> 
> Checksums may be reassuring for checking that a file still has
> integrity, but they leave open the question of what to do if the
> checksums do not match. 
> 
>  
> 
> There is a growing movement of people interested in trying to ensure
> that digital preservation techniques exist to overcome this problem.
> One of the most interesting applications to come out of this is LOCKSS
> (Lots of Copies Keeps Stuff Safe) see
> http://www.lockss.org/lockss/Home for details.
> 
>  
> 
> Most of the material archived using LOCKSS so far is from electronic
> journals, with some government papers and the odd blog. LOCKSS acts as
> a store, a proxy and a repairer. If applied to DSpace, it could enable
> a kind of co-operative backup network to develop with copies of
> content from repositories mirrored on a number of LOCKSS boxes. If
> your DSpace was unable to deliver content it could be served up from
> LOCKSS acting as a proxy instead. LOCKSS boxes spend much of their
> time contacting each other to take part in integrity checking polls
> and repairing content where required.
> 
>  
> 
> There is a recent survey of the digital preservation strategies
> available at the moment at
> http://www.clir.org/pubs/reports/pub138/pub138.pdf. De Montfort
> University is taking part in the UK LOCKSS Pilot programme:
> http://www.jisc.ac.uk/whatwedo/programmes/programme_preservation/programme_lockss.aspx.
> 
>  
> 
> Perhaps repository owners could use LOCKSS in either public or private
> networks to look after the digital preservation aspects of managing
> their content.
> 
>  
> 
> Regards,
> 
> Philip Adams
> 
> Senior Assistant Librarian (Electronic Services Development)
> 
> De Montfort University Library
> 
> 0116 250 6397
> 
>  
> 
> 
> -
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ___ DSpace-tech mailing list 
> DSpace-tech@lists.sourceforge.net 
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Load balancing / clustering

2007-03-08 Thread Cory Snavely
I'm not clear on why you would want load-balancing both in *front* of
Apache *and between* Apache and Tomcat. In particular I would think if
you had the former you would not benefit from the latter. I guess you're
concerned about Tomcat failing independently of Apache. In my case, I've
just eliminated Apache from the picture.

At any rate, re: the assetstore, if you want a load-balanced
environment, I am quite sure that real-time synchronization is
necessary. Even with an hourly rsync--problematic at best with a large
repository, BTW--a deposit on one instance and a subsequent attempted
retrieval of it on the other would cause issues. There are a number of
ways to share a file system among several servers but I would think that
the most accessible would be any reasonable NAS storage backend
depending on your existing storage infrastructure.

Make sure you run the indexer on only one instance.

I run two regular handle servers redundantly, not against DSpace, but
against MySQL with bidirectional MySQL replication. The folks at CNRI
helped me work through the issues involved, which mainly involved having
a shared private key between the two and making sure that the two
servers were configured as masters so they did not try to use handle
replication. I would think that redundant handle servers operating
against DSpace (that is, DSpace methods for Postgres or MySQL access)
would be about the same thing--just making sure that the handle server
configurations are identical on each server.

Cory Snavely
University of Michigan Library IT Core Services

On Wed, 2007-03-07 at 11:52 -0800, Ryan Ordway wrote:
> I have been digging around to find information about sites using load
> balancing and/or clustering with their Dspace installations. All I could
> find was mention of load balancing web requests to multiple Tomcat instances
> using mod_jk.
> 
> First some background, and then my question:
> 
> What I am looking to do is put my Dspace web servers behind my load balancer
> to balance the HTTP requests. The web servers then both load balance their
> Tomcat connections via mod_jk to each other, with their own instances being
> weighted heavier so that they will prefer localhost.
> 
> For the database, for now I'm just using a single Postgres instance. I'm
> hoping to get Dspace ported to MySQL to take advantage of my existing MySQL
> cluster. 
> 
> My question is, are there any issues to watch for? Will just rsync'ing the
> assetstore between the two web/app servers suffice? Are there any issues
> with running multiple handle servers?
> 
> Thanks,
> 
> Ryan
> 
> --
> Ryan Ordway  E-mail:   [EMAIL PROTECTED]
> Unix Systems Administrator [EMAIL PROTECTED]
> OSU Libraries, Corvallis, OR 97370Office: Valley Library #4657
> 
> 
> 
> -
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ___
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Load balancing / clustering

2007-03-08 Thread Cory Snavely
On Thu, 2007-03-08 at 11:54 -0800, Ryan Ordway wrote:
> On 3/8/07 4:54 AM, "Cory Snavely" <[EMAIL PROTECTED]> spake:
> 
> > At any rate, re: the assetstore, if you want a load-balanced
> > environment, I am quite sure that real-time synchronization is
> > necessary. Even with an hourly rsync--problematic at best with a large
> > repository, BTW--a deposit on one instance and a subsequent attempted
> > retrieval of it on the other would cause issues. There are a number of
> > ways to share a file system among several servers but I would think that
> > the most accessible would be any reasonable NAS storage backend
> > depending on your existing storage infrastructure.
> 
> I am also trying to avoid single points of failure. These hosts are both
> connected to a SAN, but want both hosts to have a copy of the data.
> 
> I'm considering some form of on-demand synchronization, in addition to
> scheduled synchronization. For instance, when a new item is added having it
> trigger a synchronization to push the new data to the other node.
> 
> Rsync is quite speedy. :-)

Well, whether your storage backend is a single point of failure depends
largely on its architecture. If you use dual pathing, dual active-active
controllers, etc, and some reasonable RAID level I would not at all
consider it to be a single point of failure.

If you still favor the idea of two separate storage systems, I think you
are heading down the road of bi-directional, real-time replication in
order to really do it right. I am of the opinion that most any system
reliant on crawling across large filesystems on a regular basis is
unacceptable at a large scale. I have also seen rsync require huge
amounts of memory at large scale. Lastly, the bidirectionality is also
an issue that could be complicated in particular if you allow objects to
be removed from your repository (consider whether you would use the
--delete flag or not, and how a new submission looks to one system like
a deletion to the other).

That said, if you rig up something to trigger a push to the other site,
you'll probably be able to get it to work...but it's really work that
could be achieved at the file system layer.
 
> > Make sure you run the indexer on only one instance.
> 
> Good to know!
>  
> > I run two regular handle servers redundantly, not against DSpace, but
> > against MySQL with bidirectional MySQL replication. The folks at CNRI
> > helped me work through the issues involved, which mainly involved having
> > a shared private key between the two and making sure that the two
> > servers were configured as masters so they did not try to use handle
> > replication. I would think that redundant handle servers operating
> > against DSpace (that is, DSpace methods for Postgres or MySQL access)
> > would be about the same thing--just making sure that the handle server
> > configurations are identical on each server.
> 
> What is the benefit to using the handle server with MySQL? What needs to be
> done to Dspace to get it to use the MySQL data rather than using the Dspace
> methods?

It won't apply here. To resolve handles in DSpace, you have to configure
the handle server to run against the DSpace metadata store through Java
methods.

My point with that was simply to say that handle servers can run in an
active-active load-balancing mode, but they need to both believe they
are masters and they need to use the same private key.

c

> > On Wed, 2007-03-07 at 11:52 -0800, Ryan Ordway wrote:
> >> I have been digging around to find information about sites using load
> >> balancing and/or clustering with their Dspace installations. All I could
> >> find was mention of load balancing web requests to multiple Tomcat 
> >> instances
> >> using mod_jk.
> >> 
> >> First some background, and then my question:
> >> 
> >> What I am looking to do is put my Dspace web servers behind my load 
> >> balancer
> >> to balance the HTTP requests. The web servers then both load balance their
> >> Tomcat connections via mod_jk to each other, with their own instances being
> >> weighted heavier so that they will prefer localhost.
> >> 
> >> For the database, for now I'm just using a single Postgres instance. I'm
> >> hoping to get Dspace ported to MySQL to take advantage of my existing MySQL
> >> cluster. 
> >> 
> >> My question is, are there any issues to watch for? Will just rsync'ing the
> >> assetstore between the two web/app servers suffice? Are there any issues
> >> with running multiple handle servers?
> 
> 


-

Re: [Dspace-tech] MediaFilter clarification

2007-03-29 Thread Cory Snavely
Not all PDF files are created equal. They contain different internals.

To use Adobe's taxonomy of PDF files, you have essentially three types:

* "Formatted Text and Graphics": These are typically created from
applications like Word, are structurally equivalent to PostScript, and
somewhat analagous to vector images. Text is represented internally as
text, and visual markup gives the document its look. You can use the
search tool in Acrobat Reader to search for text in these, and likewise,
Lucene can index them.

* "Image Only": These are typically created from basic scanning
applications, and are essentially bitmap images embedded in a PDF data
structure. These are not searchable or indexable by any means because
they contain no text identifiable as such.

* "Searchable Image": These are typically created from scanning
applications that have built-in OCR functionality, or by post-processing
"Image Only" PDFs. OCR is done on the image component and stored with
coordinate information within the PDF data structure. These are
searchable in Acrobat Reader and indexable by Lucene. Acrobat Reader,
because of the coordinate information, is even able to highlight search
hits in rectangles over the image where the OCRed word was found.

I suspect you have a mix of "Image Only" and "Searchable Image" type PDF
files, given your description of the project.

We do a large amount of digitization and OCR here. In migrating page
images and OCR from another environment into DSpace, we were not able to
find a good tool to both build PDFs from our existing page images and
embed our existing OCR to create "Searchable Image" PDFs. To get
full-text searching out of the entire body of materials, you are
probably going to have to do what we did, and look at OCR tools that can
operate on "Image Only: PDF files.

Arguably the best tools would be from Adobe. Acrobat Pro can do batch
OCR, essentially conversion of "Image Only" to "Searchable Image".
Acrobat Capture can also do this with greater efficiency and at greater
cost.

You'll want to make sure that whatever process you end up with, you
don't compromise the existing image quality--some tools may try to be
helpful by downsampling the existing images, or applying different
compression levels to them.

Cory Snavely
University of Michigan Library IT Core Services

On Thu, 2007-03-29 at 09:02 -0600, Shawna Sadler wrote:
> A bunch of us in Canada have received theses from Library & Archives 
> Canada (national library) where they created PDFs from microfilmed theses.
> 
> We've loaded them into DSpace and we're noticing very inconsistent 
> behavior with MediaFilter. Some of the theses have extracted text and 
> some have blank .txt files.
> 
> Thesis with successfully extracted text
> https://dspace.ucalgary.ca/handle/1880/25057
> 
> Unsuccessful- blank .txt file
> https://dspace.ucalgary.ca/handle/1880/25028
> 
> Can anyone shed some light on this issue?
> 
> Thanks,
> Shawna
> 
> Shawna Sadler
> Coordinator, Digital Initiatives
> Libraries & Cultural Resources
> University of Calgary
> Phone: (403) 220-3739
> Email: [EMAIL PROTECTED]
> 
> >  
> >
> 
> -
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ___
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] redirect port 8443 to 80?

2007-04-06 Thread Cory Snavely
For folks listening in with interest, we also use NAT port forwarding to
get around the requirement for mod_jk, but FWIW I haven't determined a
way to close the incoming *actual* Tomcat ports (8080/8443). So, a
potential downside with this approach, in addition to not having any
real logic like mod_rewrite to apply at that intermediary level.

Mind you, it's not really harmful or vulnerable, it's just a little ugly
to have your actual nonstandard ports all hanging out like that.

Cory Snavely
University of Michigan Library IT Core Services

On Fri, 2007-04-06 at 11:56 -0400, Mark Diggory wrote:
> We use Apache, mod_jk and mod_rewrite to deliver the webapplication  
> on port 80 and port 443 as separate VirtualHost entries in Apache  
> httpd. We do not allow direct access to the tomcat server over port  
> 8080 or port 8443.  I can send some more detail of our configuration  
> if you decide to go this route.
> 
> -Mark
> 
> On Apr 6, 2007, at 11:32 AM, James Rutherford wrote:
> 
> > On Thu, Apr 05, 2007 at 09:39:53AM -0600, Zhiwu Xie wrote:
> >> bar, but then when I click the DSpace logo from a secured page  
> >> such as
> >>
> >> https://laii-dspace.unm.edu/password-login
> >>
> >> all the following pages are through https regardless of which the  
> >> page
> >> is, which bothers me.
> >
> > The links used in DSpace are relative, so if you login via https, you
> > will continue with https.
> >
> >> But when I tried to click the dspace logo from the mit dspace page
> >>
> >> https://dspace.mit.edu/password-login
> >>
> >> the request to the https://dspace.mit.edu/ seems to be rerouted to
> >> http://dspace.mit.edu/. So what's the trick?
> >
> > The only reason the MIT site is different is because (I assume) they
> > have some custom configuration elsewhere that redirects https requests
> > to http for normal use. If you try accessing https://dspace.mit.edu  
> > you
> > will be redirected to the unsecured version at http://dspace.mit.edu.
> >
> > cheers,
> >
> > Jim
> >
> > -- 
> > James Rutherford  |  Hewlett-Packard Limited registered  
> > Office:
> > Research Engineer |  Cain Road,
> > HP Labs   |  Bracknell,
> > Bristol, UK   |  Berks
> > +44 117 312 7066  |  RG12 1HN.
> > [EMAIL PROTECTED]   |  Registered No: 690597 England
> >
> > The contents of this message and any attachments to it are  
> > confidential and
> > may be legally privileged. If you have received this message in  
> > error, you
> > should delete it from your system immediately and advise the  
> > sender. To any
> > recipient of this message within HP, unless otherwise stated you  
> > should
> > consider this message and attachments as "HP CONFIDENTIAL".
> >
> > -- 
> > ---
> > Take Surveys. Earn Cash. Influence the Future of IT
> > Join SourceForge.net's Techsay panel and you'll get the chance to  
> > share your
> > opinions on IT & business topics through brief surveys-and earn cash
> > http://www.techsay.com/default.php? 
> > page=join.php&p=sourceforge&CID=DEVDEV
> > ___
> > DSpace-tech mailing list
> > DSpace-tech@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dspace-tech
> 
> ~
> Mark R. Diggory - DSpace Systems Manager
> MIT Libraries, Systems and Technology Services
> Massachusetts Institute of Technology
> Office: E25-131
> Phone: (617) 253-1096
> 
> 
> 
> -
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ___
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] redirect port 8443 to 80?

2007-04-09 Thread Cory Snavely
Right, and that was my initial approach, but it seemed to have the
effect of blocking traffic to port 80.

As I've said, I'm not seeing it as a real problem, but rather just
letting people know that it is an ugliness associated with this (NAT)
approach.

On Sat, 2007-04-07 at 12:26 -0400, Mark Diggory wrote:
> On Apr 7, 2007, at 12:08 PM, Mark H. Wood wrote:
> 
> > On Fri, Apr 06, 2007 at 12:07:44PM -0400, Cory Snavely wrote:
> >> For folks listening in with interest, we also use NAT port  
> >> forwarding to
> >> get around the requirement for mod_jk, but FWIW I haven't  
> >> determined a
> >> way to close the incoming *actual* Tomcat ports (8080/8443).
> >
> > Just don't open them.  In [tomcat]conf/server.xml comment out the
> > Connector with 'port="8080"' and leave commented the one with
> > 'port="8443"'.  You should then only be running AJP 1.3 on 8009 and
> > the shutdown port on localhost:8005.  If you want to limit AJP to the
> > local host, you can add 'address="127.0.0.1"' to the AJP Connector.
> >
> > -- 
> > Mark H. Wood, Lead System Programmer   [EMAIL PROTECTED]
> > Typically when a software vendor says that a product is "intuitive" he
> > means the exact opposite.
> 
> MarkW,
> 
> This would only be the case if they were using mod_jk/Apache. but,  
> they are trying to use NAT/port forwarding and this means those  
> Tomcat ports are what are getting forwarded to. I'd say the quickest  
> solution is to just block those ports from external requests in the  
> NAT/firewall configuration.
> 
> -Mark Diggory
> 
> ~
> Mark R. Diggory - DSpace Systems Manager
> MIT Libraries, Systems and Technology Services
> Massachusetts Institute of Technology
> 
> 
> 
> -
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ___
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] compilation warning.

2007-04-09 Thread Cory Snavely
Yes, this is 1.6.0, updated after Sun published security bulletins in
January.

On Fri, 2007-04-06 at 21:56 +0100, James Rutherford wrote:
> Hi Jose,
> 
> On Fri, Apr 06, 2007 at 03:11:42PM -0400, Jose Blanco wrote:
> > As of my upgrade to 1.4.1, I get the following warning when I build the war
> > file:
> > 
> > [javac]
> > /l1/dspace/build/prod/dspace/src/org/dspace/app/oai/DIDLCrosswalk.java:55:
> > warning: sun.misc.BASE64Encoder is Sun proprietary API and may be removed in
> > a future release
> 
> Did you update your JDK by any chance? I only see these errors with JDK
> 1.6. As Graham mentioned, they're nothing to worry about, and can be
> resolved with the patch on SourceForge. Cleaning up such warnings is
> definitely on the "todo list" (volunteers?).
> 
> cheers,
> 
> Jim
> 


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Assetstore physical storage

2007-04-11 Thread Cory Snavely
There's a whole discussioon there about what's the right tool for the
job, but integration with Lucene would be my guess as to the practical
reason. I'd be interested to learn if that, in fact, were not a
constraint.

Cory Snavely
University of Michigan Library IT Core Services

On Wed, 2007-04-11 at 11:30 -0700, Ryan Ordway wrote:
> Is there a reason why only the metadata is stored in the database and not
> the actual assetstore bitstreams? Has anyone considered changing the
> physical storage from the filesystem to the database? I'm working on
> building some redundancy into my infrastructure and it's looking like the
> most efficient way to store the assetstore data in clustered configurations
> would be in the database, especially when your database is already clustered
> across multiple systems. Your database gets much larger, but you don't have
> to worry about keeping your assetstores synchronized, etc.
> 
> Any thoughts? Anyone to blame? ;-)
> 
> Ryan
> 


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Large files and DSpace

2007-04-16 Thread Cory Snavely
I'd be interested to know how using SRB addresses the problem, which I 
understand to be the logistics of handling such a large file in both the 
user interface and the back end. Does it?

Cory Snavely
University of Michigan Library IT Core Services

- Original Message - 
From: "Ekaterina Pechekhonova" <[EMAIL PROTECTED]>
To: "Gary Browne" <[EMAIL PROTECTED]>
Cc: 
Sent: Monday, April 16, 2007 8:12 PM
Subject: Re: [Dspace-tech] Large files and DSpace


> Hi Gary,
> you can configure Dspace to use SRB instead of regular assetstore. Some 
> basic information can be found in the docs which come
> with Dspace.Also you can check this link:
> http://wiki.dspace.org/index.php//DspaceSrbIntegration
>
> Kate
>
> Ekaterina Pechekhonova
> Digital Library Programmer/Analyst
> New York University
> Libraries
> email: [EMAIL PROTECTED]
> phone: 212-992-9993
>
> - Original Message -
> From: Gary Browne <[EMAIL PROTECTED]>
> Date: Monday, April 16, 2007 7:41 pm
> Subject: [Dspace-tech] Large files and DSpace
> To: dspace-tech@lists.sourceforge.net
>
>> Hello All
>>
>>
>>
>> I think I posted a question like this last year but I've just become a
>> dad for the first time and have a bit of brain meltdown. I tried
>> searching for answers on the annoying sourceforge list archive (should
>> I
>> start a separate thread about this...?) but didn't find much.
>>
>>
>>
>> My question is a general one in that I'm wondering how people are
>> handling large files in DSpace (getting them onto the server,
>> submissions and publication/access)? Is the SymLink stuff the only
>> option at this point? For example, we have (and will be getting lots
>> more of) a 12GB video file to be used in one of our collections. I'd
>> like to nut out what the possible options are before I try anything.
>>
>>
>>
>> Thanks and kind regards
>>
>> Gary
>>
>>
>>
>>
>>
>> Gary Browne
>> Development Programmer
>> Library IT Services
>> University of Sydney
>> Australia
>> ph: 61-2-9351 5946
>>
>>
>>
>> -
>> This SF.net email is sponsored by DB2 Express
>> Download DB2 Express C - the FREE version of DB2 express and take
>> control of your XML. No limits. Just data. Click to get it now.
>> http://sourceforge.net/powerbar/db2/
>> ___
>> DSpace-tech mailing list
>> DSpace-tech@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
> -
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> ___
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> 


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Large files and DSpace

2007-04-17 Thread Cory Snavely
Interesting thought, but using bittorrent would require the setup of
several peer sites in order to do its thing. Probably a good idea from a
preservation standpoint but I would suspect not practical for many, and
there are of course easier ways to support large transfer demands.

I've thought that if a need emerged for us to handle this type of media,
the DAS we use for the assetstore would be just fine, as of course would
a SAN or NAS arrangement, but submissions would probably need to come on
removable media and be loaded by staff, and distribution would probably
need to be via streaming, which I know has been discussed on this list.
To me those are the indicated approaches for this issue.

c

On Mon, 2007-04-16 at 22:30 -0300, Afonso Comba de Araujo Neto wrote:
> The problem is very intriguing and I felt like giving my 2 cents.
> 
> I don't even think the problem is where you'll put it or how you'll  
> integrate such files to DSpace. The main problem is how a regular user  
> would download such a gigantic file.
> 
> My first try would be to use another technology which is focused on  
> handling such downloads. The best technology I can think of for this  
> kind of thing is  bit torrent. If I had to do that, I would include on  
> DSpace just a .torrent file and instruct the users how to download  
> using the bit torrent protocol (links to free clients, etc.). Not only  
> it would be way better than a simple http download, but it could  
> alleviate the strain on your server, which certainly would build up  
> with such lengthy downloads.
> 
> 
> Regards,
> Afonso Araujo Neto
> 
> 
> 
> 
> 
> Citando Gary Browne <[EMAIL PROTECTED]>:
> 
> > We have an assetstore residing on a SAN which solves the capacity
> > issues, but as Cory says it is more the logistics of getting items into
> > and out of the assetstore which is the problem.
> >
> > Regards
> > Gary
> >
> >
> > Gary Browne
> > Development Programmer
> > Library IT Services
> > University of Sydney
> > Australia
> > ph: 61-2-9351 5946
> > -Original Message-
> > From: [EMAIL PROTECTED]
> > [mailto:[EMAIL PROTECTED] On Behalf Of Cory
> > Snavely
> > Sent: Tuesday, 17 April 2007 10:55 AM
> > To: dspace-tech@lists.sourceforge.net
> > Subject: Re: [Dspace-tech] Large files and DSpace
> >
> > I'd be interested to know how using SRB addresses the problem, which I
> > understand to be the logistics of handling such a large file in both the
> >
> > user interface and the back end. Does it?
> >
> > Cory Snavely
> > University of Michigan Library IT Core Services
> >
> > - Original Message -
> > From: "Ekaterina Pechekhonova" <[EMAIL PROTECTED]>
> > To: "Gary Browne" <[EMAIL PROTECTED]>
> > Cc: 
> > Sent: Monday, April 16, 2007 8:12 PM
> > Subject: Re: [Dspace-tech] Large files and DSpace
> >
> >
> >> Hi Gary,
> >> you can configure Dspace to use SRB instead of regular assetstore.
> > Some
> >> basic information can be found in the docs which come
> >> with Dspace.Also you can check this link:
> >> http://wiki.dspace.org/index.php//DspaceSrbIntegration
> >>
> >> Kate
> >>
> >> Ekaterina Pechekhonova
> >> Digital Library Programmer/Analyst
> >> New York University
> >> Libraries
> >> email: [EMAIL PROTECTED]
> >> phone: 212-992-9993
> >>
> >> - Original Message -
> >> From: Gary Browne <[EMAIL PROTECTED]>
> >> Date: Monday, April 16, 2007 7:41 pm
> >> Subject: [Dspace-tech] Large files and DSpace
> >> To: dspace-tech@lists.sourceforge.net
> >>
> >>> Hello All
> >>>
> >>>
> >>>
> >>> I think I posted a question like this last year but I've just become
> > a
> >>> dad for the first time and have a bit of brain meltdown. I tried
> >>> searching for answers on the annoying sourceforge list archive
> > (should
> >>> I
> >>> start a separate thread about this...?) but didn't find much.
> >>>
> >>>
> >>>
> >>> My question is a general one in that I'm wondering how people are
> >>> handling large files in DSpace (getting them onto the server,
> >>> submissions and publication/access)? Is the SymLink stuff the only
> >>> option at this point? For example, we have (and will be getting lots
> >>> more of) a 12GB video file to be used in one of our collection

Re: [Dspace-tech] Cannot get a connection, pool exhausted

2007-04-18 Thread Cory Snavely
In our experience, this problem appears to be due to a bug somewhere in
freeing connections back to the pool--we tend to see steady linear
growth in the number of 'idle in transaction' connections until we get
this error. These are visible with ps.

Increasing the number of connections in the pool, for us, only delayed
the occurrence of the problem. Ultimately the number of 'idle in
transaction' connections would climb to the max.

We put a workaround in place. This is a root crontab entry:

# kill old 'idle in transaction' postgres processes, leaving up to 10
* * * * * while /usr/bin/test `/usr/bin/pgrep -f 'idle in transaction'
| /usr/bin/wc -l` -gt 10; do /usr/bin/pkill -o -f 'idle in transaction';
done

At one point I was entertaining a theory that the Apache connection pool
manager delivered with DSpace was a stale version. To date, the
workaround has worked so well that I'm not sure that theory has been
fully explored.

Also, FWIW, there have been lengthy discussions on this list about this
topic already. You would probably find the previous thread useful as I'm
quite sure I'm not retelling everything here.

Cory Snavely
University of Michigan Library IT Core Services

On Wed, 2007-04-18 at 12:13 +0530, Filbert Minj wrote:
> Hi Stuart,
> 
> Thanks very much for the prompt reply.
> Recently we have upgraded it to Dspace 1.4.1 on RHEL 4 using postgres 
> database.
> I made the change in db.maxconnections and I think this should solve the 
> problem.
> 
> I had forgotten, earlier we had the same problem and did exactly what you 
> suggested.
> 
> Cheers,
> 
> --
> Filbert
> 
> - Original Message - 
> From: "Stuart Lewis [sdl]" <[EMAIL PROTECTED]>
> To: "Filbert Minj" <[EMAIL PROTECTED]>; 
> 
> Sent: Wednesday, April 18, 2007 11:32 AM
> Subject: Re: [Dspace-tech] Cannot get a connection, pool exhausted
> 
> 
> > Hi Filbert,
> >
> >> Has any one faced similar problem.
> >>
> >>  WARN  org.dspace.app.webui.servlet.DSpaceServlet @
> >> anonymous:no_context:database_error:org.apache.commons.dbcp.SQLNestedException
> >> :
> >> Cannot get a connection, pool exhausted
> >>
> >> What is solution of this problem.
> >
> > DSpace holds a 'pool' of connections to the database which it reuses. This
> > means it doesn't have the overhead of creating a connection to the 
> > database
> > each time it needs to talk to the database.
> >
> > The error message suggests that all of these connections are in use, and 
> > it
> > has reached the number of connections that you have said it can have. The
> > default set in [connections]/config/dspace.cfg is:
> >
> > db.maxconnections = 30
> >
> > There are two reasons that you might be reaching this limit -
> >
> > 1) Your DSpace is very busy (lots of visitors) and there are not enough
> > connections to cope. If your hardware is large enough to cope with number 
> > of
> > connections, you could think about increasing the number of connections in
> > the pool. (change the number, restart Tomcat).
> >
> > 2) For some reason, DSpace might not be letting go of some old 
> > connections,
> > or they might be stuck in some way. If you are using UNIX and postgres, 
> > you
> > should be able to see the connections, and what they are doing, by running 
> > a
> > 'ps' on them  (make sure you're screen is wide to see what comes at the 
> > end
> > of the line). This might show that the connections are stuck - typical 
> > state
> > might be 'idle in transaction'. This can also happen if connections to the
> > database are not closed properly by DSpace.
> >
> > Which version / operating system / database do you use?
> >
> > I hope this helps,
> >
> >
> > Stuart
> > _
> >
> > Datblygydd Cymwysiadau'r WeWeb Applications Developer
> > Gwasanaethau Gwybodaeth  Information Services
> > Prifysgol Cymru Aberystwyth   University of Wales Aberystwyth
> >
> >E-bost / E-mail: [EMAIL PROTECTED]
> > Ffon / Tel: (01970) 622860
> > _
> >
> >
> > -- 
> > This message has been scanned for viruses and
> > dangerous content by MailScanner, and is
> > believed to be clean.
> > 
> 
> 
> 
> -
>

Re: [Dspace-tech] DSpace a memory hog?

2007-04-18 Thread Cory Snavely
This depends on your definition of a memory hog.

We run a relatively large instance of DSpace and we allocate 512MB to
Tomcat, about 100MB to Postgres, and 256MB for daily indexing runs (via
the dsrun script).

In earlier versions of DSpace the indexing routine needed to be patched
to work around a poor implementation that caused memory allocation to be
linear with repository size. Without that, we were running out of memory
during indexing. I believe that patch is now part of the base.

We run comfortably inside 2G of physical memory. I may have considered
that a memory hog 5 years ago, but today I consider it light.

Cory Snavely
University of Michigan Library IT Core Services

On Wed, 2007-04-18 at 01:01 -0700, Pan Family wrote:
> Hi,
> 
> There is a rumor that says DSpace is a memory hog.
> I don't know where this is from but it may not be that
> important.  What is important is that it makes my
> management nerves.  So I'd like to hear from those
> who know anything about this issue.  Is it really
> a memory hog?  Under what circumstances it
> might become a memory hog?  Or there should
> be no worry about memory usage at all?
> 
> Thanks a lot in advance!
> 
> -Pan 
> -
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> ___ DSpace-tech mailing list 
> DSpace-tech@lists.sourceforge.net 
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] DSpace a memory hog?

2007-04-18 Thread Cory Snavely
Well, as I said at first, it all depends on your definition of what a
memory hog is. Today's hog fits in tomorrow's pocket. We better all
already be used to that.

Also, I don't think for a *minute* that the original developers of
DSpace made a casual choice about their development environment--in
fact, I think they made a responsible choice given the alternatives.
Let's give our colleagues credit that's due. Their choice permits
scaling and fits well for an open-source project. Putting the general
problem of memory bloat in their laps seems pretty angsty to me.

Lastly, dedicating a server to DSpace is a choice, not a necessity. We
as implementors have complete freedom to separate out the database and
storage tiers, and mechanisms exist for scaling Tomcat horizontally as
well. In the other direction, I suspect people are running DSpace on
VMware or xen virtual machines, too.

Cory Snavely
University of Michigan Library IT Core Services

On Wed, 2007-04-18 at 13:40 -0500, Brad Teale wrote:
> Pan,
> 
> Dspace is a memory hog considering the functionality the application
> provides.  This is mainly due to the technological choices made by the
> founders of the Dspace project, and not the functional requirements the
> Dspace project fulfills.
> 
> Application and memory bloat are pervasive in the IT industry.  Each
> individual organization should look at their requirements whether they
> are hardware, software or both.  Having to dedicate a machine to an
> application, especially a relatively simple application like Dspace, is
> wasteful for hardware resources and people resources.
> 
> Web applications should _not_ need 2G of memory to "run comfortably".
> 


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] DSpace a memory hog?

2007-04-19 Thread Cory Snavely
Generally what's going on is that Tomcat, the web application framework,
has a large virtual machine running with a substantial amount of memory
allocated to the caching of programs and data for performance.

Depending on your database configuration, there can also be a
substantial amount of allocation to cache in Postgres too.

The indexer is a periodic process that does not run constantly. You
still must account for the amount of memory it consumes while running.
Memory requirements for recent versions of the indexing routine are of
constant order, meaning they do not vary appreciably with repository
size.

On Wed, 2007-04-18 at 18:09 -0700, Pan Family wrote:
> Thank you all for giving your opinion!
> 
> Technically, is it the web application or the indexer that requires 
> most of the memory?  What data is kept in memory all the time
> (even when nobody is searching)?  Is the memory usage proportional
> to the number of concurrent sessions?
> 
> Thanks again,
> 
> Pan
> 
> 
> 
> 
> On 4/18/07, Cory Snavely <[EMAIL PROTECTED]> wrote:
> Well, as I said at first, it all depends on your definition of
> what a
> memory hog is. Today's hog fits in tomorrow's pocket. We
> better all
> already be used to that.
> 
> Also, I don't think for a *minute* that the original
> developers of 
> DSpace made a casual choice about their development
> environment--in
> fact, I think they made a responsible choice given the
> alternatives.
> Let's give our colleagues credit that's due. Their choice
> permits
> scaling and fits well for an open-source project. Putting the
> general
> problem of memory bloat in their laps seems pretty angsty to
> me.
> 
> Lastly, dedicating a server to DSpace is a choice, not a
> necessity. We
> as implementors have complete freedom to separate out the
> database and 
> storage tiers, and mechanisms exist for scaling Tomcat
> horizontally as
>     well. In the other direction, I suspect people are running
> DSpace on
> VMware or xen virtual machines, too.
> 
> Cory Snavely
> University of Michigan Library IT Core Services 
> 
> On Wed, 2007-04-18 at 13:40 -0500, Brad Teale wrote:
> > Pan,
> >
> > Dspace is a memory hog considering the functionality the
> application
> > provides.  This is mainly due to the technological choices
> made by the 
> > founders of the Dspace project, and not the functional
> requirements the
> > Dspace project fulfills.
> >
> > Application and memory bloat are pervasive in the IT
> industry.  Each
> > individual organization should look at their requirements
> whether they 
> > are hardware, software or both.  Having to dedicate a
> machine to an
> > application, especially a relatively simple application like
> Dspace, is
> > wasteful for hardware resources and people resources.
> >
> > Web applications should _not_ need 2G of memory to "run
> comfortably".
> >
> 
> 
> -
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> ___ DSpace-tech mailing list 
> DSpace-tech@lists.sourceforge.net 
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


[Dspace-tech] srb/s3/etc and lucene

2007-05-03 Thread Cory Snavely
(Apologies if this has been discussed to resolution; after a few
attempts to search the archives, I concluded they are really broken. 500
errors, bad links, etc.)

For those using, interested in, or knowledgeable about using API-based
storage (SRB, S3) as a backend for DSpace: how does doing so affect
full-text indexing? Can anyone describe how, in such a setup, full text
is stored and indexed?

My uneducated impression is that Lucene would want to work only against
a filesystem.

Thanks,
Cory Snavely
University of Michigan Library IT Core Services



-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] srb/s3/etc and lucene

2007-05-03 Thread Cory Snavely
Well, I'm just wondering, in specific terms, if we use an object-based
storage system as an assetstore rather than a filesystem, where the
files that Lucene indexes actually sit.

It's my understanding that in a filesystem-based assetstore, for
example, text is extracted from PDFs and stored in a separate file
*within the assetstore directory* that Lucene crawls. I just don't know
how that sort of thing is handled when using object-based storage.

On Thu, 2007-05-03 at 13:28 -0400, Richard Rodgers wrote:
> Hi Cory:
> 
> Not sure about the limits of Lucene, but I think the larger point is
> that the back-ends are expected only to hold the real content or assets.
> Everything else (full-text indices and the like) are *artifacts* (can be
> recreated from the assets) that we don't need to manage in the same way.
> If for performance reasons we want to put them where the assets are we
> can, but there is really no connection between the two that the system
> imposes. 
> 
> Does this get at your question, or did I miss the point?
> 
> Thanks,
> 
> Richard R
> 
> On Thu, 2007-05-03 at 12:13 -0400, Cory Snavely wrote:
> > (Apologies if this has been discussed to resolution; after a few
> > attempts to search the archives, I concluded they are really broken. 500
> > errors, bad links, etc.)
> > 
> > For those using, interested in, or knowledgeable about using API-based
> > storage (SRB, S3) as a backend for DSpace: how does doing so affect
> > full-text indexing? Can anyone describe how, in such a setup, full text
> > is stored and indexed?
> > 
> > My uneducated impression is that Lucene would want to work only against
> > a filesystem.
> > 
> > Thanks,
> > Cory Snavely
> > University of Michigan Library IT Core Services
> > 
> > 
> > 
> > -
> > This SF.net email is sponsored by DB2 Express
> > Download DB2 Express C - the FREE version of DB2 express and take
> > control of your XML. No limits. Just data. Click to get it now.
> > http://sourceforge.net/powerbar/db2/
> > ___
> > DSpace-tech mailing list
> > DSpace-tech@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dspace-tech
> 


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] srb/s3/etc and lucene

2007-05-04 Thread Cory Snavely
Thanks, but when you say assetstore, I'm not sure if you are referring
to the object-based storage in all cases. I will assume that you are
because of the the parenthetical "(s3)".

So, this is what I believe you are saying: When filter-media runs, it
extracts text for formats such as PDF that Lucene can't directly parse,
and places, using the object-based storage API those text bitstreams
alongside the originals, then again uses the object-based storage API to
fetch the text back out and feed it to Lucene.

Consequently, nothing is stored in the filesystem except for the
resulting index?

Thanks,
Cory

On Fri, 2007-05-04 at 00:10 -0400, Mark Diggory wrote:
> > 
> > On 5/4/07, Cory Snavely <[EMAIL PROTECTED]> wrote:
> > Well, I'm just wondering, in specific terms, if we use an
> > object-based 
> > storage system as an assetstore rather than a filesystem,
> > where the
> > files that Lucene indexes actually sit.
> 
> 
> Its tricky, this is what FilterMedia is for, it actually extracts the
> text and places it as a bitstream in the assetstore. Lucene full text
> indexing is done against the assetstore bitstreams in all cases (well
> accept for the metadata table in the database). So ultimately your
> pushing the text bitstreams into the assetstore (s3) in FilterMedia
> and pulling it back out on Lucene indexing, a double-whammy.
> 
> 
> Cheers,
> Mark
> 
> > 
> > It's my understanding that in a filesystem-based assetstore,
> > for
> > example, text is extracted from PDFs and stored in a
> > separate file 
> > *within the assetstore directory* that Lucene crawls. I just
> > don't know
> > how that sort of thing is handled when using object-based
> > storage.
> > 
> > On Thu, 2007-05-03 at 13:28 -0400, Richard Rodgers wrote:
> > > Hi Cory: 
> > >
> > > Not sure about the limits of Lucene, but I think the
> > larger point is
> > > that the back-ends are expected only to hold the real
> > content or assets.
> > > Everything else (full-text indices and the like) are
> > *artifacts* (can be 
> > > recreated from the assets) that we don't need to manage in
> > the same way.
> > > If for performance reasons we want to put them where the
> > assets are we
> > > can, but there is really no connection between the two
> > that the system 
> > > imposes.
> > >
> > > Does this get at your question, or did I miss the point?
> > >
> > > Thanks,
> > >
> > > Richard R
> > >
> > > On Thu, 2007-05-03 at 12:13 -0400, Cory Snavely wrote:
> > > > (Apologies if this has been discussed to resolution;
> > after a few 
> > > > attempts to search the archives, I concluded they are
> > really broken. 500
> > > > errors, bad links, etc.)
> > > >
> > > > For those using, interested in, or knowledgeable about
> > using API-based 
> > > > storage (SRB, S3) as a backend for DSpace: how does
> > doing so affect
> > > > full-text indexing? Can anyone describe how, in such a
> > setup, full text
> > > > is stored and indexed?
> > > >
> > > > My uneducated impression is that Lucene would want to
> > work only against 
> > > > a filesystem.
> > > >
> > > > Thanks,
> > > > Cory Snavely
> > > > University of Michigan Library IT Core Services
> > > >
> > > >
> > > >
> > > >
> > 
> > - 
> > > > This SF.net email is sponsored by DB2 Express
> > > > Download DB2 Express C - the FREE version of DB2 express
> > and take
> > > > control of your XML. No limits. Just data. Click to get
> > it now.
> > > > http://sourceforge.net/powerbar/db2/
> > > > ___
> > > > DSpace-tech mailing list
> >

Re: [Dspace-tech] srb/s3/etc and lucene

2007-05-04 Thread Cory Snavely
Right--I am trying to get an understand of all this in very specific
terms.

On Fri, 2007-05-04 at 09:23 -0400, Mark H. Wood wrote:
> There are two questions here:
> 
> 1)  Does the use of a non-filesystem asset store backend affect Lucene's
> output?  One would guess, no, since it doesn't do output to the
> asset store.
> 
> 2)  Does the use of a non-filesystem asset store backend affect
> Lucene's input?  IOW how does Lucene, as used in DSpace, locate
> and gain access to the files it indexes?  If it doesn't go through
> the DSpace storage layer or something equivalent then indexing is
> screwed.
> 
> Ouch!  I hadn't thought about these at all.
> 
> -
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> ___ DSpace-tech mailing list 
> DSpace-tech@lists.sourceforge.net 
> https://lists.sourceforge.net/lists/listinfo/dspace-tech


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] srb/s3/etc and lucene

2007-05-04 Thread Cory Snavely
So you are saying that for a format of eg PDF, filter-media, during its
traversal of the assetstore backended on eg SRB, reads the PDF from SRB,
extracts text, and stores that as a file back in SRB. Then, once its
crawl of the assetstore is done, it reads the extracted text back in
from SRB and indexes it. The index then lives in the filesystem,
specifically within [dspace]/search.

When I refer to transactions against SRB, I am assuming that those are
generic read and write operations in DSpace methods that are calling eg
SRB methods.

Correct? 

Thanks,
Cory

On Fri, 2007-05-04 at 09:46 -0400, Richard Rodgers wrote:
> See notes:
> 
> Quoting Cory Snavely <[EMAIL PROTECTED]>:
> 
> > Right--I am trying to get an understand of all this in very specific
> > terms.
> >
> > On Fri, 2007-05-04 at 09:23 -0400, Mark H. Wood wrote:
> >> There are two questions here:
> >>
> >> 1)  Does the use of a non-filesystem asset store backend affect Lucene's
> >> output?  One would guess, no, since it doesn't do output to the
> >> asset store.
> Correct - no. Lucene reads the file for indexing through the storage API - it
> therefore has a BitStream, not a location on a storage device.
>
> >> 2)  Does the use of a non-filesystem asset store backend affect
> >> Lucene's input?  IOW how does Lucene, as used in DSpace, locate
> >> and gain access to the files it indexes?  If it doesn't go through
> >> the DSpace storage layer or something equivalent then indexing is
> >> screwed.
> No - for the same reason. It does not circumvent the storage API or make
> any assumptions about where the files with the text to index lives
> >>
> >> Ouch!  I hadn't thought about these at all.
> >>
> Remember, we already support SRB, (a non-local filesystem option), and 
> indexing
> works fine.



-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] srb/s3/etc and lucene

2007-05-04 Thread Cory Snavely
That's what I wanted to know--thanks!

On Fri, 2007-05-04 at 14:28 -0400, Richard Rodgers wrote:
> Hi Cory:
> 
> On Fri, 2007-05-04 at 13:52 -0400, Cory Snavely wrote:
> > So you are saying that for a format of eg PDF, filter-media, during its
> > traversal of the assetstore backended on eg SRB, reads the PDF from SRB,
> > extracts text, and stores that as a file back in SRB. 
> 
> Yes. A little more precisely: MediaFilter does not directly traverse the
> backend - rather it examines each Item in the database, then for each
> bitstream in the ORIGINAL bundle of that item, if (1) the format of the
> bitstream (as recorded in the database) has a filter associated with it
> (as is the case with PDF), and (2) the extracted text file has not
> already been created, then it reads the (e.g. PDF) file, using the
> standard API (which hides the actual location of the file), extracts the
> text, and stores - again using the standard API - the text as a file in
> the TEXT bundle of the item.
> 
> > Then, once its
> > crawl of the assetstore is done, it reads the extracted text back in
> > from SRB and indexes it. The index then lives in the filesystem,
> > specifically within [dspace]/search.
> 
> Yes. A little more precisely: as a convenience, by default the indexer
> is invoked after MediaFilter has run (this can be defeated with a
> command-line argument). But this occurs whenever the indexing is run
> (e.g. when 'index-all' is run). The index files do live at
> [dspace]/search, which is conventionally a local filesystem, but
> certainly may be an NFS mount-point, etc
> > 
> > When I refer to transactions against SRB, I am assuming that those are
> > generic read and write operations in DSpace methods that are calling eg
> > SRB methods.
> 
> Yes, the 'BitstreamStorageManager' exports methods to read, write, etc
> These constitute the API to which I was alluding.
> 
> Hope this clarifies,
> 
> Richard
> > 
> > Correct? 
> > 
> > Thanks,
> > Cory
> > 
> > On Fri, 2007-05-04 at 09:46 -0400, Richard Rodgers wrote:
> > > See notes:
> > > 
> > > Quoting Cory Snavely <[EMAIL PROTECTED]>:
> > > 
> > > > Right--I am trying to get an understand of all this in very specific
> > > > terms.
> > > >
> > > > On Fri, 2007-05-04 at 09:23 -0400, Mark H. Wood wrote:
> > > >> There are two questions here:
> > > >>
> > > >> 1)  Does the use of a non-filesystem asset store backend affect 
> > > >> Lucene's
> > > >> output?  One would guess, no, since it doesn't do output to the
> > > >> asset store.
> > > Correct - no. Lucene reads the file for indexing through the storage API 
> > > - it
> > > therefore has a BitStream, not a location on a storage device.
> > >
> > > >> 2)  Does the use of a non-filesystem asset store backend affect
> > > >> Lucene's input?  IOW how does Lucene, as used in DSpace, locate
> > > >> and gain access to the files it indexes?  If it doesn't go through
> > > >> the DSpace storage layer or something equivalent then indexing is
> > > >> screwed.
> > > No - for the same reason. It does not circumvent the storage API or make
> > > any assumptions about where the files with the text to index lives
> > > >>
> > > >> Ouch!  I hadn't thought about these at all.
> > > >>
> > > Remember, we already support SRB, (a non-local filesystem option), and 
> > > indexing
> > > works fine.
> > 
> > 
> > 
> > -
> > This SF.net email is sponsored by DB2 Express
> > Download DB2 Express C - the FREE version of DB2 express and take
> > control of your XML. No limits. Just data. Click to get it now.
> > http://sourceforge.net/powerbar/db2/
> > ___
> > DSpace-tech mailing list
> > DSpace-tech@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dspace-tech
> 


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech