Re: [Dspace-tech] DSpace-tech Digest, Vol 35, Issue 41

2009-04-09 Thread Ruth Anjo
hello All,
Good Day, please new with dspace, and am trying to install it and its giving
me the error below, please what can i do to correct the error

symbol  : class Context
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[35
2,46] cannot find symbol
symbol  : class AuthorizeException
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[40
5,31] cannot find symbol
symbol  : class Context
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[40
6,46] cannot find symbol
symbol  : class AuthorizeException
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[45
6,29] cannot find symbol
symbol  : class Context
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[45
7,33] cannot find symbol
symbol  : class AuthorizeException
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[47
9,35] cannot find symbol
symbol  : class Context
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[48
1,12] cannot find symbol
symbol  : class AuthorizeException
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[65
1,32] cannot find symbol
symbol  : class Context
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[65
1,19] cannot find symbol
symbol  : class Item
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[65
2,46] cannot find symbol
symbol  : class AuthorizeException
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[67
4,40] cannot find symbol
symbol  : class Context
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[67
4,51] cannot find symbol
symbol  : class Item
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[67
4,59] cannot find symbol
symbol  : class Collection
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[72
8,51] cannot find symbol
symbol  : class Context
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[72
8,19] cannot find symbol
symbol  : class WorkspaceItem
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[72
9,46] cannot find symbol
symbol  : class AuthorizeException
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[77
6,39] cannot find symbol
symbol  : class Context
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[77
6,18] cannot find symbol
symbol  : class WorkspaceItem
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[77
7,59] cannot find symbol
symbol  : class AuthorizeException
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[81
7,36] cannot find symbol
symbol  : class Context
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[83
3,28] cannot find symbol
symbol  : class Context
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[84
0,42] cannot find symbol
symbol  : class Context
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[84
1,12] cannot find symbol
symbol  : class Group
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[92
1,40] cannot find symbol
symbol  : class Context
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[92
1,66] cannot find symbol
symbol  : class Email
location: class org.dspace.workflow.WorkflowManager

C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[93
7,39] cannot find symbol
symbol  

[Dspace-tech] Implicit specialgroup for LDAP users

2009-04-09 Thread Jason Stirnaman
I want to add ldap.login.specialgroup functionality to 1.5.1. ala Jira
issue:
http://jira.dspace.org/jira/browse/DS-10?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel
 I have 2 (maybe 3) questions: 
1. Am I correct in assuming that I can overlay
http://dspace.svn.sourceforge.net/viewvc/dspace?view=revrevision=3347
in 1.5.1 and make it work? Does it have any dependencies that don't
exist in SVN tag 1.5.1?
2. This appears to add LDAP special group as well as Password special
group.  Is that correct?

Thanks,
Jason

-- 

Jason Stirnaman
Digital Projects Librarian/School of Medicine Support
A.R. Dykes Library, University of Kansas Medical Center
jstirna...@kumc.edu 
913-588-7319


--
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Java Heap dumps during Filter-Media

2009-04-09 Thread Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
We've also experienced the Java heap errors in filter-media.  What I did was 
create a postgreSQL table that holds the bitstream_id of each document that 
will not filter.  I modified MediaFilterManager.java to write a row to this 
table whenever it encounters an unfilterable document (via Java heap or other 
error(s)) and to query this table for the bitstream_id it's getting ready to 
try and filter *BEFORE* it attempts to filter it.  If the bitstream_id *is* 
found in this table, the document is skipped.  Essentially we're accomplishing 
the same thing as Tim, only we are also collecting date, time, # of times a 
document has been skipped, and we're also able to report this list of 
unfilterable documents to our users.  Then they can open the problematic .pdf 
file and save it as a .txt file, and we import -update them back into DSpace.



Sue



-Original Message-
From: Tim Donohue [mailto:tdono...@illinois.edu]
Sent: Wednesday, April 08, 2009 10:37 AM
To: Jeffrey Trimble
Cc: DSpace Tech
Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media



Jeffrey,



I've seen this same issue all to many times to count.  From what I've

noticed it seems that the PDFBox software (which DSpace uses)

occasionally has difficulties with larger PDFs (usually 7MB or larger)

which included OCRed, scanned images.   I've never encountered this

problem with PDFs created directly from digital files (like Word, etc.)...



 From what I've seen, occasionally recreating the PDF will resolve the

problem...but, more often than not even that doesn't help.  The problem

seems to be more of an issue with how PDFBox loads the content into memory.



Locally, I've only come up with two possible solutions:



(1) Increase the memory available to the 'filter-media' script (by

bumping up the -Xmx value in the '[dspace]/bin/dsrun' script).  This

works for some PDFs, but others will continue to have problems (as

PDFBox seems to use up enormous amounts of memory for some PDFs).



(2) Force those problematic PDFs to be skipped over by the

'filter-media' script (by using the -s flag):



To make this easier on myself, I've started maintaining a

filter-skiplist file which lists all the handles of the problematic

PDFs (so far we've encountered 35 of them), with a separate handle on

each line.  Then, I pass this filter-skiplist file to the cronjob

which runs 'filter-media' like so:



0 2 * * * filter-media -s `less filter-skiplist | tr '\n' ','`



The above script translates all the newlines (\n) to commas (,) in the

'filter-skiplist' file and passes the result to the 'filter-media' -s

(skip) flag.  So, in the end, filter-media receives a comma-separated

list of handles of PDFs which it should no longer process.  (Obviously

this means any PDFs belonging to items in your 'filter-skiplist' can not

be full text searched in DSpace)



I'm hoping that in the longer term PDFBox will resolve its memory issues

as it comes out of the incubation stage under Apache.



If anyone else has potential solutions, I'd love to hear them, as I'm in

a similar situation as Jeffrey.



- Tim





Jeffrey Trimble wrote:

 I've run into a funky situation.  After using the distributed PDFBOXand

 the associated jars (bouncy castle) the filter media works really,

 really well,

 until--



 We have one pdf that has caused the filter-media to produce a memory dump/

 java heap dump.  The errors are reports first  the IBM flavor of JVM.

  We removed

 the offending PDF from the database, the filter-media went on it's way

 merrily.



 Has anyone seen anything like this?  I have a copy of the heap dump and

 trace.  I can

 reproduce it one demand by placing this PDF back into the IR.



 If you have seen this, and was able to resolve it, please let me know.

  The only thing

 I can think of doing is to rescan the PDF file from the original and

 seeing if there

 is something that resovles itself with the new scan.



 Thanks in advance,





 Jeffrey Trimble

 System LIbrarian

 William F.  Maag Library

 Youngstown State University

 330.941.2483 (Office)

 jtrim...@cc.ysu.edu mailto:jtrim...@cc.ysu.edu

 http://www.maag.ysu.edu

 http://digital.maag.ysu.edu









 



 --

 This SF.net email is sponsored by:

 High Quality Requirements in a Collaborative Environment.

 Download a free trial of Rational Requirements Composer Now!

 http://p.sf.net/sfu/www-ibm-com





 



 ___

 DSpace-tech mailing list

 DSpace-tech@lists.sourceforge.net

 https://lists.sourceforge.net/lists/listinfo/dspace-tech



--

Tim Donohue

Research Programmer, IDEALS

http://www.ideals.uiuc.edu/

University of Illinois

tdono...@illinois.edu | (217) 333-4648




[Dspace-tech] DSpace 1.5.2 RC2 released

2009-04-09 Thread Andrea Bollini
Dear DSpace Community,

We are pleased to announce the release of DSpace 1.5.2 RC2.

Please refer to the

http://jira.dspace.org/jira/browse/DS/fixforversion/10012

and SVN history for details about the modifications.
http://wiki.dspace.org/index.php/DSpace_Release_1.5.2_Notes#SVN_Repository_History

The final release of 1.5.2 should be out on or after 14th April.
   
We request that community members interested in testing this
release candidate please download it and verify that they can complete upgrade  
and fresh installation. We request that the svn branch be frozen  
until we do complete the final release, if developers do have further  
fixes, please request their addition through the developers list  
before moving forward with SVN commits. The documentation for this release is 
bundled within the package.

DSpace 1.5.2 RC2 can be downloaded from the files area at
https://sourceforge.net/project/showfiles.php?group_id=19984

or with SVN from

http://dspace.svn.sf.net/svnroot/dspace/tags/dspace-1_5_2-rc2/

Please use the mailing lists to provide feedback on this release.

Those wishing to do development work with DSpace are strongly  
encouraged to obtain the source code using SVN. This is very  
straightforward and a guide to doing this is available here:
http://wiki.dspace.org/ContributionGuidelines

We would also like to take this opportunity to invite you all to take  
part in the DSpace development process. Extra developer hands are  
always welcome, but there are other ways you can help:

- Test the system and report bugs
- Provide documentation (for end users and institutions, as well as technical)
- Provide or update language packs
- Share your deployment experiences
- Donate content and metadata for testing and research
- Share your technical experience and ideas

Please visit the DSpace Wiki to see the various resources and  
collaboration tools available to the DSpace community:
http://wiki.dspace.org/DspaceResources

Sincerely,
Andrea Bollini

-- 
Dott. Andrea Bollini
Project Manager, IT Architect  Systems Integrator
Sezione Servizi per le Biblioteche e l'Editoria Elettronica
CILEA, http://www.cilea.it
tel. +39 06-59292853
cel. +39 348-8277525

---

Disclaimer: the content of this email is confidential and may be privileged, 
and it must not be disclosed or copied without the sender's consent. If you 
have received this message in error, please notify the sender and remove it 
from your system. The content of this email does not constitute legal advice, 
nor any responsibility is accepted for loss or damage incurred as a result of 
acting upon its contents or attachments. 
The statements and opinions expressed in this email are those of the author and 
do not necessarily reflect those of the employer.


--
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Java Heap dumps during Filter-Media

2009-04-09 Thread Graham Triggs
Nice work Larry,

I've replaced our PDF text extraction and thumbnail generation with this code.

Thankfully, running on Debian, adding the third party tools was as hard as 
apt-get install xpdf ;)

I actually ran into a few more difficulties with the ImageIO libraries - it's a 
pity that you don't get a simple ClassNotFoundException to be able to report 
this more clearly.

But aside from that, my limited tests seem to work quite well.

G 

-Original Message-
From: Larry Stone [mailto:l...@mit.edu] 
Sent: 08 April 2009 22:21
To: Tim Donohue
Cc: DSpace Tech; Jeffrey Trimble
Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media

The PDFBox library is _always_ going to be a problem because of its 
architecture.  It insists on reading the entire PDF document, images included, 
into memory.  This is not necessary, PDF was explicitly designed to let 
renderers process a page at a time in limited memory.
Perhaps it could gain a lot by adding a mode where it ignores images (e.g. 
for text extraction, it is a complete waste of time to even read them into 
memory since it won't be getting any text out of them).

I took a different approach that may be helpful to sites with a lot of PDF 
content that is pathological to PDFBox.  I wrote a couple of filters that 
invoke the XPDF utilities as external OS-level command processes to do the 
dirty work.  They are a bit more complicated to maintain since they rely on 
outside programs that have to be installed, but I've found the xpdf tools to be 
simple to install and maintain.
The XPDF-based text extractor is about three times as fast as PDFBox and the 
only inputs it failed on PDFs were corrupt.  There were also no issues with 
heap space since it runs outside of the JVM.

See patch #2745393 for the code:
https://sourceforge.net/tracker/?func=detailaid=2745393group_id=19984atid=319984

-- Larry


--
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

--
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Java Heap dumps during Filter-Media

2009-04-09 Thread Mark Diggory
Larry, I assume this is a donation to DSpace? If so I'll commit it so its
available for testing/use in the 1.5.2 release.

Mark


On Thu, Apr 9, 2009 at 10:56 AM, Graham Triggs gra...@biomedcentral.comwrote:

 Nice work Larry,

 I've replaced our PDF text extraction and thumbnail generation with this
 code.

 Thankfully, running on Debian, adding the third party tools was as hard as
 apt-get install xpdf ;)

 I actually ran into a few more difficulties with the ImageIO libraries -
 it's a pity that you don't get a simple ClassNotFoundException to be able to
 report this more clearly.

 But aside from that, my limited tests seem to work quite well.

 G

 -Original Message-
 From: Larry Stone [mailto:l...@mit.edu]
 Sent: 08 April 2009 22:21
 To: Tim Donohue
 Cc: DSpace Tech; Jeffrey Trimble
 Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media

 The PDFBox library is _always_ going to be a problem because of its
 architecture.  It insists on reading the entire PDF document, images
 included, into memory.  This is not necessary, PDF was explicitly designed
 to let renderers process a page at a time in limited memory.
 Perhaps it could gain a lot by adding a mode where it ignores images
 (e.g. for text extraction, it is a complete waste of time to even read them
 into memory since it won't be getting any text out of them).

 I took a different approach that may be helpful to sites with a lot of PDF
 content that is pathological to PDFBox.  I wrote a couple of filters that
 invoke the XPDF utilities as external OS-level command processes to do the
 dirty work.  They are a bit more complicated to maintain since they rely on
 outside programs that have to be installed, but I've found the xpdf tools to
 be simple to install and maintain.
 The XPDF-based text extractor is about three times as fast as PDFBox and
 the only inputs it failed on PDFs were corrupt.  There were also no issues
 with heap space since it runs outside of the JVM.

 See patch #2745393 for the code:

 https://sourceforge.net/tracker/?func=detailaid=2745393group_id=19984atid=319984

-- Larry



 --
 This SF.net email is sponsored by:
 High Quality Requirements in a Collaborative Environment.
 Download a free trial of Rational Requirements Composer Now!
 http://p.sf.net/sfu/www-ibm-com
 ___
 DSpace-tech mailing list
 DSpace-tech@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dspace-tech


 --
 This SF.net email is sponsored by:
 High Quality Requirements in a Collaborative Environment.
 Download a free trial of Rational Requirements Composer Now!
 http://p.sf.net/sfu/www-ibm-com
 ___
 DSpace-tech mailing list
 DSpace-tech@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dspace-tech




-- 
Mark R. Diggory
http://purl.org/net/mdiggory/homepage - Bio
http://www.atmire.com - Institutional Repository Solutions
http://www.togather.eu - Before getting together, get t...@ther
--
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Java Heap dumps during Filter-Media

2009-04-09 Thread Larry Stone
 Larry, I assume this is a donation to DSpace? If so I'll commit it so its
 available for testing/use in the 1.5.2 release.

Sure, go ahead, although I won't have time to provide better documentation
(for a while at least, maybe ever).  My time on the FACADE project which
produced this code is ending tomorrow, april 10; that's also the end of
my time at MIT.  I'm working desperately to finish other parts of the
project and do not have any time to spend on this, that's why I just
threw it over the wall because it looked like it could be useful right now.

Eventually all of the code I produced for FACADE will be made available
as open source; keep an eye on http://facade.mit.edu/  .. Not sure when
this will happen, though.

I'm not looking at any of the JIRA stuff (don't even have access yet)
so if there's anything there that needs my attention, please send me
personal mail -- I'm deleting anything with JIRA in the subject.
Thanks, and enjoy..

-- Larry

 On Thu, Apr 9, 2009 at 10:56 AM, Graham Triggs 
 gra...@biomedcentral.comwrote:
  Nice work Larry,
 
  I've replaced our PDF text extraction and thumbnail generation with this
  code.
 
  Thankfully, running on Debian, adding the third party tools was as hard as
  apt-get install xpdf ;)
 
  I actually ran into a few more difficulties with the ImageIO libraries -
  it's a pity that you don't get a simple ClassNotFoundException to be able to
  report this more clearly.
 
  But aside from that, my limited tests seem to work quite well.
 
  G
 
  -Original Message-
  From: Larry Stone [mailto:l...@mit.edu]
  Sent: 08 April 2009 22:21
  To: Tim Donohue
  Cc: DSpace Tech; Jeffrey Trimble
  Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media
 
  The PDFBox library is _always_ going to be a problem because of its
  architecture.  It insists on reading the entire PDF document, images
  included, into memory.  This is not necessary, PDF was explicitly designed
  to let renderers process a page at a time in limited memory.
  Perhaps it could gain a lot by adding a mode where it ignores images
  (e.g. for text extraction, it is a complete waste of time to even read them
  into memory since it won't be getting any text out of them).
 
  I took a different approach that may be helpful to sites with a lot of PDF
  content that is pathological to PDFBox.  I wrote a couple of filters that
  invoke the XPDF utilities as external OS-level command processes to do the
  dirty work.  They are a bit more complicated to maintain since they rely on
  outside programs that have to be installed, but I've found the xpdf tools to
  be simple to install and maintain.
  The XPDF-based text extractor is about three times as fast as PDFBox and
  the only inputs it failed on PDFs were corrupt.  There were also no issues
  with heap space since it runs outside of the JVM.
 
  See patch #2745393 for the code:
 
  https://sourceforge.net/tracker/?func=detailaid=2745393group_id=19984atid=319984
 
 -- Larry
 
 
 
  --
  This SF.net email is sponsored by:
  High Quality Requirements in a Collaborative Environment.
  Download a free trial of Rational Requirements Composer Now!
  http://p.sf.net/sfu/www-ibm-com
  ___
  DSpace-tech mailing list
  DSpace-tech@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/dspace-tech
 
 
  --
  This SF.net email is sponsored by:
  High Quality Requirements in a Collaborative Environment.
  Download a free trial of Rational Requirements Composer Now!
  http://p.sf.net/sfu/www-ibm-com
  ___
  DSpace-tech mailing list
  DSpace-tech@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/dspace-tech
 



 --
 Mark R. Diggory
 http://purl.org/net/mdiggory/homepage - Bio
 http://www.atmire.com - Institutional Repository Solutions
 http://www.togather.eu - Before getting together, get t...@ther

 --001636c5b1fac033c2046723300c
 Content-Type: text/html; charset=ISO-8859-1
 Content-Transfer-Encoding: quoted-printable

 Larry, I assume this is a donation to DSpace? If so I#39;ll commit it so i=
 ts available for testing/use in the 1.5.2 release.brbrMarkbrbrbr=
 div class=3Dgmail_quoteOn Thu, Apr 9, 2009 at 10:56 AM, Graham Triggs s=
 pan dir=3Dltrlt;a href=3Dmailto:gra...@biomedcentral.com;gra...@biom=
 edcentral.com/agt;/span wrote:br

 blockquote class=3Dgmail_quote style=3Dborder-left: 1px solid rgb(204, =
 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;Nice work Larry,=
 br
 br
 I#39;ve replaced our PDF text extraction and thumbnail generation with thi=
 s code.br
 br
 Thankfully, running on Debian, adding the third party tools was as hard as =
 quot;apt-get install xpdfquot; ;)br
 br
 I actually ran into a few more difficulties with the ImageIO libraries - it=
 

Re: [Dspace-tech] Java Heap dumps during Filter-Media

2009-04-09 Thread Mark Diggory
Larry,

Thanks, nI posted a patch that fits into DSpace 1.5.x cleanly and only
requires imageio if you enable the build profile for using it.

I also altered the documentation... (anyone out there any good at inserting
this stuff into docBook? We could use a hand).

Note, the JIRA issue is public ad viewable ...
http://jira.dspace.org/jira/browse/DS-183

And we are switching away from S.F. and eventually shutting down the
trackers there... I expect you will eventually have to post things there
instead.  We hope it will be a big improvement over the SF Tracker mess.

Cheers,
Mark

On Thu, Apr 9, 2009 at 12:24 PM, Larry Stone l...@mit.edu wrote:

  Larry, I assume this is a donation to DSpace? If so I'll commit it so its
  available for testing/use in the 1.5.2 release.

 Sure, go ahead, although I won't have time to provide better documentation
 (for a while at least, maybe ever).  My time on the FACADE project which
 produced this code is ending tomorrow, april 10; that's also the end of
 my time at MIT.  I'm working desperately to finish other parts of the
 project and do not have any time to spend on this, that's why I just
 threw it over the wall because it looked like it could be useful right now.

 Eventually all of the code I produced for FACADE will be made available
 as open source; keep an eye on http://facade.mit.edu/  .. Not sure when
 this will happen, though.

 I'm not looking at any of the JIRA stuff (don't even have access yet)
 so if there's anything there that needs my attention, please send me
 personal mail -- I'm deleting anything with JIRA in the subject.
 Thanks, and enjoy..

-- Larry

  On Thu, Apr 9, 2009 at 10:56 AM, Graham Triggs gra...@biomedcentral.com
 wrote:
   Nice work Larry,
  
   I've replaced our PDF text extraction and thumbnail generation with
 this
   code.
  
   Thankfully, running on Debian, adding the third party tools was as hard
 as
   apt-get install xpdf ;)
  
   I actually ran into a few more difficulties with the ImageIO libraries
 -
   it's a pity that you don't get a simple ClassNotFoundException to be
 able to
   report this more clearly.
  
   But aside from that, my limited tests seem to work quite well.
  
   G
  
   -Original Message-
   From: Larry Stone [mailto:l...@mit.edu]
   Sent: 08 April 2009 22:21
   To: Tim Donohue
   Cc: DSpace Tech; Jeffrey Trimble
   Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media
  
   The PDFBox library is _always_ going to be a problem because of its
   architecture.  It insists on reading the entire PDF document, images
   included, into memory.  This is not necessary, PDF was explicitly
 designed
   to let renderers process a page at a time in limited memory.
   Perhaps it could gain a lot by adding a mode where it ignores images
   (e.g. for text extraction, it is a complete waste of time to even read
 them
   into memory since it won't be getting any text out of them).
  
   I took a different approach that may be helpful to sites with a lot of
 PDF
   content that is pathological to PDFBox.  I wrote a couple of filters
 that
   invoke the XPDF utilities as external OS-level command processes to do
 the
   dirty work.  They are a bit more complicated to maintain since they
 rely on
   outside programs that have to be installed, but I've found the xpdf
 tools to
   be simple to install and maintain.
   The XPDF-based text extractor is about three times as fast as PDFBox
 and
   the only inputs it failed on PDFs were corrupt.  There were also no
 issues
   with heap space since it runs outside of the JVM.
  
   See patch #2745393 for the code:
  
  
 https://sourceforge.net/tracker/?func=detailaid=2745393group_id=19984atid=319984
  
  -- Larry
  
  
  
  
 --
   This SF.net email is sponsored by:
   High Quality Requirements in a Collaborative Environment.
   Download a free trial of Rational Requirements Composer Now!
   http://p.sf.net/sfu/www-ibm-com
   ___
   DSpace-tech mailing list
   DSpace-tech@lists.sourceforge.net
   https://lists.sourceforge.net/lists/listinfo/dspace-tech
  
  
  
 --
   This SF.net email is sponsored by:
   High Quality Requirements in a Collaborative Environment.
   Download a free trial of Rational Requirements Composer Now!
   http://p.sf.net/sfu/www-ibm-com
   ___
   DSpace-tech mailing list
   DSpace-tech@lists.sourceforge.net
   https://lists.sourceforge.net/lists/listinfo/dspace-tech
  
 
 
 
  --
  Mark R. Diggory
  http://purl.org/net/mdiggory/homepage - Bio
  http://www.atmire.com - Institutional Repository Solutions
  http://www.togather.eu - Before getting together, get t...@ther
 
  --001636c5b1fac033c2046723300c
  Content-Type: text/html; charset=ISO-8859-1
  Content-Transfer-Encoding: quoted-printable
 
  

[Dspace-tech] Reminder for those interested in DSpace GSOC 2009 mentoring.

2009-04-09 Thread Mark Diggory
Dear Community,

If you are still interested in mentoring and have not signed up, the
deadline for having signed up, having voted on applications and having
volunteered to mentor a student is fast approaching next Wednesday.  If you
are still interested I highly recommend signing up before next Monday.

Signing up starts here:
http://socghop.appspot.com/org/show/google/gsoc2009/dspace

Sincerely,
Mark

-- 
Mark R. Diggory
http://purl.org/net/mdiggory/homepage - Bio
http://www.atmire.com - Institutional Repository Solutions
http://www.togather.eu - Before getting together, get t...@ther
--
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech