Re: [Dspace-tech] DSpace-tech Digest, Vol 35, Issue 41
hello All, Good Day, please new with dspace, and am trying to install it and its giving me the error below, please what can i do to correct the error symbol : class Context location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[35 2,46] cannot find symbol symbol : class AuthorizeException location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[40 5,31] cannot find symbol symbol : class Context location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[40 6,46] cannot find symbol symbol : class AuthorizeException location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[45 6,29] cannot find symbol symbol : class Context location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[45 7,33] cannot find symbol symbol : class AuthorizeException location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[47 9,35] cannot find symbol symbol : class Context location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[48 1,12] cannot find symbol symbol : class AuthorizeException location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[65 1,32] cannot find symbol symbol : class Context location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[65 1,19] cannot find symbol symbol : class Item location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[65 2,46] cannot find symbol symbol : class AuthorizeException location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[67 4,40] cannot find symbol symbol : class Context location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[67 4,51] cannot find symbol symbol : class Item location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[67 4,59] cannot find symbol symbol : class Collection location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[72 8,51] cannot find symbol symbol : class Context location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[72 8,19] cannot find symbol symbol : class WorkspaceItem location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[72 9,46] cannot find symbol symbol : class AuthorizeException location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[77 6,39] cannot find symbol symbol : class Context location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[77 6,18] cannot find symbol symbol : class WorkspaceItem location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[77 7,59] cannot find symbol symbol : class AuthorizeException location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[81 7,36] cannot find symbol symbol : class Context location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[83 3,28] cannot find symbol symbol : class Context location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[84 0,42] cannot find symbol symbol : class Context location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[84 1,12] cannot find symbol symbol : class Group location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[92 1,40] cannot find symbol symbol : class Context location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[92 1,66] cannot find symbol symbol : class Email location: class org.dspace.workflow.WorkflowManager C:\dspaces\dspace-api\src\main\java\org\dspace\workflow\WorkflowManager.java:[93 7,39] cannot find symbol symbol
[Dspace-tech] Implicit specialgroup for LDAP users
I want to add ldap.login.specialgroup functionality to 1.5.1. ala Jira issue: http://jira.dspace.org/jira/browse/DS-10?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel I have 2 (maybe 3) questions: 1. Am I correct in assuming that I can overlay http://dspace.svn.sourceforge.net/viewvc/dspace?view=revrevision=3347 in 1.5.1 and make it work? Does it have any dependencies that don't exist in SVN tag 1.5.1? 2. This appears to add LDAP special group as well as Password special group. Is that correct? Thanks, Jason -- Jason Stirnaman Digital Projects Librarian/School of Medicine Support A.R. Dykes Library, University of Kansas Medical Center jstirna...@kumc.edu 913-588-7319 -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Java Heap dumps during Filter-Media
We've also experienced the Java heap errors in filter-media. What I did was create a postgreSQL table that holds the bitstream_id of each document that will not filter. I modified MediaFilterManager.java to write a row to this table whenever it encounters an unfilterable document (via Java heap or other error(s)) and to query this table for the bitstream_id it's getting ready to try and filter *BEFORE* it attempts to filter it. If the bitstream_id *is* found in this table, the document is skipped. Essentially we're accomplishing the same thing as Tim, only we are also collecting date, time, # of times a document has been skipped, and we're also able to report this list of unfilterable documents to our users. Then they can open the problematic .pdf file and save it as a .txt file, and we import -update them back into DSpace. Sue -Original Message- From: Tim Donohue [mailto:tdono...@illinois.edu] Sent: Wednesday, April 08, 2009 10:37 AM To: Jeffrey Trimble Cc: DSpace Tech Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media Jeffrey, I've seen this same issue all to many times to count. From what I've noticed it seems that the PDFBox software (which DSpace uses) occasionally has difficulties with larger PDFs (usually 7MB or larger) which included OCRed, scanned images. I've never encountered this problem with PDFs created directly from digital files (like Word, etc.)... From what I've seen, occasionally recreating the PDF will resolve the problem...but, more often than not even that doesn't help. The problem seems to be more of an issue with how PDFBox loads the content into memory. Locally, I've only come up with two possible solutions: (1) Increase the memory available to the 'filter-media' script (by bumping up the -Xmx value in the '[dspace]/bin/dsrun' script). This works for some PDFs, but others will continue to have problems (as PDFBox seems to use up enormous amounts of memory for some PDFs). (2) Force those problematic PDFs to be skipped over by the 'filter-media' script (by using the -s flag): To make this easier on myself, I've started maintaining a filter-skiplist file which lists all the handles of the problematic PDFs (so far we've encountered 35 of them), with a separate handle on each line. Then, I pass this filter-skiplist file to the cronjob which runs 'filter-media' like so: 0 2 * * * filter-media -s `less filter-skiplist | tr '\n' ','` The above script translates all the newlines (\n) to commas (,) in the 'filter-skiplist' file and passes the result to the 'filter-media' -s (skip) flag. So, in the end, filter-media receives a comma-separated list of handles of PDFs which it should no longer process. (Obviously this means any PDFs belonging to items in your 'filter-skiplist' can not be full text searched in DSpace) I'm hoping that in the longer term PDFBox will resolve its memory issues as it comes out of the incubation stage under Apache. If anyone else has potential solutions, I'd love to hear them, as I'm in a similar situation as Jeffrey. - Tim Jeffrey Trimble wrote: I've run into a funky situation. After using the distributed PDFBOXand the associated jars (bouncy castle) the filter media works really, really well, until-- We have one pdf that has caused the filter-media to produce a memory dump/ java heap dump. The errors are reports first the IBM flavor of JVM. We removed the offending PDF from the database, the filter-media went on it's way merrily. Has anyone seen anything like this? I have a copy of the heap dump and trace. I can reproduce it one demand by placing this PDF back into the IR. If you have seen this, and was able to resolve it, please let me know. The only thing I can think of doing is to rescan the PDF file from the original and seeing if there is something that resovles itself with the new scan. Thanks in advance, Jeffrey Trimble System LIbrarian William F. Maag Library Youngstown State University 330.941.2483 (Office) jtrim...@cc.ysu.edu mailto:jtrim...@cc.ysu.edu http://www.maag.ysu.edu http://digital.maag.ysu.edu -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- Tim Donohue Research Programmer, IDEALS http://www.ideals.uiuc.edu/ University of Illinois tdono...@illinois.edu | (217) 333-4648
[Dspace-tech] DSpace 1.5.2 RC2 released
Dear DSpace Community, We are pleased to announce the release of DSpace 1.5.2 RC2. Please refer to the http://jira.dspace.org/jira/browse/DS/fixforversion/10012 and SVN history for details about the modifications. http://wiki.dspace.org/index.php/DSpace_Release_1.5.2_Notes#SVN_Repository_History The final release of 1.5.2 should be out on or after 14th April. We request that community members interested in testing this release candidate please download it and verify that they can complete upgrade and fresh installation. We request that the svn branch be frozen until we do complete the final release, if developers do have further fixes, please request their addition through the developers list before moving forward with SVN commits. The documentation for this release is bundled within the package. DSpace 1.5.2 RC2 can be downloaded from the files area at https://sourceforge.net/project/showfiles.php?group_id=19984 or with SVN from http://dspace.svn.sf.net/svnroot/dspace/tags/dspace-1_5_2-rc2/ Please use the mailing lists to provide feedback on this release. Those wishing to do development work with DSpace are strongly encouraged to obtain the source code using SVN. This is very straightforward and a guide to doing this is available here: http://wiki.dspace.org/ContributionGuidelines We would also like to take this opportunity to invite you all to take part in the DSpace development process. Extra developer hands are always welcome, but there are other ways you can help: - Test the system and report bugs - Provide documentation (for end users and institutions, as well as technical) - Provide or update language packs - Share your deployment experiences - Donate content and metadata for testing and research - Share your technical experience and ideas Please visit the DSpace Wiki to see the various resources and collaboration tools available to the DSpace community: http://wiki.dspace.org/DspaceResources Sincerely, Andrea Bollini -- Dott. Andrea Bollini Project Manager, IT Architect Systems Integrator Sezione Servizi per le Biblioteche e l'Editoria Elettronica CILEA, http://www.cilea.it tel. +39 06-59292853 cel. +39 348-8277525 --- Disclaimer: the content of this email is confidential and may be privileged, and it must not be disclosed or copied without the sender's consent. If you have received this message in error, please notify the sender and remove it from your system. The content of this email does not constitute legal advice, nor any responsibility is accepted for loss or damage incurred as a result of acting upon its contents or attachments. The statements and opinions expressed in this email are those of the author and do not necessarily reflect those of the employer. -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Java Heap dumps during Filter-Media
Nice work Larry, I've replaced our PDF text extraction and thumbnail generation with this code. Thankfully, running on Debian, adding the third party tools was as hard as apt-get install xpdf ;) I actually ran into a few more difficulties with the ImageIO libraries - it's a pity that you don't get a simple ClassNotFoundException to be able to report this more clearly. But aside from that, my limited tests seem to work quite well. G -Original Message- From: Larry Stone [mailto:l...@mit.edu] Sent: 08 April 2009 22:21 To: Tim Donohue Cc: DSpace Tech; Jeffrey Trimble Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media The PDFBox library is _always_ going to be a problem because of its architecture. It insists on reading the entire PDF document, images included, into memory. This is not necessary, PDF was explicitly designed to let renderers process a page at a time in limited memory. Perhaps it could gain a lot by adding a mode where it ignores images (e.g. for text extraction, it is a complete waste of time to even read them into memory since it won't be getting any text out of them). I took a different approach that may be helpful to sites with a lot of PDF content that is pathological to PDFBox. I wrote a couple of filters that invoke the XPDF utilities as external OS-level command processes to do the dirty work. They are a bit more complicated to maintain since they rely on outside programs that have to be installed, but I've found the xpdf tools to be simple to install and maintain. The XPDF-based text extractor is about three times as fast as PDFBox and the only inputs it failed on PDFs were corrupt. There were also no issues with heap space since it runs outside of the JVM. See patch #2745393 for the code: https://sourceforge.net/tracker/?func=detailaid=2745393group_id=19984atid=319984 -- Larry -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Java Heap dumps during Filter-Media
Larry, I assume this is a donation to DSpace? If so I'll commit it so its available for testing/use in the 1.5.2 release. Mark On Thu, Apr 9, 2009 at 10:56 AM, Graham Triggs gra...@biomedcentral.comwrote: Nice work Larry, I've replaced our PDF text extraction and thumbnail generation with this code. Thankfully, running on Debian, adding the third party tools was as hard as apt-get install xpdf ;) I actually ran into a few more difficulties with the ImageIO libraries - it's a pity that you don't get a simple ClassNotFoundException to be able to report this more clearly. But aside from that, my limited tests seem to work quite well. G -Original Message- From: Larry Stone [mailto:l...@mit.edu] Sent: 08 April 2009 22:21 To: Tim Donohue Cc: DSpace Tech; Jeffrey Trimble Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media The PDFBox library is _always_ going to be a problem because of its architecture. It insists on reading the entire PDF document, images included, into memory. This is not necessary, PDF was explicitly designed to let renderers process a page at a time in limited memory. Perhaps it could gain a lot by adding a mode where it ignores images (e.g. for text extraction, it is a complete waste of time to even read them into memory since it won't be getting any text out of them). I took a different approach that may be helpful to sites with a lot of PDF content that is pathological to PDFBox. I wrote a couple of filters that invoke the XPDF utilities as external OS-level command processes to do the dirty work. They are a bit more complicated to maintain since they rely on outside programs that have to be installed, but I've found the xpdf tools to be simple to install and maintain. The XPDF-based text extractor is about three times as fast as PDFBox and the only inputs it failed on PDFs were corrupt. There were also no issues with heap space since it runs outside of the JVM. See patch #2745393 for the code: https://sourceforge.net/tracker/?func=detailaid=2745393group_id=19984atid=319984 -- Larry -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- Mark R. Diggory http://purl.org/net/mdiggory/homepage - Bio http://www.atmire.com - Institutional Repository Solutions http://www.togather.eu - Before getting together, get t...@ther -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Java Heap dumps during Filter-Media
Larry, I assume this is a donation to DSpace? If so I'll commit it so its available for testing/use in the 1.5.2 release. Sure, go ahead, although I won't have time to provide better documentation (for a while at least, maybe ever). My time on the FACADE project which produced this code is ending tomorrow, april 10; that's also the end of my time at MIT. I'm working desperately to finish other parts of the project and do not have any time to spend on this, that's why I just threw it over the wall because it looked like it could be useful right now. Eventually all of the code I produced for FACADE will be made available as open source; keep an eye on http://facade.mit.edu/ .. Not sure when this will happen, though. I'm not looking at any of the JIRA stuff (don't even have access yet) so if there's anything there that needs my attention, please send me personal mail -- I'm deleting anything with JIRA in the subject. Thanks, and enjoy.. -- Larry On Thu, Apr 9, 2009 at 10:56 AM, Graham Triggs gra...@biomedcentral.comwrote: Nice work Larry, I've replaced our PDF text extraction and thumbnail generation with this code. Thankfully, running on Debian, adding the third party tools was as hard as apt-get install xpdf ;) I actually ran into a few more difficulties with the ImageIO libraries - it's a pity that you don't get a simple ClassNotFoundException to be able to report this more clearly. But aside from that, my limited tests seem to work quite well. G -Original Message- From: Larry Stone [mailto:l...@mit.edu] Sent: 08 April 2009 22:21 To: Tim Donohue Cc: DSpace Tech; Jeffrey Trimble Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media The PDFBox library is _always_ going to be a problem because of its architecture. It insists on reading the entire PDF document, images included, into memory. This is not necessary, PDF was explicitly designed to let renderers process a page at a time in limited memory. Perhaps it could gain a lot by adding a mode where it ignores images (e.g. for text extraction, it is a complete waste of time to even read them into memory since it won't be getting any text out of them). I took a different approach that may be helpful to sites with a lot of PDF content that is pathological to PDFBox. I wrote a couple of filters that invoke the XPDF utilities as external OS-level command processes to do the dirty work. They are a bit more complicated to maintain since they rely on outside programs that have to be installed, but I've found the xpdf tools to be simple to install and maintain. The XPDF-based text extractor is about three times as fast as PDFBox and the only inputs it failed on PDFs were corrupt. There were also no issues with heap space since it runs outside of the JVM. See patch #2745393 for the code: https://sourceforge.net/tracker/?func=detailaid=2745393group_id=19984atid=319984 -- Larry -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- Mark R. Diggory http://purl.org/net/mdiggory/homepage - Bio http://www.atmire.com - Institutional Repository Solutions http://www.togather.eu - Before getting together, get t...@ther --001636c5b1fac033c2046723300c Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Larry, I assume this is a donation to DSpace? If so I#39;ll commit it so i= ts available for testing/use in the 1.5.2 release.brbrMarkbrbrbr= div class=3Dgmail_quoteOn Thu, Apr 9, 2009 at 10:56 AM, Graham Triggs s= pan dir=3Dltrlt;a href=3Dmailto:gra...@biomedcentral.com;gra...@biom= edcentral.com/agt;/span wrote:br blockquote class=3Dgmail_quote style=3Dborder-left: 1px solid rgb(204, = 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;Nice work Larry,= br br I#39;ve replaced our PDF text extraction and thumbnail generation with thi= s code.br br Thankfully, running on Debian, adding the third party tools was as hard as = quot;apt-get install xpdfquot; ;)br br I actually ran into a few more difficulties with the ImageIO libraries - it=
Re: [Dspace-tech] Java Heap dumps during Filter-Media
Larry, Thanks, nI posted a patch that fits into DSpace 1.5.x cleanly and only requires imageio if you enable the build profile for using it. I also altered the documentation... (anyone out there any good at inserting this stuff into docBook? We could use a hand). Note, the JIRA issue is public ad viewable ... http://jira.dspace.org/jira/browse/DS-183 And we are switching away from S.F. and eventually shutting down the trackers there... I expect you will eventually have to post things there instead. We hope it will be a big improvement over the SF Tracker mess. Cheers, Mark On Thu, Apr 9, 2009 at 12:24 PM, Larry Stone l...@mit.edu wrote: Larry, I assume this is a donation to DSpace? If so I'll commit it so its available for testing/use in the 1.5.2 release. Sure, go ahead, although I won't have time to provide better documentation (for a while at least, maybe ever). My time on the FACADE project which produced this code is ending tomorrow, april 10; that's also the end of my time at MIT. I'm working desperately to finish other parts of the project and do not have any time to spend on this, that's why I just threw it over the wall because it looked like it could be useful right now. Eventually all of the code I produced for FACADE will be made available as open source; keep an eye on http://facade.mit.edu/ .. Not sure when this will happen, though. I'm not looking at any of the JIRA stuff (don't even have access yet) so if there's anything there that needs my attention, please send me personal mail -- I'm deleting anything with JIRA in the subject. Thanks, and enjoy.. -- Larry On Thu, Apr 9, 2009 at 10:56 AM, Graham Triggs gra...@biomedcentral.com wrote: Nice work Larry, I've replaced our PDF text extraction and thumbnail generation with this code. Thankfully, running on Debian, adding the third party tools was as hard as apt-get install xpdf ;) I actually ran into a few more difficulties with the ImageIO libraries - it's a pity that you don't get a simple ClassNotFoundException to be able to report this more clearly. But aside from that, my limited tests seem to work quite well. G -Original Message- From: Larry Stone [mailto:l...@mit.edu] Sent: 08 April 2009 22:21 To: Tim Donohue Cc: DSpace Tech; Jeffrey Trimble Subject: Re: [Dspace-tech] Java Heap dumps during Filter-Media The PDFBox library is _always_ going to be a problem because of its architecture. It insists on reading the entire PDF document, images included, into memory. This is not necessary, PDF was explicitly designed to let renderers process a page at a time in limited memory. Perhaps it could gain a lot by adding a mode where it ignores images (e.g. for text extraction, it is a complete waste of time to even read them into memory since it won't be getting any text out of them). I took a different approach that may be helpful to sites with a lot of PDF content that is pathological to PDFBox. I wrote a couple of filters that invoke the XPDF utilities as external OS-level command processes to do the dirty work. They are a bit more complicated to maintain since they rely on outside programs that have to be installed, but I've found the xpdf tools to be simple to install and maintain. The XPDF-based text extractor is about three times as fast as PDFBox and the only inputs it failed on PDFs were corrupt. There were also no issues with heap space since it runs outside of the JVM. See patch #2745393 for the code: https://sourceforge.net/tracker/?func=detailaid=2745393group_id=19984atid=319984 -- Larry -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- Mark R. Diggory http://purl.org/net/mdiggory/homepage - Bio http://www.atmire.com - Institutional Repository Solutions http://www.togather.eu - Before getting together, get t...@ther --001636c5b1fac033c2046723300c Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
[Dspace-tech] Reminder for those interested in DSpace GSOC 2009 mentoring.
Dear Community, If you are still interested in mentoring and have not signed up, the deadline for having signed up, having voted on applications and having volunteered to mentor a student is fast approaching next Wednesday. If you are still interested I highly recommend signing up before next Monday. Signing up starts here: http://socghop.appspot.com/org/show/google/gsoc2009/dspace Sincerely, Mark -- Mark R. Diggory http://purl.org/net/mdiggory/homepage - Bio http://www.atmire.com - Institutional Repository Solutions http://www.togather.eu - Before getting together, get t...@ther -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech