[ https://issues.apache.org/jira/browse/PDFBOX-5670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler reassigned PDFBOX-5670: ------------------------------------------ Assignee: Andreas Lehmkühler > Allow repeatable subcommands in the command line tools > ------------------------------------------------------ > > Key: PDFBOX-5670 > URL: https://issues.apache.org/jira/browse/PDFBOX-5670 > Project: PDFBox > Issue Type: New Feature > Components: Text extraction > Affects Versions: 3.0.0 PDFBox > Environment: Windows 10 > java version "1.8.0_381" > Java(TM) SE Runtime Environment (build 1.8.0_381-b09) > Java HotSpot(TM) 64-Bit Server VM (build 25.381-b09, mixed mode) > Reporter: Marcelo Modesto > Assignee: Andreas Lehmkühler > Priority: Minor > Attachments: ExtractTextAsRepeatableSubcommand.patch, Runtime > comparasion.txt > > > I've been using *ExtractText* command line tool (versions 2.0.23 and 2.0.29) > to extract text from multiple PDFs files a time. > After some tries I've decided changing *ExtractText* (2.0.29) to allow it to > process a list of PDFs instead of a single one. > My main goal was to improve processing time by invoking the JVM only once. > As the version 3.0.0 uses _*picocli*_ I've decided to do some tests. > I've attached a patch that allows you to use something like this: > {code:bash} > # Remember that you can use "@-file" to avoid a long command line > java -jar pdfbox-app-3.0.0.jar export:text -console -i file1.pdf ... > export:text -console -i fileN.pdf > {code} > With this modification I can process about 2500 files in about 3 minutes > (max. memory usage ~ 1GB). > Processing one PDF at a time takes about 1h15min (max. memory usage ~ 128MB). > I would appreciate it if you could evaluate these and perhaps incorporate > them into command line tools. > Thank you! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org