I can't tell you what's wrong with code you haven't provided, but... ``` IN: scratchpad USING: io.files io.launcher io.encodings.ascii tools.which ; IN: scratchpad "docsplit" which . "/usr/local/bin/docsplit" IN: scratchpad "/tmp/cv.pdf" exists? . t IN: scratchpad "/tmp/cv.txt" exists? . f IN: scratchpad "docsplit text --no-clean -l eng /tmp/cv.pdf" try-process IN: scratchpad "/tmp/cv.txt" exists? . t IN: scratchpad "/tmp/cv.txt" ascii file-lines first . "Alex Vondrak" ```
On Sat, Feb 8, 2014 at 2:32 AM, CW Alston <cwalsto...@gmail.com> wrote: > Hi folks - > > I am thrilled to find a versatile open-source optical character recognition > engine called docsplit. I've got it installed easily as a ruby gem, & it > works > just great on my Mac as a shell command (it also provides a ruby module): > > ➜ ~ git:(master) ✗ which docsplit > /usr/local/opt/ruby/bin/docsplit > ➜ ~ git:(master) ✗ > > I need such a tool to extract text from a deep directory tree, with a couple > thousand > folders. Each leaf folder contains 3-6 scanned pdfs (in Chinese & English), > from which > docsplit makes a plaintext (.txt) file with the same basename, deposited in > the same > leaf directory. My Factor vocab can easily visit each leaf dir & prepare to > pass each pdf > there to docsplit in the format it happily handles in the terminal (I use > oh-my-zsh & iTerm2). > My Factor code chokes on this intermediate step, trying to call docsplit. > > Going to the terminal, I have to first cd to the directory containing the > pdfs, e.g., > > ➜ ~ git:(master) ✗ cd /path/to/1_long_gu > > then call docsplit with the appropriate flags on each pdf: > > ➜ 1_long_gu git:(master) ✗ docsplit text --no-clean -l chi_sim > long_gu001.pdf > ➜ 1_long_gu git:(master) ✗ docsplit text --no-clean -l eng long_gu002.pdf > > etc., for each pdf, & docsplit gives back a bunch of text files in the dir > like > > /path/to/1_long_gu/long_gu001.txt > > In the terminal, even a compound phrase like the following works without a > hitch: > > ➜ ~ git:(master) ✗ cd /path/to/1_long_gu ; docsplit text --no-clean -l > chi_sim long_gu001.pdf ; docsplit text --no-clean -l eng long_gu002.pdf ; > docsplit text --no-clean -l eng long_gu003.pdf ;... > ➜ 1_long_gu git:(master) ✗ > > So, working from the terminal, I wind up with a series of text files in > /path/to/1_long_gu > that my Factor vocab amalgamates into a single text file (with whitespace in > filename), e.g., > /path/to/1_long_gu/long gu.txt, which I can edit for mistakes, and upload to > a couchdb database. > Joy! > > But I haven't been able to work out how to accomplish this docsplit call > from Factor code. > I have no problem traversing the directory tree (Factor's word each-file & > the like come in > very handy). I've experimented with io.launcher, io.pipes, shell scripts > (bash, zsh, factor), > & autoload shell functions, but flunked out. No errors with io.launcher > tries; just no result. > Need to learn something here. I routinely launch couchdb as a detached > <process>. > > It would be such a boon to use docsplit in Factor. After a couple weeks lost > at sea with this, > I'm broadcasting a Mayday. Any suggestions? > > Thanks in advance, > ~cw > > -- > ~ Memento Amori > > ------------------------------------------------------------------------------ > Managing the Performance of Cloud-Based Applications > Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. > Read the Whitepaper. > http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk > _______________________________________________ > Factor-talk mailing list > Factor-talk@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/factor-talk > ------------------------------------------------------------------------------ Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk _______________________________________________ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk