Hi folks -

I am thrilled to find a versatile open-source optical character recognition
engine called docsplit <http://documentcloud.github.io/docsplit/>. I've got
it installed easily as a ruby gem, & it works
just great on my Mac as a shell command (it also provides a ruby module):

➜  ~ git:(master) ✗ which docsplit
/usr/local/opt/ruby/bin/docsplit
➜  ~ git:(master) ✗

I need such a tool to extract text from a deep directory tree, with a
couple thousand
folders. Each leaf folder contains 3-6 scanned pdfs (in Chinese & English),
from which
docsplit makes a plaintext (.txt) file with the same basename, deposited in
the same
leaf directory. My Factor vocab can easily visit each leaf dir & prepare to
pass each pdf
there to docsplit in the format it happily handles in the terminal (I use
oh-my-zsh & iTerm2).
My Factor code chokes on this intermediate step, trying to call docsplit.

Going to the terminal, I have to first cd to the directory containing the
pdfs, e.g.,

➜  ~ git:(master) ✗ cd /path/to/1_long_gu

then call docsplit with the appropriate flags on each pdf:

➜  1_long_gu git:(master) ✗ docsplit text --no-clean -l chi_sim
long_gu001.pdf
➜  1_long_gu git:(master) ✗ docsplit text --no-clean -l eng long_gu002.pdf

etc., for each pdf, & docsplit gives back a bunch of text files in the dir
like

/path/to/1_long_gu/long_gu001.txt

In the terminal, even a compound phrase like the following works without a
hitch:

➜  ~ git:(master) ✗ cd /path/to/1_long_gu ; docsplit text --no-clean -l
chi_sim long_gu001.pdf ; docsplit text --no-clean -l eng long_gu002.pdf ;
docsplit text --no-clean -l eng long_gu003.pdf ;...
➜  1_long_gu git:(master) ✗

So, working from the terminal, I wind up with a series of text files in
/path/to/1_long_gu
that my Factor vocab amalgamates into a single text file (with whitespace
in filename), e.g.,
/path/to/1_long_gu/long gu.txt, which I can edit for mistakes, and upload
to a couchdb database.
Joy!

But I haven't been able to work out how to accomplish this docsplit call
from Factor code.
I have no problem traversing the directory tree (Factor's word each-file &
the like come in
very handy). I've experimented with io.launcher, io.pipes, shell scripts
(bash, zsh, factor),
& autoload shell functions, but flunked out. No errors with io.launcher
tries; just no result.
Need to learn something here. I routinely launch couchdb as a detached
<process>.

It would be such a boon to use docsplit in Factor. After a couple weeks
lost at sea with this,
I'm broadcasting a Mayday. Any suggestions?

Thanks in advance,
~cw

-- 
*~ Memento Amori*
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Reply via email to