Well if you want process output, you can do something like:

    { "docsplit" "text" "--no-clean" "-l" "path" } utf8 [ lines ]
with-process-reader

or without output, using a single command string:

    "docsplit text --no-clean -l path" run-process drop

You can docsplit a directory of files:

    : docsplit ( file -- )
        { "docsplit" "text" "--no-clean" "-l" }
        swap prefix run-process drop ;

    : docsplit-all ( path -- )
        directory-files [ docsplit ] each ;

And concatenate all the files in a directory:

    # bash
    ls *.factor | sort | xargs -I '{}' cat '{}'

    # factor
    : cat-results ( path -- )
        directory-files [ ".txt" tail? ] filter natural-sort
        [ file-lines ] map concat ;

Or something like that, which part are you having problems with?

Best,
John.



On Sat, Feb 8, 2014 at 2:32 AM, CW Alston <cwalsto...@gmail.com> wrote:

> Hi folks -
>
> I am thrilled to find a versatile open-source optical character recognition
> engine called docsplit <http://documentcloud.github.io/docsplit/>. I've
> got it installed easily as a ruby gem, & it works
> just great on my Mac as a shell command (it also provides a ruby module):
>
> ➜  ~ git:(master) ✗ which docsplit
> /usr/local/opt/ruby/bin/docsplit
> ➜  ~ git:(master) ✗
>
> I need such a tool to extract text from a deep directory tree, with a
> couple thousand
> folders. Each leaf folder contains 3-6 scanned pdfs (in Chinese &
> English), from which
> docsplit makes a plaintext (.txt) file with the same basename, deposited
> in the same
> leaf directory. My Factor vocab can easily visit each leaf dir & prepare
> to pass each pdf
> there to docsplit in the format it happily handles in the terminal (I use
> oh-my-zsh & iTerm2).
> My Factor code chokes on this intermediate step, trying to call docsplit.
>
> Going to the terminal, I have to first cd to the directory containing the
> pdfs, e.g.,
>
> ➜  ~ git:(master) ✗ cd /path/to/1_long_gu
>
> then call docsplit with the appropriate flags on each pdf:
>
> ➜  1_long_gu git:(master) ✗ docsplit text --no-clean -l chi_sim
> long_gu001.pdf
> ➜  1_long_gu git:(master) ✗ docsplit text --no-clean -l eng long_gu002.pdf
>
> etc., for each pdf, & docsplit gives back a bunch of text files in the dir
> like
>
> /path/to/1_long_gu/long_gu001.txt
>
> In the terminal, even a compound phrase like the following works without a
> hitch:
>
> ➜  ~ git:(master) ✗ cd /path/to/1_long_gu ; docsplit text --no-clean -l
> chi_sim long_gu001.pdf ; docsplit text --no-clean -l eng long_gu002.pdf ;
> docsplit text --no-clean -l eng long_gu003.pdf ;...
> ➜  1_long_gu git:(master) ✗
>
> So, working from the terminal, I wind up with a series of text files in
> /path/to/1_long_gu
> that my Factor vocab amalgamates into a single text file (with whitespace
> in filename), e.g.,
> /path/to/1_long_gu/long gu.txt, which I can edit for mistakes, and upload
> to a couchdb database.
> Joy!
>
> But I haven't been able to work out how to accomplish this docsplit call
> from Factor code.
> I have no problem traversing the directory tree (Factor's word each-file &
> the like come in
> very handy). I've experimented with io.launcher, io.pipes, shell scripts
> (bash, zsh, factor),
> & autoload shell functions, but flunked out. No errors with io.launcher
> tries; just no result.
> Need to learn something here. I routinely launch couchdb as a detached
> <process>.
>
> It would be such a boon to use docsplit in Factor. After a couple weeks
> lost at sea with this,
> I'm broadcasting a Mayday. Any suggestions?
>
> Thanks in advance,
> ~cw
>
> --
> *~ Memento Amori*
>
>
> ------------------------------------------------------------------------------
> Managing the Performance of Cloud-Based Applications
> Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
> Read the Whitepaper.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
> _______________________________________________
> Factor-talk mailing list
> Factor-talk@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/factor-talk
>
>
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Reply via email to