Thanks for the replies. Maybe a clue here - I get this from "which":

IN: scratchpad USE: tools.which
IN: scratchpad "docsplit" which .
f
IN: scratchpad "couchdb" which .
f
IN: scratchpad "ruby" which  .
f

Whereas in the terminal:

➜  ~ git:(master) ✗ which docsplit
/usr/local/opt/ruby/bin/docsplit

➜  ~ git:(master) ✗ which couchdb
/usr/local/bin/couchdb

➜  ~ git:(master) ✗ which ruby
/usr/local/bin/ruby

Let me try moving up to the most recent development release
& see if the problem disappears. I'll get back to you.

Best,
~cw



On Sat, Feb 8, 2014 at 7:42 AM, John Benediktsson <mrj...@gmail.com> wrote:

> Well if you want process output, you can do something like:
>
>     { "docsplit" "text" "--no-clean" "-l" "path" } utf8 [ lines ]
> with-process-reader
>
> or without output, using a single command string:
>
>     "docsplit text --no-clean -l path" run-process drop
>
> You can docsplit a directory of files:
>
>     : docsplit ( file -- )
>         { "docsplit" "text" "--no-clean" "-l" }
>         swap prefix run-process drop ;
>
>     : docsplit-all ( path -- )
>         directory-files [ docsplit ] each ;
>
> And concatenate all the files in a directory:
>
>     # bash
>     ls *.factor | sort | xargs -I '{}' cat '{}'
>
>     # factor
>     : cat-results ( path -- )
>         directory-files [ ".txt" tail? ] filter natural-sort
>         [ file-lines ] map concat ;
>
> Or something like that, which part are you having problems with?
>
> Best,
> John.
>
>
>
> On Sat, Feb 8, 2014 at 2:32 AM, CW Alston <cwalsto...@gmail.com> wrote:
>
>> Hi folks -
>>
>> I am thrilled to find a versatile open-source optical character
>> recognition
>> engine called docsplit <http://documentcloud.github.io/docsplit/>. I've
>> got it installed easily as a ruby gem, & it works
>> just great on my Mac as a shell command (it also provides a ruby module):
>>
>> ➜  ~ git:(master) ✗ which docsplit
>> /usr/local/opt/ruby/bin/docsplit
>> ➜  ~ git:(master) ✗
>>
>> I need such a tool to extract text from a deep directory tree, with a
>> couple thousand
>> folders. Each leaf folder contains 3-6 scanned pdfs (in Chinese &
>> English), from which
>> docsplit makes a plaintext (.txt) file with the same basename, deposited
>> in the same
>> leaf directory. My Factor vocab can easily visit each leaf dir & prepare
>> to pass each pdf
>> there to docsplit in the format it happily handles in the terminal (I use
>> oh-my-zsh & iTerm2).
>> My Factor code chokes on this intermediate step, trying to call docsplit.
>>
>> Going to the terminal, I have to first cd to the directory containing the
>> pdfs, e.g.,
>>
>> ➜  ~ git:(master) ✗ cd /path/to/1_long_gu
>>
>> then call docsplit with the appropriate flags on each pdf:
>>
>> ➜  1_long_gu git:(master) ✗ docsplit text --no-clean -l chi_sim
>> long_gu001.pdf
>> ➜  1_long_gu git:(master) ✗ docsplit text --no-clean -l eng long_gu002.pdf
>>
>> etc., for each pdf, & docsplit gives back a bunch of text files in the
>> dir like
>>
>> /path/to/1_long_gu/long_gu001.txt
>>
>> In the terminal, even a compound phrase like the following works without
>> a hitch:
>>
>> ➜  ~ git:(master) ✗ cd /path/to/1_long_gu ; docsplit text --no-clean -l
>> chi_sim long_gu001.pdf ; docsplit text --no-clean -l eng long_gu002.pdf ;
>> docsplit text --no-clean -l eng long_gu003.pdf ;...
>> ➜  1_long_gu git:(master) ✗
>>
>> So, working from the terminal, I wind up with a series of text files in
>> /path/to/1_long_gu
>> that my Factor vocab amalgamates into a single text file (with whitespace
>> in filename), e.g.,
>> /path/to/1_long_gu/long gu.txt, which I can edit for mistakes, and upload
>> to a couchdb database.
>> Joy!
>>
>> But I haven't been able to work out how to accomplish this docsplit call
>> from Factor code.
>> I have no problem traversing the directory tree (Factor's word each-file
>> & the like come in
>> very handy). I've experimented with io.launcher, io.pipes, shell scripts
>> (bash, zsh, factor),
>> & autoload shell functions, but flunked out. No errors with io.launcher
>> tries; just no result.
>> Need to learn something here. I routinely launch couchdb as a detached
>> <process>.
>>
>> It would be such a boon to use docsplit in Factor. After a couple weeks
>> lost at sea with this,
>> I'm broadcasting a Mayday. Any suggestions?
>>
>> Thanks in advance,
>> ~cw
>>
>> --
>> *~ Memento Amori*
>>
>>
>> ------------------------------------------------------------------------------
>> Managing the Performance of Cloud-Based Applications
>> Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
>> Read the Whitepaper.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Factor-talk mailing list
>> Factor-talk@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/factor-talk
>>
>>
>


-- 
*~ Memento Amori*
------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Reply via email to