Re: [Factor-talk] OCR via docsplit in Factor
If you get lost in path land you can always take a break and use the /full/path/to/docsplit. On Feb 9, 2014, at 2:03 AM, CW Alston cwalsto...@gmail.com wrote: Ah! Thanks, Joe- Great tip; should clear up the issue with which. I am indeed starting Factor in the Finder. I'll try adjusting the plist. Maybe that even has something to do with my docsplit puzzle. Since I can address commands like couchdb via a process, I should be able to invoke docsplit that way as well, even though htop shows me that docsplit itself spawns sub-processes, like poppler tesseract, to do its extraction work. Interesting. I'll go study the Mac dev doc you point to, see what I can glean from there. Back to the books, ~cw On Sat, Feb 8, 2014 at 10:27 PM, Joe Groff arc...@gmail.com wrote: On Sat, Feb 8, 2014 at 7:30 PM, CW Alston cwalsto...@gmail.com wrote: Hi - Ok, I've upgraded using factor-macosx-x86-32-2013-07-25-14-21.dmg, still Version 0.97. Same issue with Factor's which: IN: scratchpad USE: tools.which IN: scratchpad couchdb which . f IN: scratchpad python which . /usr/bin/python - The trouble appears to be with reporting my PATH properly, via getenv: IN: scratchpad USE: environment IN: scratchpad PATH os-env . /usr/bin:/bin:/usr/sbin:/sbin IN: scratchpad USE: unix.ffi IN: scratchpad PATH getenv . /usr/bin:/bin:/usr/sbin:/sbin IN: scratchpad \ getenv see USING: alien.c-types alien.syntax ; IN: unix.ffi LIBRARY: libc FUNCTION: c-string getenv ( c-string name ) ; inline - Here's my actual PATH, as seen in the terminal: ➜ ~ git:(master) ✗ echo $PATH /usr/local/bin:/usr/local/opt/ruby/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/cwalston/factor:/Users/cwalston/bin:/usr/local/go/bin:/usr/local/lib/node_modules:/usr/local/narwhal/bin:/usr/texbin:/usr/X11/bin:/usr/local/sbin:/Users/cwalston/.gem/ruby/1.8/bin:/Applications/Mozart.app/Contents/Resources/bin - whereby which correctly finds couchdb: ➜ ~ git:(master) ✗ which couchdb /usr/local/bin/couchdb So, Factor's which (et al.) doesn't search beyond /usr/bin:/bin:/usr/sbin:/sbin. Reading through man getenv (GETENV(3), on OSX 10.6.8 ), doesn't give me a clue as to how to rectify this short-sightedness via the libc getenv. This is probably a side issue to my docsplit quandary (but maybe not). Anyone see a way to report my actual PATH to which in Factor? My PATH is augmented in my .zshrc. I don't understand why the libc function doesn't read it. Odd, indeed! If you're starting Factor from the Finder, you're not going to get a PATH set from your .profile or other shell dotfiles, since UI apps are launched under the loginwindow session and not under any shell. To set environment variables for UI apps, try setting them in ~/.MacOSX/environment.plist: https://developer.apple.com/library/mac/documentation/MacOSX/Conceptual/BPRuntimeConfig/Articles/EnvironmentVars.html -Joe -- ~ Memento Amori -- Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk
Re: [Factor-talk] OCR via docsplit in Factor
Hi John- Beg pardon, I should have mentioned earlier that since docsplit plants a .txt file in the target pdf's directory on its own, with no other output, I had gone the route you suggested, but to no avail, i.e., docsplit text --no-clean -l path run-process drop In the terminal, cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf works fine. The surprise is that, in the listener, the phrase: cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf run-process . - returns with status 0, but leaves no file. Ditto using /full/path/to/docsplit in the command. The docsplit bin alias (/usr/local/opt/ruby/bin/docsplit) resolves to /usr/local/Cellar/ruby/2.1.0/bin/docsplit (installed w/ homebrew). There I find this ruby script: require 'rubygems' version = = 0 if ARGV.first str = ARGV.first str = str.dup.force_encoding(BINARY) if str.respond_to? :force_encoding if str =~ /\A_(.*)_\z/ version = $1 ARGV.shift end end gem 'docsplit', version load Gem.bin_path('docsplit', 'docsplit', version) If I manage to decipher this, I'll try to translate it in Factor, and invoke docsplit that way. That should keep me busy for a while. Worth a try, though I know zip about ruby. Once past this boondoggle, I already have Factor code that walks the tree collates the files. Thanks! ~cw On Sun, Feb 9, 2014 at 4:31 AM, John Benediktsson mrj...@gmail.com wrote: If you get lost in path land you can always take a break and use the /full/path/to/docsplit. On Feb 9, 2014, at 2:03 AM, CW Alston cwalsto...@gmail.com wrote: Ah! Thanks, Joe- Great tip; should clear up the issue with which. I am indeed starting Factor in the Finder. I'll try adjusting the plist. Maybe that even has something to do with my docsplit puzzle. Since I can address commands like couchdb via a process, I should be able to invoke docsplit that way as well, even though htop shows me that docsplit itself spawns sub-processes, like poppler tesseract, to do its extraction work. Interesting. I'll go study the Mac dev doc you point to, see what I can glean from there. Back to the books, ~cw On Sat, Feb 8, 2014 at 10:27 PM, Joe Groff arc...@gmail.com wrote: On Sat, Feb 8, 2014 at 7:30 PM, CW Alston cwalsto...@gmail.com wrote: Hi - Ok, I've upgraded using factor-macosx-x86-32-2013-07-25-14-21.dmg, still Version 0.97. Same issue with Factor's which: IN: scratchpad USE: tools.which IN: scratchpad couchdb which . f IN: scratchpad python which . /usr/bin/python - The trouble appears to be with reporting my PATH properly, via getenv: IN: scratchpad USE: environment IN: scratchpad PATH os-env . /usr/bin:/bin:/usr/sbin:/sbin IN: scratchpad USE: unix.ffi IN: scratchpad PATH getenv . /usr/bin:/bin:/usr/sbin:/sbin IN: scratchpad \ getenv see USING: alien.c-types alien.syntax ; IN: unix.ffi LIBRARY: libc FUNCTION: c-string getenv ( c-string name ) ; inline - Here's my actual PATH, as seen in the terminal: ➜ ~ git:(master) ✗ echo $PATH /usr/local/bin:/usr/local/opt/ruby/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/cwalston/factor:/Users/cwalston/bin:/usr/local/go/bin:/usr/local/lib/node_modules:/usr/local/narwhal/bin:/usr/texbin:/usr/X11/bin:/usr/local/sbin:/Users/cwalston/.gem/ruby/1.8/bin:/Applications/Mozart.app/Contents/Resources/bin - whereby which correctly finds couchdb: ➜ ~ git:(master) ✗ which couchdb /usr/local/bin/couchdb So, Factor's which (et al.) doesn't search beyond /usr/bin:/bin:/usr/sbin:/sbin. Reading through man getenv (GETENV(3), on OSX 10.6.8 ), doesn't give me a clue as to how to rectify this short-sightedness via the libc getenv. This is probably a side issue to my docsplit quandary (but maybe not). Anyone see a way to report my actual PATH to which in Factor? My PATH is augmented in my .zshrc. I don't understand why the libc function doesn't read it. Odd, indeed! If you're starting Factor from the Finder, you're not going to get a PATH set from your .profile or other shell dotfiles, since UI apps are launched under the loginwindow session and not under any shell. To set environment variables for UI apps, try setting them in ~/.MacOSX/environment.plist: https://developer.apple.com/library/mac/documentation/MacOSX/Conceptual/BPRuntimeConfig/Articles/EnvironmentVars.html -Joe -- *~ Memento Amori* -- *~ Memento Amori* -- Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk
Re: [Factor-talk] OCR via docsplit in Factor
It's probably easiest to specify the full path to the file, like I did in my previous message. Combined with the full path to the docsplit binary/link (for your particular problem), it should theoretically work fine: /full/path/to/docsplit text --no-clean -l chi_sim /path/to/1_long_gu/long_gu001.pdf try-process On Sun, Feb 9, 2014 at 1:00 PM, CW Alston cwalsto...@gmail.com wrote: Hi John- Beg pardon, I should have mentioned earlier that since docsplit plants a .txt file in the target pdf's directory on its own, with no other output, I had gone the route you suggested, but to no avail, i.e., docsplit text --no-clean -l path run-process drop In the terminal, cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf works fine. The surprise is that, in the listener, the phrase: cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf run-process . - returns with status 0, but leaves no file. Ditto using /full/path/to/docsplit in the command. The docsplit bin alias (/usr/local/opt/ruby/bin/docsplit) resolves to /usr/local/Cellar/ruby/2.1.0/bin/docsplit (installed w/ homebrew). There I find this ruby script: require 'rubygems' version = = 0 if ARGV.first str = ARGV.first str = str.dup.force_encoding(BINARY) if str.respond_to? :force_encoding if str =~ /\A_(.*)_\z/ version = $1 ARGV.shift end end gem 'docsplit', version load Gem.bin_path('docsplit', 'docsplit', version) If I manage to decipher this, I'll try to translate it in Factor, and invoke docsplit that way. That should keep me busy for a while. Worth a try, though I know zip about ruby. Once past this boondoggle, I already have Factor code that walks the tree collates the files. Thanks! ~cw On Sun, Feb 9, 2014 at 4:31 AM, John Benediktsson mrj...@gmail.com wrote: If you get lost in path land you can always take a break and use the /full/path/to/docsplit. On Feb 9, 2014, at 2:03 AM, CW Alston cwalsto...@gmail.com wrote: Ah! Thanks, Joe- Great tip; should clear up the issue with which. I am indeed starting Factor in the Finder. I'll try adjusting the plist. Maybe that even has something to do with my docsplit puzzle. Since I can address commands like couchdb via a process, I should be able to invoke docsplit that way as well, even though htop shows me that docsplit itself spawns sub-processes, like poppler tesseract, to do its extraction work. Interesting. I'll go study the Mac dev doc you point to, see what I can glean from there. Back to the books, ~cw On Sat, Feb 8, 2014 at 10:27 PM, Joe Groff arc...@gmail.com wrote: On Sat, Feb 8, 2014 at 7:30 PM, CW Alston cwalsto...@gmail.com wrote: Hi - Ok, I've upgraded using factor-macosx-x86-32-2013-07-25-14-21.dmg, still Version 0.97. Same issue with Factor's which: IN: scratchpad USE: tools.which IN: scratchpad couchdb which . f IN: scratchpad python which . /usr/bin/python - The trouble appears to be with reporting my PATH properly, via getenv: IN: scratchpad USE: environment IN: scratchpad PATH os-env . /usr/bin:/bin:/usr/sbin:/sbin IN: scratchpad USE: unix.ffi IN: scratchpad PATH getenv . /usr/bin:/bin:/usr/sbin:/sbin IN: scratchpad \ getenv see USING: alien.c-types alien.syntax ; IN: unix.ffi LIBRARY: libc FUNCTION: c-string getenv ( c-string name ) ; inline - Here's my actual PATH, as seen in the terminal: ➜ ~ git:(master) ✗ echo $PATH /usr/local/bin:/usr/local/opt/ruby/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/cwalston/factor:/Users/cwalston/bin:/usr/local/go/bin:/usr/local/lib/node_modules:/usr/local/narwhal/bin:/usr/texbin:/usr/X11/bin:/usr/local/sbin:/Users/cwalston/.gem/ruby/1.8/bin:/Applications/Mozart.app/Contents/Resources/bin - whereby which correctly finds couchdb: ➜ ~ git:(master) ✗ which couchdb /usr/local/bin/couchdb So, Factor's which (et al.) doesn't search beyond /usr/bin:/bin:/usr/sbin:/sbin. Reading through man getenv (GETENV(3), on OSX 10.6.8 ), doesn't give me a clue as to how to rectify this short-sightedness via the libc getenv. This is probably a side issue to my docsplit quandary (but maybe not). Anyone see a way to report my actual PATH to which in Factor? My PATH is augmented in my .zshrc. I don't understand why the libc function doesn't read it. Odd, indeed! If you're starting Factor from the Finder, you're not going to get a PATH set from your .profile or other shell dotfiles, since UI apps are launched under the loginwindow session and not under any shell. To set environment variables for UI apps, try setting them in ~/.MacOSX/environment.plist: https://developer.apple.com/library/mac/documentation/MacOSX/Conceptual/BPRuntimeConfig/Articles/EnvironmentVars.html -Joe -- ~ Memento Amori -- ~ Memento Amori -- Managing the Performance of Cloud-Based Applications Take advantage of what the
Re: [Factor-talk] OCR via docsplit in Factor
As a follow-up, from Factor you can use `with-directory-files` (http://docs.factorcode.org/content/word-with-directory-files,io.directories.html) and `absolute-path` (http://docs.factorcode.org/content/word-absolute-path,io.pathnames.html) to get full paths to the files in some directory: ``` IN: scratchpad /home/alex/factor/core [ [ absolute-path . ] each ] with-directory-files /home/alex/factor/core/generic /home/alex/factor/core/parser /home/alex/factor/core/sorting [etc] ``` On Sun, Feb 9, 2014 at 1:53 PM, Alex Vondrak ajvond...@gmail.com wrote: It's probably easiest to specify the full path to the file, like I did in my previous message. Combined with the full path to the docsplit binary/link (for your particular problem), it should theoretically work fine: /full/path/to/docsplit text --no-clean -l chi_sim /path/to/1_long_gu/long_gu001.pdf try-process On Sun, Feb 9, 2014 at 1:00 PM, CW Alston cwalsto...@gmail.com wrote: Hi John- Beg pardon, I should have mentioned earlier that since docsplit plants a .txt file in the target pdf's directory on its own, with no other output, I had gone the route you suggested, but to no avail, i.e., docsplit text --no-clean -l path run-process drop In the terminal, cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf works fine. The surprise is that, in the listener, the phrase: cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf run-process . - returns with status 0, but leaves no file. Ditto using /full/path/to/docsplit in the command. The docsplit bin alias (/usr/local/opt/ruby/bin/docsplit) resolves to /usr/local/Cellar/ruby/2.1.0/bin/docsplit (installed w/ homebrew). There I find this ruby script: require 'rubygems' version = = 0 if ARGV.first str = ARGV.first str = str.dup.force_encoding(BINARY) if str.respond_to? :force_encoding if str =~ /\A_(.*)_\z/ version = $1 ARGV.shift end end gem 'docsplit', version load Gem.bin_path('docsplit', 'docsplit', version) If I manage to decipher this, I'll try to translate it in Factor, and invoke docsplit that way. That should keep me busy for a while. Worth a try, though I know zip about ruby. Once past this boondoggle, I already have Factor code that walks the tree collates the files. Thanks! ~cw On Sun, Feb 9, 2014 at 4:31 AM, John Benediktsson mrj...@gmail.com wrote: If you get lost in path land you can always take a break and use the /full/path/to/docsplit. On Feb 9, 2014, at 2:03 AM, CW Alston cwalsto...@gmail.com wrote: Ah! Thanks, Joe- Great tip; should clear up the issue with which. I am indeed starting Factor in the Finder. I'll try adjusting the plist. Maybe that even has something to do with my docsplit puzzle. Since I can address commands like couchdb via a process, I should be able to invoke docsplit that way as well, even though htop shows me that docsplit itself spawns sub-processes, like poppler tesseract, to do its extraction work. Interesting. I'll go study the Mac dev doc you point to, see what I can glean from there. Back to the books, ~cw On Sat, Feb 8, 2014 at 10:27 PM, Joe Groff arc...@gmail.com wrote: On Sat, Feb 8, 2014 at 7:30 PM, CW Alston cwalsto...@gmail.com wrote: Hi - Ok, I've upgraded using factor-macosx-x86-32-2013-07-25-14-21.dmg, still Version 0.97. Same issue with Factor's which: IN: scratchpad USE: tools.which IN: scratchpad couchdb which . f IN: scratchpad python which . /usr/bin/python - The trouble appears to be with reporting my PATH properly, via getenv: IN: scratchpad USE: environment IN: scratchpad PATH os-env . /usr/bin:/bin:/usr/sbin:/sbin IN: scratchpad USE: unix.ffi IN: scratchpad PATH getenv . /usr/bin:/bin:/usr/sbin:/sbin IN: scratchpad \ getenv see USING: alien.c-types alien.syntax ; IN: unix.ffi LIBRARY: libc FUNCTION: c-string getenv ( c-string name ) ; inline - Here's my actual PATH, as seen in the terminal: ➜ ~ git:(master) ✗ echo $PATH /usr/local/bin:/usr/local/opt/ruby/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/cwalston/factor:/Users/cwalston/bin:/usr/local/go/bin:/usr/local/lib/node_modules:/usr/local/narwhal/bin:/usr/texbin:/usr/X11/bin:/usr/local/sbin:/Users/cwalston/.gem/ruby/1.8/bin:/Applications/Mozart.app/Contents/Resources/bin - whereby which correctly finds couchdb: ➜ ~ git:(master) ✗ which couchdb /usr/local/bin/couchdb So, Factor's which (et al.) doesn't search beyond /usr/bin:/bin:/usr/sbin:/sbin. Reading through man getenv (GETENV(3), on OSX 10.6.8 ), doesn't give me a clue as to how to rectify this short-sightedness via the libc getenv. This is probably a side issue to my docsplit quandary (but maybe not). Anyone see a way to report my actual PATH to which in Factor? My PATH is augmented in my .zshrc. I don't understand why the libc function doesn't read it. Odd, indeed! If you're starting Factor from the Finder, you're not going to get a PATH
Re: [Factor-talk] OCR via docsplit in Factor
Hi Alex- Thanks, I did try /full/path/to/docsplit text --no-clean -l chi_sim /path/to/1_long_gu/long_gu001.pdf try-process using both the symlink and the resolved executable: /usr/local/opt/ruby/bin/docsplit /usr/local/Cellar/ruby/2.1.0/bin/docsplit but still no response, still status 0. A lightbulb went on, and I set a duplicate symlink in /usr/bin/docsplit (where Factor's which can find it) straight to /usr/local/Cellar/ruby/2.1.0/bin/docsplit: IN: scratchpad docsplit which . /usr/bin/docsplit -ok, but still no success with anything in io.launcher. Oy! I see on the web that this problem calling docsplit isn't confined to Factor. Help calls appear in Plone-Usershttp://sourceforge.net/mailarchive/message.php?msg_id=29982797 and stackoverflow re pythonhttp://stackoverflow.com/questions/18237442/execute-shell-commands-in-python-to-use-docsplit. Let me dig around some more; this sticky wicket must have a workaround... I'll dig around some more. ~cw On Sun, Feb 9, 2014 at 2:16 PM, Alex Vondrak ajvond...@gmail.com wrote: As a follow-up, from Factor you can use `with-directory-files` ( http://docs.factorcode.org/content/word-with-directory-files,io.directories.html ) and `absolute-path` (http://docs.factorcode.org/content/word-absolute-path,io.pathnames.html) to get full paths to the files in some directory: ``` IN: scratchpad /home/alex/factor/core [ [ absolute-path . ] each ] with-directory-files /home/alex/factor/core/generic /home/alex/factor/core/parser /home/alex/factor/core/sorting [etc] ``` On Sun, Feb 9, 2014 at 1:53 PM, Alex Vondrak ajvond...@gmail.com wrote: It's probably easiest to specify the full path to the file, like I did in my previous message. Combined with the full path to the docsplit binary/link (for your particular problem), it should theoretically work fine: /full/path/to/docsplit text --no-clean -l chi_sim /path/to/1_long_gu/long_gu001.pdf try-process On Sun, Feb 9, 2014 at 1:00 PM, CW Alston cwalsto...@gmail.com wrote: Hi John- Beg pardon, I should have mentioned earlier that since docsplit plants a .txt file in the target pdf's directory on its own, with no other output, I had gone the route you suggested, but to no avail, i.e., docsplit text --no-clean -l path run-process drop In the terminal, cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf works fine. The surprise is that, in the listener, the phrase: cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf run-process . - returns with status 0, but leaves no file. Ditto using /full/path/to/docsplit in the command. The docsplit bin alias (/usr/local/opt/ruby/bin/docsplit) resolves to /usr/local/Cellar/ruby/2.1.0/bin/docsplit (installed w/ homebrew). There I find this ruby script: require 'rubygems' version = = 0 if ARGV.first str = ARGV.first str = str.dup.force_encoding(BINARY) if str.respond_to? :force_encoding if str =~ /\A_(.*)_\z/ version = $1 ARGV.shift end end gem 'docsplit', version load Gem.bin_path('docsplit', 'docsplit', version) If I manage to decipher this, I'll try to translate it in Factor, and invoke docsplit that way. That should keep me busy for a while. Worth a try, though I know zip about ruby. Once past this boondoggle, I already have Factor code that walks the tree collates the files. Thanks! ~cw On Sun, Feb 9, 2014 at 4:31 AM, John Benediktsson mrj...@gmail.com wrote: If you get lost in path land you can always take a break and use the /full/path/to/docsplit. On Feb 9, 2014, at 2:03 AM, CW Alston cwalsto...@gmail.com wrote: Ah! Thanks, Joe- Great tip; should clear up the issue with which. I am indeed starting Factor in the Finder. I'll try adjusting the plist. Maybe that even has something to do with my docsplit puzzle. Since I can address commands like couchdb via a process, I should be able to invoke docsplit that way as well, even though htop shows me that docsplit itself spawns sub-processes, like poppler tesseract, to do its extraction work. Interesting. I'll go study the Mac dev doc you point to, see what I can glean from there. Back to the books, ~cw On Sat, Feb 8, 2014 at 10:27 PM, Joe Groff arc...@gmail.com wrote: On Sat, Feb 8, 2014 at 7:30 PM, CW Alston cwalsto...@gmail.com wrote: Hi - Ok, I've upgraded using factor-macosx-x86-32-2013-07-25-14-21.dmg, still Version 0.97. Same issue with Factor's which: IN: scratchpad USE: tools.which IN: scratchpad couchdb which . f IN: scratchpad python which . /usr/bin/python - The trouble appears to be with reporting my PATH properly, via getenv: IN: scratchpad USE: environment IN: scratchpad PATH os-env . /usr/bin:/bin:/usr/sbin:/sbin IN: scratchpad USE: unix.ffi IN: scratchpad PATH getenv . /usr/bin:/bin:/usr/sbin:/sbin IN:
Re: [Factor-talk] OCR via docsplit in Factor
Strange. Well, not actually strange, since many programs aren't great about return codes...but still! I decided to re-enact the issue by removing /usr/local/bin (where my docsplit was installed) from my PATH, starting Factor, and trying it out. Looks like docsplit is dumping the txt file in the current working directory: IN: scratchpad docsplit which . f IN: scratchpad docsplit text --no-clean -l eng /tmp/thesis.pdf run-process status . 255 IN: scratchpad /usr/local/bin/docsplit text --no-clean -l eng /tmp/thesis.pdf run-process status . 0 IN: scratchpad /tmp/thesis.txt exists? . f IN: scratchpad thesis.txt exists? . t Seems as though you need to tell Factor to run in another working directory: IN: scratchpad /tmp [ /usr/local/bin/docsplit text --no-clean -l eng /tmp/thesis.pdf run-process status . ] with-directory 0 IN: scratchpad /tmp/thesis.txt exists? . t By the way, turns out you can set the `environment` slot of an io.launcher process, so I was thinking maybe that would help, but... IN: scratchpad process docsplit text --no-clean -l eng /tmp/thesis.pdf command /tmp/stdout.txt stdout +stdout+ stderr { { PATH /usr/local/bin } } environment run-process status . 1 IN: scratchpad /tmp/stdout.txt utf8 file-contents print sh: 1: pdftotext: not found Damn. No dice. Looks like you'll have to fix the PATH issue on the system itself. Anyway, hope that helps. (P.S.: Charles, if you're getting this message again, it's because I think GMail might've screwed up the reply behavior and didn't send this to the list, so I'm re-sending it.) On Sun, Feb 9, 2014 at 3:13 PM, CW Alston cwalsto...@gmail.com wrote: Hi Alex- Thanks, I did try /full/path/to/docsplit text --no-clean -l chi_sim /path/to/1_long_gu/long_gu001.pdf try-process using both the symlink and the resolved executable: /usr/local/opt/ruby/bin/docsplit /usr/local/Cellar/ruby/2.1.0/bin/docsplit but still no response, still status 0. A lightbulb went on, and I set a duplicate symlink in /usr/bin/docsplit (where Factor's which can find it) straight to /usr/local/Cellar/ruby/2.1.0/bin/docsplit: IN: scratchpad docsplit which . /usr/bin/docsplit -ok, but still no success with anything in io.launcher. Oy! I see on the web that this problem calling docsplit isn't confined to Factor. Help calls appear in Plone-Usershttp://sourceforge.net/mailarchive/message.php?msg_id=29982797 and stackoverflow re pythonhttp://stackoverflow.com/questions/18237442/execute-shell-commands-in-python-to-use-docsplit. Let me dig around some more; this sticky wicket must have a workaround... I'll dig around some more. ~cw On Sun, Feb 9, 2014 at 2:16 PM, Alex Vondrak ajvond...@gmail.com wrote: As a follow-up, from Factor you can use `with-directory-files` ( http://docs.factorcode.org/content/word-with-directory-files,io.directories.html ) and `absolute-path` (http://docs.factorcode.org/content/word-absolute-path,io.pathnames.html) to get full paths to the files in some directory: ``` IN: scratchpad /home/alex/factor/core [ [ absolute-path . ] each ] with-directory-files /home/alex/factor/core/generic /home/alex/factor/core/parser /home/alex/factor/core/sorting [etc] ``` On Sun, Feb 9, 2014 at 1:53 PM, Alex Vondrak ajvond...@gmail.com wrote: It's probably easiest to specify the full path to the file, like I did in my previous message. Combined with the full path to the docsplit binary/link (for your particular problem), it should theoretically work fine: /full/path/to/docsplit text --no-clean -l chi_sim /path/to/1_long_gu/long_gu001.pdf try-process On Sun, Feb 9, 2014 at 1:00 PM, CW Alston cwalsto...@gmail.com wrote: Hi John- Beg pardon, I should have mentioned earlier that since docsplit plants a .txt file in the target pdf's directory on its own, with no other output, I had gone the route you suggested, but to no avail, i.e., docsplit text --no-clean -l path run-process drop In the terminal, cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf works fine. The surprise is that, in the listener, the phrase: cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf run-process . - returns with status 0, but leaves no file. Ditto using /full/path/to/docsplit in the command. The docsplit bin alias (/usr/local/opt/ruby/bin/docsplit) resolves to /usr/local/Cellar/ruby/2.1.0/bin/docsplit (installed w/ homebrew). There I find this ruby script: require 'rubygems' version = = 0 if ARGV.first str = ARGV.first str = str.dup.force_encoding(BINARY) if str.respond_to? :force_encoding if str =~ /\A_(.*)_\z/ version = $1 ARGV.shift end end gem 'docsplit', version load Gem.bin_path('docsplit', 'docsplit', version) If I manage to decipher this, I'll try to translate it in Factor, and invoke docsplit that way. That should keep me busy for a while. Worth a
Re: [Factor-talk] OCR via docsplit in Factor
Yeah, Alex- I would have thought the cd in my compound command string would take care of he current directory issue. There's another thread about this problemhttp://www.programmingrelief.com/3213645/Docsplit-Works-Fine-In-Command-Line-But-Ignores-Code-In-Ruby-Script%3Fthat finds docsplit returning files in the root directory - on my system no files are winding up there. Let me see what I can do w/ your path/environment suggestions. Gonna be another long night... Thanks much, ~cw On Sun, Feb 9, 2014 at 4:08 PM, Alex Vondrak ajvond...@gmail.com wrote: Strange. Well, not actually strange, since many programs aren't great about return codes...but still! I decided to re-enact the issue by removing /usr/local/bin (where my docsplit was installed) from my PATH, starting Factor, and trying it out. Looks like docsplit is dumping the txt file in the current working directory: IN: scratchpad docsplit which . f IN: scratchpad docsplit text --no-clean -l eng /tmp/thesis.pdf run-process status . 255 IN: scratchpad /usr/local/bin/docsplit text --no-clean -l eng /tmp/thesis.pdf run-process status . 0 IN: scratchpad /tmp/thesis.txt exists? . f IN: scratchpad thesis.txt exists? . t Seems as though you need to tell Factor to run in another working directory: IN: scratchpad /tmp [ /usr/local/bin/docsplit text --no-clean -l eng /tmp/thesis.pdf run-process status . ] with-directory 0 IN: scratchpad /tmp/thesis.txt exists? . t By the way, turns out you can set the `environment` slot of an io.launcher process, so I was thinking maybe that would help, but... IN: scratchpad process docsplit text --no-clean -l eng /tmp/thesis.pdf command /tmp/stdout.txt stdout +stdout+ stderr { { PATH /usr/local/bin } } environment run-process status . 1 IN: scratchpad /tmp/stdout.txt utf8 file-contents print sh: 1: pdftotext: not found Damn. No dice. Looks like you'll have to fix the PATH issue on the system itself. Anyway, hope that helps. (P.S.: Charles, if you're getting this message again, it's because I think GMail might've screwed up the reply behavior and didn't send this to the list, so I'm re-sending it.) On Sun, Feb 9, 2014 at 3:13 PM, CW Alston cwalsto...@gmail.com wrote: Hi Alex- Thanks, I did try /full/path/to/docsplit text --no-clean -l chi_sim /path/to/1_long_gu/long_gu001.pdf try-process using both the symlink and the resolved executable: /usr/local/opt/ruby/bin/docsplit /usr/local/Cellar/ruby/2.1.0/bin/docsplit but still no response, still status 0. A lightbulb went on, and I set a duplicate symlink in /usr/bin/docsplit (where Factor's which can find it) straight to /usr/local/Cellar/ruby/2.1.0/bin/docsplit: IN: scratchpad docsplit which . /usr/bin/docsplit -ok, but still no success with anything in io.launcher. Oy! I see on the web that this problem calling docsplit isn't confined to Factor. Help calls appear in Plone-Usershttp://sourceforge.net/mailarchive/message.php?msg_id=29982797 and stackoverflow re pythonhttp://stackoverflow.com/questions/18237442/execute-shell-commands-in-python-to-use-docsplit. Let me dig around some more; this sticky wicket must have a workaround... I'll dig around some more. ~cw On Sun, Feb 9, 2014 at 2:16 PM, Alex Vondrak ajvond...@gmail.com wrote: As a follow-up, from Factor you can use `with-directory-files` ( http://docs.factorcode.org/content/word-with-directory-files,io.directories.html ) and `absolute-path` (http://docs.factorcode.org/content/word-absolute-path,io.pathnames.html ) to get full paths to the files in some directory: ``` IN: scratchpad /home/alex/factor/core [ [ absolute-path . ] each ] with-directory-files /home/alex/factor/core/generic /home/alex/factor/core/parser /home/alex/factor/core/sorting [etc] ``` On Sun, Feb 9, 2014 at 1:53 PM, Alex Vondrak ajvond...@gmail.com wrote: It's probably easiest to specify the full path to the file, like I did in my previous message. Combined with the full path to the docsplit binary/link (for your particular problem), it should theoretically work fine: /full/path/to/docsplit text --no-clean -l chi_sim /path/to/1_long_gu/long_gu001.pdf try-process On Sun, Feb 9, 2014 at 1:00 PM, CW Alston cwalsto...@gmail.com wrote: Hi John- Beg pardon, I should have mentioned earlier that since docsplit plants a .txt file in the target pdf's directory on its own, with no other output, I had gone the route you suggested, but to no avail, i.e., docsplit text --no-clean -l path run-process drop In the terminal, cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf works fine. The surprise is that, in the listener, the phrase: cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf run-process . - returns with status 0, but leaves no file. Ditto using /full/path/to/docsplit in the command. The docsplit bin alias
Re: [Factor-talk] OCR via docsplit in Factor
Thing is, `cd` isn't a binary that Factor can execute in a process. It's just a shell command implemented by bash or zsh or whatever you use. Same with the semicolon syntax, for that matter. You might try to finagle something like IN: scratchpad { sh -c cd /tmp ; pwd } utf8 [ contents . ] with-process-reader /tmp\n Not sure how the PATH stuff will work out with that, though. You could also try just using the `-o` flag to docsplit. Again, deliberately messing up my PATH so Factor can't run docsplit directly: IN: scratchpad docsplit which . f IN: scratchpad /tmp/thesis.pdf exists? . t IN: scratchpad /tmp/thesis.txt exists? . f IN: scratchpad /usr/local/bin/docsplit text --no-clean -l eng /tmp/thesis.pdf -o /tmp try-process IN: scratchpad /tmp/thesis.txt exists? . t On Sun, Feb 9, 2014 at 5:02 PM, CW Alston cwalsto...@gmail.com wrote: Yeah, Alex- I would have thought the cd in my compound command string would take care of he current directory issue. There's another thread about this problemhttp://www.programmingrelief.com/3213645/Docsplit-Works-Fine-In-Command-Line-But-Ignores-Code-In-Ruby-Script%3Fthat finds docsplit returning files in the root directory - on my system no files are winding up there. Let me see what I can do w/ your path/environment suggestions. Gonna be another long night... Thanks much, ~cw On Sun, Feb 9, 2014 at 4:08 PM, Alex Vondrak ajvond...@gmail.com wrote: Strange. Well, not actually strange, since many programs aren't great about return codes...but still! I decided to re-enact the issue by removing /usr/local/bin (where my docsplit was installed) from my PATH, starting Factor, and trying it out. Looks like docsplit is dumping the txt file in the current working directory: IN: scratchpad docsplit which . f IN: scratchpad docsplit text --no-clean -l eng /tmp/thesis.pdf run-process status . 255 IN: scratchpad /usr/local/bin/docsplit text --no-clean -l eng /tmp/thesis.pdf run-process status . 0 IN: scratchpad /tmp/thesis.txt exists? . f IN: scratchpad thesis.txt exists? . t Seems as though you need to tell Factor to run in another working directory: IN: scratchpad /tmp [ /usr/local/bin/docsplit text --no-clean -l eng /tmp/thesis.pdf run-process status . ] with-directory 0 IN: scratchpad /tmp/thesis.txt exists? . t By the way, turns out you can set the `environment` slot of an io.launcher process, so I was thinking maybe that would help, but... IN: scratchpad process docsplit text --no-clean -l eng /tmp/thesis.pdf command /tmp/stdout.txt stdout +stdout+ stderr { { PATH /usr/local/bin } } environment run-process status . 1 IN: scratchpad /tmp/stdout.txt utf8 file-contents print sh: 1: pdftotext: not found Damn. No dice. Looks like you'll have to fix the PATH issue on the system itself. Anyway, hope that helps. (P.S.: Charles, if you're getting this message again, it's because I think GMail might've screwed up the reply behavior and didn't send this to the list, so I'm re-sending it.) On Sun, Feb 9, 2014 at 3:13 PM, CW Alston cwalsto...@gmail.com wrote: Hi Alex- Thanks, I did try /full/path/to/docsplit text --no-clean -l chi_sim /path/to/1_long_gu/long_gu001.pdf try-process using both the symlink and the resolved executable: /usr/local/opt/ruby/bin/docsplit /usr/local/Cellar/ruby/2.1.0/bin/docsplit but still no response, still status 0. A lightbulb went on, and I set a duplicate symlink in /usr/bin/docsplit (where Factor's which can find it) straight to /usr/local/Cellar/ruby/2.1.0/bin/docsplit: IN: scratchpad docsplit which . /usr/bin/docsplit -ok, but still no success with anything in io.launcher. Oy! I see on the web that this problem calling docsplit isn't confined to Factor. Help calls appear in Plone-Usershttp://sourceforge.net/mailarchive/message.php?msg_id=29982797 and stackoverflow re pythonhttp://stackoverflow.com/questions/18237442/execute-shell-commands-in-python-to-use-docsplit. Let me dig around some more; this sticky wicket must have a workaround... I'll dig around some more. ~cw On Sun, Feb 9, 2014 at 2:16 PM, Alex Vondrak ajvond...@gmail.comwrote: As a follow-up, from Factor you can use `with-directory-files` ( http://docs.factorcode.org/content/word-with-directory-files,io.directories.html ) and `absolute-path` ( http://docs.factorcode.org/content/word-absolute-path,io.pathnames.html ) to get full paths to the files in some directory: ``` IN: scratchpad /home/alex/factor/core [ [ absolute-path . ] each ] with-directory-files /home/alex/factor/core/generic /home/alex/factor/core/parser /home/alex/factor/core/sorting [etc] ``` On Sun, Feb 9, 2014 at 1:53 PM, Alex Vondrak ajvond...@gmail.com wrote: It's probably easiest to specify the full path to the file, like I did in my previous message. Combined with the full path to the docsplit binary/link (for your particular problem), it should
Re: [Factor-talk] OCR via docsplit in Factor
Lord love a duck, Alex - I didn't realize that builtins like `cd` are 'existentially' different than utilities like `cat` - (I only speak pidgin unix; bites me often). Thanks for the heads-up. Okay... I'll try moving|copying my target directory into my home folder, to obviate the need for any cd'ing (I hope), pass docsplit an array of pdfs and flags; or maybe have docsplit iterate over a tmp file containing lines like: chi_sim long_gu001.pdf eng long_gu002.pdf eng long_gu003.pdf ... Probably have to do this in a script. Never a dull moment. ~cw On Sun, Feb 9, 2014 at 6:34 PM, Alex Vondrak ajvond...@gmail.com wrote: Thing is, `cd` isn't a binary that Factor can execute in a process. It's just a shell command implemented by bash or zsh or whatever you use. Same with the semicolon syntax, for that matter. You might try to finagle something like IN: scratchpad { sh -c cd /tmp ; pwd } utf8 [ contents . ] with-process-reader /tmp\n Not sure how the PATH stuff will work out with that, though. You could also try just using the `-o` flag to docsplit. Again, deliberately messing up my PATH so Factor can't run docsplit directly: IN: scratchpad docsplit which . f IN: scratchpad /tmp/thesis.pdf exists? . t IN: scratchpad /tmp/thesis.txt exists? . f IN: scratchpad /usr/local/bin/docsplit text --no-clean -l eng /tmp/thesis.pdf -o /tmp try-process IN: scratchpad /tmp/thesis.txt exists? . t On Sun, Feb 9, 2014 at 5:02 PM, CW Alston cwalsto...@gmail.com wrote: Yeah, Alex- I would have thought the cd in my compound command string would take care of he current directory issue. There's another thread about this problemhttp://www.programmingrelief.com/3213645/Docsplit-Works-Fine-In-Command-Line-But-Ignores-Code-In-Ruby-Script%3Fthat finds docsplit returning files in the root directory - on my system no files are winding up there. Let me see what I can do w/ your path/environment suggestions. Gonna be another long night... Thanks much, ~cw On Sun, Feb 9, 2014 at 4:08 PM, Alex Vondrak ajvond...@gmail.com wrote: Strange. Well, not actually strange, since many programs aren't great about return codes...but still! I decided to re-enact the issue by removing /usr/local/bin (where my docsplit was installed) from my PATH, starting Factor, and trying it out. Looks like docsplit is dumping the txt file in the current working directory: IN: scratchpad docsplit which . f IN: scratchpad docsplit text --no-clean -l eng /tmp/thesis.pdf run-process status . 255 IN: scratchpad /usr/local/bin/docsplit text --no-clean -l eng /tmp/thesis.pdf run-process status . 0 IN: scratchpad /tmp/thesis.txt exists? . f IN: scratchpad thesis.txt exists? . t Seems as though you need to tell Factor to run in another working directory: IN: scratchpad /tmp [ /usr/local/bin/docsplit text --no-clean -l eng /tmp/thesis.pdf run-process status . ] with-directory 0 IN: scratchpad /tmp/thesis.txt exists? . t By the way, turns out you can set the `environment` slot of an io.launcher process, so I was thinking maybe that would help, but... IN: scratchpad process docsplit text --no-clean -l eng /tmp/thesis.pdf command /tmp/stdout.txt stdout +stdout+ stderr { { PATH /usr/local/bin } } environment run-process status . 1 IN: scratchpad /tmp/stdout.txt utf8 file-contents print sh: 1: pdftotext: not found Damn. No dice. Looks like you'll have to fix the PATH issue on the system itself. Anyway, hope that helps. (P.S.: Charles, if you're getting this message again, it's because I think GMail might've screwed up the reply behavior and didn't send this to the list, so I'm re-sending it.) On Sun, Feb 9, 2014 at 3:13 PM, CW Alston cwalsto...@gmail.com wrote: Hi Alex- Thanks, I did try /full/path/to/docsplit text --no-clean -l chi_sim /path/to/1_long_gu/long_gu001.pdf try-process using both the symlink and the resolved executable: /usr/local/opt/ruby/bin/docsplit /usr/local/Cellar/ruby/2.1.0/bin/docsplit but still no response, still status 0. A lightbulb went on, and I set a duplicate symlink in /usr/bin/docsplit (where Factor's which can find it) straight to /usr/local/Cellar/ruby/2.1.0/bin/docsplit: IN: scratchpad docsplit which . /usr/bin/docsplit -ok, but still no success with anything in io.launcher. Oy! I see on the web that this problem calling docsplit isn't confined to Factor. Help calls appear in Plone-Usershttp://sourceforge.net/mailarchive/message.php?msg_id=29982797 and stackoverflow re pythonhttp://stackoverflow.com/questions/18237442/execute-shell-commands-in-python-to-use-docsplit. Let me dig around some more; this sticky wicket must have a workaround... I'll dig around some more. ~cw On Sun, Feb 9, 2014 at 2:16 PM, Alex Vondrak ajvond...@gmail.comwrote: As a follow-up, from Factor you can use `with-directory-files` (
Re: [Factor-talk] OCR via docsplit in Factor
I can't tell you what's wrong with code you haven't provided, but... ``` IN: scratchpad USING: io.files io.launcher io.encodings.ascii tools.which ; IN: scratchpad docsplit which . /usr/local/bin/docsplit IN: scratchpad /tmp/cv.pdf exists? . t IN: scratchpad /tmp/cv.txt exists? . f IN: scratchpad docsplit text --no-clean -l eng /tmp/cv.pdf try-process IN: scratchpad /tmp/cv.txt exists? . t IN: scratchpad /tmp/cv.txt ascii file-lines first . Alex Vondrak ``` On Sat, Feb 8, 2014 at 2:32 AM, CW Alston cwalsto...@gmail.com wrote: Hi folks - I am thrilled to find a versatile open-source optical character recognition engine called docsplit. I've got it installed easily as a ruby gem, it works just great on my Mac as a shell command (it also provides a ruby module): ➜ ~ git:(master) ✗ which docsplit /usr/local/opt/ruby/bin/docsplit ➜ ~ git:(master) ✗ I need such a tool to extract text from a deep directory tree, with a couple thousand folders. Each leaf folder contains 3-6 scanned pdfs (in Chinese English), from which docsplit makes a plaintext (.txt) file with the same basename, deposited in the same leaf directory. My Factor vocab can easily visit each leaf dir prepare to pass each pdf there to docsplit in the format it happily handles in the terminal (I use oh-my-zsh iTerm2). My Factor code chokes on this intermediate step, trying to call docsplit. Going to the terminal, I have to first cd to the directory containing the pdfs, e.g., ➜ ~ git:(master) ✗ cd /path/to/1_long_gu then call docsplit with the appropriate flags on each pdf: ➜ 1_long_gu git:(master) ✗ docsplit text --no-clean -l chi_sim long_gu001.pdf ➜ 1_long_gu git:(master) ✗ docsplit text --no-clean -l eng long_gu002.pdf etc., for each pdf, docsplit gives back a bunch of text files in the dir like /path/to/1_long_gu/long_gu001.txt In the terminal, even a compound phrase like the following works without a hitch: ➜ ~ git:(master) ✗ cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf ; docsplit text --no-clean -l eng long_gu002.pdf ; docsplit text --no-clean -l eng long_gu003.pdf ;... ➜ 1_long_gu git:(master) ✗ So, working from the terminal, I wind up with a series of text files in /path/to/1_long_gu that my Factor vocab amalgamates into a single text file (with whitespace in filename), e.g., /path/to/1_long_gu/long gu.txt, which I can edit for mistakes, and upload to a couchdb database. Joy! But I haven't been able to work out how to accomplish this docsplit call from Factor code. I have no problem traversing the directory tree (Factor's word each-file the like come in very handy). I've experimented with io.launcher, io.pipes, shell scripts (bash, zsh, factor), autoload shell functions, but flunked out. No errors with io.launcher tries; just no result. Need to learn something here. I routinely launch couchdb as a detached process. It would be such a boon to use docsplit in Factor. After a couple weeks lost at sea with this, I'm broadcasting a Mayday. Any suggestions? Thanks in advance, ~cw -- ~ Memento Amori -- Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk -- Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk
Re: [Factor-talk] OCR via docsplit in Factor
Well if you want process output, you can do something like: { docsplit text --no-clean -l path } utf8 [ lines ] with-process-reader or without output, using a single command string: docsplit text --no-clean -l path run-process drop You can docsplit a directory of files: : docsplit ( file -- ) { docsplit text --no-clean -l } swap prefix run-process drop ; : docsplit-all ( path -- ) directory-files [ docsplit ] each ; And concatenate all the files in a directory: # bash ls *.factor | sort | xargs -I '{}' cat '{}' # factor : cat-results ( path -- ) directory-files [ .txt tail? ] filter natural-sort [ file-lines ] map concat ; Or something like that, which part are you having problems with? Best, John. On Sat, Feb 8, 2014 at 2:32 AM, CW Alston cwalsto...@gmail.com wrote: Hi folks - I am thrilled to find a versatile open-source optical character recognition engine called docsplit http://documentcloud.github.io/docsplit/. I've got it installed easily as a ruby gem, it works just great on my Mac as a shell command (it also provides a ruby module): ➜ ~ git:(master) ✗ which docsplit /usr/local/opt/ruby/bin/docsplit ➜ ~ git:(master) ✗ I need such a tool to extract text from a deep directory tree, with a couple thousand folders. Each leaf folder contains 3-6 scanned pdfs (in Chinese English), from which docsplit makes a plaintext (.txt) file with the same basename, deposited in the same leaf directory. My Factor vocab can easily visit each leaf dir prepare to pass each pdf there to docsplit in the format it happily handles in the terminal (I use oh-my-zsh iTerm2). My Factor code chokes on this intermediate step, trying to call docsplit. Going to the terminal, I have to first cd to the directory containing the pdfs, e.g., ➜ ~ git:(master) ✗ cd /path/to/1_long_gu then call docsplit with the appropriate flags on each pdf: ➜ 1_long_gu git:(master) ✗ docsplit text --no-clean -l chi_sim long_gu001.pdf ➜ 1_long_gu git:(master) ✗ docsplit text --no-clean -l eng long_gu002.pdf etc., for each pdf, docsplit gives back a bunch of text files in the dir like /path/to/1_long_gu/long_gu001.txt In the terminal, even a compound phrase like the following works without a hitch: ➜ ~ git:(master) ✗ cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf ; docsplit text --no-clean -l eng long_gu002.pdf ; docsplit text --no-clean -l eng long_gu003.pdf ;... ➜ 1_long_gu git:(master) ✗ So, working from the terminal, I wind up with a series of text files in /path/to/1_long_gu that my Factor vocab amalgamates into a single text file (with whitespace in filename), e.g., /path/to/1_long_gu/long gu.txt, which I can edit for mistakes, and upload to a couchdb database. Joy! But I haven't been able to work out how to accomplish this docsplit call from Factor code. I have no problem traversing the directory tree (Factor's word each-file the like come in very handy). I've experimented with io.launcher, io.pipes, shell scripts (bash, zsh, factor), autoload shell functions, but flunked out. No errors with io.launcher tries; just no result. Need to learn something here. I routinely launch couchdb as a detached process. It would be such a boon to use docsplit in Factor. After a couple weeks lost at sea with this, I'm broadcasting a Mayday. Any suggestions? Thanks in advance, ~cw -- *~ Memento Amori* -- Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk -- Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk
Re: [Factor-talk] OCR via docsplit in Factor
Thanks for the replies. Maybe a clue here - I get this from which: IN: scratchpad USE: tools.which IN: scratchpad docsplit which . f IN: scratchpad couchdb which . f IN: scratchpad ruby which . f Whereas in the terminal: ➜ ~ git:(master) ✗ which docsplit /usr/local/opt/ruby/bin/docsplit ➜ ~ git:(master) ✗ which couchdb /usr/local/bin/couchdb ➜ ~ git:(master) ✗ which ruby /usr/local/bin/ruby Let me try moving up to the most recent development release see if the problem disappears. I'll get back to you. Best, ~cw On Sat, Feb 8, 2014 at 7:42 AM, John Benediktsson mrj...@gmail.com wrote: Well if you want process output, you can do something like: { docsplit text --no-clean -l path } utf8 [ lines ] with-process-reader or without output, using a single command string: docsplit text --no-clean -l path run-process drop You can docsplit a directory of files: : docsplit ( file -- ) { docsplit text --no-clean -l } swap prefix run-process drop ; : docsplit-all ( path -- ) directory-files [ docsplit ] each ; And concatenate all the files in a directory: # bash ls *.factor | sort | xargs -I '{}' cat '{}' # factor : cat-results ( path -- ) directory-files [ .txt tail? ] filter natural-sort [ file-lines ] map concat ; Or something like that, which part are you having problems with? Best, John. On Sat, Feb 8, 2014 at 2:32 AM, CW Alston cwalsto...@gmail.com wrote: Hi folks - I am thrilled to find a versatile open-source optical character recognition engine called docsplit http://documentcloud.github.io/docsplit/. I've got it installed easily as a ruby gem, it works just great on my Mac as a shell command (it also provides a ruby module): ➜ ~ git:(master) ✗ which docsplit /usr/local/opt/ruby/bin/docsplit ➜ ~ git:(master) ✗ I need such a tool to extract text from a deep directory tree, with a couple thousand folders. Each leaf folder contains 3-6 scanned pdfs (in Chinese English), from which docsplit makes a plaintext (.txt) file with the same basename, deposited in the same leaf directory. My Factor vocab can easily visit each leaf dir prepare to pass each pdf there to docsplit in the format it happily handles in the terminal (I use oh-my-zsh iTerm2). My Factor code chokes on this intermediate step, trying to call docsplit. Going to the terminal, I have to first cd to the directory containing the pdfs, e.g., ➜ ~ git:(master) ✗ cd /path/to/1_long_gu then call docsplit with the appropriate flags on each pdf: ➜ 1_long_gu git:(master) ✗ docsplit text --no-clean -l chi_sim long_gu001.pdf ➜ 1_long_gu git:(master) ✗ docsplit text --no-clean -l eng long_gu002.pdf etc., for each pdf, docsplit gives back a bunch of text files in the dir like /path/to/1_long_gu/long_gu001.txt In the terminal, even a compound phrase like the following works without a hitch: ➜ ~ git:(master) ✗ cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf ; docsplit text --no-clean -l eng long_gu002.pdf ; docsplit text --no-clean -l eng long_gu003.pdf ;... ➜ 1_long_gu git:(master) ✗ So, working from the terminal, I wind up with a series of text files in /path/to/1_long_gu that my Factor vocab amalgamates into a single text file (with whitespace in filename), e.g., /path/to/1_long_gu/long gu.txt, which I can edit for mistakes, and upload to a couchdb database. Joy! But I haven't been able to work out how to accomplish this docsplit call from Factor code. I have no problem traversing the directory tree (Factor's word each-file the like come in very handy). I've experimented with io.launcher, io.pipes, shell scripts (bash, zsh, factor), autoload shell functions, but flunked out. No errors with io.launcher tries; just no result. Need to learn something here. I routinely launch couchdb as a detached process. It would be such a boon to use docsplit in Factor. After a couple weeks lost at sea with this, I'm broadcasting a Mayday. Any suggestions? Thanks in advance, ~cw -- *~ Memento Amori* -- Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk -- *~ Memento Amori* -- Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk___ Factor-talk
Re: [Factor-talk] OCR via docsplit in Factor
Thats odd, Factor's which just looks in the $PATH for your executable. IN: scratchpad PATH os-env You can read a bit about how its implemented cross-platform: http://re-factor.blogspot.com/2013/01/which.html On Sat, Feb 8, 2014 at 2:30 PM, CW Alston cwalsto...@gmail.com wrote: Thanks for the replies. Maybe a clue here - I get this from which: IN: scratchpad USE: tools.which IN: scratchpad docsplit which . f IN: scratchpad couchdb which . f IN: scratchpad ruby which . f Whereas in the terminal: ➜ ~ git:(master) ✗ which docsplit /usr/local/opt/ruby/bin/docsplit ➜ ~ git:(master) ✗ which couchdb /usr/local/bin/couchdb ➜ ~ git:(master) ✗ which ruby /usr/local/bin/ruby Let me try moving up to the most recent development release see if the problem disappears. I'll get back to you. Best, ~cw On Sat, Feb 8, 2014 at 7:42 AM, John Benediktsson mrj...@gmail.comwrote: Well if you want process output, you can do something like: { docsplit text --no-clean -l path } utf8 [ lines ] with-process-reader or without output, using a single command string: docsplit text --no-clean -l path run-process drop You can docsplit a directory of files: : docsplit ( file -- ) { docsplit text --no-clean -l } swap prefix run-process drop ; : docsplit-all ( path -- ) directory-files [ docsplit ] each ; And concatenate all the files in a directory: # bash ls *.factor | sort | xargs -I '{}' cat '{}' # factor : cat-results ( path -- ) directory-files [ .txt tail? ] filter natural-sort [ file-lines ] map concat ; Or something like that, which part are you having problems with? Best, John. On Sat, Feb 8, 2014 at 2:32 AM, CW Alston cwalsto...@gmail.com wrote: Hi folks - I am thrilled to find a versatile open-source optical character recognition engine called docsplit http://documentcloud.github.io/docsplit/. I've got it installed easily as a ruby gem, it works just great on my Mac as a shell command (it also provides a ruby module): ➜ ~ git:(master) ✗ which docsplit /usr/local/opt/ruby/bin/docsplit ➜ ~ git:(master) ✗ I need such a tool to extract text from a deep directory tree, with a couple thousand folders. Each leaf folder contains 3-6 scanned pdfs (in Chinese English), from which docsplit makes a plaintext (.txt) file with the same basename, deposited in the same leaf directory. My Factor vocab can easily visit each leaf dir prepare to pass each pdf there to docsplit in the format it happily handles in the terminal (I use oh-my-zsh iTerm2). My Factor code chokes on this intermediate step, trying to call docsplit. Going to the terminal, I have to first cd to the directory containing the pdfs, e.g., ➜ ~ git:(master) ✗ cd /path/to/1_long_gu then call docsplit with the appropriate flags on each pdf: ➜ 1_long_gu git:(master) ✗ docsplit text --no-clean -l chi_sim long_gu001.pdf ➜ 1_long_gu git:(master) ✗ docsplit text --no-clean -l eng long_gu002.pdf etc., for each pdf, docsplit gives back a bunch of text files in the dir like /path/to/1_long_gu/long_gu001.txt In the terminal, even a compound phrase like the following works without a hitch: ➜ ~ git:(master) ✗ cd /path/to/1_long_gu ; docsplit text --no-clean -l chi_sim long_gu001.pdf ; docsplit text --no-clean -l eng long_gu002.pdf ; docsplit text --no-clean -l eng long_gu003.pdf ;... ➜ 1_long_gu git:(master) ✗ So, working from the terminal, I wind up with a series of text files in /path/to/1_long_gu that my Factor vocab amalgamates into a single text file (with whitespace in filename), e.g., /path/to/1_long_gu/long gu.txt, which I can edit for mistakes, and upload to a couchdb database. Joy! But I haven't been able to work out how to accomplish this docsplit call from Factor code. I have no problem traversing the directory tree (Factor's word each-file the like come in very handy). I've experimented with io.launcher, io.pipes, shell scripts (bash, zsh, factor), autoload shell functions, but flunked out. No errors with io.launcher tries; just no result. Need to learn something here. I routinely launch couchdb as a detached process. It would be such a boon to use docsplit in Factor. After a couple weeks lost at sea with this, I'm broadcasting a Mayday. Any suggestions? Thanks in advance, ~cw -- *~ Memento Amori* -- Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk ___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk -- *~ Memento Amori*
Re: [Factor-talk] OCR via docsplit in Factor
Hi - Ok, I've upgraded using factor-macosx-x86-32-2013-07-25-14-21.dmg, still Version 0.97. Same issue with Factor's which: IN: scratchpad USE: tools.which IN: scratchpad couchdb which . f IN: scratchpad python which . /usr/bin/python - The trouble appears to be with reporting my PATH properly, via getenv: IN: scratchpad USE: environment IN: scratchpad PATH os-env . /usr/bin:/bin:/usr/sbin:/sbin IN: scratchpad USE: unix.ffi IN: scratchpad PATH getenv . /usr/bin:/bin:/usr/sbin:/sbin IN: scratchpad \ getenv see USING: alien.c-types alien.syntax ; IN: unix.ffi LIBRARY: libc FUNCTION: c-string getenv ( c-string name ) ; inline - Here's my actual PATH, as seen in the terminal: ➜ ~ git:(master) ✗ echo $PATH /usr/local/bin:/usr/local/opt/ruby/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/cwalston/factor:/Users/cwalston/bin:/usr/local/go/bin:/usr/local/lib/node_modules:/usr/local/narwhal/bin:/usr/texbin:/usr/X11/bin:/usr/local/sbin:/Users/cwalston/.gem/ruby/1.8/bin:/Applications/Mozart.app/Contents/Resources/bin - whereby which correctly finds couchdb: ➜ ~ git:(master) ✗ which couchdb /usr/local/bin/couchdb So, Factor's which (et al.) doesn't search beyond /usr/bin:/bin:/usr/sbin:/sbin. Reading through man getenv (GETENV(3), on OSX 10.6.8 ), doesn't give me a clue as to how to rectify this short-sightedness via the libc getenv. This is probably a side issue to my docsplit quandary (but maybe not). Anyone see a way to report my actual PATH to which in Factor? My PATH is augmented in my .zshrc. I don't understand why the libc function doesn't read it. Odd, indeed! ~cw On Sat, Feb 8, 2014 at 4:39 PM, John Benediktsson mrj...@gmail.com wrote: Thats odd, Factor's which just looks in the $PATH for your executable. IN: scratchpad PATH os-env You can read a bit about how its implemented cross-platform: http://re-factor.blogspot.com/2013/01/which.html On Sat, Feb 8, 2014 at 2:30 PM, CW Alston cwalsto...@gmail.com wrote: Thanks for the replies. Maybe a clue here - I get this from which: IN: scratchpad USE: tools.which IN: scratchpad docsplit which . f IN: scratchpad couchdb which . f IN: scratchpad ruby which . f Whereas in the terminal: ➜ ~ git:(master) ✗ which docsplit /usr/local/opt/ruby/bin/docsplit ➜ ~ git:(master) ✗ which couchdb /usr/local/bin/couchdb ➜ ~ git:(master) ✗ which ruby /usr/local/bin/ruby Let me try moving up to the most recent development release see if the problem disappears. I'll get back to you. Best, ~cw On Sat, Feb 8, 2014 at 7:42 AM, John Benediktsson mrj...@gmail.comwrote: Well if you want process output, you can do something like: { docsplit text --no-clean -l path } utf8 [ lines ] with-process-reader or without output, using a single command string: docsplit text --no-clean -l path run-process drop You can docsplit a directory of files: : docsplit ( file -- ) { docsplit text --no-clean -l } swap prefix run-process drop ; : docsplit-all ( path -- ) directory-files [ docsplit ] each ; And concatenate all the files in a directory: # bash ls *.factor | sort | xargs -I '{}' cat '{}' # factor : cat-results ( path -- ) directory-files [ .txt tail? ] filter natural-sort [ file-lines ] map concat ; Or something like that, which part are you having problems with? Best, John. On Sat, Feb 8, 2014 at 2:32 AM, CW Alston cwalsto...@gmail.com wrote: Hi folks - I am thrilled to find a versatile open-source optical character recognition engine called docsplit http://documentcloud.github.io/docsplit/. I've got it installed easily as a ruby gem, it works just great on my Mac as a shell command (it also provides a ruby module): ➜ ~ git:(master) ✗ which docsplit /usr/local/opt/ruby/bin/docsplit ➜ ~ git:(master) ✗ I need such a tool to extract text from a deep directory tree, with a couple thousand folders. Each leaf folder contains 3-6 scanned pdfs (in Chinese English), from which docsplit makes a plaintext (.txt) file with the same basename, deposited in the same leaf directory. My Factor vocab can easily visit each leaf dir prepare to pass each pdf there to docsplit in the format it happily handles in the terminal (I use oh-my-zsh iTerm2). My Factor code chokes on this intermediate step, trying to call docsplit. Going to the terminal, I have to first cd to the directory containing the pdfs, e.g., ➜ ~ git:(master) ✗ cd /path/to/1_long_gu then call docsplit with the appropriate flags on each pdf: ➜ 1_long_gu git:(master) ✗ docsplit text --no-clean -l chi_sim long_gu001.pdf ➜ 1_long_gu git:(master) ✗ docsplit text --no-clean -l eng long_gu002.pdf etc., for each pdf, docsplit gives back a bunch of text files in the dir like /path/to/1_long_gu/long_gu001.txt In the terminal, even a compound phrase like the following works without a hitch: ➜ ~ git:(master) ✗ cd /path/to/1_long_gu ;
Re: [Factor-talk] OCR via docsplit in Factor
On Sat, Feb 8, 2014 at 7:30 PM, CW Alston cwalsto...@gmail.com wrote: Hi - Ok, I've upgraded using factor-macosx-x86-32-2013-07-25-14-21.dmg, still Version 0.97. Same issue with Factor's which: IN: scratchpad USE: tools.which IN: scratchpad couchdb which . f IN: scratchpad python which . /usr/bin/python - The trouble appears to be with reporting my PATH properly, via getenv: IN: scratchpad USE: environment IN: scratchpad PATH os-env . /usr/bin:/bin:/usr/sbin:/sbin IN: scratchpad USE: unix.ffi IN: scratchpad PATH getenv . /usr/bin:/bin:/usr/sbin:/sbin IN: scratchpad \ getenv see USING: alien.c-types alien.syntax ; IN: unix.ffi LIBRARY: libc FUNCTION: c-string getenv ( c-string name ) ; inline - Here's my actual PATH, as seen in the terminal: ➜ ~ git:(master) ✗ echo $PATH /usr/local/bin:/usr/local/opt/ruby/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/cwalston/factor:/Users/cwalston/bin:/usr/local/go/bin:/usr/local/lib/node_modules:/usr/local/narwhal/bin:/usr/texbin:/usr/X11/bin:/usr/local/sbin:/Users/cwalston/.gem/ruby/1.8/bin:/Applications/Mozart.app/Contents/Resources/bin - whereby which correctly finds couchdb: ➜ ~ git:(master) ✗ which couchdb /usr/local/bin/couchdb So, Factor's which (et al.) doesn't search beyond /usr/bin:/bin:/usr/sbin:/sbin. Reading through man getenv (GETENV(3), on OSX 10.6.8 ), doesn't give me a clue as to how to rectify this short-sightedness via the libc getenv. This is probably a side issue to my docsplit quandary (but maybe not). Anyone see a way to report my actual PATH to which in Factor? My PATH is augmented in my .zshrc. I don't understand why the libc function doesn't read it. Odd, indeed! If you're starting Factor from the Finder, you're not going to get a PATH set from your .profile or other shell dotfiles, since UI apps are launched under the loginwindow session and not under any shell. To set environment variables for UI apps, try setting them in ~/.MacOSX/environment.plist: https://developer.apple.com/library/mac/documentation/MacOSX/Conceptual/BPRuntimeConfig/Articles/EnvironmentVars.html -Joe -- Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231iu=/4140/ostg.clktrk___ Factor-talk mailing list Factor-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/factor-talk