Re: ExtractText and docx

John Hardin Thu, 06 May 2021 21:30:37 -0700

On Thu, 6 May 2021, Alex wrote:

Hi,


I'm trying to use the latest ExtractText plugin, but the docx2txt
program the plugin references is no longer available from
http://docx2txt.sourceforge.net

Do you have any recommendations for an alternative...?


Perhaps one of (from Stack Overflow):

 unzip -p some.docx word/document.xml |\
   sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

 unzip -p document.docx word/document.xml |\
   sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'

 unzip -p document.docx word/document.xml |\
   sed -e 's/<\/w:p>/ /g; s/<[^>]\{1,\}>/ /g; s/[^[:print:]]\{1,\}/ /g'

...though html2text might be better than sed for reliably de-XMLizing thedocument text.


There's also this:

  http://abisource.com/downloads/wv/

There's conflicting information on whether Antiword groks .docx, you maywant to try it and see. It may be available from your distro, otherwise:


  http://www.winfield.demon.nl/index.html

It might be worthwhile to use native perl utilities to unzip the file,extract the document.xml content and pass it through XML::XPath to extractthe text, but that would probably involve code changes to ExtractTextrather than just configuring an it to use external utility.


Caveat: I have never looked at the ExtractText plugin.


--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 [email protected]                         pgpk -a [email protected]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Are you a mildly tech-literate politico horrified by the level of
  ignorance demonstrated by lawmakers gearing up to regulate online
  technology they don't even begin to grasp? Cool. Now you have a
  tiny glimpse into a day in the life of a gun owner.   -- Sean Davis
-----------------------------------------------------------------------
 2 days until the 76th anniversary of VE day

Re: ExtractText and docx

Reply via email to