On Thu, 6 May 2021, Alex wrote:
Hi,
I'm trying to use the latest ExtractText plugin, but the docx2txt
program the plugin references is no longer available from
http://docx2txt.sourceforge.net
Do you have any recommendations for an alternative...?
Perhaps one of (from Stack Overflow):
unzip -p some.docx word/document.xml |\
sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
unzip -p document.docx word/document.xml |\
sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'
unzip -p document.docx word/document.xml |\
sed -e 's/<\/w:p>/ /g; s/<[^>]\{1,\}>/ /g; s/[^[:print:]]\{1,\}/ /g'
...though html2text might be better than sed for reliably de-XMLizing the
document text.
There's also this:
http://abisource.com/downloads/wv/
There's conflicting information on whether Antiword groks .docx, you may
want to try it and see. It may be available from your distro, otherwise:
http://www.winfield.demon.nl/index.html
It might be worthwhile to use native perl utilities to unzip the file,
extract the document.xml content and pass it through XML::XPath to extract
the text, but that would probably involve code changes to ExtractText
rather than just configuring an it to use external utility.
Caveat: I have never looked at the ExtractText plugin.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
[email protected] pgpk -a [email protected]
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Are you a mildly tech-literate politico horrified by the level of
ignorance demonstrated by lawmakers gearing up to regulate online
technology they don't even begin to grasp? Cool. Now you have a
tiny glimpse into a day in the life of a gun owner. -- Sean Davis
-----------------------------------------------------------------------
2 days until the 76th anniversary of VE day