ExtractText and docx

Alex Thu, 06 May 2021 18:20:48 -0700

Hi,

I'm trying to use the latest ExtractText plugin, but the docx2txt
program the plugin references is no longer available from
http://docx2txt.sourceforge.net


I've located a working replacement at
https://github.com/ankushshah89/python-docx2txt/ (although it's
written in python and I don't have a distro package for that), it
doesn't appear to output to stdout.

extracttext_external  docx2txt   /usr/local/bin/docx2txt {} -
extracttext_use       docx2txt   .docx application/docx

Do you have any recommendations for an alternative or how to modify
this python script to pipe its text to stdout?

# /usr/local/bin/docx2txt -h
usage: docx2txt [-h] [-i IMG_DIR] docx

A pure python-based utility to extract text and images from docx files.

positional arguments:
  docx                  path of the docx file

optional arguments:
  -h, --help            show this help message and exit
  -i IMG_DIR, --img_dir IMG_DIR
                        path of directory to extract images

Also, has anyone written any meta rules for use with ExtractText that
they'd like to share? I'd like to block all PDF file that contain any
type of javascript - malicious or otherwise. I'd also like to block
all PDFs that's a single page and contain a single URL - that appears
to be the vast majority of all malicious PDFs.

ExtractText and docx

Reply via email to