Hi, I'm trying to use the latest ExtractText plugin, but the docx2txt program the plugin references is no longer available from http://docx2txt.sourceforge.net
I've located a working replacement at https://github.com/ankushshah89/python-docx2txt/ (although it's written in python and I don't have a distro package for that), it doesn't appear to output to stdout. extracttext_external docx2txt /usr/local/bin/docx2txt {} - extracttext_use docx2txt .docx application/docx Do you have any recommendations for an alternative or how to modify this python script to pipe its text to stdout? # /usr/local/bin/docx2txt -h usage: docx2txt [-h] [-i IMG_DIR] docx A pure python-based utility to extract text and images from docx files. positional arguments: docx path of the docx file optional arguments: -h, --help show this help message and exit -i IMG_DIR, --img_dir IMG_DIR path of directory to extract images Also, has anyone written any meta rules for use with ExtractText that they'd like to share? I'd like to block all PDF file that contain any type of javascript - malicious or otherwise. I'd also like to block all PDFs that's a single page and contain a single URL - that appears to be the vast majority of all malicious PDFs.
