Control: severity -1 minor
Control: retitle -1 antiword: give more helpful error for docx
On Sat, Aug 23, 2014 at 10:27:45AM +0200, Vincent Lefevre wrote:
> On a Microsoft Word 2007+ document (according to the "file" utility),
> I get:
>
> $ antiword test.docx
> test.docx is not a Word Document.
>
> which is wrong since the "file" utility correctly recognized this file
> as a Word document:
>
> $ file test.docx
> test.docx: Microsoft Word 2007+
This is the new-style XML-in-a-zip-container format, which is completely
different to the binary Microsoft Word formats which antiword handles
most of.
There's no realistic likelihood of antiword supporting this - the last
antiword upstream release was 2005-10-21 (which predates this XML
format). The antiword package is still useful for handling the files it
handles, but I don't plan to take on maintaining the upstream code to
the extent of adding support for entirely new formats.
The error message isn't very helpful though - it was correct at the time
of the last upstream release, but arguably isn't now, as you point out.
We can at least improve that.
Not sure what a good lightweight extractor for docx files is - I see
docx2txt in the archive, but I've never tried it.
Cheers,
Olly
--
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]