[jira] [Commented] (TIKA-1817) Extracts entire file content for ASCII DXF files

Nick Burch (JIRA) Tue, 22 Dec 2015 05:02:57 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068077#comment-15068077
 ]


Nick Burch commented on TIKA-1817:
----------------------------------

I've had a go at adding mime subtypes for binary and ascii for DXF, as well as 
the related DXB, in r1721390. No unit tests though :( Needs some suitable 
sample files

With that in place, ascii dxf files should no longer end up routed to the text 
parser. That's probably slightly better, but not ideal... We really need 
someone to volunteer to write a proper parser!

Writing one shouldn't be too bad, especially for strings and metadata, along 
the lines of the DWG one we already have. 
http://www.fileformat.info/format/dxf/egff.htm seems a good overview of the 
file format, and there's also published stuff at 
http://www.autodesk.com/techpubs/autocad/acadr14/dxf/drawing_interchange_file_formats.htm
 that should help

> Extracts entire file content for ASCII DXF files
> ------------------------------------------------
>
>                 Key: TIKA-1817
>                 URL: https://issues.apache.org/jira/browse/TIKA-1817
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.11
>            Reporter: Zoltan Toth
>         Attachments: jcsample-screendump.jpg, jcsample.dxf
>
>
> By definition, ASCII DXF files are encoded in plain text.  However. the vast 
> majority of their content is not intended to be human readable (see 
> https://en.wikipedia.org/wiki/AutoCAD_DXF).  Unfortunately for these files, 
> Tika simply "extracts" the entire content of the file instead of the 
> human-readable portions (i.e. comments etc.) that a CAD tool would render.  
> This results in massive amounts of rubbish data being returned with dire 
> consequences for applications that rely on this.
> It would be nice if only the human-readable text fields were extracted.  
> Failing this, it would still be nice if no text was extracted from these 
> files at all.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1817) Extracts entire file content for ASCII DXF files

Reply via email to