[ 
https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067319#comment-15067319
 ] 

Nick Burch commented on TIKA-1817:
----------------------------------

Any chance you could upload a small sample DXF file? Ideally with the same / 
similar metadata and contents as our other AutoCAD files, but failing that 
anything with known contents

First task will be using that to get detection working properly, so if you know 
the mime type for these files, that'll help!

Once we have detection, then it's a question of parsing. That should be quite 
quick to do, from the sound of it, and might even be a good starting point for 
one of our new volunteers for the project :) Either way, needs some test files!

> Extracts entire file content for ASCII DXF files
> ------------------------------------------------
>
>                 Key: TIKA-1817
>                 URL: https://issues.apache.org/jira/browse/TIKA-1817
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.11
>            Reporter: Zoltan Toth
>
> By definition, ASCII DXF files are encoded in plain text.  However. the vast 
> majority of their content is not intended to be human readable (see 
> https://en.wikipedia.org/wiki/AutoCAD_DXF).  Unfortunately for these files, 
> Tika simply "extracts" the entire content of the file instead of the 
> human-readable portions (i.e. comments etc.) that a CAD tool would render.  
> This results in massive amounts of rubbish data being returned with dire 
> consequences for applications that rely on this.
> It would be nice if only the human-readable text fields were extracted.  
> Failing this, it would still be nice if no text was extracted from these 
> files at all.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to