Cool! Yea, I had actually been watching that JIRA instance and thinking about 
taking it on. I’m currently traveling with my family, but when I get back in a 
couple days I would love to start tackling this project.

Happy Holidays,
Joey
 
> On Dec 22, 2015, at 5:02 AM, Nick Burch (JIRA) <j...@apache.org> wrote:
> 
> 
>    [ 
> https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068077#comment-15068077
>  ] 
> 
> Nick Burch commented on TIKA-1817:
> ----------------------------------
> 
> I've had a go at adding mime subtypes for binary and ascii for DXF, as well 
> as the related DXB, in r1721390. No unit tests though :( Needs some suitable 
> sample files
> 
> With that in place, ascii dxf files should no longer end up routed to the 
> text parser. That's probably slightly better, but not ideal... We really need 
> someone to volunteer to write a proper parser!
> 
> Writing one shouldn't be too bad, especially for strings and metadata, along 
> the lines of the DWG one we already have. 
> http://www.fileformat.info/format/dxf/egff.htm seems a good overview of the 
> file format, and there's also published stuff at 
> http://www.autodesk.com/techpubs/autocad/acadr14/dxf/drawing_interchange_file_formats.htm
>  that should help
> 
>> Extracts entire file content for ASCII DXF files
>> ------------------------------------------------
>> 
>>                Key: TIKA-1817
>>                URL: https://issues.apache.org/jira/browse/TIKA-1817
>>            Project: Tika
>>         Issue Type: Bug
>>   Affects Versions: 1.11
>>           Reporter: Zoltan Toth
>>        Attachments: jcsample-screendump.jpg, jcsample.dxf
>> 
>> 
>> By definition, ASCII DXF files are encoded in plain text.  However. the vast 
>> majority of their content is not intended to be human readable (see 
>> https://en.wikipedia.org/wiki/AutoCAD_DXF).  Unfortunately for these files, 
>> Tika simply "extracts" the entire content of the file instead of the 
>> human-readable portions (i.e. comments etc.) that a CAD tool would render.  
>> This results in massive amounts of rubbish data being returned with dire 
>> consequences for applications that rely on this.
>> It would be nice if only the human-readable text fields were extracted.  
>> Failing this, it would still be nice if no text was extracted from these 
>> files at all.  
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)

Reply via email to