Cool! Yea, I had actually been watching that JIRA instance and thinking about taking it on. I’m currently traveling with my family, but when I get back in a couple days I would love to start tackling this project.
Happy Holidays, Joey > On Dec 22, 2015, at 5:02 AM, Nick Burch (JIRA) <j...@apache.org> wrote: > > > [ > https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068077#comment-15068077 > ] > > Nick Burch commented on TIKA-1817: > ---------------------------------- > > I've had a go at adding mime subtypes for binary and ascii for DXF, as well > as the related DXB, in r1721390. No unit tests though :( Needs some suitable > sample files > > With that in place, ascii dxf files should no longer end up routed to the > text parser. That's probably slightly better, but not ideal... We really need > someone to volunteer to write a proper parser! > > Writing one shouldn't be too bad, especially for strings and metadata, along > the lines of the DWG one we already have. > http://www.fileformat.info/format/dxf/egff.htm seems a good overview of the > file format, and there's also published stuff at > http://www.autodesk.com/techpubs/autocad/acadr14/dxf/drawing_interchange_file_formats.htm > that should help > >> Extracts entire file content for ASCII DXF files >> ------------------------------------------------ >> >> Key: TIKA-1817 >> URL: https://issues.apache.org/jira/browse/TIKA-1817 >> Project: Tika >> Issue Type: Bug >> Affects Versions: 1.11 >> Reporter: Zoltan Toth >> Attachments: jcsample-screendump.jpg, jcsample.dxf >> >> >> By definition, ASCII DXF files are encoded in plain text. However. the vast >> majority of their content is not intended to be human readable (see >> https://en.wikipedia.org/wiki/AutoCAD_DXF). Unfortunately for these files, >> Tika simply "extracts" the entire content of the file instead of the >> human-readable portions (i.e. comments etc.) that a CAD tool would render. >> This results in massive amounts of rubbish data being returned with dire >> consequences for applications that rely on this. >> It would be nice if only the human-readable text fields were extracted. >> Failing this, it would still be nice if no text was extracted from these >> files at all. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332)