[ https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070227#comment-15070227 ]
Hudson commented on TIKA-1817: ------------------------------ SUCCESS: Integrated in tika-trunk-jdk1.7 #897 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/897/]) TIKA-1817 Test DXF ASCII file, and detection unit test (nick: [http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1721576]) * trunk/LICENSE.txt * trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java * trunk/tika-parsers/src/test/resources/test-documents/testDXF_ascii.dxf > Extracts entire file content for ASCII DXF files > ------------------------------------------------ > > Key: TIKA-1817 > URL: https://issues.apache.org/jira/browse/TIKA-1817 > Project: Tika > Issue Type: Bug > Affects Versions: 1.11 > Reporter: Zoltan Toth > Attachments: SMA-Controller.dxf, house design.dxf, > jcsample-screendump.jpg, jcsample.dxf > > > By definition, ASCII DXF files are encoded in plain text. However. the vast > majority of their content is not intended to be human readable (see > https://en.wikipedia.org/wiki/AutoCAD_DXF). Unfortunately for these files, > Tika simply "extracts" the entire content of the file instead of the > human-readable portions (i.e. comments etc.) that a CAD tool would render. > This results in massive amounts of rubbish data being returned with dire > consequences for applications that rely on this. > It would be nice if only the human-readable text fields were extracted. > Failing this, it would still be nice if no text was extracted from these > files at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)