[jira] [Commented] (TIKA-1817) Extracts entire file content for ASCII DXF files

Joey Hong (JIRA) Tue, 29 Dec 2015 14:47:39 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074386#comment-15074386
 ]


Joey Hong commented on TIKA-1817:
---------------------------------

I've been working on the DXF parser for the past couple days. I only have a 
rough ASCII implementation so far, and based on what I could take away from the 
DXF file documentation, there isn't too much metadata I could find in the 
header section of files (e.g. no title, author). 

What I was able to extract were the date metadata (though the file uses Julian 
dates, should I implement a function to change that?), and all human-readable 
text by looking for the TEXT headers. 

Am i missing some details about DXF files that the parser should fine? I looked 
at the existing parser for DWG files, which should be relatively similar, and 
it seemed to have been able to find the author name and title.

> Extracts entire file content for ASCII DXF files
> ------------------------------------------------
>
>                 Key: TIKA-1817
>                 URL: https://issues.apache.org/jira/browse/TIKA-1817
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.11
>            Reporter: Zoltan Toth
>         Attachments: SMA-Controller.dxf, house design.dxf, 
> jcsample-screendump.jpg, jcsample.dxf
>
>
> By definition, ASCII DXF files are encoded in plain text.  However. the vast 
> majority of their content is not intended to be human readable (see 
> https://en.wikipedia.org/wiki/AutoCAD_DXF).  Unfortunately for these files, 
> Tika simply "extracts" the entire content of the file instead of the 
> human-readable portions (i.e. comments etc.) that a CAD tool would render.  
> This results in massive amounts of rubbish data being returned with dire 
> consequences for applications that rely on this.
> It would be nice if only the human-readable text fields were extracted.  
> Failing this, it would still be nice if no text was extracted from these 
> files at all.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1817) Extracts entire file content for ASCII DXF files

Reply via email to