[ 
https://issues.apache.org/jira/browse/TIKA-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069017#comment-15069017
 ] 

Zoltan Toth commented on TIKA-1817:
-----------------------------------

The previously attached sample was obtained from a site that clearly states 
that it's a sample: http://justcad.com/downloads.html. Is that good enough to 
meet your requirements?  There are many other sites that fall into the same 
category.

What about the newly attached "house design.dxf" and "SMA-Controller.dxf". 
These were downloaded from:

  http://cadkit.blogspot.com.au/2012/01/sample-dxf-files.html

The home page for the site (http://cadkit.blogspot.com.au) states:

  "Permission to use, copy, modify, and distribute this software and its 
documentation
  for any purpose is hereby granted without fee, provided that the above 
copyright
  notice, author statement appear in all copies of this software and related 
documentation."

The other option you have is to convert the test files that you must already be 
using to test DWG text extraction.  Just use one of the free converters, such 
as: http://www.autodwg.com/DWG_DXF_Converter

Best of luck with it.



> Extracts entire file content for ASCII DXF files
> ------------------------------------------------
>
>                 Key: TIKA-1817
>                 URL: https://issues.apache.org/jira/browse/TIKA-1817
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.11
>            Reporter: Zoltan Toth
>         Attachments: jcsample-screendump.jpg, jcsample.dxf
>
>
> By definition, ASCII DXF files are encoded in plain text.  However. the vast 
> majority of their content is not intended to be human readable (see 
> https://en.wikipedia.org/wiki/AutoCAD_DXF).  Unfortunately for these files, 
> Tika simply "extracts" the entire content of the file instead of the 
> human-readable portions (i.e. comments etc.) that a CAD tool would render.  
> This results in massive amounts of rubbish data being returned with dire 
> consequences for applications that rely on this.
> It would be nice if only the human-readable text fields were extracted.  
> Failing this, it would still be nice if no text was extracted from these 
> files at all.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to