[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529023#comment-17529023
 ] 

Dan Coldrick commented on TIKA-3742:


 
{code:java}
package org.apache.tika.parser.dgn;import java.io.BufferedReader;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Set;import org.apache.commons.compress.utils.IOUtils;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.AbstractParser;
import org.apache.tika.parser.ParseContext;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;public 

class DGN7Parser extends AbstractParser {    

private static final long serialVersionUID = 7609445358323296566L;    

Set SUPPORTED_TYPES = 
Collections.singleton(MediaType.image("vnd.dgn; version=7"));    

@Override
    public Set getSupportedTypes(ParseContext context) {
        return SUPPORTED_TYPES;
    }    @Override
    public void parse(InputStream stream, ContentHandler handler, Metadata 
metadata, ParseContext context)
            throws IOException, TikaException, SAXException {
        File file = new File("G:/temp/Drawing.dgn");
        try (OutputStream outputStream = new FileOutputStream(file)) {
            IOUtils.copy(stream, outputStream);
        }
        Runtime rt = Runtime.getRuntime();
        String[] commands = {"C:\\Users\\monkm\\DGN\\dgndump.exe","-r","1", 
"G:\\temp\\Drawing.dgn"};
        Process proc = rt.exec(commands);        

BufferedReader stdInput = new BufferedReader(new 
             InputStreamReader(proc.getInputStream()));        
BufferedReader stdError = new BufferedReader(new 
             InputStreamReader(proc.getErrorStream()));
        
        ArrayList ar = new ArrayList();

        String s = null;
        while ((s = stdInput.readLine()) != null) {
            if(s.startsWith("  string = \"")) {
                ar.add(s.substring(12, s.length()-1).trim());
            }
            System.out.println(s);
        }
            System.out.println(ar);
        while ((s = stdError.readLine()) != null) {
            System.out.println(s);
        }
    }}
  {code}
 

 

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library ([http://dgnlib.maptools.org/)] for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529029#comment-17529029
 ] 

Nick Burch commented on TIKA-3742:
--

If it can just be run standalone and then {{ExternalParser}} + 
{{tika-external-parsers.xml}} is probably the way to go - that already handles 
testing if the program is installed, spawning it, cleaning up, grabbing text etc

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529033#comment-17529033
 ] 

Dan Coldrick commented on TIKA-3742:


[~nick] 

Apologies, new to all this. Can you point me at some documentation? External 
parsers assume they don't exist in the main TIKA GIT and you have another repo 
just for that parser that users can add in? Or does it work differently?

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529034#comment-17529034
 ] 

Tim Allison commented on TIKA-3742:
---

Or, now that you know it works, just port it to Java! But seriously, do look at 
the original ExternalParser or the newer one: 
o.a.t.parser.external2.ExternalParser.

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529038#comment-17529038
 ] 

Nick Burch commented on TIKA-3742:
--

In theory you shouldn't need any java code at all if you don't want, just an 
xml file with a magic well-known name

We've a couple already in Tika, mostly focused on metadata:

[https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml]

Pop your own one on the classpath and it should be picked up dynamically at 
runtime

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529037#comment-17529037
 ] 

Dan Coldrick commented on TIKA-3742:


[~tallison]  I struggle to get out the bed in the morning let alone read C/C++ 
and convert it to Java. I can make out what's it's doing but no idea how it 
does the bytes read stuff which is really how the underlying bits work. I can 
see how in the file there are the element types but again no idea how they are 
mapped to the bytes, I've never had any dealings with C/C++.

I was happy after an hour pissing about on google I managed to get it to 
compile (on Windows) :D

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529040#comment-17529040
 ] 

Tim Allison commented on TIKA-3742:
---

https://github.com/tballison/file-observatory/blob/main/tika-containers/tika-pdftotext/my-tika-config.xml

That’s an example of the newer external parser, but if you want your name in 
lights, port it to Java! :D

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529042#comment-17529042
 ] 

Dan Coldrick commented on TIKA-3742:


[~tallison]  got a link to that or an example?

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529044#comment-17529044
 ] 

Dan Coldrick commented on TIKA-3742:


lol, you posted before I responded

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529101#comment-17529101
 ] 

Nick Burch commented on TIKA-3742:
--

Assuming we just want type=17 text elements of a DGNv7 file (as per 
[http://dgnlib.maptools.org/dgn.html#type17] ) then a quick'n'dirty parser 
wouldn't be too bad 
[https://gist.github.com/Gagravarr/90d390fec7c5f2c5cf966c0eedccac5c] is a basic 
reader that finds these texts elements and prints them

Couldn't immediately spot any useful metadata elements to pull out, so I think 
a basic parser would just be the text for DGN7

Anyone fancy finishing this off into a "proper" Tika parser? :)

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529105#comment-17529105
 ] 

Tim Allison commented on TIKA-3742:
---

Related, with the new detection, we now know we have a couple of handfuls of 
dgnv7 files in our regression corpus. :D

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-28 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529409#comment-17529409
 ] 

Dan Coldrick commented on TIKA-3742:


[~nick]  I can have a go although I can't get the following line to compile in 
eclipse:

byte[] str = is.readNBytes(len);

 

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-28 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529417#comment-17529417
 ] 

Nick Burch commented on TIKA-3742:
--

I believe {{readNBytes}} only came in with Java 9, and the particular 
{{readNBytes(int)}} in Java 11, so you'll need to use a newer JVM. Should be 
able to replace it with Commons IO calls once we're happy with the general 
logic + approach

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529431#comment-17529431
 ] 

Tim Allison commented on TIKA-3742:
---

IOUtils.readFully()?

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-28 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529459#comment-17529459
 ] 

Tim Allison commented on TIKA-3742:
---

[~nick] your gist looks great!  [~monkmachine], I'm passing the baton to you on 
this one.  In general, please use readFully and skipFully and ensure that the 
parse stops if the file is truncated -- check every read for EOF.

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-28 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529517#comment-17529517
 ] 

Nick Burch commented on TIKA-3742:
--

I've updated the code in the gist to use Commons IO and skipFully / readFully, 
as well as fetching a few more values so it isn't blindly skipping as much.

I'm not sure if the text element type is always at the top level, or if we need 
to go hunting inside complex elements for them. Are you able to check in some 
of your larger test files [~monkmachine] ?

We might be able to get some useful info out of the tags, at least based on 
[http://dgnlib.maptools.org/dgn.html#type37] - do you have / could you create a 
test file with some Dan?

Finally, needs converting to an actual parser, 
[https://tika.apache.org/2.3.0/parser_guide.html] has the steps if you want to 
give it a whirl Dan!

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-28 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529667#comment-17529667
 ] 

Dan Coldrick commented on TIKA-3742:


[~nick]  I've made a start today which I can share at some point tomorrow (been 
to the pub tonight lol so will have to wait till tomorrow ), are you ok if I 
lean on you 2 for help? I'd rather write something myself which you can rip 
apart so I can learn something. I've learnt a lot in the last week or so 
already :)

 

I also think there is some meta data in there somewhere which we should be able 
to pull out :)

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-29 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529918#comment-17529918
 ] 

Nick Burch commented on TIKA-3742:
--

Sure! Potentially easiest is if you create your own fork of Tika on Github, 
create a branch, and work on that. You can then share that branch with us to 
review, feedback on etc. When it's all working, you can then create a pull 
request for us to merge straight into Tika!

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-30 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530448#comment-17530448
 ] 

Dan Coldrick commented on TIKA-3742:


Hi [~nick] 

I'm struggling, I can see there are deletes which I want to exclude from the 
parser but can't work out how to in Java.

I can see come out in DGN dump with a deleted attribute:

 
{code:java}
Element:Text         Level:27 id:19707  (DELETED) 
  offset=1959730  size=74 bytes
  graphic_group:0   color:0 weight:0 style:0
  properties=1536,MODIFIED,NEW
  origin=(963453.83000,96730.11000), rotation=272.763292
  font=1, just=2, length_mult=119.99, height_mult=119.99
  string = "HARVARD     RD" {code}
 I can see in the core element structure it should be there:

 

 
{code:java}
The first 18 words of an element in the design file are its fixed header -- 
 containing the element type, level, words to follow, and range 
 information. The C declaration for this header is as follows
 
   typedef struct
      {
      unsigned          level:6              ;            /* level element is 
on */
      unsigned          :1                   ;           /* reserved */
      unsigned          complex:1            ;          /* component of complex 
elem.*/
      unsigned          type:7               ;          /* type of element */
      unsigned          deleted:1            ;          /* set if element is 
deleted */
      unsigned short             words       ;           /* words to follow in 
element */
      unsigned long           xlow           ;            /* element range - 
low */
      unsigned long           ylow           ;
      unsigned long           zlow           ;
      unsigned long           xhigh          ;           /* element range - 
high */
      unsigned long           yhigh          ;
      unsigned long           zhigh          ;
      } Elm_hdr         {code}
 

 

You get the type out (which I think is from the same header structure)
{code:java}
int h2 = tstream.read() ;
int type = h2 & 0x7f; {code}
How do I get the deleted attribute out so I can remove it from the parse 
content? Also you said about type 37, I don't have any examples where we have 
type 37 elements.

 

I've created a fork and created some dirty code to test in:

[https://github.com/monkmachine/tika/tree/TIKA-3742/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-cad-module/src/main/java/org/apache/tika/parser/dgn]

 

 

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-30 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530457#comment-17530457
 ] 

Dan Coldrick commented on TIKA-3742:


Is this correct for working out the deletion? If it is I might actually 
understand how its working a bit more!
{code:java}
boolean isdeleted = BigInteger.valueOf(h2).testBit(7); {code}

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-05-18 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539019#comment-17539019
 ] 

Dan Coldrick commented on TIKA-3742:


[~nick] any advice? I'm stuck on the random chars at the moment with this one 
so any help would be appreciated :)

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: 1264t.dgn, DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)