[RESULT] [VOTE] Release Apache Tika 1.28.2 Candidate #1

2022-04-27 Thread Tim Allison
The vote for 1.28.2-rc1 has failed.

We have a -1 from Tilman, with which I heartily concur.

We need to fix dependency convergence when building with > java 8.  We
need to fix the new ppt exception that Tilman identified, and we need
to rerun the regression results.

Cheers,

  Tim
On Tue, Apr 26, 2022 at 9:52 AM Tim Allison  wrote:
>
> A candidate for the Tika 1.28.2 release is available at:
>   https://dist.apache.org/repos/dist/dev/tika/1.28.2
>
> The release candidate is a zip archive of the sources in:
>   https://github.com/apache/tika/tree/tika-1.28.2-rc1/
>
> The SHA-512 checksum of the archive is
>   
> d0641610f78ae2d08d0694f3dc1868193f5078c33c302898d3dcd2c8d8807c9d3e3566997ba203191d03c03641dda49266179e826957fd0af7873d4d4eab27bd.
>
> In addition, a staged maven repository is available here:
>   
> https://repository.apache.org/content/repositories/orgapachetika-1082/org/apache/tika
>
> Please vote on releasing this package as Apache Tika 1.28.2.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.28.2
> [ ] -1 Do not release this package because...
>
> Here's my +1
>
> Cheers!
>
> Best,
>
>  Tim


preliminary regression results from 2.4.0

2022-04-27 Thread Tim Allison
The preliminary regression results for 2.4.0 are here:
https://corpora.tika.apache.org/base/reports/tika-2.4.0-reports.tgz

We have some new exceptions caused by the new http parser; many where
the files are truncated or malformed.  I view this as a good thing.

We have newly identified dgn7 and dgn8.

We have many more tika-ooxml and tika ole's being identified as more
specific xlsx, docx, etc, which is good.

The ppt that TIlman identified is a new exception in 2.4.0 as well,
and we need to fix that.

Once we fix the ppt issue, I'll rerun the regression tests.  Please
let me know if you see anything else.

Best,

Tim


[jira] [Created] (TIKA-3741) Fix new ppt exception

2022-04-27 Thread Tim Allison (Jira)
Tim Allison created TIKA-3741:
-

 Summary: Fix new ppt exception
 Key: TIKA-3741
 URL: https://issues.apache.org/jira/browse/TIKA-3741
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


In the regression tests for both 1.x and 2.x, Tilman identified a new exception 
regression on: bug_trackers/TIKA/TIKA-2215-0.ppt

We're letting a runtime exception from an embedded document percolate and stop 
the parse, which means that we're losing content.

Embedded exceptions should be caught and recorded in the metadata (in /rmeta); 
obv, we should let SecurityExceptions percolate through to the top.




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3741) Fix regression in exception handling for embedded resources in a ppt

2022-04-27 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3741:
--
Summary: Fix regression in exception handling for embedded resources in a 
ppt  (was: Fix new ppt exception)

> Fix regression in exception handling for embedded resources in a ppt
> 
>
> Key: TIKA-3741
> URL: https://issues.apache.org/jira/browse/TIKA-3741
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> In the regression tests for both 1.x and 2.x, Tilman identified a new 
> exception regression on: bug_trackers/TIKA/TIKA-2215-0.ppt
> We're letting a runtime exception from an embedded document percolate and 
> stop the parse, which means that we're losing content.
> Embedded exceptions should be caught and recorded in the metadata (in 
> /rmeta); obv, we should let SecurityExceptions percolate through to the top.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3741) Fix regression in exception handling for embedded resources in a ppt

2022-04-27 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528833#comment-17528833
 ] 

Hudson commented on TIKA-3741:
--

UNSTABLE: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #190 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/190/])
TIKA-3741 -- fix regression in handling embedded exceptions in ppt (tallison: 
[https://github.com/apache/tika/commit/88bff551fd05a3d7193291dcd3a98af56f38471a])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java


> Fix regression in exception handling for embedded resources in a ppt
> 
>
> Key: TIKA-3741
> URL: https://issues.apache.org/jira/browse/TIKA-3741
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> In the regression tests for both 1.x and 2.x, Tilman identified a new 
> exception regression on: bug_trackers/TIKA/TIKA-2215-0.ppt
> We're letting a runtime exception from an embedded document percolate and 
> stop the parse, which means that we're losing content.
> Embedded exceptions should be caught and recorded in the metadata (in 
> /rmeta); obv, we should let SecurityExceptions percolate through to the top.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3741) Fix regression in exception handling for embedded resources in a ppt

2022-04-27 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528871#comment-17528871
 ] 

Hudson commented on TIKA-3741:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #528 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/528/])
TIKA-3741 -- fix regression in handling embedded file exceptions in ppt 
(tallison: 
[https://github.com/apache/tika/commit/fc05adb888320647143d04690bb154ebe7f267b2])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/PowerPointParserTest.java


> Fix regression in exception handling for embedded resources in a ppt
> 
>
> Key: TIKA-3741
> URL: https://issues.apache.org/jira/browse/TIKA-3741
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
>
> In the regression tests for both 1.x and 2.x, Tilman identified a new 
> exception regression on: bug_trackers/TIKA/TIKA-2215-0.ppt
> We're letting a runtime exception from an embedded document percolate and 
> stop the parse, which means that we're losing content.
> Embedded exceptions should be caught and recorded in the metadata (in 
> /rmeta); obv, we should let SecurityExceptions percolate through to the top.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (TIKA-3741) Fix regression in exception handling for embedded resources in a ppt

2022-04-27 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3741.
---
Fix Version/s: 1.28.2
   2.4.0
   Resolution: Fixed

> Fix regression in exception handling for embedded resources in a ppt
> 
>
> Key: TIKA-3741
> URL: https://issues.apache.org/jira/browse/TIKA-3741
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.28.2, 2.4.0
>
>
> In the regression tests for both 1.x and 2.x, Tilman identified a new 
> exception regression on: bug_trackers/TIKA/TIKA-2215-0.ppt
> We're letting a runtime exception from an embedded document percolate and 
> stop the parse, which means that we're losing content.
> Embedded exceptions should be caught and recorded in the metadata (in 
> /rmeta); obv, we should let SecurityExceptions percolate through to the top.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (TIKA-3739) Dependency convergence exception when building 1.x with Java > 8

2022-04-27 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-3739.
---
Resolution: Fixed

> Dependency convergence exception when building 1.x with Java > 8
> 
>
> Key: TIKA-3739
> URL: https://issues.apache.org/jira/browse/TIKA-3739
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: preliminary regression results from 2.4.0

2022-04-27 Thread Tilman Hausherr

Am 27.04.2022 um 14:00 schrieb Tim Allison:

Once we fix the ppt issue, I'll rerun the regression tests.  Please
let me know if you see anything else.


commoncrawl3/5Y/5YX5CR7P7FVPZIMTBBPGQU5FULLMJOXM

has lost a bit of extracted text, but that "mail" is broken.

Tilman



Re: preliminary regression results from 2.4.0

2022-04-27 Thread Tim Allison
Y, I think this is an improvement because it was identified as xhtml
by the earlier version of Tika, and it is now correctly being parsed
by the rfc822 parser...and y, it is broken.

There were a number of other files that are now correctly identified
as http-response, but we're getting less text because the files are
truncated and the http-response parser is throwing an exception.

On Wed, Apr 27, 2022 at 2:59 PM Tilman Hausherr  wrote:
>
> Am 27.04.2022 um 14:00 schrieb Tim Allison:
> > Once we fix the ppt issue, I'll rerun the regression tests.  Please
> > let me know if you see anything else.
>
> commoncrawl3/5Y/5YX5CR7P7FVPZIMTBBPGQU5FULLMJOXM
>
> has lost a bit of extracted text, but that "mail" is broken.
>
> Tilman
>


[jira] [Created] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)
Dan Coldrick created TIKA-3742:
--

 Summary: Advice around DGN7 parser and whether to add to TIKA
 Key: TIKA-3742
 URL: https://issues.apache.org/jira/browse/TIKA-3742
 Project: Tika
  Issue Type: Task
  Components: parser
Reporter: Dan Coldrick


Hi [~tallison] & Whoever else. 

I managed to compile the C/C++ library ([http://dgnlib.maptools.org/)] for DGN7 
which produces an dgndump.exe which will dump all the data from the DGN. From 
my initial testing it looks pretty good. 

Would you guys think it was worth adding this or just keep it as a custom 
parser rather than in the main source code? It's under MIT license. I've 
attached the exe (zipped), a copy of the output from the dump and my very dirty 
testing calling the exe (my code I was only interested in the Strings so am 
only pulling those into a string array at the moment to check it's pulling out 
the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Coldrick updated TIKA-3742:
---
Attachment: DGN.zip

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library ([http://dgnlib.maptools.org/)] for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529023#comment-17529023
 ] 

Dan Coldrick commented on TIKA-3742:


 
{code:java}
package org.apache.tika.parser.dgn;import java.io.BufferedReader;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Set;import org.apache.commons.compress.utils.IOUtils;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.AbstractParser;
import org.apache.tika.parser.ParseContext;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;public 

class DGN7Parser extends AbstractParser {    

private static final long serialVersionUID = 7609445358323296566L;    

Set SUPPORTED_TYPES = 
Collections.singleton(MediaType.image("vnd.dgn; version=7"));    

@Override
    public Set getSupportedTypes(ParseContext context) {
        return SUPPORTED_TYPES;
    }    @Override
    public void parse(InputStream stream, ContentHandler handler, Metadata 
metadata, ParseContext context)
            throws IOException, TikaException, SAXException {
        File file = new File("G:/temp/Drawing.dgn");
        try (OutputStream outputStream = new FileOutputStream(file)) {
            IOUtils.copy(stream, outputStream);
        }
        Runtime rt = Runtime.getRuntime();
        String[] commands = {"C:\\Users\\monkm\\DGN\\dgndump.exe","-r","1", 
"G:\\temp\\Drawing.dgn"};
        Process proc = rt.exec(commands);        

BufferedReader stdInput = new BufferedReader(new 
             InputStreamReader(proc.getInputStream()));        
BufferedReader stdError = new BufferedReader(new 
             InputStreamReader(proc.getErrorStream()));
        
        ArrayList ar = new ArrayList();

        String s = null;
        while ((s = stdInput.readLine()) != null) {
            if(s.startsWith("  string = \"")) {
                ar.add(s.substring(12, s.length()-1).trim());
            }
            System.out.println(s);
        }
            System.out.println(ar);
        while ((s = stdError.readLine()) != null) {
            System.out.println(s);
        }
    }}
  {code}
 

 

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library ([http://dgnlib.maptools.org/)] for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Coldrick updated TIKA-3742:
---
Attachment: ExampleOutput.txt

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library ([http://dgnlib.maptools.org/)] for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3571) Add an interface for rendering engines

2022-04-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529025#comment-17529025
 ] 

Tim Allison commented on TIKA-3571:
---

I'm now thinking that we need to add "page start" and "page end"  parameters to 
the interface as well as a "render it all" option.  I don't like this, but the 
need is that the PDFParser should be able to decide after trying to extract the 
text, that it needs to run OCR only on that one page.  I don't want to render 
the full document, if the user doesn't want the rendered images and OCR only 
needs one page.

The question is: is this too pdf centric?  I think it isn't awful.  If there 
are formats that are single paged, this should be ok.  PPT/PPTX page # = slide 
#.  Thoughts?

> Add an interface for rendering engines
> --
>
> Key: TIKA-3571
> URL: https://issues.apache.org/jira/browse/TIKA-3571
> Project: Tika
>  Issue Type: Wish
>Reporter: Tim Allison
>Priority: Major
>
> We've now seen a few requests for extracting text _and_ rendering PDFs, and 
> certainly it might be useful to have alternatives for rendering files (e.g. 
> this [Alfresco 
> study|https://hub.alfresco.com/t5/alfresco-content-services-blog/pdf-rendering-engine-performance-and-fidelity-comparison/ba-p/287618]),
>  including MSOffice or at least PPTx...
> And there are cases where users don't want the rendered images, but they do 
> want OCR to be run against the rendered images.
> I doubt I'll have a chance to work on this for a while, but I wanted to open 
> an issue for discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Coldrick updated TIKA-3742:
---
Description: 
Hi [~tallison] & Whoever else. 

I managed to compile the C/C++ library 
([http://dgnlib.maptools.org/|http://dgnlib.maptools.org/)] for DGN7 which 
produces an dgndump.exe which will dump all the data from the DGN. From my 
initial testing it looks pretty good. 

Would you guys think it was worth adding this or just keep it as a custom 
parser rather than in the main source code? It's under MIT license. I've 
attached the exe (zipped), a copy of the output from the dump and my very dirty 
testing calling the exe (my code I was only interested in the Strings so am 
only pulling those into a string array at the moment to check it's pulling out 
the correct data).

  was:
Hi [~tallison] & Whoever else. 

I managed to compile the C/C++ library ([http://dgnlib.maptools.org/)] for DGN7 
which produces an dgndump.exe which will dump all the data from the DGN. From 
my initial testing it looks pretty good. 

Would you guys think it was worth adding this or just keep it as a custom 
parser rather than in the main source code? It's under MIT license. I've 
attached the exe (zipped), a copy of the output from the dump and my very dirty 
testing calling the exe (my code I was only interested in the Strings so am 
only pulling those into a string array at the moment to check it's pulling out 
the correct data).


> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library 
> ([http://dgnlib.maptools.org/|http://dgnlib.maptools.org/)] for DGN7 which 
> produces an dgndump.exe which will dump all the data from the DGN. From my 
> initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Coldrick updated TIKA-3742:
---
Description: 
Hi [~tallison] & Whoever else. 

I managed to compile the C/C++ library 
[http://dgnlib.maptools.org|http://dgnlib.maptools.org/)] for DGN7 which 
produces an dgndump.exe which will dump all the data from the DGN. From my 
initial testing it looks pretty good. 

Would you guys think it was worth adding this or just keep it as a custom 
parser rather than in the main source code? It's under MIT license. I've 
attached the exe (zipped), a copy of the output from the dump and my very dirty 
testing calling the exe (my code I was only interested in the Strings so am 
only pulling those into a string array at the moment to check it's pulling out 
the correct data).

  was:
Hi [~tallison] & Whoever else. 

I managed to compile the C/C++ library 
([http://dgnlib.maptools.org|http://dgnlib.maptools.org/)] for DGN7 which 
produces an dgndump.exe which will dump all the data from the DGN. From my 
initial testing it looks pretty good. 

Would you guys think it was worth adding this or just keep it as a custom 
parser rather than in the main source code? It's under MIT license. I've 
attached the exe (zipped), a copy of the output from the dump and my very dirty 
testing calling the exe (my code I was only interested in the Strings so am 
only pulling those into a string array at the moment to check it's pulling out 
the correct data).


> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library 
> [http://dgnlib.maptools.org|http://dgnlib.maptools.org/)] for DGN7 which 
> produces an dgndump.exe which will dump all the data from the DGN. From my 
> initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Coldrick updated TIKA-3742:
---
Description: 
Hi [~tallison] & Whoever else. 

I managed to compile the C/C++ library 
([http://dgnlib.maptools.org|http://dgnlib.maptools.org/)] for DGN7 which 
produces an dgndump.exe which will dump all the data from the DGN. From my 
initial testing it looks pretty good. 

Would you guys think it was worth adding this or just keep it as a custom 
parser rather than in the main source code? It's under MIT license. I've 
attached the exe (zipped), a copy of the output from the dump and my very dirty 
testing calling the exe (my code I was only interested in the Strings so am 
only pulling those into a string array at the moment to check it's pulling out 
the correct data).

  was:
Hi [~tallison] & Whoever else. 

I managed to compile the C/C++ library 
([http://dgnlib.maptools.org/|http://dgnlib.maptools.org/)] for DGN7 which 
produces an dgndump.exe which will dump all the data from the DGN. From my 
initial testing it looks pretty good. 

Would you guys think it was worth adding this or just keep it as a custom 
parser rather than in the main source code? It's under MIT license. I've 
attached the exe (zipped), a copy of the output from the dump and my very dirty 
testing calling the exe (my code I was only interested in the Strings so am 
only pulling those into a string array at the moment to check it's pulling out 
the correct data).


> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library 
> ([http://dgnlib.maptools.org|http://dgnlib.maptools.org/)] for DGN7 which 
> produces an dgndump.exe which will dump all the data from the DGN. From my 
> initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Coldrick updated TIKA-3742:
---
Description: 
Hi [~tallison] & Whoever else. 

I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for DGN7 
which produces an dgndump.exe which will dump all the data from the DGN. From 
my initial testing it looks pretty good. 

Would you guys think it was worth adding this or just keep it as a custom 
parser rather than in the main source code? It's under MIT license. I've 
attached the exe (zipped), a copy of the output from the dump and my very dirty 
testing calling the exe (my code I was only interested in the Strings so am 
only pulling those into a string array at the moment to check it's pulling out 
the correct data).

  was:
Hi [~tallison] & Whoever else. 

I managed to compile the C/C++ library 
[http://dgnlib.maptools.org|http://dgnlib.maptools.org/)] for DGN7 which 
produces an dgndump.exe which will dump all the data from the DGN. From my 
initial testing it looks pretty good. 

Would you guys think it was worth adding this or just keep it as a custom 
parser rather than in the main source code? It's under MIT license. I've 
attached the exe (zipped), a copy of the output from the dump and my very dirty 
testing calling the exe (my code I was only interested in the Strings so am 
only pulling those into a string array at the moment to check it's pulling out 
the correct data).


> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529023#comment-17529023
 ] 

Dan Coldrick edited comment on TIKA-3742 at 4/27/22 8:09 PM:
-

 
{code:java}
package org.apache.tika.parser.dgn;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Set;
import org.apache.commons.compress.utils.IOUtils;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.AbstractParser;
import org.apache.tika.parser.ParseContext;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

public class DGN7Parser extends AbstractParser {    

private static final long serialVersionUID = 7609445358323296566L;    

Set SUPPORTED_TYPES = 
Collections.singleton(MediaType.image("vnd.dgn; version=7"));    

@Override
    public Set getSupportedTypes(ParseContext context) {
        return SUPPORTED_TYPES;
    }    @Override
    public void parse(InputStream stream, ContentHandler handler, Metadata 
metadata, ParseContext context)
            throws IOException, TikaException, SAXException {
        File file = new File("G:/temp/Drawing.dgn");
        try (OutputStream outputStream = new FileOutputStream(file)) {
            IOUtils.copy(stream, outputStream);
        }
        Runtime rt = Runtime.getRuntime();
        String[] commands = {"C:\\Users\\monkm\\DGN\\dgndump.exe","-r","1", 
"G:\\temp\\Drawing.dgn"};
        Process proc = rt.exec(commands);        

BufferedReader stdInput = new BufferedReader(new 
             InputStreamReader(proc.getInputStream()));        
BufferedReader stdError = new BufferedReader(new 
             InputStreamReader(proc.getErrorStream()));
        
        ArrayList ar = new ArrayList();

        String s = null;
        while ((s = stdInput.readLine()) != null) {
            if(s.startsWith("  string = \"")) {
                ar.add(s.substring(12, s.length()-1).trim());
            }
            System.out.println(s);
        }
            System.out.println(ar);
        while ((s = stdError.readLine()) != null) {
            System.out.println(s);
        }
    }}
  {code}
 

 


was (Author: monkmachine):
 
{code:java}
package org.apache.tika.parser.dgn;import java.io.BufferedReader;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Set;import org.apache.commons.compress.utils.IOUtils;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.AbstractParser;
import org.apache.tika.parser.ParseContext;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;public 

class DGN7Parser extends AbstractParser {    

private static final long serialVersionUID = 7609445358323296566L;    

Set SUPPORTED_TYPES = 
Collections.singleton(MediaType.image("vnd.dgn; version=7"));    

@Override
    public Set getSupportedTypes(ParseContext context) {
        return SUPPORTED_TYPES;
    }    @Override
    public void parse(InputStream stream, ContentHandler handler, Metadata 
metadata, ParseContext context)
            throws IOException, TikaException, SAXException {
        File file = new File("G:/temp/Drawing.dgn");
        try (OutputStream outputStream = new FileOutputStream(file)) {
            IOUtils.copy(stream, outputStream);
        }
        Runtime rt = Runtime.getRuntime();
        String[] commands = {"C:\\Users\\monkm\\DGN\\dgndump.exe","-r","1", 
"G:\\temp\\Drawing.dgn"};
        Process proc = rt.exec(commands);        

BufferedReader stdInput = new BufferedReader(new 
             InputStreamReader(proc.getInputStream()));        
BufferedReader stdError = new BufferedReader(new 
             InputStreamReader(proc.getErrorStream()));
        
        ArrayList ar = new ArrayList();

        String s = null;
        while ((s = stdInput.readLine()) != null) {
            if(s.startsWith("  string = \"")) {
                ar.add(s.substring(12, s.length()-1).trim());
            }
            System.out.println(s);
        }
            System.out.println(ar);
        while ((s = stdError.readLine()) != null) {
            System.out.println(s);
        }
    }}
  {code}
 

 

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues

[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529029#comment-17529029
 ] 

Nick Burch commented on TIKA-3742:
--

If it can just be run standalone and then {{ExternalParser}} + 
{{tika-external-parsers.xml}} is probably the way to go - that already handles 
testing if the program is installed, spawning it, cleaning up, grabbing text etc

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529033#comment-17529033
 ] 

Dan Coldrick commented on TIKA-3742:


[~nick] 

Apologies, new to all this. Can you point me at some documentation? External 
parsers assume they don't exist in the main TIKA GIT and you have another repo 
just for that parser that users can add in? Or does it work differently?

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529034#comment-17529034
 ] 

Tim Allison commented on TIKA-3742:
---

Or, now that you know it works, just port it to Java! But seriously, do look at 
the original ExternalParser or the newer one: 
o.a.t.parser.external2.ExternalParser.

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529038#comment-17529038
 ] 

Nick Burch commented on TIKA-3742:
--

In theory you shouldn't need any java code at all if you don't want, just an 
xml file with a magic well-known name

We've a couple already in Tika, mostly focused on metadata:

[https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml]

Pop your own one on the classpath and it should be picked up dynamically at 
runtime

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529037#comment-17529037
 ] 

Dan Coldrick commented on TIKA-3742:


[~tallison]  I struggle to get out the bed in the morning let alone read C/C++ 
and convert it to Java. I can make out what's it's doing but no idea how it 
does the bytes read stuff which is really how the underlying bits work. I can 
see how in the file there are the element types but again no idea how they are 
mapped to the bytes, I've never had any dealings with C/C++.

I was happy after an hour pissing about on google I managed to get it to 
compile (on Windows) :D

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529040#comment-17529040
 ] 

Tim Allison commented on TIKA-3742:
---

https://github.com/tballison/file-observatory/blob/main/tika-containers/tika-pdftotext/my-tika-config.xml

That’s an example of the newer external parser, but if you want your name in 
lights, port it to Java! :D

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529042#comment-17529042
 ] 

Dan Coldrick commented on TIKA-3742:


[~tallison]  got a link to that or an example?

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Dan Coldrick (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529044#comment-17529044
 ] 

Dan Coldrick commented on TIKA-3742:


lol, you posted before I responded

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529101#comment-17529101
 ] 

Nick Burch commented on TIKA-3742:
--

Assuming we just want type=17 text elements of a DGNv7 file (as per 
[http://dgnlib.maptools.org/dgn.html#type17] ) then a quick'n'dirty parser 
wouldn't be too bad 
[https://gist.github.com/Gagravarr/90d390fec7c5f2c5cf966c0eedccac5c] is a basic 
reader that finds these texts elements and prints them

Couldn't immediately spot any useful metadata elements to pull out, so I think 
a basic parser would just be the text for DGN7

Anyone fancy finishing this off into a "proper" Tika parser? :)

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


1.28.2 regression results

2022-04-27 Thread Tim Allison
Are available here:
https://corpora.tika.apache.org/base/reports/tika-1.28.2-reports-20220427.tgz

I haven't taken a look yet.

Let me know if you find anything.

Best,

  Tim


[jira] [Commented] (TIKA-3742) Advice around DGN7 parser and whether to add to TIKA

2022-04-27 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529105#comment-17529105
 ] 

Tim Allison commented on TIKA-3742:
---

Related, with the new detection, we now know we have a couple of handfuls of 
dgnv7 files in our regression corpus. :D

> Advice around DGN7 parser and whether to add to TIKA
> 
>
> Key: TIKA-3742
> URL: https://issues.apache.org/jira/browse/TIKA-3742
> Project: Tika
>  Issue Type: Task
>  Components: parser
>Reporter: Dan Coldrick
>Priority: Minor
> Attachments: DGN.zip, ExampleOutput.txt
>
>
> Hi [~tallison] & Whoever else. 
> I managed to compile the C/C++ library [http://dgnlib.maptools.org/]  for 
> DGN7 which produces an dgndump.exe which will dump all the data from the DGN. 
> From my initial testing it looks pretty good. 
> Would you guys think it was worth adding this or just keep it as a custom 
> parser rather than in the main source code? It's under MIT license. I've 
> attached the exe (zipped), a copy of the output from the dump and my very 
> dirty testing calling the exe (my code I was only interested in the Strings 
> so am only pulling those into a string array at the moment to check it's 
> pulling out the correct data).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [tika] grossws commented on pull request #432: [TIKA-3368] Add Bill of Materials (BOM) artifact (Tika 1.x)

2022-04-27 Thread GitBox


grossws commented on PR #432:
URL: https://github.com/apache/tika/pull/432#issuecomment-670632

   Since Tika 1.x will reach EOL in September, 2022 I think this PR is no 
longer relevant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [tika] grossws closed pull request #432: [TIKA-3368] Add Bill of Materials (BOM) artifact (Tika 1.x)

2022-04-27 Thread GitBox


grossws closed pull request #432: [TIKA-3368] Add Bill of Materials (BOM) 
artifact (Tika 1.x)
URL: https://github.com/apache/tika/pull/432


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-3368) Add Bill of Materials (BOM) artifact (Tika 1.x)

2022-04-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529154#comment-17529154
 ] 

ASF GitHub Bot commented on TIKA-3368:
--

grossws commented on PR #432:
URL: https://github.com/apache/tika/pull/432#issuecomment-670632

   Since Tika 1.x will reach EOL in September, 2022 I think this PR is no 
longer relevant.




> Add Bill of Materials (BOM) artifact (Tika 1.x)
> ---
>
> Key: TIKA-3368
> URL: https://issues.apache.org/jira/browse/TIKA-3368
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Major
> Fix For: 2.0.0-BETA
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3368) Add Bill of Materials (BOM) artifact (Tika 1.x)

2022-04-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529155#comment-17529155
 ] 

ASF GitHub Bot commented on TIKA-3368:
--

grossws closed pull request #432: [TIKA-3368] Add Bill of Materials (BOM) 
artifact (Tika 1.x)
URL: https://github.com/apache/tika/pull/432




> Add Bill of Materials (BOM) artifact (Tika 1.x)
> ---
>
> Key: TIKA-3368
> URL: https://issues.apache.org/jira/browse/TIKA-3368
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Konstantin Gribov
>Assignee: Konstantin Gribov
>Priority: Major
> Fix For: 2.0.0-BETA
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: 1.28.2 regression results

2022-04-27 Thread Tilman Hausherr

Am 28.04.2022 um 00:25 schrieb Tim Allison:

Are available here:
https://corpora.tika.apache.org/base/reports/tika-1.28.2-reports-20220427.tgz

I haven't taken a look yet.

Let me know if you find anything.



commoncrawl3/OR/ORTIXLZEFH4QC5RJTV3L5XBNOVW42KPH

this is minor and is related to superscript, I don't know if this is 
wanted or not.


The two "file not fully read from stream" exceptions, am I correct to 
assume that these are problems in the batch itself?


Tilman