[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851779#comment-17851779 ] ASF GitHub Bot commented on TIKA-4252: -- tballison commented on PR #1778: URL: https://github.com/apache/tika/pull/1778#issuecomment-2145904427 Ha, @nddipiazza. I did earlier this morning. I chose your choices over mine in the merge, largely. See https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=17851727#comment-17851727 What we now need to do is figure out how to serialize+deserialize ParseContext with as little work as possible. :D > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-4252: switch to using the parse context for additional http headers [tika]
tballison commented on PR #1778: URL: https://github.com/apache/tika/pull/1778#issuecomment-2145904427 Ha, @nddipiazza. I did earlier this morning. I chose your choices over mine in the merge, largely. See https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=17851727#comment-17851727 What we now need to do is figure out how to serialize+deserialize ParseContext with as little work as possible. :D -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?
[ https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851777#comment-17851777 ] ASF GitHub Bot commented on TIKA-4252: -- nddipiazza commented on PR #1778: URL: https://github.com/apache/tika/pull/1778#issuecomment-2145900710 sure will do @tballison sorry didn't see this until now > PipesClient#process - seems to lose the Fetch input metadata? > - > > Key: TIKA-4252 > URL: https://issues.apache.org/jira/browse/TIKA-4252 > Project: Tika > Issue Type: Bug >Reporter: Nicholas DiPiazza >Priority: Major > Fix For: 3.0.0 > > > when calling: > PipesResult pipesResult = pipesClient.process(new > FetchEmitTuple(request.getFetchKey(), > new FetchKey(fetcher.getName(), request.getFetchKey()), > new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, > FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP)); > the tikaMetadata is not present in the fetch data when the fetch method is > called. > > It's OK through this part: > UnsynchronizedByteArrayOutputStream bos = > UnsynchronizedByteArrayOutputStream.builder().get(); > try (ObjectOutputStream objectOutputStream = new > ObjectOutputStream(bos)) > { objectOutputStream.writeObject(t); } > byte[] bytes = bos.toByteArray(); > output.write(CALL.getByte()); > output.writeInt(bytes.length); > output.write(bytes); > output.flush(); > > i verified the bytes have the expected metadata from that point. > > UPDATE: found issue > > org.apache.tika.pipes.PipesServer#parseFromTuple > > is using a new Metadata when it should only use empty metadata if fetch tuple > metadata is null. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-4252: switch to using the parse context for additional http headers [tika]
nddipiazza commented on PR #1778: URL: https://github.com/apache/tika/pull/1778#issuecomment-2145900710 sure will do @tballison sorry didn't see this until now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851727#comment-17851727 ] Tim Allison edited comment on TIKA-4243 at 6/3/24 5:10 PM: --- I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompositeDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig) based on the http-headers. We'd want to extend this to handle inheritance and embedded objects... Something along these lines in json: {code:json} { "settings" : { "org.apache.tika.parser.pdf.PDFParserConfig.class": { "ocrDPI":300, "sortByPosition": true, "org.apache.tika.parser.pdf.image.ImageGraphicsEngineFactory.class: { "_class":"com.tika.custom.OurCompanysFactory", "speed":"blazing", "dpi":1000 } }, "org.apache.tika.parser.Parser": { "_class":"org.apache.tika.parser.EmptyParser" } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :(* was (Author: talli...@mitre.org): I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompositeDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig) based on the http-headers. We'd want to extend this to handle inheritance and embedded objects... Something along these lines in json: {code:json} { "settings" : { "org.apache.tika.parser.pdf.PDFParserConfig": { "ocrDPI":300, "sortByPosition": true, }, "org.apache.tika.parser.Parser": { "_class":"org.apache.tika.parser.EmptyParser" } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :(* > tika
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851727#comment-17851727 ] Tim Allison edited comment on TIKA-4243 at 6/3/24 5:02 PM: --- I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompositeDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig) based on the http-headers. We'd want to extend this to handle inheritance and embedded objects... Something along these lines in json: {code:json} { "settings" : { "org.apache.tika.parser.pdf.PDFParserConfig": { "ocrDPI":300, "sortByPosition": true, }, "org.apache.tika.parser.Parser": { "_class":"org.apache.tika.parser.EmptyParser" } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :(* was (Author: talli...@mitre.org): I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompositeDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig) based on the http-headers. We'd want to extend this to handle inheritance and embedded objects... Something along these lines in json: {code:json} { "settings" : { "org.apache.tika.parser.pdf.PDFParserConfig": { "ocrDPI":300, "sortByPosition": true, }, { "org.apache.tika.parser.Parser": { "_class":"org.apache.tika.parser.EmptyParser" } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :(* > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851727#comment-17851727 ] Tim Allison edited comment on TIKA-4243 at 6/3/24 5:02 PM: --- I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompositeDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig) based on the http-headers. We'd want to extend this to handle inheritance and embedded objects... Something along these lines in json: {code:json} { "settings" : { "org.apache.tika.parser.pdf.PDFParserConfig": { "ocrDPI":300, "sortByPosition": true, }, { "org.apache.tika.parser.Parser": { "_class":"org.apache.tika.parser.EmptyParser" } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :(* was (Author: talli...@mitre.org): I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompoundDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig based on the http-headers). We'd want to extend this to handle inheritance. Something along these lines in json: {code:json} { "settings" : { "PDFParserConfig.class": { "ocrDPI":300, "sortByPosition": true, } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :(* > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851727#comment-17851727 ] Tim Allison edited comment on TIKA-4243 at 6/3/24 4:45 PM: --- I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompoundDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig based on the http-headers). We'd want to extend this to handle inheritance. Something along these lines in json: {code:json} { "settings" : { "PDFParserConfig.class": { "ocrDPI":300, "sortByPosition": true, } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :(* was (Author: talli...@mitre.org): I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompoundDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig based on the http-headers). We'd want to extend this to handle inheritance. Something along these lines in json: {code:json} { "settings" : { "PDFParserConfig.class": { "ocrDPI":300, "sortByPosition": true, } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) * What I don't like about this is that we're back in the game of creating our own serialization framework. :( * > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and
[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851727#comment-17851727 ] Tim Allison edited comment on TIKA-4243 at 6/3/24 4:45 PM: --- I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompoundDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig based on the http-headers). We'd want to extend this to handle inheritance. Something along these lines in json: {code:json} { "settings" : { "PDFParserConfig.class": { "ocrDPI":300, "sortByPosition": true, } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) * What I don't like about this is that we're back in the game of creating our own serialization framework. :( * was (Author: talli...@mitre.org): I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{parseContext.set(Parser.class, new EmptyParser())}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios). We'd want to deal with embedded objects for the obvious use cases of the CompoundDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter -- for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig based on the http-headers). We'd want to extend this to handle inheritance. Something along these lines in json: {code:json} { "settings" : { "PDFParserConfig.class": { "ocrDPI":300, "sortByPosition": true, } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :( * > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and
[jira] [Commented] (TIKA-4243) tika configuration overhaul
[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851727#comment-17851727 ] Tim Allison commented on TIKA-4243: --- I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{parseContext.set(Parser.class, new EmptyParser())}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios). We'd want to deal with embedded objects for the obvious use cases of the CompoundDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter -- for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig based on the http-headers). We'd want to extend this to handle inheritance. Something along these lines in json: {code:json} { "settings" : { "PDFParserConfig.class": { "ocrDPI":300, "sortByPosition": true, } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map) *What I don't like about this is that we're back in the game of creating our own serialization framework. :( * > tika configuration overhaul > --- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config >Affects Versions: 3.0.0 >Reporter: Nicholas DiPiazza >Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4260) Add parse context to the fetcher interface in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4260. --- Resolution: Duplicate Turns out this is a duplicate. Onwards to TIKA-4243! > Add parse context to the fetcher interface in 3.x > - > > Key: TIKA-4260 > URL: https://issues.apache.org/jira/browse/TIKA-4260 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4267: -- Summary: Not getting correct mime type for a few file extensions. example: csv (was: Not getting correct mimet type for few file extensions. example :csv) > Not getting correct mime type for a few file extensions. example: csv > - > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Affects Versions: 1.28.4 >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-4267: -- Affects Version/s: 1.28.4 > Not getting correct mimet type for few file extensions. example :csv > > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Affects Versions: 1.28.4 >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598 ] Tilman Hausherr edited comment on TIKA-4267 at 6/3/24 12:06 PM: The current version is 2.9.2, please retry with that one. Get the list of parsers with this code: {code:java} AutoDetectParser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); Map parsers = parser.getParsers(context); Tika tika = new Tika(); System.out.println(tika.toString()); System.out.println("List of parsers: "); int idx = 0; for (Map.Entry p : parsers.entrySet()) { MediaType t = p.getKey(); System.out.println((idx + 1) + ".- " + t.getType() + "/" + t.getSubtype()); ++idx; } {code} was (Author: tilman): The current version is 2.9.2, please retry with that one. > Not getting correct mimet type for few file extensions. example :csv > > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598 ] Tilman Hausherr edited comment on TIKA-4267 at 6/3/24 12:07 PM: The current version is 2.9.2, please retry with that one; if it still doesn't work, please attach your csv file. Get the list of parsers with this code: {code:java} AutoDetectParser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); Map parsers = parser.getParsers(context); Tika tika = new Tika(); System.out.println(tika.toString()); System.out.println("List of parsers: "); int idx = 0; for (Map.Entry p : parsers.entrySet()) { MediaType t = p.getKey(); System.out.println((idx + 1) + ".- " + t.getType() + "/" + t.getSubtype()); ++idx; } {code} was (Author: tilman): The current version is 2.9.2, please retry with that one. Get the list of parsers with this code: {code:java} AutoDetectParser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); Map parsers = parser.getParsers(context); Tika tika = new Tika(); System.out.println(tika.toString()); System.out.println("List of parsers: "); int idx = 0; for (Map.Entry p : parsers.entrySet()) { MediaType t = p.getKey(); System.out.println((idx + 1) + ".- " + t.getType() + "/" + t.getSubtype()); ++idx; } {code} > Not getting correct mime type for a few file extensions. example: csv > - > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Affects Versions: 1.28.4 >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv
[ https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598 ] Tilman Hausherr commented on TIKA-4267: --- The current version is 2.9.2, please retry with that one. > Not getting correct mimet type for few file extensions. example :csv > > > Key: TIKA-4267 > URL: https://issues.apache.org/jira/browse/TIKA-4267 > Project: Tika > Issue Type: Bug >Reporter: niv >Priority: Major > > Mime type for CSV files incorrectly detected as text/plain always. > Using method > {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00}, > {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color} > jar file used - Tikka 1.28.4 > How can i get the correct mimetype in java application? > Please redirect me to the link list of extension currently support by latest > jar with the mimetype if any. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851587#comment-17851587 ] ASF GitHub Bot commented on TIKA-4260: -- tballison closed pull request #1776: TIKA-4260 > Add parse context to the fetcher interface in 3.x > - > > Key: TIKA-4260 > URL: https://issues.apache.org/jira/browse/TIKA-4260 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x
[ https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851588#comment-17851588 ] ASF GitHub Bot commented on TIKA-4260: -- tballison commented on PR #1776: URL: https://github.com/apache/tika/pull/1776#issuecomment-2144945579 I merged this into @nddipiazza 's TIKA-4252 PR. > Add parse context to the fetcher interface in 3.x > - > > Key: TIKA-4260 > URL: https://issues.apache.org/jira/browse/TIKA-4260 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-4260 -- add ParseContext to fetchers and emitters [tika]
tballison closed pull request #1776: TIKA-4260 -- add ParseContext to fetchers and emitters URL: https://github.com/apache/tika/pull/1776 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] TIKA-4260 -- add ParseContext to fetchers and emitters [tika]
tballison commented on PR #1776: URL: https://github.com/apache/tika/pull/1776#issuecomment-2144945579 I merged this into @nddipiazza 's TIKA-4252 PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org