[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-06-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851779#comment-17851779
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

tballison commented on PR #1778:
URL: https://github.com/apache/tika/pull/1778#issuecomment-2145904427

   Ha, @nddipiazza. I did earlier this morning. I chose your choices over mine 
in the merge, largely.
   
   See 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=17851727#comment-17851727
   
   What we now need to do is figure out how to serialize+deserialize 
ParseContext with as little work as possible. :D
   




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4252: switch to using the parse context for additional http headers [tika]

2024-06-03 Thread via GitHub


tballison commented on PR #1778:
URL: https://github.com/apache/tika/pull/1778#issuecomment-2145904427

   Ha, @nddipiazza. I did earlier this morning. I chose your choices over mine 
in the merge, largely.
   
   See 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel=17851727#comment-17851727
   
   What we now need to do is figure out how to serialize+deserialize 
ParseContext with as little work as possible. :D
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-4252) PipesClient#process - seems to lose the Fetch input metadata?

2024-06-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851777#comment-17851777
 ] 

ASF GitHub Bot commented on TIKA-4252:
--

nddipiazza commented on PR #1778:
URL: https://github.com/apache/tika/pull/1778#issuecomment-2145900710

   sure will do @tballison sorry didn't see this until now 




> PipesClient#process - seems to lose the Fetch input metadata?
> -
>
> Key: TIKA-4252
> URL: https://issues.apache.org/jira/browse/TIKA-4252
> Project: Tika
>  Issue Type: Bug
>Reporter: Nicholas DiPiazza
>Priority: Major
> Fix For: 3.0.0
>
>
> when calling:
> PipesResult pipesResult = pipesClient.process(new 
> FetchEmitTuple(request.getFetchKey(),
>                     new FetchKey(fetcher.getName(), request.getFetchKey()), 
> new EmitKey(), tikaMetadata, HandlerConfig.DEFAULT_HANDLER_CONFIG, 
> FetchEmitTuple.ON_PARSE_EXCEPTION.SKIP));
> the tikaMetadata is not present in the fetch data when the fetch method is 
> called.
>  
> It's OK through this part: 
>             UnsynchronizedByteArrayOutputStream bos = 
> UnsynchronizedByteArrayOutputStream.builder().get();
>             try (ObjectOutputStream objectOutputStream = new 
> ObjectOutputStream(bos))
> {                 objectOutputStream.writeObject(t);             }
>             byte[] bytes = bos.toByteArray();
>             output.write(CALL.getByte());
>             output.writeInt(bytes.length);
>             output.write(bytes);
>             output.flush();
>  
> i verified the bytes have the expected metadata from that point.
>  
> UPDATE: found issue
>  
> org.apache.tika.pipes.PipesServer#parseFromTuple
>  
> is using a new Metadata when it should only use empty metadata if fetch tuple 
> metadata is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4252: switch to using the parse context for additional http headers [tika]

2024-06-03 Thread via GitHub


nddipiazza commented on PR #1778:
URL: https://github.com/apache/tika/pull/1778#issuecomment-2145900710

   sure will do @tballison sorry didn't see this until now 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 5:10 PM:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
"org.apache.tika.parser.pdf.image.ImageGraphicsEngineFactory.class: {
"_class":"com.tika.custom.OurCompanysFactory",
   "speed":"blazing",
   "dpi":1000
}
   },
   "org.apache.tika.parser.Parser": {
 "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig": { 
"ocrDPI":300,
"sortByPosition": true,
   },
   "org.apache.tika.parser.Parser": {
 "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*

> tika 

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 5:02 PM:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig": { 
"ocrDPI":300,
"sortByPosition": true,
   },
   "org.apache.tika.parser.Parser": {
 "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig": { 
"ocrDPI":300,
"sortByPosition": true,
   },
   { "org.apache.tika.parser.Parser": {
 "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New 

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 5:02 PM:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig": { 
"ocrDPI":300,
"sortByPosition": true,
   },
   { "org.apache.tika.parser.Parser": {
 "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json:
{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 4:45 PM:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json:
{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json:
{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

* What I don't like about this is that we're back in the game of creating our 
own serialization framework. :( *

> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and 

[jira] [Comment Edited] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 4:45 PM:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json:
{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

* What I don't like about this is that we're back in the game of creating our 
own serialization framework. :( *


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{parseContext.set(Parser.class, new EmptyParser())}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter -- for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig. 

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json: 

{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}

Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(
*


> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and 

[jira] [Commented] (TIKA-4243) tika configuration overhaul

2024-06-03 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851727#comment-17851727
 ] 

Tim Allison commented on TIKA-4243:
---

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{parseContext.set(Parser.class, new EmptyParser())}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios).

We'd want to deal with embedded objects for the obvious use cases of the 
CompoundDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter -- for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig. 

I'm wondering if it would be simpler to backoff to a Map 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig based on the http-headers). We'd want to extend this 
to handle inheritance.

Something along these lines in json: 

{code:json}
{
"settings" : {
   "PDFParserConfig.class": { 
"ocrDPI":300,
"sortByPosition": true,
   }
}
{code}

Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map)

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(
*


> tika configuration overhaul
> ---
>
> Key: TIKA-4243
> URL: https://issues.apache.org/jira/browse/TIKA-4243
> Project: Tika
>  Issue Type: New Feature
>  Components: config
>Affects Versions: 3.0.0
>Reporter: Nicholas DiPiazza
>Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-06-03 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4260.
---
Resolution: Duplicate

Turns out this is a duplicate. Onwards to TIKA-4243!

> Add parse context to the fetcher interface in 3.x
> -
>
> Key: TIKA-4260
> URL: https://issues.apache.org/jira/browse/TIKA-4260
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv

2024-06-03 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4267:
--
Summary: Not getting correct mime type for a few file extensions. example: 
csv  (was: Not getting correct mimet type for few file extensions. example :csv)

> Not getting correct mime type for a few file extensions. example: csv
> -
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.28.4
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv

2024-06-03 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4267:
--
Affects Version/s: 1.28.4

> Not getting correct mimet type for few file extensions. example :csv
> 
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.28.4
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv

2024-06-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598
 ] 

Tilman Hausherr edited comment on TIKA-4267 at 6/3/24 12:06 PM:


The current version is 2.9.2, please retry with that one.

Get the list of parsers with this code:
{code:java}
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
Map parsers = parser.getParsers(context);
Tika tika = new Tika();
System.out.println(tika.toString());
System.out.println("List of parsers: ");
int idx = 0;
for (Map.Entry p : parsers.entrySet())
{
MediaType t = p.getKey();
System.out.println((idx + 1) + ".- " + t.getType() + "/" + 
t.getSubtype());
++idx;
}
{code}


was (Author: tilman):
The current version is 2.9.2, please retry with that one.

> Not getting correct mimet type for few file extensions. example :csv
> 
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4267) Not getting correct mime type for a few file extensions. example: csv

2024-06-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598
 ] 

Tilman Hausherr edited comment on TIKA-4267 at 6/3/24 12:07 PM:


The current version is 2.9.2, please retry with that one; if it still doesn't 
work, please attach your csv file.

Get the list of parsers with this code:
{code:java}
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
Map parsers = parser.getParsers(context);
Tika tika = new Tika();
System.out.println(tika.toString());
System.out.println("List of parsers: ");
int idx = 0;
for (Map.Entry p : parsers.entrySet())
{
MediaType t = p.getKey();
System.out.println((idx + 1) + ".- " + t.getType() + "/" + 
t.getSubtype());
++idx;
}
{code}


was (Author: tilman):
The current version is 2.9.2, please retry with that one.

Get the list of parsers with this code:
{code:java}
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
Map parsers = parser.getParsers(context);
Tika tika = new Tika();
System.out.println(tika.toString());
System.out.println("List of parsers: ");
int idx = 0;
for (Map.Entry p : parsers.entrySet())
{
MediaType t = p.getKey();
System.out.println((idx + 1) + ".- " + t.getType() + "/" + 
t.getSubtype());
++idx;
}
{code}

> Not getting correct mime type for a few file extensions. example: csv
> -
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.28.4
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4267) Not getting correct mimet type for few file extensions. example :csv

2024-06-03 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851598#comment-17851598
 ] 

Tilman Hausherr commented on TIKA-4267:
---

The current version is 2.9.2, please retry with that one.

> Not getting correct mimet type for few file extensions. example :csv
> 
>
> Key: TIKA-4267
> URL: https://issues.apache.org/jira/browse/TIKA-4267
> Project: Tika
>  Issue Type: Bug
>Reporter: niv
>Priority: Major
>
> Mime type for CSV files incorrectly detected as text/plain always.
> Using  method 
> {color:#00}{color:#6a3e3e}detector{color}{color:#00}.{color}{color:#00}detect{color}{color:#00}({color}{color:#6a3e3e}stream{color}{color:#00},
>  {color}{color:#6a3e3e}metadata{color}{color:#00});{color}{color}
> jar file used - Tikka 1.28.4
> How can i get the correct mimetype in java application?
> Please redirect me to the link list of extension currently support by latest 
> jar with the mimetype if any.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-06-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851587#comment-17851587
 ] 

ASF GitHub Bot commented on TIKA-4260:
--

tballison closed pull request #1776: TIKA-4260 

> Add parse context to the fetcher interface in 3.x
> -
>
> Key: TIKA-4260
> URL: https://issues.apache.org/jira/browse/TIKA-4260
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4260) Add parse context to the fetcher interface in 3.x

2024-06-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851588#comment-17851588
 ] 

ASF GitHub Bot commented on TIKA-4260:
--

tballison commented on PR #1776:
URL: https://github.com/apache/tika/pull/1776#issuecomment-2144945579

   I merged this into @nddipiazza 's TIKA-4252 PR.




> Add parse context to the fetcher interface in 3.x
> -
>
> Key: TIKA-4260
> URL: https://issues.apache.org/jira/browse/TIKA-4260
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4260 -- add ParseContext to fetchers and emitters [tika]

2024-06-03 Thread via GitHub


tballison closed pull request #1776: TIKA-4260 -- add ParseContext to fetchers 
and emitters
URL: https://github.com/apache/tika/pull/1776


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] TIKA-4260 -- add ParseContext to fetchers and emitters [tika]

2024-06-03 Thread via GitHub


tballison commented on PR #1776:
URL: https://github.com/apache/tika/pull/1776#issuecomment-2144945579

   I merged this into @nddipiazza 's TIKA-4252 PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org