[jira] [Commented] (TIKA-2056) Installing exiftool causes ForkParserIntegration test errors

2016-08-25 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15436705#comment-15436705
 ] 

Ray Gauss II commented on TIKA-2056:


My guess is that when Exiftool is available on the command line the existing 
[external parser is 
enabled|https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml]
 as part of the {{CompositeExternalParser}} which would get included in the 
{{AutoDetectParser}} and something in that chain is failing serialization.

Perhaps because 
[ExternalParser.LineConsumer|https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java#L59]
 is not Serializable?

> Installing exiftool causes ForkParserIntegration test errors
> 
>
> Key: TIKA-2056
> URL: https://issues.apache.org/jira/browse/TIKA-2056
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>Reporter: Chris A. Mattmann
>
> [~rgauss] maybe you can help me with this. For some reason when I was trying 
> your PR, I got all sorts of weird errors that I thought had to do with your 
> PR, but in fact, had to do with Fork Parser Integration test. [~kkrugler] 
> I've seen you've contributed to the Fork parser tests so tagging you on this 
> too. Any reason you guys can think of that exiftool causes the Fork parser 
> integration tests to fail?
> Here's the log msg (that I thought was due to the Sentiment parser, but is in 
> fact not!):
> {noformat}
> [INFO] Changes detected - recompiling the module!
> [INFO] Compiling 124 source files to 
> /Users/mattmann/tmp/tika1.14/tika-parsers/target/test-classes
> [INFO] 
> /Users/mattmann/tmp/tika1.14/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java:
>  Some input files use or override a deprecated API.
> [INFO] 
> /Users/mattmann/tmp/tika1.14/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java:
>  Recompile with -Xlint:deprecation for details.
> [INFO] 
> [INFO] --- maven-surefire-plugin:2.18.1:test (default-test) @ tika-parsers ---
> [INFO] Surefire report directory: 
> /Users/mattmann/tmp/tika1.14/tika-parsers/target/surefire-reports
> ---
>  T E S T S
> ---
> Running org.apache.tika.parser.fork.ForkParserIntegrationTest
> Tests run: 5, Failures: 1, Errors: 3, Skipped: 0, Time elapsed: 2.46 sec <<< 
> FAILURE! - in org.apache.tika.parser.fork.ForkParserIntegrationTest
> testForkedTextParsing(org.apache.tika.parser.fork.ForkParserIntegrationTest)  
> Time elapsed: 0.185 sec  <<< ERROR!
> org.apache.tika.exception.TikaException: Unable to serialize AutoDetectParser 
> to pass to the Forked Parser
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at java.util.ArrayList.writeObject(ArrayList.java:762)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at java.util.ArrayList.writeObject(ArrayList.java:762)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke

[jira] [Commented] (TIKA-774) ExifTool Parser

2016-03-23 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209162#comment-15209162
 ] 

Ray Gauss II commented on TIKA-774:
---

bq. we should add a static check for whether exiftool is available and adjust 
"handled" mimes at that point.

I think we'll find other areas to improve on as well, I just wanted to get the 
ball rolling again on the contribution and review as we had to close the source 
on the stand-alone project mentioned above.

bq. I should have a chance to look more closely early next week, but I doubt 
there's reason to wait for my feedback.

We'd value your feed back, and it's been over 4 years, we can wait a few more 
weeks. :)

bq. Is this a replacement for the one I hacked together?

There's the possibility for the two to coexist, perhaps requiring this parser 
to be explicitly called programmatically.

At a high level the biggest differences are:
# As mentioned in TIKA-1639, there's an extensive mapping from ExifTool's 
namespace to proper Tika properties (currently done programmatically)
# It includes the ability embed, i.e. writing metadata back into binary files. 
(TIKA-776)

> ExifTool Parser
> ---
>
> Key: TIKA-774
> URL: https://issues.apache.org/jira/browse/TIKA-774
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.0
> Environment: Requires be installed 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>  Labels: features, new-parser, newbie, patch
> Fix For: 1.13
>
> Attachments: testJPEG_IPTC_EXT.jpg, 
> tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt
>
>
> Adds an external parser that calls ExifTool to extract extended metadata 
> fields from images and other content types.
> In the core project:
> An ExifTool interface is added which contains Property objects that define 
> the metadata fields available.
> An additional Property constructor for internalTextBag type.
> In the parsers project:
> An ExiftoolMetadataExtractor is added which does the work of calling ExifTool 
> on the command line and mapping the response to tika metadata fields.  This 
> extractor could be called instead of or in addition to the existing 
> ImageMetadataExtractor and JempboxExtractor under TiffParser and/or 
> JpegParser but those have not been changed at this time.
> An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
> An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool 
> metadata fields to existing tika and Drew Noakes metadata fields if enabled.
> An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag 
> implementations in XML files.
> An ExifToolParserTest is added which tests several expected XMP and IPTC 
> metadata values in testJPEG_IPTC_EXT.jpg.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1906) ExternalParser No Longer Supports Commands in Array Format

2016-03-23 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1906:
---
Fix Version/s: 1.13
   2.0

> ExternalParser No Longer Supports Commands in Array Format
> --
>
> Key: TIKA-1906
> URL: https://issues.apache.org/jira/browse/TIKA-1906
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
> Fix For: 2.0, 1.13
>
>
> After the changes in TIKA-1638 the ExternalParser now ignores commands 
> specified as a string array and assumes commands will be in a single string 
> with a space delimiter.
> Both formats should be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1906) ExternalParser No Longer Supports Commands in Array Format

2016-03-23 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1906.

Resolution: Fixed

> ExternalParser No Longer Supports Commands in Array Format
> --
>
> Key: TIKA-1906
> URL: https://issues.apache.org/jira/browse/TIKA-1906
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
> Fix For: 2.0, 1.13
>
>
> After the changes in TIKA-1638 the ExternalParser now ignores commands 
> specified as a string array and assumes commands will be in a single string 
> with a space delimiter.
> Both formats should be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1906) ExternalParser No Longer Supports Commands in Array Format

2016-03-22 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206138#comment-15206138
 ] 

Ray Gauss II edited comment on TIKA-1906 at 3/22/16 2:37 PM:
-

bq. agreed, sorry must have missed that as I thought I fixed it for both per 
TIKA-1638.

No worries.

I guess I'll leave this open until the tika-2.x build is happy again.


was (Author: rgauss):
bq. agreed, sorry must have missed that as I thought I fixed it for both per 
TIKA-1638.

No worries.

I guess I'll leave this open until the tika-2.x is happy again.

> ExternalParser No Longer Supports Commands in Array Format
> --
>
> Key: TIKA-1906
> URL: https://issues.apache.org/jira/browse/TIKA-1906
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
>
> After the changes in TIKA-1638 the ExternalParser now ignores commands 
> specified as a string array and assumes commands will be in a single string 
> with a space delimiter.
> Both formats should be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1906) ExternalParser No Longer Supports Commands in Array Format

2016-03-22 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206138#comment-15206138
 ] 

Ray Gauss II commented on TIKA-1906:


bq. agreed, sorry must have missed that as I thought I fixed it for both per 
TIKA-1638.

No worries.

I guess I'll leave this open until the tika-2.x is happy again.

> ExternalParser No Longer Supports Commands in Array Format
> --
>
> Key: TIKA-1906
> URL: https://issues.apache.org/jira/browse/TIKA-1906
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
>
> After the changes in TIKA-1638 the ExternalParser now ignores commands 
> specified as a string array and assumes commands will be in a single string 
> with a space delimiter.
> Both formats should be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1906) ExternalParser No Longer Supports Commands in Array Format

2016-03-21 Thread Ray Gauss II (JIRA)
Ray Gauss II created TIKA-1906:
--

 Summary: ExternalParser No Longer Supports Commands in Array Format
 Key: TIKA-1906
 URL: https://issues.apache.org/jira/browse/TIKA-1906
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Ray Gauss II
Assignee: Ray Gauss II


After the changes in TIKA-1638 the ExternalParser now ignores commands 
specified as a string array and assumes commands will be in a single string 
with a space delimiter.

Both formats should be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15196030#comment-15196030
 ] 

Ray Gauss II commented on TIKA-1607:


bq. It might be more easily configurable to use the ParsingEmbeddedDocExtractor 
as is and let users write their own XMP parsers, no?

Yes, and we could do that in addition to the above, but if I'm understanding 
correctly that alone would still force users to write 'Tika-based' XMP parsers 
rather than allowing them access to the RAW XMP encoded bytes you're referring 
to in the last sentence, which I do agree might be helpful in some cases.

So the idea for the second part would be to get the user those bytes in a way 
that hopefully doesn't require sweeping changes to the parsers (I'm thinking of 
this with an eye towards all types of embedded resources, not just XMP).

The {{EmbeddedDocumentExtractor}} interface's {{parseEmbedded}} method 
currently takes a {{Metadata}} object which is only associated with the 
embedded resource (not the same metadata object associated with the 'container' 
file) and is populated with the embedded resource's filename, type, size, etc.

Option 1. We might be able to do something like:
{code}
/**
 * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded
 * resources during parsing for retrieval.
 */
public interface StoringEmbeddedDocumentExtractor extends 
EmbeddedDocumentExtractor {

/**
 * Gets the map of known embedded resources or null if no resources
 * were stored during parsing
 * 
 * @return the embedded resources
 */
Map getEmbeddedResources();

}
{code}

then modify ParsingEmbeddedDocumentExtractor to implement it with an option 
which 'turns it on'?

Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor 
that users could set in the context?

Option 3. Just pull {{FileEmbeddedDocumentExtractor}} out of {{TikaCLI}} and 
make them use temp files?

Option 4. Maybe the effort is better spent on said sweeping parser changes to 
include some {{EmbeddedResources}} object to be optionally populated along with 
the {{Metadata}} in the {{Parser.parse}} method?

Other options?  Maybe they don't need the RAW XMP?

I'm also aware that we've strayed a bit from the original issue here of 
structured metadata.  Should we create a separate issue?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193845#comment-15193845
 ] 

Ray Gauss II edited comment on TIKA-1607 at 3/15/16 1:57 PM:
-

Have we already considered treating the XMP packets more like embedded 
resources and making it easier for the advanced users described above to get at 
those resources, perhaps providing an {{EmbeddedDocumentExtractor}} 
implementation they could use without resorting to extracting them to files?


was (Author: rgauss):
Have we already considered treating the XMP packets more like embedded 
resources and making it easier for the advanced users described above to get at 
those resources, perhaps providing an {{EmbeddedResourceHandler}} 
implementation they could use without resorting to extracting them to files?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195326#comment-15195326
 ] 

Ray Gauss II commented on TIKA-1607:


Sorry, I meant {{EmbeddedDocumentExtractor}} (edited comment).

We can currently dump stuff to files in some parsers with the {{--extract}} CLI 
option which sticks a {{FileEmbeddedDocumentExtractor}} in the context.

The current default for PDF is the {{ParsingEmbeddedDocumentExtractor}}.

Perhaps we could add an option to ParsingEmbeddedDocumentExtractor which, when 
enabled, would also save the embedded resources in memory for an advanced user 
to do whatever they need, knowing the risk and resources required for that 
option?

Or provide some other in-memory implementation that advanced users could 
explicitly set in the context?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-14 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193845#comment-15193845
 ] 

Ray Gauss II commented on TIKA-1607:


Have we already considered treating the XMP packets more like embedded 
resources and making it easier for the advanced users described above to get at 
those resources, perhaps providing an {{EmbeddedResourceHandler}} 
implementation they could use without resorting to extracting them to files?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1894) Add XMPMM metadata extraction to JempboxExtractor

2016-03-14 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193622#comment-15193622
 ] 

Ray Gauss II commented on TIKA-1894:


The {{tika-xmp}} project deals with converting a populated Tika {{Metadata}} 
object into XMP.

Perhaps that project should be renamed to something more specific at some 
point, but regardless, I don't think it's the right spot for this sort of 
shared parser code.

I'd vote for the simpler shared util jar, but I think it can still live next to 
the modules, something like {{/tika-parsers-modules/tika-parser-xmp-commons}}?

> Add XMPMM metadata extraction to JempboxExtractor
> -
>
> Key: TIKA-1894
> URL: https://issues.apache.org/jira/browse/TIKA-1894
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Priority: Minor
>
> The XMP Media Management (XMPMM) section of xmp carries some useful 
> information.  We currently have keys for many of the important attributes in 
> tika-core's o.a.t.metadata.XMPMM, and JempBox extracts the XMPMM schema, but 
> the wiring between the two has not yet been installed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-25 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167135#comment-15167135
 ] 

Ray Gauss II commented on TIKA-1607:


I know there can be multiple XMP packets in a single file, but do we have many 
other examples where we'd need multiple DOMs associated with a single file?

I'm trying to understand if the metadata is really the right place for this.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-19 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15154205#comment-15154205
 ] 

Ray Gauss II commented on TIKA-1607:


In my experience people gravitate towards 'other' buckets, i.e.: "I didn't know 
(bother to read) what the designated ones were so I just used 'other'".

{{getBytes}} feels like 'other'.

While people could still do really stupid things with {{getDOM}} if they wanted 
to, {{getBytes}} seems to encourage a developer to go ahead and try to use each 
frame of a 120fps 8K video as a 'metadata' value.  An extreme and unlikely 
example of course, but you get the gist.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-16 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149231#comment-15149231
 ] 

Ray Gauss II commented on TIKA-1607:


Are we opening a can of worms by encouraging the use of a byte array directly 
with no restrictions on length, etc.?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-03 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130386#comment-15130386
 ] 

Ray Gauss II commented on TIKA-1824:


bq. Thank you, Bob Paulin! Again, this is fantastic.

Indeed, thanks!

bq. Perhaps add "parser(s?) to the artifactId, e.g. tika-parser-cad-module

Now that the change is in there it seems a bit redundant to have parser and 
module in every artifact ID.  {{tika-parser-*}} follows the least to most 
specific precedence and they're so perhaps we could just remove module?

I had some concerns over the apparent duplication of dependencies / versions 
but it looks like that will be addressed in TIKA-1847.

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-09-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746719#comment-14746719
 ] 

Ray Gauss II commented on TIKA-1607:


Hi [~talli...@mitre.org], apologies for the delay on responding here.

1. POJOs
bq. We might have better documentation of POJOs and compile-time guarantees 
about methods and typed values.

Agreed, but the DOM persistence doesn't preclude us from also using Java 
'helper' classes that know how to more easily get and set values for particular 
schemas that we'd like to focus on.

bq. Schemas/xsds can enforce plenty, I know, but would we want to build an xsd 
and maintain it?

I'd vote for sticking as true to a specification's original schema as possible 
when there is one but whether we'd want to build and maintain for those that 
don't is a good question.

2. Passthrough
bq. why couldn't we literally pass that through via the String version of the 
xml?

I think we could, but we'd first have to 'merge' with the metadata being 
modeled by the parsers and could then allow access to the full DOM {{Document}} 
object which clients could easily serialize to a string if need be.

3. Serialization to JSON
There seem to be several libraries available that can help with XML to JSON, 
though I don't think this would belong in core.

4. Multilingual fields
Great question.  XMP uses RDF and xml:lang:
{noformat}

  
quick brown fox
rapido fox marrone
  

{noformat}
that's one possibility.

bq. I'm wondering if we want to add structure only where structured data 
doesn't exist within the document and let the client parse what they'd like out 
of structured metadata that is in the document?

This also relates to passthrough above but one thing to keep in mind is that 
the metadata we're parsing could be coming from several different parts of the 
binary.  For example, EXIF doesn't necessarily also live in XMP (though most 
apps also write it there these days) and there can be more than one XMP packet 
present in a file.  It would be nice to bring these different sources into a 
unified persistence structure, even if for simpler metadata everything lives at 
the top level.

bq. how do we transfer as much normalized/structured metadata as possible in as 
simple a way to the end user.

This also gets back to passthrough and the possibility of access to the full 
DOM {{Document}} object.

Thanks for keeping the discussion going.  We obviously need to take great care 
in changing such a fundamental area of the code.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.11
>
> Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-21 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706706#comment-14706706
 ] 

Ray Gauss II commented on TIKA-1607:


Yes, by shoehorn I meant that the index is embedded in the key (in this case 
sub-group name) and that all parsers and consuming client apps must know to 
utilize that syntax rather than either a separate, explicit index field or a 
well defined structure like that of the DOM approach.

Perhaps we should flesh out a solid requirements list (possibly using the 
[comment 
above|https://issues.apache.org/jira/browse/TIKA-1607?focusedCommentId=14660441&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14660441]
 as a starting point).

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.11
>
> Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-20 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704880#comment-14704880
 ] 

Ray Gauss II commented on TIKA-1607:


I did see that, but I was after full URI namespaces, i.e. 
{{http://purl.org/dc/elements/1.1/}}, not just prefixes.

The OODT approach looks like you'd have to shoehorn the "index" into the group 
name, much like the tika-ffmpeg workaround, rather than a more strictly defined 
structure.

OODT might support deeper structures in the inner {{Group}} class, but the 
public methods appear to only support a single level?  For example, How could 
one get to something like the value of the city of the 3rd contact's 2nd 
address, i.e. "p1:contact[2]/p1:address[1]/p1:city"?

We could mimic XPath syntax but the DOM approach allows us to use 
{{javax.xml.xpath.XPath}} processing.  From the [test mentioned 
above|https://github.com/rgauss/tika/blob/trunk/tika-core/src/test/java/org/apache/tika/metadata/TestMetadata.java#L394]:
{code:java}
String expression = "/tika:metadata/vcard:tel[1]/vcard:uri";
assertEquals(telUri, metadata.getValueByXPath(expression));
{code}

The DOM approach would also allow us to leverage things like attributes to 
further describe a particular metadata value in the future if need be.

We might also be able to "pass through" entire metadata structures that Tika 
hasn't explicitly modeled.

It's certainly a larger change, but I think it gives us a lot more options.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.11
>
> Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-19 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704108#comment-14704108
 ] 

Ray Gauss II commented on TIKA-1607:


[~chrismattmann], I did.

It seemed more similar to the XPath-like workaround I described with the notion 
of groups in the store, rather than the full-fledged DOM store proposed in the 
GitHub fork, i.e. I didn't see where anything was namespaced.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.11
>
> Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-19 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703924#comment-14703924
 ] 

Ray Gauss II commented on TIKA-1607:


I've put together the start of the DOM metadata store option on [GitHub as 
well|https://github.com/apache/tika/compare/trunk...rgauss:trunk].

The crux of the change is using a {{org.w3c.dom.Document}} object instead of a 
{{Map}} as the metadata store and Property objects based on 
{{QName}}s instead of Strings.

A few things to note:
* This does bring in commons-lang for XML escaping, we could change if need be
* It seems mostly backwards compatible. tika-xmp is failing at the moment, but 
I think it's just a matter of applying the same techniques there
* String-based accessors weren't deprecated, but could be if targeting Tika 2.0
* There are several TODOs that would still need to be addressed

The [test 
added|https://github.com/rgauss/tika/blob/trunk/tika-core/src/test/java/org/apache/tika/metadata/TestMetadata.java#L394]
 demonstrates creating a DOM structure, adding it to the metadata, then pulling 
it out both programmatically and via XPath expression (sticking to the 
telephone number example).

That programmatic creation of the DOM structure is a bit cumbersome and we 
could certainly employ Java classes specific to each standard as a convenience 
(somewhat similar to [~talli...@mitre.org]'s proposal), but I do like the 
generic nature of the DOM store.

The {{toString}} method of the metadata object after building that example is 
properly structured and namespaced XML:
{code:xml}

http://tika.apache.org/";>
  

  
work
  

tel:+1-800-555-1234
  

{code}

There's obviously lots of room for improvement and discussion but I wanted to 
put it out there before the momentum on this slows.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.11
>
> Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-06 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660441#comment-14660441
 ] 

Ray Gauss II commented on TIKA-1607:


To clarify, the work mentioned above that uses an XPath-like syntax is only a 
workaround for mapping structured metadata into the current 'flat' metadata 
model in Tika.

I fully support moving towards a structured metadata store in a 2.0 timeframe. 
(maybe that's now?)

This is simply restating some of what's already been said, but there are many 
aspects to consider during that refactoring:
* Moving towards properly namespacing metadata (even if, for now, our 
serialization of it only contains a prefix)
* Backwards compatibility for simple string key/values
* Enabling easy serialization to XML and JSON
* Enabling easy discovery of at least top level elements
* Lightweight dependencies in tika-core
* Possible representation of binary data
* Not re-inventing the wheel

Given the above, perhaps we'd want to consider using Java DOM 
({{org.w3c.dom.*}}) classes programmatically as a metadata store, appending and 
getting child nodes, etc. rather than hard coding POJOs for each metadata 
standard we want to support.

I'll try to find some time to put together an example patch for that approach 
in the next few days.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.10
>
> Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new HashMap data structure for persitsence of Tika Metadata

2015-04-21 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505054#comment-14505054
 ] 

Ray Gauss II commented on TIKA-1607:


We've had a few discussions on structured metadata over the years, some of 
which was captured in the [MetadataRoadmap Wiki 
page|http://wiki.apache.org/tika/MetadataRoadmap].

I'd agree that we should strive to maintain backwards compatibility for simple 
values.

I think we should also consider serialization of the metadata store, not just 
in the {{Serializable}} interface sense, but perhaps being able to easily 
marshal the entire metadata store into JSON and XML.

As [~gagravarr] points out, work has been done to express structured metadata 
via the existing metadata store.  In that email thread you'll find reference to 
the external [tika-ffmpeg project|https://github.com/AlfrescoLabs/tika-ffmpeg].

> Introduce new HashMap data structure for persitsence of Tika 
> Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.9
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the  Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1594) Webp parsing support

2015-04-07 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484463#comment-14484463
 ] 

Ray Gauss II commented on TIKA-1594:


I'd recommend that for now we trim since {{Metadata.IMAGE_*}} properties are 
defined as {{Property.internalInteger}}.

In the future I think we should consider changing to (or perhaps adding) more 
generally useful dimension properties, like {{Dimensions}} from the [additional 
properties of 
XMP|http://www.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/XMPSpecificationPart2.pdf]
 (section 1.2.2.2) which includes a {{unit}} field.

> Webp parsing support
> 
>
> Key: TIKA-1594
> URL: https://issues.apache.org/jira/browse/TIKA-1594
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Jan Kronquist
>
> webp content type is correctly detected, but parsing is not supported. 
> I noticed that metadata-extractor 2.8.0 supports webp:
> https://github.com/drewnoakes/metadata-extractor/issues/85
> However, Tika does currently not work with this version (I tried manually 
> overriding the dependency). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-634) Command Line Parser for Metadata Extraction

2015-03-01 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342547#comment-14342547
 ] 

Ray Gauss II commented on TIKA-634:
---

Also see the [tika-ffmpeg project|https://github.com/AlfrescoLabs/tika-ffmpeg].

There we recently had to patch {{ExternalParser}} for some stream parsing 
concurrency problems which should be raised in a separate issue here shortly.

> Command Line Parser for Metadata Extraction
> ---
>
> Key: TIKA-634
> URL: https://issues.apache.org/jira/browse/TIKA-634
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 0.9
>Reporter: Nick Burch
>Assignee: Nick Burch
>Priority: Minor
>
> As discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/tika-dev/201104.mbox/%3calpine.deb.2.00.1104052028380.29...@urchin.earth.li%3E
> This issue is to track improvements in the ExternalParser support to handle 
> metadata extraction, and probably easier configuration of an external parser 
> too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1510) FFMpeg installed but not parsing video files

2015-01-12 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273520#comment-14273520
 ] 

Ray Gauss II commented on TIKA-1510:


Yes.

The only reason I haven't myself is that I've been trying to find some time to 
refactor the vorbis stuff per the previous 
[conversation|http://mail-archives.apache.org/mod_mbox/tika-dev/201408.mbox/%3calpine.deb.2.02.1408221155450.8...@urchin.earth.li%3E]
 with [~gagravarr].

> FFMpeg installed but not parsing video files
> 
>
> Key: TIKA-1510
> URL: https://issues.apache.org/jira/browse/TIKA-1510
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: FFMPEG, Mac OS X 10.9 with HomeBrew
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.7
>
>
> I have FFMPEG installed with homebrew:
> {noformat}
> # brew install ffmpeg
> {noformat}
> I've got some AVI files and have tried to parse them with Tika:
> {noformat}
> [chipotle:~/Desktop/drone-vids] mattmann% tika -m SPOT11_01\ 17.AVI
> Content-Length: 334917340
> Content-Type: video/x-msvideo
> X-Parsed-By: org.apache.tika.parser.EmptyParser
> resourceName: SPOT11_01 17.AVI
> {noformat}
> I took a look at the ExternalParser, which is configured for using ffmpeg if 
> it's installed. It seems it only works on:
> {code:xml}
>
>video/avi
>video/mpeg
>  
> {code}
> I'll add video/x-msvideo and see if that fixes it. I also stumbled upon the 
> work by [~rgauss] at Github - Ray I noticed there is no parser in that work:
> https://github.com/AlfrescoLabs/tika-ffmpeg
> But there seems to be metadata extraction code, etc. Ray should I do 
> something with this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1510) FFMpeg installed but not parsing video files

2015-01-11 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273049#comment-14273049
 ] 

Ray Gauss II commented on TIKA-1510:


In that project there is a 
[{{TikaIntrinsicAVFfmpegParserFactory}}|https://github.com/AlfrescoLabs/tika-ffmpeg/blob/master/src/main/java/org/apache/tika/parser/ffmpeg/TikaIntrinsicAVFfmpegParserFactory.java]
 which is used to set up an {{ExternalParser}}.

See the 
[{{TikaIntrinsicAVFfmpegParserTest}}|https://github.com/AlfrescoLabs/tika-ffmpeg/blob/master/src/test/java/org/apache/tika/parser/ffmpeg/TikaIntrinsicAVFfmpegParserTest.java]
 for an example of its use.

> FFMpeg installed but not parsing video files
> 
>
> Key: TIKA-1510
> URL: https://issues.apache.org/jira/browse/TIKA-1510
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: FFMPEG, Mac OS X 10.9 with HomeBrew
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.7
>
>
> I have FFMPEG installed with homebrew:
> {noformat}
> # brew install ffmpeg
> {noformat}
> I've got some AVI files and have tried to parse them with Tika:
> {noformat}
> [chipotle:~/Desktop/drone-vids] mattmann% tika -m SPOT11_01\ 17.AVI
> Content-Length: 334917340
> Content-Type: video/x-msvideo
> X-Parsed-By: org.apache.tika.parser.EmptyParser
> resourceName: SPOT11_01 17.AVI
> {noformat}
> I took a look at the ExternalParser, which is configured for using ffmpeg if 
> it's installed. It seems it only works on:
> {code:xml}
>
>video/avi
>video/mpeg
>  
> {code}
> I'll add video/x-msvideo and see if that fixes it. I also stumbled upon the 
> work by [~rgauss] at Github - Ray I noticed there is no parser in that work:
> https://github.com/AlfrescoLabs/tika-ffmpeg
> But there seems to be metadata extraction code, etc. Ray should I do 
> something with this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-93) OCR support

2014-09-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14134822#comment-14134822
 ] 

Ray Gauss II commented on TIKA-93:
--

You could use 
[{{org.junit.Assume}}|http://stackoverflow.com/questions/1689242/conditionally-ignoring-tests-in-junit-4]
 so the tests will be skipped rather than reported as passing.

Perhaps we should consider the Maven Failsafe Plugin as well?

> OCR support
> ---
>
> Key: TIKA-93
> URL: https://issues.apache.org/jira/browse/TIKA-93
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.7
>
> Attachments: Petr_tika-config.xml, TIKA-93.patch, TIKA-93.patch, 
> TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, 
> TesseractOCRParser.patch, TesseractOCR_Tyler.patch, 
> TesseractOCR_Tyler_v2.patch, TesseractOCR_Tyler_v3.patch, testOCR.docx, 
> testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-93) OCR support

2014-08-19 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102193#comment-14102193
 ] 

Ray Gauss II commented on TIKA-93:
--

Apologies, jumped in late and only glanced at the comment thread.

> OCR support
> ---
>
> Key: TIKA-93
> URL: https://issues.apache.org/jira/browse/TIKA-93
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.7
>
> Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
> TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, testOCR.docx, 
> testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-93) OCR support

2014-08-19 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102175#comment-14102175
 ] 

Ray Gauss II commented on TIKA-93:
--

Can you create a config object and pass that in the {{ParseContext}}, similar 
to what 
[{{PDFParser}}|https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java]
 does with a 
[{{PDFParserConfig}}|https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java]
 entry?
{code}
//config from context, or default if not set via context
PDFParserConfig localConfig = context.get(PDFParserConfig.class, defaultConfig);
{code}

> OCR support
> ---
>
> Key: TIKA-93
> URL: https://issues.apache.org/jira/browse/TIKA-93
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: Jukka Zitting
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.7
>
> Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
> TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
> TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, testOCR.docx, 
> testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1328) Translate Metadata and Content

2014-06-10 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026783#comment-14026783
 ] 

Ray Gauss II commented on TIKA-1328:


Leaning towards the whitelist approach, perhaps we could add an 
{{isTranslatable}} field / method and corresponding constructor to the 
{{Property}} class (with a default of false) and update the properties we want 
to support translation on?

> Translate Metadata and Content
> --
>
> Key: TIKA-1328
> URL: https://issues.apache.org/jira/browse/TIKA-1328
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
> Fix For: 1.7
>
>
> Right now, Translation is only done on Strings. Ideally, users would be able 
> to "turn on" translation while parsing. I can think of a couple options:
> - Make a TranslateAutoDetectParser. Automatically detect the file type, parse 
> it, then translate the content.
> - Make a Context switch. When true, translate the content regardless of the 
> parser used. I'm not sure the best way to go about this method, but I prefer 
> it over another Parser.
> Regardless, we need a black or white list for translation. I think black list 
> would be the way to go -- which fields should not be translated (dates, 
> versions, ...) Any ideas? Also, somewhat unrelated, does anyone know of any 
> other open source translation libraries? If we were really lucky, it wouldn't 
> depend on an online service.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1319) Translation

2014-06-10 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026703#comment-14026703
 ] 

Ray Gauss II commented on TIKA-1319:


[~gagravarr], that comment seems to be more closely related to TIKA-1328.

Should we combine these issues?

> Translation
> ---
>
> Key: TIKA-1319
> URL: https://issues.apache.org/jira/browse/TIKA-1319
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tyler Palsulich
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.6
>
>
> I just opened up a review on reviews.apache.org -- 
> https://reviews.apache.org/r/22219/. I copied the description below. 
> This patch adds basic language translation functionality to Tika. Translation 
> is provided by a Microsoft API, but accessed through Apache 2 licensed 
> com.memetix.microsoft-translator-java-api 
> (https://code.google.com/p/microsoft-translator-java-api/ ). If a user wants 
> to use the translation feature, they have to add a client id and client 
> secret to the 
> tika-core/src/main/resources/org/apache/tika/language/translator.properties 
> file (see http://msdn.microsoft.com/en-us/library/hh454950.aspx ). I added 
> com.memetix as a dependency in tika-core. I put the Translator class in 
> org.apache.tika.language. There is no integration with the server or CLI, 
> yet. Further, only Strings are translated right now -- if you pass in a full 
> document with xml tags, the structure will be mangled. But, I think that 
> would be a cool feature -- translate the body, title, subtitle, etc, but not 
> the structural elements. 
> There is still more work to do, but I wanted some more eyes on this to make 
> sure I'm heading in the right direction and this is a desired feature. Let me 
> know what you think!
> There are two simple unit tests for now which translate "hello" to French 
> ("salut"). One for inputting the source and target languages, one for 
> inputing just the target language (and detecting the source language 
> automatically).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1320) extract text from jpeg in solr tika

2014-06-04 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017613#comment-14017613
 ] 

Ray Gauss II commented on TIKA-1320:


I'm not sure we have enough context in the description of this issue to help 
much here.

As [~thaichat04] points out, OCR is one way of obtaining text from an image, 
but there are also several forms of embedded metadata that can be extracted.

Is there specific text you're looking to extract?

> extract text from jpeg in solr tika
> ---
>
> Key: TIKA-1320
> URL: https://issues.apache.org/jira/browse/TIKA-1320
> Project: Tika
>  Issue Type: New Feature
>Reporter: muruganv
>  Labels: features
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> How to extract text from jpeg or image format or tiff in solr tika



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

2014-05-29 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012393#comment-14012393
 ] 

Ray Gauss II commented on TIKA-1294:


Hi [~talli...@apache.org],

The changes look good, thanks!

One minor point on conventions: I think enums are typically uppercase?

> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---
>
> Key: TIKA-1294
> URL: https://issues.apache.org/jira/browse/TIKA-1294
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 1.6
>
> Attachments: TIKA-1294.patch, TIKA-1294v1.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded 
> resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types 
> of embedded resources.  I see two ways of allowing the client to choose 
> whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them 
> as embedded PDXObjectImages vs regular image attachments.  The client can 
> then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1295) Make some Dublin Core items multi-valued

2014-05-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995945#comment-13995945
 ] 

Ray Gauss II commented on TIKA-1295:


+1 for the data model more accurately reflecting the standard and for 
multilingual fields, but with a simple text bag how would you know which value 
corresponds to which language?

I think this is another example that highlights the need for a more structured 
underlying metadata store as mentioned in section IV of the [metadata 
roadmap|http://wiki.apache.org/tika/MetadataRoadmap].

> Make some Dublin Core items multi-valued
> 
>
> Key: TIKA-1295
> URL: https://issues.apache.org/jira/browse/TIKA-1295
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.6
>
>
> According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, 
> dc:title, dc:description and dc:rights should allow multiple values because 
> of language alternatives.  Unless anyone objects in the next few days, I'll 
> switch those to Property.toInternalTextBag() from Property.toInternalText().  
> I'll also modify PDFParser to extract dc:rights.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

2014-05-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995298#comment-13995298
 ] 

Ray Gauss II commented on TIKA-1278:


Hi [~tallison],

I thought about adding to {{PDFParser.properties}} but decided against it since 
PDFBox could change the default values or change the properties' scale or use, 
and if we weren't aware of that change we'd be inadvertently overriding those 
defaults.

Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work 
well for most people.

We can certainly reconsider setting those defaults and/or adding other config 
if there are particular parameters people would find useful.

> Expose PDF Avg Char and Spacing Tolerance Config Params
> ---
>
> Key: TIKA-1278
> URL: https://issues.apache.org/jira/browse/TIKA-1278
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
> Fix For: 1.6
>
>
> {{PDFParserConfig}} should allow for override of PDFBox's 
> {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
> comment in {{PDF2XHTML}}.
> Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
> slightly to allow for extension of that config class and its configuration 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1295) Make some Dublin Core items multi-valued

2014-05-14 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997478#comment-13997478
 ] 

Ray Gauss II commented on TIKA-1295:


bq. I see that there is an ALT PropertyType. Are there any plans to implement 
that (or did I miss the implementation somewhere)

Not sure. On first glance I don't see it anywhere, nor any use of 
{{ValueType.LOCALE}}.

I think we'd need a design discussion on how best to implement multilingual 
properties, likely through some suffixing of property keys if we don't change 
the underlying metadata structure, or perhaps that discussion has already taken 
place?

> Make some Dublin Core items multi-valued
> 
>
> Key: TIKA-1295
> URL: https://issues.apache.org/jira/browse/TIKA-1295
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.6
>
>
> According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, 
> dc:title, dc:description and dc:rights should allow multiple values because 
> of language alternatives.  Unless anyone objects in the next few days, I'll 
> switch those to Property.toInternalTextBag() from Property.toInternalText().  
> I'll also modify PDFParser to extract dc:rights.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

2014-05-14 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997500#comment-13997500
 ] 

Ray Gauss II commented on TIKA-1294:


I saw similar problematic resource consumption as well, which was the reason 
for figuring out how to disable this stuff :)

Perhaps a generic indication of why this embedded object is being parsed would 
be useful to have in the metadata object passed to the 
{{EmbeddedDocumentExtractor}}, something like an {{EmbeddedObjectContext}} enum 
with {{INLINE}} and {{ATTACHMENT}} options, which the 
{{EmbeddedDocumentExtractor}} (and in most cases that means the 
{{DocumentSelector}}) could use to determine whether to parse on a per-object 
basis? 

> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---
>
> Key: TIKA-1294
> URL: https://issues.apache.org/jira/browse/TIKA-1294
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: TIKA-1294.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded 
> resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types 
> of embedded resources.  I see two ways of allowing the client to choose 
> whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them 
> as embedded PDXObjectImages vs regular image attachments.  The client can 
> then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

2014-05-14 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995474#comment-13995474
 ] 

Ray Gauss II commented on TIKA-1294:


We ran into this exact issue recently and there is another method to achieve 
the same result without changing Tika code.

In {{ParsingEmbeddedDocumentExtractor.shouldParseEmbedded}} the 
{{ParseContext}} is checked for a {{DocumentSelector}}.

Since that extractor seems to be the only place that type is checked for 
(perhaps {{EmbeddedDocumentSelector}} would be a more appropriate name?) you 
can create one that suits your needs and set it as the document selector value 
in the {{ParseContext}}.

In our case we created a simple {{MediaTypeDisablingDocumentSelector}} that 
holds a list of {{disabledMediaTypes}}.

See 
[{{TikaGUI}}|http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java]
 and its {{ImageDocumentSelector}} as a general example of document selector 
use.

> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---
>
> Key: TIKA-1294
> URL: https://issues.apache.org/jira/browse/TIKA-1294
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: TIKA-1294.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded 
> resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types 
> of embedded resources.  I see two ways of allowing the client to choose 
> whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them 
> as embedded PDXObjectImages vs regular image attachments.  The client can 
> then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

2014-05-13 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995960#comment-13995960
 ] 

Ray Gauss II commented on TIKA-1294:


bq. Can your MediaTypeDisablingDocumentSelector tell the difference between a 
jpeg that was attached to a PDF (basic attachment) and one that was derived 
from a PDXObjectImage?

If by basic attachment you mean those defined in 
{{PDEmbeddedFilesNameTreeNode}}, then not exactly.

Both {{PDF2XHTML.extractImages}} and {{PDF2XHTML.extractEmbeddedDocuments}} end 
up using the same {{getEmbeddedDocumentExtractor}} (a 
{{ParsingEmbeddedDocumentExtractor}} by default) and use the same 
{{DocumentSelector}} in the calls to 
{{extractor.shouldParseEmbedded(metadata)}}, but neither sets any special 
metadata keys indicating 'attached' vs 'embedded' so document selectors aren't 
able to explicitly distinguish.

However, the {{PDXObjectImage}} resources *only* get the media type set in the 
metadata object while the {{PDEmbeddedFilesNameTreeNode}} resources get media 
type, name, and length set, so you could potentially check for their presence 
to distinguish.

> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---
>
> Key: TIKA-1294
> URL: https://issues.apache.org/jira/browse/TIKA-1294
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Trivial
> Attachments: TIKA-1294.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded 
> resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types 
> of embedded resources.  I see two ways of allowing the client to choose 
> whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them 
> as embedded PDXObjectImages vs regular image attachments.  The client can 
> then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

2014-05-12 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995298#comment-13995298
 ] 

Ray Gauss II edited comment on TIKA-1278 at 5/12/14 5:39 PM:
-

Hi [~talli...@apache.org],

I thought about adding to {{PDFParser.properties}} but decided against it since 
PDFBox could change the default values or change the properties' scale or use, 
and if we weren't aware of that change we'd be inadvertently overriding those 
defaults.

Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work 
well for most people.

We can certainly reconsider setting those defaults and/or adding other config 
if there are particular parameters people would find useful.


was (Author: rgauss):
Hi [~tallison],

I thought about adding to {{PDFParser.properties}} but decided against it since 
PDFBox could change the default values or change the properties' scale or use, 
and if we weren't aware of that change we'd be inadvertently overriding those 
defaults.

Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work 
well for most people.

We can certainly reconsider setting those defaults and/or adding other config 
if there are particular parameters people would find useful.

> Expose PDF Avg Char and Spacing Tolerance Config Params
> ---
>
> Key: TIKA-1278
> URL: https://issues.apache.org/jira/browse/TIKA-1278
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
> Fix For: 1.6
>
>
> {{PDFParserConfig}} should allow for override of PDFBox's 
> {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
> comment in {{PDF2XHTML}}.
> Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
> slightly to allow for extension of that config class and its configuration 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (TIKA-1279) Missing return lines at output of SourceCodeParser

2014-04-24 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reopened TIKA-1279:


  Assignee: Hong-Thai Nguyen

[~thaichat04], I believe we still have to support Java 6 and 
{{System.lineSeparator()}} appears to have been added in Java 7.

I think {{System.getProperty("line.separator")}} would be equivalent.

> Missing return lines at output of SourceCodeParser
> --
>
> Key: TIKA-1279
> URL: https://issues.apache.org/jira/browse/TIKA-1279
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Assignee: Hong-Thai Nguyen
>Priority: Trivial
> Fix For: 1.6
>
>
> xhtml output is on a single line.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

2014-04-24 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13979700#comment-13979700
 ] 

Ray Gauss II edited comment on TIKA-1278 at 4/24/14 1:31 PM:
-

Resolved in r1589722.

The setting of {{PDF2XHTML}} params was also moved from {{PDF2XHTML.process}} 
to a new {{PDFParserConfig.configure}} method which should allow developers to 
extend {{PDFParserConfig}} for custom behavior.


was (Author: rgauss):
Resolved in r1589722.

> Expose PDF Avg Char and Spacing Tolerance Config Params
> ---
>
> Key: TIKA-1278
> URL: https://issues.apache.org/jira/browse/TIKA-1278
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
> Fix For: 1.6
>
>
> {{PDFParserConfig}} should allow for override of PDFBox's 
> {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
> comment in {{PDF2XHTML}}.
> Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
> slightly to allow for extension of that config class and its configuration 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

2014-04-24 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1278.


Resolution: Fixed

Resolved in r1589722.

> Expose PDF Avg Char and Spacing Tolerance Config Params
> ---
>
> Key: TIKA-1278
> URL: https://issues.apache.org/jira/browse/TIKA-1278
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
> Fix For: 1.6
>
>
> {{PDFParserConfig}} should allow for override of PDFBox's 
> {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
> comment in {{PDF2XHTML}}.
> Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
> slightly to allow for extension of that config class and it's configuration 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

2014-04-24 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1278:
---

Description: 
{{PDFParserConfig}} should allow for override of PDFBox's 
{{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
comment in {{PDF2XHTML}}.

Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
slightly to allow for extension of that config class and its configuration 
behavior.

  was:
{{PDFParserConfig}} should allow for override of PDFBox's 
{{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
comment in {{PDF2XHTML}}.

Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
slightly to allow for extension of that config class and it's configuration 
behavior.


> Expose PDF Avg Char and Spacing Tolerance Config Params
> ---
>
> Key: TIKA-1278
> URL: https://issues.apache.org/jira/browse/TIKA-1278
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
> Fix For: 1.6
>
>
> {{PDFParserConfig}} should allow for override of PDFBox's 
> {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
> comment in {{PDF2XHTML}}.
> Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
> slightly to allow for extension of that config class and its configuration 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

2014-04-24 Thread Ray Gauss II (JIRA)
Ray Gauss II created TIKA-1278:
--

 Summary: Expose PDF Avg Char and Spacing Tolerance Config Params
 Key: TIKA-1278
 URL: https://issues.apache.org/jira/browse/TIKA-1278
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


{{PDFParserConfig}} should allow for override of PDFBox's 
{{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
comment in {{PDF2XHTML}}.

Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
slightly to allow for extension of that config class and it's configuration 
behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2014-03-24 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1151:
---

Fix Version/s: 1.6

> Maven Build Should Automatically Produce test-jar Artifacts
> ---
>
> Key: TIKA-1151
> URL: https://issues.apache.org/jira/browse/TIKA-1151
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
> Fix For: 1.6
>
>
> The Maven build should be updated to produce test jar artifacts for 
> appropriate sub-projects (see below) such that developers can extend test 
> classes by adding the {{test-jar}} artifact as a dependency, i.e.:
> {code}
> 
>   org.apache.tika
>   tika-parsers
>   1.6-SNAPSHOT
>   test-jar
>   test
> 
> {code}
> The following sub-projects contain tests that developers might want to extend 
> and their corresponding {{pom.xml}} should have the [attached 
> tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
> - tika-app
> - tika-core
> - tika-parsers
> - tika-server
> - tika-xmp



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2014-03-24 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1151.


Resolution: Fixed

Resolved in r1580887.

> Maven Build Should Automatically Produce test-jar Artifacts
> ---
>
> Key: TIKA-1151
> URL: https://issues.apache.org/jira/browse/TIKA-1151
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
>
> The Maven build should be updated to produce test jar artifacts for 
> appropriate sub-projects (see below) such that developers can extend test 
> classes by adding the {{test-jar}} artifact as a dependency, i.e.:
> {code}
> 
>   org.apache.tika
>   tika-parsers
>   1.6-SNAPSHOT
>   test-jar
>   test
> 
> {code}
> The following sub-projects contain tests that developers might want to extend 
> and their corresponding {{pom.xml}} should have the [attached 
> tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
> - tika-app
> - tika-core
> - tika-parsers
> - tika-server
> - tika-xmp



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2014-02-20 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1151:
---

Description: 
The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}

  org.apache.tika
  tika-parsers
  1.6-SNAPSHOT
  test-jar
  test

{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-core
- tika-parsers
- tika-server
- tika-xmp



  was:
The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}

  org.apache.tika
  tika-parsers
  1.5-SNAPSHOT
  test-jar
  test

{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-core
- tika-parsers
- tika-server
- tika-xmp




> Maven Build Should Automatically Produce test-jar Artifacts
> ---
>
> Key: TIKA-1151
> URL: https://issues.apache.org/jira/browse/TIKA-1151
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
>
> The Maven build should be updated to produce test jar artifacts for 
> appropriate sub-projects (see below) such that developers can extend test 
> classes by adding the {{test-jar}} artifact as a dependency, i.e.:
> {code}
> 
>   org.apache.tika
>   tika-parsers
>   1.6-SNAPSHOT
>   test-jar
>   test
> 
> {code}
> The following sub-projects contain tests that developers might want to extend 
> and their corresponding {{pom.xml}} should have the [attached 
> tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
> - tika-app
> - tika-core
> - tika-parsers
> - tika-server
> - tika-xmp



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2014-02-20 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907100#comment-13907100
 ] 

Ray Gauss II commented on TIKA-1151:


This will create a few artifacts on the larger side, notably:
||Artifact||Size||
|tika-parsers-1.6-SNAPSHOT-tests.jar|33MB|
|tika-server-1.6-SNAPSHOT-tests.jar|6.8MB|

Not huge, but I thought I'd double check that no one has any issues with that 
before committing.

> Maven Build Should Automatically Produce test-jar Artifacts
> ---
>
> Key: TIKA-1151
> URL: https://issues.apache.org/jira/browse/TIKA-1151
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
>
> The Maven build should be updated to produce test jar artifacts for 
> appropriate sub-projects (see below) such that developers can extend test 
> classes by adding the {{test-jar}} artifact as a dependency, i.e.:
> {code}
> 
>   org.apache.tika
>   tika-parsers
>   1.6-SNAPSHOT
>   test-jar
>   test
> 
> {code}
> The following sub-projects contain tests that developers might want to extend 
> and their corresponding {{pom.xml}} should have the [attached 
> tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
> - tika-app
> - tika-core
> - tika-parsers
> - tika-server
> - tika-xmp



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2014-02-20 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1151:
---

Description: 
The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}

  org.apache.tika
  tika-parsers
  1.5-SNAPSHOT
  test-jar
  test

{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-core
- tika-parsers
- tika-server
- tika-xmp



  was:
The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}

  org.apache.tika
  tika-parsers
  1.5-SNAPSHOT
  test-jar
  test

{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-bundle
- tika-core
- tika-parsers
- tika-server
- tika-xmp




> Maven Build Should Automatically Produce test-jar Artifacts
> ---
>
> Key: TIKA-1151
> URL: https://issues.apache.org/jira/browse/TIKA-1151
> Project: Tika
>  Issue Type: Improvement
>  Components: packaging
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
>
> The Maven build should be updated to produce test jar artifacts for 
> appropriate sub-projects (see below) such that developers can extend test 
> classes by adding the {{test-jar}} artifact as a dependency, i.e.:
> {code}
> 
>   org.apache.tika
>   tika-parsers
>   1.5-SNAPSHOT
>   test-jar
>   test
> 
> {code}
> The following sub-projects contain tests that developers might want to extend 
> and their corresponding {{pom.xml}} should have the [attached 
> tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
> - tika-app
> - tika-core
> - tika-parsers
> - tika-server
> - tika-xmp



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (TIKA-1179) A corrupt mp3 file can cause an infinite loop in Mp3Parser

2013-10-04 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1179.


Resolution: Cannot Reproduce
  Assignee: Ray Gauss II

I've just confirmed the described behavior in Tika 1.4, however, it appears the 
file is parsed just fine in 1.5!

You can verify by downloading a 1.5 snapshot of {{tika-app}} ([current 
link|https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/1.5-SNAPSHOT/tika-app-1.5-20130927.201341-30.jar]),
 running the app, i.e.:
{code}
java -jar tika-app-1.5-20130927.201341-30.jar
{code}
and dropping {{corrupt.mp3}} onto the app window.

> A corrupt mp3 file can cause an infinite loop in Mp3Parser
> --
>
> Key: TIKA-1179
> URL: https://issues.apache.org/jira/browse/TIKA-1179
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Marius Dumitru Florea
>Assignee: Ray Gauss II
> Fix For: 1.5
>
> Attachments: corrupt.mp3
>
>
> I have a thread that indexes (among other things) files using Apache Sorl. 
> This thread hangs (still running but with no progress) when trying to extract 
> meta data from the mp3 file attached to this issue. Here are a couple of 
> thread dumps taken at various moments:
> {noformat}
> "XWiki Solr index thread" daemon prio=10 tid=0x03b72800 nid=0x64b5 
> runnable [0x7f46f4617000]
>java.lang.Thread.State: RUNNABLE
>   at 
> org.apache.commons.io.input.AutoCloseInputStream.close(AutoCloseInputStream.java:63)
>   at 
> org.apache.commons.io.input.AutoCloseInputStream.afterRead(AutoCloseInputStream.java:77)
>   at 
> org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.BufferedInputStream.fill(Unknown Source)
>   at java.io.BufferedInputStream.read1(Unknown Source)
>   at java.io.BufferedInputStream.read(Unknown Source)
>   - locked <0xcb7094e8> (a java.io.BufferedInputStream)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.FilterInputStream.read(Unknown Source)
>   at org.apache.tika.io.TailStream.read(TailStream.java:117)
>   at org.apache.tika.io.TailStream.skip(TailStream.java:140)
>   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
>   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
>   at 
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:380)
>   ...
> {noformat}
> {noformat}
> "XWiki Solr index thread" daemon prio=10 tid=0x03b72800 nid=0x64b5 
> runnable [0x7f46f4618000]
>java.lang.Thread.State: RUNNABLE
>   at org.apache.tika.io.TailStream.skip(TailStream.java:133)
>   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
>   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
>   at 
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at org.apache.tika.Tika.parseToString(Tika.java:380)
>   ...
> {noformat}
> {noformat}
> "XWiki Solr index thread" daemon prio=10 tid=0x03b72800 nid=0x64b5 
> runnable [0x7f46f4617000]
>java.lang.Thread.State: RUNNABLE
>   at java.io.BufferedInputStream.read1(Unknown Source)
>   at java.io.BufferedInputStream.read(Unknown Source)
>   - locked <0xcb1be170> (a java.io.BufferedInputStream)
>   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
>   at java.io.FilterInputStream.read(Unknown Source)
>   at org.apache.tika.io.TailStream.read(TailStream.java:117)
>   at org.apache.tika.io.TailStream.skip(TailStream.java:140)
>   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
>   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
>   at 
> org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
>   at 
> org.apache.tika.parser.Co

[jira] [Resolved] (TIKA-1177) Add Matroska (mkv, mka) format detection

2013-10-04 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1177.


   Resolution: Fixed
Fix Version/s: 1.5

Unfortunately that magic doesn't seem to be required in all MKV files.  I tired 
several utilities to convert various sources to MKV and none contained that 
magic.

A magic value of {{0x1A45DFA3}} is present, but that's also present in WebM  
which is extended from Matroska.

I've added Matroska mime-types based on just extension for now and also added 
the WebM mime-type.

We can open other issues, linked to this one, for data detection of MKV and 
WebM files if need be.

Resolved in r1529260.

> Add Matroska (mkv, mka) format detection
> 
>
> Key: TIKA-1177
> URL: https://issues.apache.org/jira/browse/TIKA-1177
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.4
>Reporter: Boris Naguet
>Assignee: Ray Gauss II
>Priority: Minor
> Fix For: 1.5
>
>
> There's no mimetype detection for Matroska format, although it's a popular 
> video format.
> Here is some code I added in my custom mimetypes to detect them:
> {code}
>   
>   
>   
>type="string" offset="0" />
>   
>   
>   
>   
>   
> {code}
> I found the signature for the mkv on: 
> http://www.garykessler.net/library/file_sigs.html
> I was not able to find it clearly for mka, but detection by filename is still 
> useful.
> Although, the full spec is available here:
> http://matroska.org/technical/specs/index.html
> Maybe it's a bit more complex than this constant magic, but it works on my 
> tests files.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Assigned] (TIKA-1177) Add Matroska (mkv, mka) format detection

2013-10-04 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1177:
--

Assignee: Ray Gauss II

> Add Matroska (mkv, mka) format detection
> 
>
> Key: TIKA-1177
> URL: https://issues.apache.org/jira/browse/TIKA-1177
> Project: Tika
>  Issue Type: Improvement
>  Components: mime
>Affects Versions: 1.4
>Reporter: Boris Naguet
>Assignee: Ray Gauss II
>Priority: Minor
>
> There's no mimetype detection for Matroska format, although it's a popular 
> video format.
> Here is some code I added in my custom mimetypes to detect them:
> {code}
>   
>   
>   
>type="string" offset="0" />
>   
>   
>   
>   
>   
> {code}
> I found the signature for the mkv on: 
> http://www.garykessler.net/library/file_sigs.html
> I was not able to find it clearly for mka, but detection by filename is still 
> useful.
> Although, the full spec is available here:
> http://matroska.org/technical/specs/index.html
> Maybe it's a bit more complex than this constant magic, but it works on my 
> tests files.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757000#comment-13757000
 ] 

Ray Gauss II commented on TIKA-1170:


Yes, but in this particular case I thought it might be better to explicitly 
change the file name so other developers don't "fix" the media type for that 
file in the future.

> Insufficiently specific magic for binary image/cgm files
> 
>
> Key: TIKA-1170
> URL: https://issues.apache.org/jira/browse/TIKA-1170
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.4
>Reporter: Andrew Jackson
>Assignee: Ray Gauss II
>Priority: Minor
> Fix For: 1.5
>
> Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
> 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch, 
> plotutils-example.cgm
>
>
> I've been running Tika against a large corpus of web archives files, and I'm 
> seeing a number of false positives for image/cgm. The Tika magic is
> {code}
>   
>   
> {code}
> The issue seems to be that the second magic matcher is not very specific, 
> e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
> matches out of >300 million resources, but it would be nice if this could be 
> tightened up. 
> Looking at the PRONOM signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1048&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1049&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1050&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1051&strPageToDisplay=signatures
> it seems we have a variable position marker that changes slightly for each 
> version. Therefore, a more robust signature should be:
> {code}
>   
>   
> 
> 
> 
> 
>   
> {code}
> Where I have assumed the filename part of the CGM file will be less that 64 
> characters long.
> Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1170.


Resolution: Fixed

Resolved in r1519792.

SVN did not like the html extension on the problem file.

Thanks again.

> Insufficiently specific magic for binary image/cgm files
> 
>
> Key: TIKA-1170
> URL: https://issues.apache.org/jira/browse/TIKA-1170
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.4
>Reporter: Andrew Jackson
>Assignee: Ray Gauss II
>Priority: Minor
> Fix For: 1.5
>
> Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
> 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch, 
> plotutils-example.cgm
>
>
> I've been running Tika against a large corpus of web archives files, and I'm 
> seeing a number of false positives for image/cgm. The Tika magic is
> {code}
>   
>   
> {code}
> The issue seems to be that the second magic matcher is not very specific, 
> e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
> matches out of >300 million resources, but it would be nice if this could be 
> tightened up. 
> Looking at the PRONOM signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1048&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1049&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1050&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1051&strPageToDisplay=signatures
> it seems we have a variable position marker that changes slightly for each 
> version. Therefore, a more robust signature should be:
> {code}
>   
>   
> 
> 
> 
> 
>   
> {code}
> Where I have assumed the filename part of the CGM file will be less that 64 
> characters long.
> Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reopened TIKA-1170:



> Insufficiently specific magic for binary image/cgm files
> 
>
> Key: TIKA-1170
> URL: https://issues.apache.org/jira/browse/TIKA-1170
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.4
>Reporter: Andrew Jackson
>Assignee: Ray Gauss II
>Priority: Minor
> Fix For: 1.5
>
> Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
> plotutils-example.cgm
>
>
> I've been running Tika against a large corpus of web archives files, and I'm 
> seeing a number of false positives for image/cgm. The Tika magic is
> {code}
>   
>   
> {code}
> The issue seems to be that the second magic matcher is not very specific, 
> e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
> matches out of >300 million resources, but it would be nice if this could be 
> tightened up. 
> Looking at the PRONOM signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1048&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1049&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1050&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1051&strPageToDisplay=signatures
> it seems we have a variable position marker that changes slightly for each 
> version. Therefore, a more robust signature should be:
> {code}
>   
>   
> 
> 
> 
> 
>   
> {code}
> Where I have assumed the filename part of the CGM file will be less that 64 
> characters long.
> Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1375#comment-1375
 ] 

Ray Gauss II commented on TIKA-1170:


My mistake, that's an artifact of me manually applying the git patch.

It does, however, seem to indicate that we should have a unit test for the 
false positives.  Do you have a file which demonstrates that problem?

> Insufficiently specific magic for binary image/cgm files
> 
>
> Key: TIKA-1170
> URL: https://issues.apache.org/jira/browse/TIKA-1170
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.4
>Reporter: Andrew Jackson
>Assignee: Ray Gauss II
>Priority: Minor
> Fix For: 1.5
>
> Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
> plotutils-example.cgm
>
>
> I've been running Tika against a large corpus of web archives files, and I'm 
> seeing a number of false positives for image/cgm. The Tika magic is
> {code}
>   
>   
> {code}
> The issue seems to be that the second magic matcher is not very specific, 
> e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
> matches out of >300 million resources, but it would be nice if this could be 
> tightened up. 
> Looking at the PRONOM signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1048&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1049&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1050&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1051&strPageToDisplay=signatures
> it seems we have a variable position marker that changes slightly for each 
> version. Therefore, a more robust signature should be:
> {code}
>   
>   
> 
> 
> 
> 
>   
> {code}
> Where I have assumed the filename part of the CGM file will be less that 64 
> characters long.
> Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1170.


   Resolution: Fixed
Fix Version/s: 1.5

Added in r1519664.

Thanks!

> Insufficiently specific magic for binary image/cgm files
> 
>
> Key: TIKA-1170
> URL: https://issues.apache.org/jira/browse/TIKA-1170
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.4
>Reporter: Andrew Jackson
>Assignee: Ray Gauss II
>Priority: Minor
> Fix For: 1.5
>
> Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
> plotutils-example.cgm
>
>
> I've been running Tika against a large corpus of web archives files, and I'm 
> seeing a number of false positives for image/cgm. The Tika magic is
> {code}
>   
>   
> {code}
> The issue seems to be that the second magic matcher is not very specific, 
> e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
> matches out of >300 million resources, but it would be nice if this could be 
> tightened up. 
> Looking at the PRONOM signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1048&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1049&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1050&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1051&strPageToDisplay=signatures
> it seems we have a variable position marker that changes slightly for each 
> version. Therefore, a more robust signature should be:
> {code}
>   
>   
> 
> 
> 
> 
>   
> {code}
> Where I have assumed the filename part of the CGM file will be less that 64 
> characters long.
> Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1170:
--

Assignee: Ray Gauss II

> Insufficiently specific magic for binary image/cgm files
> 
>
> Key: TIKA-1170
> URL: https://issues.apache.org/jira/browse/TIKA-1170
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.4
>Reporter: Andrew Jackson
>Assignee: Ray Gauss II
>Priority: Minor
> Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
> plotutils-example.cgm
>
>
> I've been running Tika against a large corpus of web archives files, and I'm 
> seeing a number of false positives for image/cgm. The Tika magic is
> {code}
>   
>   
> {code}
> The issue seems to be that the second magic matcher is not very specific, 
> e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
> matches out of >300 million resources, but it would be nice if this could be 
> tightened up. 
> Looking at the PRONOM signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1048&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1049&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1050&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1051&strPageToDisplay=signatures
> it seems we have a variable position marker that changes slightly for each 
> version. Therefore, a more robust signature should be:
> {code}
>   
>   
> 
> 
> 
> 
>   
> {code}
> Where I have assumed the filename part of the CGM file will be less that 64 
> characters long.
> Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1166) FLVParser NullPointerException

2013-08-28 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1166.


   Resolution: Fixed
Fix Version/s: 1.5

I briefly tried a few methods of trimming the problem file's size but none 
reproduced the issue in the resulting file.

Committed a check for null in r1518318.

> FLVParser NullPointerException
> --
>
> Key: TIKA-1166
> URL: https://issues.apache.org/jira/browse/TIKA-1166
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1, 1.2, 1.3, 1.4
> Environment: All
>Reporter: david rapin
>Assignee: Ray Gauss II
>  Labels: easyfix
> Fix For: 1.5
>
> Attachments: data.mp4
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> On certain video files, the FLV parser throws an NPE on line 242.
> The piece of code causing this is the following:
> https://github.com/apache/tika/blob/1.4/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java#L242
> {noformat}241: for (Entry entry : 
> extractedMetadata.entrySet()) {
> 242:   metadata.set(entry.getKey(), entry.getValue().toString());
> 243: }
> {noformat} 
> Which should probably be replaced by something like this:
> {noformat}241: for (Entry entry : 
> extractedMetadata.entrySet()) {
> 242:   if (entry.getValue() == null) continue;
> 243:   metadata.set(entry.getKey(), entry.getValue().toString());
> 244: }
> {noformat} 
> Exception trace :
> {noformat}[root@hermes backend]# java -jar bin/tika-app-1.1.jar -j ./data.mp4
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.video.FLVParser@58d9660d
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
> Caused by: java.lang.NullPointerException
> at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 5 more
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
> Caused by: java.lang.NullPointerException
> at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 5 more
> {noformat} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (TIKA-1166) FLVParser NullPointerException

2013-08-28 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1166:
--

Assignee: Ray Gauss II

> FLVParser NullPointerException
> --
>
> Key: TIKA-1166
> URL: https://issues.apache.org/jira/browse/TIKA-1166
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1, 1.2, 1.3, 1.4
> Environment: All
>Reporter: david rapin
>Assignee: Ray Gauss II
>  Labels: easyfix
> Attachments: data.mp4
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> On certain video files, the FLV parser throws an NPE on line 242.
> The piece of code causing this is the following:
> https://github.com/apache/tika/blob/1.4/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java#L242
> {noformat}241: for (Entry entry : 
> extractedMetadata.entrySet()) {
> 242:   metadata.set(entry.getKey(), entry.getValue().toString());
> 243: }
> {noformat} 
> Which should probably be replaced by something like this:
> {noformat}241: for (Entry entry : 
> extractedMetadata.entrySet()) {
> 242:   if (entry.getValue() == null) continue;
> 243:   metadata.set(entry.getKey(), entry.getValue().toString());
> 244: }
> {noformat} 
> Exception trace :
> {noformat}[root@hermes backend]# java -jar bin/tika-app-1.1.jar -j ./data.mp4
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.video.FLVParser@58d9660d
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
> Caused by: java.lang.NullPointerException
> at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 5 more
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
> Caused by: java.lang.NullPointerException
> at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 5 more
> {noformat} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1166) FLVParser NullPointerException

2013-08-22 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747529#comment-13747529
 ] 

Ray Gauss II commented on TIKA-1166:


Thanks.  Is there any chance you could get that down to under, say, 50k, while 
still demonstrating the failure so that we can include it in the dist and 
create a unit test against it?

> FLVParser NullPointerException
> --
>
> Key: TIKA-1166
> URL: https://issues.apache.org/jira/browse/TIKA-1166
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1, 1.2, 1.3, 1.4
> Environment: All
>Reporter: david rapin
>  Labels: easyfix
> Attachments: data.mp4
>
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> On certain video files, the FLV parser throws an NPE on line 242.
> The piece of code causing this is the following:
> https://github.com/apache/tika/blob/1.4/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java#L242
> {noformat}241: for (Entry entry : 
> extractedMetadata.entrySet()) {
> 242:   metadata.set(entry.getKey(), entry.getValue().toString());
> 243: }
> {noformat} 
> Which should probably be replaced by something like this:
> {noformat}241: for (Entry entry : 
> extractedMetadata.entrySet()) {
> 242:   if (entry.getValue() == null) continue;
> 243:   metadata.set(entry.getKey(), entry.getValue().toString());
> 244: }
> {noformat} 
> Exception trace :
> {noformat}[root@hermes backend]# java -jar bin/tika-app-1.1.jar -j ./data.mp4
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.video.FLVParser@58d9660d
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
> Caused by: java.lang.NullPointerException
> at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 5 more
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
> Caused by: java.lang.NullPointerException
> at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
> at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> ... 5 more
> {noformat} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2013-07-26 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13720694#comment-13720694
 ] 

Ray Gauss II commented on TIKA-1154:


I've been pushing the metadata-extractor Maven release through Sonatype thus 
far, but Mr. Noakes has been granted access there [1].

If there's no response to your Google code issue I can push a 2.6.2.1 release 
that upgrades xercesImpl to 2.11.0 which, on first look, compiles and has no 
test failures.


[1] https://issues.sonatype.org/browse/OSSRH-3948

> Tika hangs on format detection of malformed HTML file.
> --
>
> Key: TIKA-1154
> URL: https://issues.apache.org/jira/browse/TIKA-1154
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.4
>Reporter: Andrew Jackson
>Priority: Minor
> Attachments: tika-breaker.html
>
>
> We are using Tika on large web archives, which also happen to contain some 
> malformed files. In particular, we found a HTML file with binary characters 
> in the DOCTYPE declaration. This hangs Tika, either embedded or from the 
> command line, during format detection.
> An example file is attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2013-07-22 Thread Ray Gauss II (JIRA)
Ray Gauss II created TIKA-1151:
--

 Summary: Maven Build Should Automatically Produce test-jar 
Artifacts
 Key: TIKA-1151
 URL: https://issues.apache.org/jira/browse/TIKA-1151
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Ray Gauss II
Assignee: Ray Gauss II


The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}

  org.apache.tika
  tika-parsers
  1.5-SNAPSHOT
  test-jar
  test

{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-bundle
- tika-core
- tika-parsers
- tika-server
- tika-xmp



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1147) File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed

2013-07-17 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1147.


   Resolution: Fixed
Fix Version/s: 1.5

Resolved in r1504302.

> File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed
> -
>
> Key: TIKA-1147
> URL: https://issues.apache.org/jira/browse/TIKA-1147
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.4
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
>Priority: Critical
> Fix For: 1.5
>
>
> When an application using Tika passes {{InputStream}} objects to 
> {{ExternalEmbedder.embed}} the stream is usually read into a temporary file 
> which is then deleted after embedding takes place.
> However, if the application passes in a file-based {{TikaInputStream}} the 
> embedder ends up dealing with directly with the original source file, which 
> is then deleted after embedding takes place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1147) File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed

2013-07-17 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1147:
---

  Component/s: metadata
  Description: 
When an application using Tika passes {{InputStream}} objects to 
{{ExternalEmbedder.embed}} the stream is usually read into a temporary file 
which is then deleted after embedding takes place.

However, if the application passes in a file-based {{TikaInputStream}} the 
embedder ends up dealing with directly with the original source file, which is 
then deleted after embedding takes place.
 Priority: Critical  (was: Major)
Affects Version/s: 1.4
 Assignee: Ray Gauss II
  Summary: File-Based TikaInputStreams are Deleted by 
ExternalEmbedder.embed  (was: Passing a File-Based TikaInputStream to 
ExternalEmbedder Delete)

> File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed
> -
>
> Key: TIKA-1147
> URL: https://issues.apache.org/jira/browse/TIKA-1147
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.4
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
>Priority: Critical
>
> When an application using Tika passes {{InputStream}} objects to 
> {{ExternalEmbedder.embed}} the stream is usually read into a temporary file 
> which is then deleted after embedding takes place.
> However, if the application passes in a file-based {{TikaInputStream}} the 
> embedder ends up dealing with directly with the original source file, which 
> is then deleted after embedding takes place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1147) Passing a File-Based TikaInputStream to ExternalEmbedder Delete

2013-07-17 Thread Ray Gauss II (JIRA)
Ray Gauss II created TIKA-1147:
--

 Summary: Passing a File-Based TikaInputStream to ExternalEmbedder 
Delete
 Key: TIKA-1147
 URL: https://issues.apache.org/jira/browse/TIKA-1147
 Project: Tika
  Issue Type: Bug
Reporter: Ray Gauss II




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-13 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682924#comment-13682924
 ] 

Ray Gauss II commented on TIKA-1130:


Test file and method committed in r1492909.

This was just added onto {{OOXMLParserTest}} and named with a {{disabled}} 
prefix rather than using {{@Ignore}}.  I think we should start moving towards 
that for new test classes though.

> .docx text extract leaves out some portions of text
> ---
>
> Key: TIKA-1130
> URL: https://issues.apache.org/jira/browse/TIKA-1130
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.2, 1.3
> Environment: OpenJDK x86_64
>Reporter: Daniel Gibby
>Priority: Critical
> Attachments: Resume 6.4.13.docx
>
>
> When parsing a Microsoft Word .docx 
> (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
> certain portions of text remain unextracted.
> I have attached a .docx file that can be tested against. The 'gray' portions 
> of text are what are not extracted, while the darker colored text extracts 
> fine.
> Looking at the document.xml portion of the .docx zip file shows the text is 
> all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-13 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682644#comment-13682644
 ] 

Ray Gauss II commented on TIKA-1130:


I've created a unit test that reproduces the issue with a stripped down version 
of the original file.

Shall I comment out the actual test and commit?

> .docx text extract leaves out some portions of text
> ---
>
> Key: TIKA-1130
> URL: https://issues.apache.org/jira/browse/TIKA-1130
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.2, 1.3
> Environment: OpenJDK x86_64
>Reporter: Daniel Gibby
>Priority: Critical
> Attachments: Resume 6.4.13.docx
>
>
> When parsing a Microsoft Word .docx 
> (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
> certain portions of text remain unextracted.
> I have attached a .docx file that can be tested against. The 'gray' portions 
> of text are what are not extracted, while the darker colored text extracts 
> fine.
> Looking at the document.xml portion of the .docx zip file shows the text is 
> all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1135) Incorrect Cardinality and Case in IPTC Metadata Definition

2013-06-11 Thread Ray Gauss II (JIRA)
Ray Gauss II created TIKA-1135:
--

 Summary: Incorrect Cardinality and Case in IPTC Metadata Definition
 Key: TIKA-1135
 URL: https://issues.apache.org/jira/browse/TIKA-1135
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.3
Reporter: Ray Gauss II
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.4


Some of the fields defined in the {{IPTC}} interface have incorrect cardinality 
and metadata key names with incorrect case.

The change of key names should be done though composite properties which 
include deprecated versions of the incorrect names as secondary properties for 
backwards compatibility.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1135) Incorrect Cardinality and Case in IPTC Metadata Definition

2013-06-11 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1135.


Resolution: Fixed

Resolved in r1491935.

> Incorrect Cardinality and Case in IPTC Metadata Definition
> --
>
> Key: TIKA-1135
> URL: https://issues.apache.org/jira/browse/TIKA-1135
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.3
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
>Priority: Minor
> Fix For: 1.4
>
>
> Some of the fields defined in the {{IPTC}} interface have incorrect 
> cardinality and metadata key names with incorrect case.
> The change of key names should be done though composite properties which 
> include deprecated versions of the incorrect names as secondary properties 
> for backwards compatibility.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1133) Ability to Allow Empty and Duplicate Tika Values for XML Elements

2013-06-10 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1133.


   Resolution: Fixed
Fix Version/s: 1.4

Resolved in r1491680.

> Ability to Allow Empty and Duplicate Tika Values for XML Elements
> -
>
> Key: TIKA-1133
> URL: https://issues.apache.org/jira/browse/TIKA-1133
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
> Fix For: 1.4
>
>
> In some cases it is beneficial to allow empty and duplicate Tika metadata 
> values for multi-valued XML elements like RDF bags.
> Consider an example where the original source metadata is structured 
> something like:
> {code}
> 
>   John
>   Smith
> 
> 
>   Jane
>   Doe
> 
> 
>   Bob
> 
> 
>   Kate
>   Smith
> 
> {code}
> and since Tika stores only flat metadata we transform that before invoking a 
> parser to something like:
> {code}
>  
>   
>John
>Jane
>Bob
>Kate
>   
>  
>  
>   
>Smith
>Doe
>
>Smith
>   
>  
> {code}
> The current behavior ignores empties and duplicates and we don't know if Bob 
> or Kate ever had last names.  Empties or duplicates in other positions result 
> in an incorrect mapping of data.
> We should allow the option to create an {{ElementMetadataHandler}} which 
> allows empty and/or duplicate values.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1133) Ability to Allow Empty and Duplicate Tika Values for XML Elements

2013-06-10 Thread Ray Gauss II (JIRA)
Ray Gauss II created TIKA-1133:
--

 Summary: Ability to Allow Empty and Duplicate Tika Values for XML 
Elements
 Key: TIKA-1133
 URL: https://issues.apache.org/jira/browse/TIKA-1133
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
Assignee: Ray Gauss II


In some cases it is beneficial to allow empty and duplicate Tika metadata 
values for multi-valued XML elements like RDF bags.

Consider an example where the original source metadata is structured something 
like:
{code}

  John
  Smith


  Jane
  Doe


  Bob


  Kate
  Smith

{code}

and since Tika stores only flat metadata we transform that before invoking a 
parser to something like:
{code}
 
  
   John
   Jane
   Bob
   Kate
  
 
 
  
   Smith
   Doe
   
   Smith
  
 
{code}

The current behavior ignores empties and duplicates and we don't know if Bob or 
Kate ever had last names.  Empties or duplicates in other positions result in 
an incorrect mapping of data.

We should allow the option to create an {{ElementMetadataHandler}} which allows 
empty and/or duplicate values.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1115) ExifHandler throws NullPointerException

2013-05-01 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1115.


   Resolution: Fixed
Fix Version/s: 1.4

Resolved in r1478111

> ExifHandler throws NullPointerException
> ---
>
> Key: TIKA-1115
> URL: https://issues.apache.org/jira/browse/TIKA-1115
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.3
> Environment: verified on Mac OSX and Ubuntu 12.04
>Reporter: Lee Graber
>Assignee: Ray Gauss II
>  Labels: ImageMetadataExtractor
> Fix For: 1.4
>
> Attachments: 654000main_transit-hubble-orig_full.jpg
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Notice that in the second if block, there is no check for null on the 
> retrived datetime. I have hit this with a file which apparently has null for 
> this value. Seems like the fix is trivial
> public void handleDateTags(Directory directory, Metadata metadata)
> throws MetadataException {
> // Date/Time Original overrides value from 
> ExifDirectory.TAG_DATETIME
> Date original = null;
> if 
> (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
> original = 
> directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
> // Unless we have GPS time we don't know the time zone so 
> date must be set
> // as ISO 8601 datetime without timezone suffix (no Z or +/-)
> if (original != null) {
> String datetimeNoTimeZone = 
> DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
> uses
> metadata.set(TikaCoreProperties.CREATED, 
> datetimeNoTimeZone);
> metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
> }
> }
> if (directory.containsTag(ExifIFD0Directory.TAG_DATETIME)) {
> Date datetime = 
> directory.getDate(ExifIFD0Directory.TAG_DATETIME);
> String datetimeNoTimeZone = 
> DATE_UNSPECIFIED_TZ.format(datetime);
> metadata.set(TikaCoreProperties.MODIFIED, datetimeNoTimeZone);
> // If Date/Time Original does not exist this might be 
> creation date
> if (metadata.get(TikaCoreProperties.CREATED) == null) {
> metadata.set(TikaCoreProperties.CREATED, 
> datetimeNoTimeZone);
> }
> }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1115) ExifHandler throws NullPointerException

2013-05-01 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13646709#comment-13646709
 ] 

Ray Gauss II commented on TIKA-1115:


Hi Lee,

Do we have permission to include the problem file at a greatly reduced size, 
say 64px wide, as a test file?

> ExifHandler throws NullPointerException
> ---
>
> Key: TIKA-1115
> URL: https://issues.apache.org/jira/browse/TIKA-1115
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.3
> Environment: verified on Mac OSX and Ubuntu 12.04
>Reporter: Lee Graber
>Assignee: Ray Gauss II
>  Labels: ImageMetadataExtractor
> Attachments: 654000main_transit-hubble-orig_full.jpg
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Notice that in the second if block, there is no check for null on the 
> retrived datetime. I have hit this with a file which apparently has null for 
> this value. Seems like the fix is trivial
> public void handleDateTags(Directory directory, Metadata metadata)
> throws MetadataException {
> // Date/Time Original overrides value from 
> ExifDirectory.TAG_DATETIME
> Date original = null;
> if 
> (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
> original = 
> directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
> // Unless we have GPS time we don't know the time zone so 
> date must be set
> // as ISO 8601 datetime without timezone suffix (no Z or +/-)
> if (original != null) {
> String datetimeNoTimeZone = 
> DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
> uses
> metadata.set(TikaCoreProperties.CREATED, 
> datetimeNoTimeZone);
> metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
> }
> }
> if (directory.containsTag(ExifIFD0Directory.TAG_DATETIME)) {
> Date datetime = 
> directory.getDate(ExifIFD0Directory.TAG_DATETIME);
> String datetimeNoTimeZone = 
> DATE_UNSPECIFIED_TZ.format(datetime);
> metadata.set(TikaCoreProperties.MODIFIED, datetimeNoTimeZone);
> // If Date/Time Original does not exist this might be 
> creation date
> if (metadata.get(TikaCoreProperties.CREATED) == null) {
> metadata.set(TikaCoreProperties.CREATED, 
> datetimeNoTimeZone);
> }
> }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (TIKA-1115) ExifHandler throws NullPointerException

2013-05-01 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1115:
--

Assignee: Ray Gauss II

> ExifHandler throws NullPointerException
> ---
>
> Key: TIKA-1115
> URL: https://issues.apache.org/jira/browse/TIKA-1115
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.3
> Environment: verified on Mac OSX and Ubuntu 12.04
>Reporter: Lee Graber
>Assignee: Ray Gauss II
>  Labels: ImageMetadataExtractor
> Attachments: 654000main_transit-hubble-orig_full.jpg
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Notice that in the second if block, there is no check for null on the 
> retrived datetime. I have hit this with a file which apparently has null for 
> this value. Seems like the fix is trivial
> public void handleDateTags(Directory directory, Metadata metadata)
> throws MetadataException {
> // Date/Time Original overrides value from 
> ExifDirectory.TAG_DATETIME
> Date original = null;
> if 
> (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
> original = 
> directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
> // Unless we have GPS time we don't know the time zone so 
> date must be set
> // as ISO 8601 datetime without timezone suffix (no Z or +/-)
> if (original != null) {
> String datetimeNoTimeZone = 
> DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
> uses
> metadata.set(TikaCoreProperties.CREATED, 
> datetimeNoTimeZone);
> metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
> }
> }
> if (directory.containsTag(ExifIFD0Directory.TAG_DATETIME)) {
> Date datetime = 
> directory.getDate(ExifIFD0Directory.TAG_DATETIME);
> String datetimeNoTimeZone = 
> DATE_UNSPECIFIED_TZ.format(datetime);
> metadata.set(TikaCoreProperties.MODIFIED, datetimeNoTimeZone);
> // If Date/Time Original does not exist this might be 
> creation date
> if (metadata.get(TikaCoreProperties.CREATED) == null) {
> metadata.set(TikaCoreProperties.CREATED, 
> datetimeNoTimeZone);
> }
> }
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584194#comment-13584194
 ] 

Ray Gauss II commented on TIKA-1074:


bq. But it's a little weird throw TikaExc in response to an interrupt (ie, code 
above will be trying to catch an IE) ... I think it's cleaner to set the 
interrupt bit and let the next place that waits see the interrupt bit and throw 
IE?

That's what I found in my investigation for TIKA-775 / TIKA-1059 as well.

> Extraction should continue if an exception is hit visiting an embedded 
> document
> ---
>
> Key: TIKA-1074
> URL: https://issues.apache.org/jira/browse/TIKA-1074
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 1.4
>
> Attachments: TIKA-1074.patch, TIKA-1074.patch
>
>
> Spinoff from TIKA-1072.
> In that issue, a problematic document (still not sure if document is corrupt, 
> or possible POI bug) caused an exception when visiting the embedded documents.
> If I change Tika to suppress that exception, the rest of the document 
> extracts fine.
> So somehow I think we should be more robust here, and maybe log the 
> exception, or save/record the exception(s) somewhere so after parsing the app 
> could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1068) Metadata-extractor throws NoSuchMethodError for jpg image with xmp header data

2013-01-30 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566693#comment-13566693
 ] 

Ray Gauss II commented on TIKA-1068:


I can't reproduce this using tika-app from either the download distribution or 
compiled from source.

We're using the 2.6.2 metadata-extractor jar from Maven central repository [1].

I'm not sure how your build is structured but perhaps you're including a 2.6.2 
metadata-extractor jar you've downloaded from elsewhere?  If so, can you try 
replacing that with the one on Maven central? 


[1] 
http://search.maven.org/#artifactdetails%7Ccom.drewnoakes%7Cmetadata-extractor%7C2.6.2%7Cjar

> Metadata-extractor throws NoSuchMethodError for jpg image with xmp header data
> --
>
> Key: TIKA-1068
> URL: https://issues.apache.org/jira/browse/TIKA-1068
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Magnus Lövgren
>Priority: Critical
> Attachments: vinter080501-66.jpg
>
>
> Using Tika 1.3, parsing of jpg files throws NoSuchMethodError when the jpg 
> contains xmp data. No Error was thrown in Tika 1.2.
> The metadata-extractor was updated in Tika 1.3 (to 
> "com.drewnoakes:metadata-extractor:2.6.2"), See TIKA-811 (duplicated by 
> TIKA-996). That jar is badly compiled (as mentioned by Emmanuel Hugonnet as 
> comment on TIKA-915) and causes the NoSuchMethodError!
> => the metadata-extractor 2.6.2 jar needs to be replaced! Problem seems fixed 
> in metadata-extractor 2.7.0, but that isn't released yet.
> Discussions available at:
> http://code.google.com/p/metadata-extractor/issues/detail?id=39
> http://code.google.com/p/metadata-extractor/issues/detail?id=55
> Code to reproduce problem:
> =
> 
>   org.apache.tika
>   tika-core
>   1.3
> 
> 
>   org.apache.tika
>   tika-xmp
>   1.3
> 
> 
>   org.apache.tika
>   tika-parsers
>   1.3
> 
> InputStream inputStream = ... // vinter080501-66.jpg file (attached)
> ContentHandler contentHandler = new BodyContentHandler(200);
> Metadata metadata = new Metadata();
> ParseContext context = new ParseContext();
> Parser parser = new AutoDetectParser();
> parser.parse(inputStream, contentHandler, metadata, context); // Throws 
> NoSuchMethodError
> => java.lang.NoSuchMethodError: 
> com.adobe.xmp.properties.XMPPropertyInfo.getValue()Ljava/lang/Object;
>   at com.drew.metadata.xmp.XmpReader.extract(Unknown Source)
>   at 
> com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(Unknown
>  Source)
>   at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(Unknown Source)
>   at 
> org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
>   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder

2013-01-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1059:
---

Issue Type: Improvement  (was: Bug)

> Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
> --
>
> Key: TIKA-1059
> URL: https://issues.apache.org/jira/browse/TIKA-1059
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ray Gauss II
> Fix For: 1.4
>
>
> The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch 
> {{InterruptedException}} and ignore it.
> The methods should either call {{interrupt()}} on the current thread or 
> re-throw the exception, possibly wrapped in a {{TikaException}}.
> See TIKA-775 for a previous discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder

2013-01-18 Thread Ray Gauss II (JIRA)
Ray Gauss II created TIKA-1059:
--

 Summary: Better Handling of InterruptedException in ExternalParser 
and ExternalEmbedder
 Key: TIKA-1059
 URL: https://issues.apache.org/jira/browse/TIKA-1059
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
 Fix For: 1.4


The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch 
{{InterruptedException}} and ignore it.

The methods should either call {{interrupt()}} on the current thread or 
re-throw the exception, possibly wrapped in a {{TikaException}}.

See TIKA-775 for a previous discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-775) Embed Capabilities

2013-01-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-775.
---

   Resolution: Fixed
Fix Version/s: (was: 1.4)
   1.3
 Assignee: Ray Gauss II

> Embed Capabilities
> --
>
> Key: TIKA-775
> URL: https://issues.apache.org/jira/browse/TIKA-775
> Project: Tika
>  Issue Type: Improvement
>  Components: general, metadata
>Affects Versions: 1.0
> Environment: The default ExternalEmbedder requires that sed be 
> installed.
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
>  Labels: embed, patch
> Fix For: 1.3
>
> Attachments: embed_20121029.diff, embed.diff, 
> tika-core-embed-patch.txt, tika-parsers-embed-patch.txt
>
>
> This patch defines and implements the concept of embedding tika metadata into 
> a file stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed 
> ExternalEmbedder implementation meant to be extended or configured are added. 
>  These classes are essentially a reverse flow of the existing Parser and 
> ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which 
> uses the default ExternalEmbedder (calls sed) to embed a value placed in 
> Metadata.DESCRIPTION then verify the operation by parsing the resulting 
> stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-775) Embed Capabilities

2013-01-17 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556401#comment-13556401
 ] 

Ray Gauss II commented on TIKA-775:
---

This code is already on trunk.

Can we re-resolve for 1.3 and open new, 'smaller' issues for 1.4 if there are 
still specific concerns?

> Embed Capabilities
> --
>
> Key: TIKA-775
> URL: https://issues.apache.org/jira/browse/TIKA-775
> Project: Tika
>  Issue Type: Improvement
>  Components: general, metadata
>Affects Versions: 1.0
> Environment: The default ExternalEmbedder requires that sed be 
> installed.
>Reporter: Ray Gauss II
>  Labels: embed, patch
> Fix For: 1.4
>
> Attachments: embed_20121029.diff, embed.diff, 
> tika-core-embed-patch.txt, tika-parsers-embed-patch.txt
>
>
> This patch defines and implements the concept of embedding tika metadata into 
> a file stream, the reverse of extraction.
> In the tika-core project an interface defining an Embedder and a generic sed 
> ExternalEmbedder implementation meant to be extended or configured are added. 
>  These classes are essentially a reverse flow of the existing Parser and 
> ExternalParser classes.
> In the tika-parsers project an ExternalEmbedderTest unit test is added which 
> uses the default ExternalEmbedder (calls sed) to embed a value placed in 
> Metadata.DESCRIPTION then verify the operation by parsing the resulting 
> stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1056) unify ImageMetadataExtractor interface

2013-01-16 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1056.


   Resolution: Fixed
Fix Version/s: 1.3

Resolved in r1434117.

> unify ImageMetadataExtractor interface
> --
>
> Key: TIKA-1056
> URL: https://issues.apache.org/jira/browse/TIKA-1056
> Project: Tika
>  Issue Type: Wish
>Reporter: Maciej Lizewski
>Assignee: Ray Gauss II
>Priority: Trivial
> Fix For: 1.3
>
>
> there are several methods in this class that are targeted for different image 
> type but with different visibility:
> public void parseJpeg(File file);
> protected void parseTiff(InputStream stream);
> both simply extract all possible metadata from image file or stream. Would be 
> nice if parseTiff could also be "public" so it will be easier to create 
> custom parsers located in external jars that use this functionality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (TIKA-1056) unify ImageMetadataExtractor interface

2013-01-16 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1056:
--

Assignee: Ray Gauss II

> unify ImageMetadataExtractor interface
> --
>
> Key: TIKA-1056
> URL: https://issues.apache.org/jira/browse/TIKA-1056
> Project: Tika
>  Issue Type: Wish
>Reporter: Maciej Lizewski
>Assignee: Ray Gauss II
>Priority: Trivial
>
> there are several methods in this class that are targeted for different image 
> type but with different visibility:
> public void parseJpeg(File file);
> protected void parseTiff(InputStream stream);
> both simply extract all possible metadata from image file or stream. Would be 
> nice if parseTiff could also be "public" so it will be easier to create 
> custom parsers located in external jars that use this functionality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-962) Backwards Compatibility for Metadata.LAST_AUTHOR is Broken

2013-01-08 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-962.
---

Resolution: Fixed

This has been fixed, but I didn't resolve for 1.3 as I thought it might be 
worthy of a fix release.

> Backwards Compatibility for Metadata.LAST_AUTHOR is Broken
> --
>
> Key: TIKA-962
> URL: https://issues.apache.org/jira/browse/TIKA-962
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.2
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
>Priority: Critical
> Fix For: 1.3
>
>
> As a result of changes in TIKA-930, support for the deprecated 
> Metadata.LAST_AUTHOR property has been dropped.
> The new TikaCoreProperties.MODIFIED should be a composite property containing 
> Metadata.LAST_AUTHOR.
> Should we consider a fix release for this?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-963) Backwards Compatibility for Metadata.DATE is Incorrect

2013-01-08 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-963.
---

Resolution: Fixed

This has been fixed, but I didn't resolve for 1.3 as I thought it might be 
worthy of a fix release.

> Backwards Compatibility for Metadata.DATE is Incorrect
> --
>
> Key: TIKA-963
> URL: https://issues.apache.org/jira/browse/TIKA-963
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.2
>Reporter: Ray Gauss II
>Assignee: Ray Gauss II
>Priority: Critical
> Fix For: 1.3
>
>
> Metadata.DATE was always somewhat ambiguous, but during the consolidation in 
> TIKA-930 it was incorrectly assumed that most parsers used it as a creation 
> date.
> Metadata.DATE needs to instead be part of the TikaCoreProperties.MODIFIED 
> composite property.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-895) Empty title element makes Tika-generated HTML documents not open

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-895.
---

   Resolution: Fixed
Fix Version/s: 1.3

When a {{TransformerHandler}} is used the actual writing of the final elements 
is delegated to an XML serializer such as {{ToHTMLStream}} which extends 
{{ToStream}}.

When {{ToStream.characters}} is called with zero length it returns immediately 
and does not close the start tag of the current element, and 
{{ToStream.endElement}} checks whether the start tag is open to determine 
whether or not to close as {{}} or {{}}.

It seems the code brought over from the xalan project to the JDK was locked 
down quite a bit during the transition.  When using xalan directly an alternate 
XML serializer can be specified via XSLT or other means [1], but in the JDK 
that functionality seems to have been removed as 
{{TransletOutputHandlerFactory.getSerializationHandler}} has ToHTMLStream 
hard-coded.

Additionally, ToHTMLStream is declared as final and the majority of the classes 
which one would normally extend to use a different 
{{TransletOutputHandlerFactory}} are internal, so a proper solution would 
likely involve depending on xalan directly or duplicating a whole lot of code, 
neither of which is ideal.

As a workaround, a {{ExpandedTitleContentHandler}} content handler decorator 
was added which checks for the previous fix for this issue, a call to 
{{characters(new char[0], 0, 0)}} for the title element, and if present changes 
the length to 1 then catches the expected {{ArrayIndexOutOfBoundsException}} 
thrown by {{ToStream.characters}}.

The result is that the title start tag is closed since the check for zero 
length passes and no character writing is attempted.

{{TikaCLI}} was modified to wrap the transformer handler returned by 
{{SAXTransformerFactory}} for the {{html}} output method, so only handling of 
the {{title}} tag for HTML output will be affected by the change.

In the event that this approach has adverse effects for those using XML 
serializers other than those present in the JDK, the change to {{TikaCLI}} can 
be reverted or made an option.

Those calling Tika programmatically will need to wrap their transformer 
handlers in a {{ExpandedTitleContentHandler}} as well, i.e.:

{code}
...
SAXTransformerFactory factory = (SAXTransformerFactory) 
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, indent);
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, encoding);
handler.setResult(new StreamResult(output));
return new ExpandedTitleContentHandler(handler);
{code}

Resolved in r1423538.


[1] http://xml.apache.org/xalan-j/usagepatterns.html

> Empty title element makes Tika-generated HTML documents not open
> 
>
> Key: TIKA-895
> URL: https://issues.apache.org/jira/browse/TIKA-895
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.1
> Environment: Windows 7 
>Reporter: Benoit MAGGI
>Assignee: Ray Gauss II
>Priority: Trivial
>  Labels: newbie
> Fix For: 1.3
>
>
> I try to transform an empty docx to an html file.
> Ex : java -jar tika-app-1.1.jar -x example.docx > t.html
> The html file can't be open with Firefox,Internet Explorer and Chrome.
> The main point is that  seems to be forbiden by html specification 
> (can't get the point on html5)
> bq. http://www.w3.org/TR/html401/struct/global.html#h-7.4.2 
> bq. 7.4.2 The TITLE element 
> bq. 
> bq.    - - (#PCDATA) 
> -(%head.misc; 
> bq.  ) -- document 
> title -->
> bq. http://www.w3.org/TR/html401/sgml/dtd.html#i18n> >
> bq. *Start tag: required, End tag: required*
> For information there was the same bug with xls
> https://issues.apache.org/jira/browse/TIKA-725
> The simple solution should be to provide an empty title by default

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (TIKA-895) Empty title element makes Tika-generated HTML documents not open

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reopened TIKA-895:
---


Reopening to resolve as fixed rather than duplicate.

> Empty title element makes Tika-generated HTML documents not open
> 
>
> Key: TIKA-895
> URL: https://issues.apache.org/jira/browse/TIKA-895
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.1
> Environment: Windows 7 
>Reporter: Benoit MAGGI
>Assignee: Ray Gauss II
>Priority: Trivial
>  Labels: newbie
>
> I try to transform an empty docx to an html file.
> Ex : java -jar tika-app-1.1.jar -x example.docx > t.html
> The html file can't be open with Firefox,Internet Explorer and Chrome.
> The main point is that  seems to be forbiden by html specification 
> (can't get the point on html5)
> bq. http://www.w3.org/TR/html401/struct/global.html#h-7.4.2 
> bq. 7.4.2 The TITLE element 
> bq. 
> bq.    - - (#PCDATA) 
> -(%head.misc; 
> bq.  ) -- document 
> title -->
> bq. http://www.w3.org/TR/html401/sgml/dtd.html#i18n> >
> bq. *Start tag: required, End tag: required*
> For information there was the same bug with xls
> https://issues.apache.org/jira/browse/TIKA-725
> The simple solution should be to provide an empty title by default

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II closed TIKA-725.
-


> Empty title element makes Tika-generated HTML documents not open in Chromium
> 
>
> Key: TIKA-725
> URL: https://issues.apache.org/jira/browse/TIKA-725
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.9
> Environment: Chromium 12 on Ubuntu Linux
>Reporter: Henri Bergius
>Assignee: Ray Gauss II
>Priority: Minor
>  Labels: html
> Fix For: 0.10
>
>
> Currently when converting Excel sheets (both XLS and XLSX), Tika generates an 
> empty title element as  into the document HEAD section. This causes 
> Chromium not to display the document contents.
> Switching it to  fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Issue Comment Deleted] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-725:
--

Comment: was deleted

(was: Sorry, reopening to move comments.)

> Empty title element makes Tika-generated HTML documents not open in Chromium
> 
>
> Key: TIKA-725
> URL: https://issues.apache.org/jira/browse/TIKA-725
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.9
> Environment: Chromium 12 on Ubuntu Linux
>Reporter: Henri Bergius
>Assignee: Ray Gauss II
>Priority: Minor
>  Labels: html
> Fix For: 0.10
>
>
> Currently when converting Excel sheets (both XLS and XLSX), Tika generates an 
> empty title element as  into the document HEAD section. This causes 
> Chromium not to display the document contents.
> Switching it to  fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-725.
---

Resolution: Fixed

Moving new changes to TIKA-895.

> Empty title element makes Tika-generated HTML documents not open in Chromium
> 
>
> Key: TIKA-725
> URL: https://issues.apache.org/jira/browse/TIKA-725
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.9
> Environment: Chromium 12 on Ubuntu Linux
>Reporter: Henri Bergius
>Assignee: Ray Gauss II
>Priority: Minor
>  Labels: html
> Fix For: 0.10
>
>
> Currently when converting Excel sheets (both XLS and XLSX), Tika generates an 
> empty title element as  into the document HEAD section. This causes 
> Chromium not to display the document contents.
> Switching it to  fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Issue Comment Deleted] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-725:
--

Comment: was deleted

(was: Confirmed that the problem remains when a {{TransformerHandler}} is used, 
such those obtained from {{SAXTransformerFactory}} in {{TikaCLI}} and 
{{TikaGUI}}.

I've investigated and developed a workaround.)

> Empty title element makes Tika-generated HTML documents not open in Chromium
> 
>
> Key: TIKA-725
> URL: https://issues.apache.org/jira/browse/TIKA-725
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.9
> Environment: Chromium 12 on Ubuntu Linux
>Reporter: Henri Bergius
>Assignee: Ray Gauss II
>Priority: Minor
>  Labels: html
> Fix For: 0.10
>
>
> Currently when converting Excel sheets (both XLS and XLSX), Tika generates an 
> empty title element as  into the document HEAD section. This causes 
> Chromium not to display the document contents.
> Switching it to  fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Issue Comment Deleted] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-725:
--

Comment: was deleted

(was: When a {{TransformerHandler}} is used the actual writing of the final 
elements is delegated to an XML serializer such as {{ToHTMLStream}} which 
extends {{ToStream}}.

When {{ToStream.characters}} is called with zero length it returns immediately 
and does not close the start tag of the current element, and 
{{ToStream.endElement}} checks whether the start tag is open to determine 
whether or not to close as {{}} or {{}}.

It seems the code brought over from the xalan project to the JDK was locked 
down quite a bit during the transition.  When using xalan directly an alternate 
XML serializer can be specified via XSLT or other means [1], but in the JDK 
that functionality seems to have been removed as 
{{TransletOutputHandlerFactory.getSerializationHandler}} has ToHTMLStream 
hard-coded.

Additionally, ToHTMLStream is declared as final and the majority of the classes 
which one would normally extend to use a different 
{{TransletOutputHandlerFactory}} are internal, so a proper solution would 
likely involve depending on xalan directly or duplicating a whole lot of code, 
neither of which is ideal.

As a workaround, a {{ExpandedTitleContentHandler}} content handler decorator 
was added which checks for the previous fix for this issue, a call to 
{{characters(new char[0], 0, 0)}} for the title element, and if present changes 
the length to 1 then catches the expected {{ArrayIndexOutOfBoundsException}} 
thrown by {{ToStream.characters}}.

The result is that the title start tag is closed since the check for zero 
length passes and no character writing is attempted.

{{TikaCLI}} was modified to wrap the transformer handler returned by 
{{SAXTransformerFactory}} for the {{html}} output method, so only handling of 
the {{title}} tag for HTML output will be affected by the change.

In the event that this approach has adverse effects for those using XML 
serializers other than those present in the JDK, the change to {{TikaCLI}} can 
be reverted or made an option.

Those calling Tika programmatically will need to wrap their transformer 
handlers in a {{ExpandedTitleContentHandler}} as well, i.e.:

{code}
...
SAXTransformerFactory factory = (SAXTransformerFactory) 
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, indent);
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, encoding);
handler.setResult(new StreamResult(output));
return new ExpandedTitleContentHandler(handler);
{code}

Resolved in r1423538.


[1] http://xml.apache.org/xalan-j/usagepatterns.html)

> Empty title element makes Tika-generated HTML documents not open in Chromium
> 
>
> Key: TIKA-725
> URL: https://issues.apache.org/jira/browse/TIKA-725
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.9
> Environment: Chromium 12 on Ubuntu Linux
>Reporter: Henri Bergius
>Assignee: Ray Gauss II
>Priority: Minor
>  Labels: html
> Fix For: 0.10
>
>
> Currently when converting Excel sheets (both XLS and XLSX), Tika generates an 
> empty title element as  into the document HEAD section. This causes 
> Chromium not to display the document contents.
> Switching it to  fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reopened TIKA-725:
---


Sorry, reopening to move comments.

> Empty title element makes Tika-generated HTML documents not open in Chromium
> 
>
> Key: TIKA-725
> URL: https://issues.apache.org/jira/browse/TIKA-725
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.9
> Environment: Chromium 12 on Ubuntu Linux
>Reporter: Henri Bergius
>Assignee: Ray Gauss II
>Priority: Minor
>  Labels: html
> Fix For: 0.10
>
>
> Currently when converting Excel sheets (both XLS and XLSX), Tika generates an 
> empty title element as  into the document HEAD section. This causes 
> Chromium not to display the document contents.
> Switching it to  fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II closed TIKA-725.
-


> Empty title element makes Tika-generated HTML documents not open in Chromium
> 
>
> Key: TIKA-725
> URL: https://issues.apache.org/jira/browse/TIKA-725
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.9
> Environment: Chromium 12 on Ubuntu Linux
>Reporter: Henri Bergius
>Assignee: Ray Gauss II
>Priority: Minor
>  Labels: html
> Fix For: 0.10
>
>
> Currently when converting Excel sheets (both XLS and XLSX), Tika generates an 
> empty title element as  into the document HEAD section. This causes 
> Chromium not to display the document contents.
> Switching it to  fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-725:
--

Fix Version/s: (was: 1.3)

> Empty title element makes Tika-generated HTML documents not open in Chromium
> 
>
> Key: TIKA-725
> URL: https://issues.apache.org/jira/browse/TIKA-725
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.9
> Environment: Chromium 12 on Ubuntu Linux
>Reporter: Henri Bergius
>Assignee: Ray Gauss II
>Priority: Minor
>  Labels: html
> Fix For: 0.10
>
>
> Currently when converting Excel sheets (both XLS and XLSX), Tika generates an 
> empty title element as  into the document HEAD section. This causes 
> Chromium not to display the document contents.
> Switching it to  fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium

2012-12-18 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535070#comment-13535070
 ] 

Ray Gauss II edited comment on TIKA-725 at 12/18/12 5:22 PM:
-

When a {{TransformerHandler}} is used the actual writing of the final elements 
is delegated to an XML serializer such as {{ToHTMLStream}} which extends 
{{ToStream}}.

When {{ToStream.characters}} is called with zero length it returns immediately 
and does not close the start tag of the current element, and 
{{ToStream.endElement}} checks whether the start tag is open to determine 
whether or not to close as {{}} or {{}}.

It seems the code brought over from the xalan project to the JDK was locked 
down quite a bit during the transition.  When using xalan directly an alternate 
XML serializer can be specified via XSLT or other means [1], but in the JDK 
that functionality seems to have been removed as 
{{TransletOutputHandlerFactory.getSerializationHandler}} has ToHTMLStream 
hard-coded.

Additionally, ToHTMLStream is declared as final and the majority of the classes 
which one would normally extend to use a different 
{{TransletOutputHandlerFactory}} are internal, so a proper solution would 
likely involve depending on xalan directly or duplicating a whole lot of code, 
neither of which is ideal.

As a workaround, a {{ExpandedTitleContentHandler}} content handler decorator 
was added which checks for the previous fix for this issue, a call to 
{{characters(new char[0], 0, 0)}} for the title element, and if present changes 
the length to 1 then catches the expected {{ArrayIndexOutOfBoundsException}} 
thrown by {{ToStream.characters}}.

The result is that the title start tag is closed since the check for zero 
length passes and no character writing is attempted.

{{TikaCLI}} was modified to wrap the transformer handler returned by 
{{SAXTransformerFactory}} for the {{html}} output method, so only handling of 
the {{title}} tag for HTML output will be affected by the change.

In the event that this approach has adverse effects for those using XML 
serializers other than those present in the JDK, the change to {{TikaCLI}} can 
be reverted or made an option.

Those calling Tika programmatically will need to wrap their transformer 
handlers in a {{ExpandedTitleContentHandler}} as well, i.e.:

{code}
...
SAXTransformerFactory factory = (SAXTransformerFactory) 
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, indent);
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, encoding);
handler.setResult(new StreamResult(output));
return new ExpandedTitleContentHandler(handler);
{code}

Resolved in r1423538.


[1] http://xml.apache.org/xalan-j/usagepatterns.html

  was (Author: rgauss):

When a {{TransformerHandler}} is used the actual writing of the final elements 
is delegated to an XML serializer such as {{ToHTMLStream}} which extends 
{{ToStream}}.

When {{ToStream.characters}} is called with zero length it returns immediately 
and does not close the start tag of the current element, and 
{{ToStream.endElement}} checks whether the start tag is open to determine 
whether or not to close as {{}} or {{}}.

It seems the code brought over from the xalan project to the JDK was locked 
down quite a bit during the transition.  When using xalan directly an alternate 
XML serializer can be specified via XSLT or other means [1], but in the JDK 
that functionality seems to have been removed as 
{{TransletOutputHandlerFactory.getSerializationHandler}} has ToHTMLStream 
hard-coded.

Additionally, ToHTMLStream is declared as final and the majority of the classes 
which one would normally extend to use a different 
{{TransletOutputHandlerFactory}} are internal, so a proper solution would 
likely involve depending on xalan directly or duplicating a whole lot of code, 
neither of which is ideal.

As a workaround, a {{ExpandedTitleContentHandler}} content handler decorator 
was added which checks for the previous fix for this issue, a call to 
{{characters(new char[0], 0, 0)}} for the title element, and if present changes 
the length to 1 then catches the expected {{ArrayIndexOutOfBoundsException}} 
thrown by {{ToStream.characters}}.

The result is that the title start tag is closed since the check for zero 
length passes and no character writing is attempted.

{{TikaCLI}} was modified to wrap the transformer handler returned by 
{{SAXTransformerFactory}} for the {{html}} output method, so only handling of 
the {{title}} tag for HTML output will be affected by the change.

In the event that this approach has adverse effects for those using XML 
serializers other than those present in the JDK, the

[jira] [Resolved] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-725.
---

   Resolution: Fixed
Fix Version/s: 1.3


When a {{TransformerHandler}} is used the actual writing of the final elements 
is delegated to an XML serializer such as {{ToHTMLStream}} which extends 
{{ToStream}}.

When {{ToStream.characters}} is called with zero length it returns immediately 
and does not close the start tag of the current element, and 
{{ToStream.endElement}} checks whether the start tag is open to determine 
whether or not to close as {{}} or {{}}.

It seems the code brought over from the xalan project to the JDK was locked 
down quite a bit during the transition.  When using xalan directly an alternate 
XML serializer can be specified via XSLT or other means [1], but in the JDK 
that functionality seems to have been removed as 
{{TransletOutputHandlerFactory.getSerializationHandler}} has ToHTMLStream 
hard-coded.

Additionally, ToHTMLStream is declared as final and the majority of the classes 
which one would normally extend to use a different 
{{TransletOutputHandlerFactory}} are internal, so a proper solution would 
likely involve depending on xalan directly or duplicating a whole lot of code, 
neither of which is ideal.

As a workaround, a {{ExpandedTitleContentHandler}} content handler decorator 
was added which checks for the previous fix for this issue, a call to 
{{characters(new char[0], 0, 0)}} for the title element, and if present changes 
the length to 1 then catches the expected {{ArrayIndexOutOfBoundsException}} 
thrown by {{ToStream.characters}}.

The result is that the title start tag is closed since the check for zero 
length passes and no character writing is attempted.

{{TikaCLI}} was modified to wrap the transformer handler returned by 
{{SAXTransformerFactory}} for the {{html}} output method, so only handling of 
the {{title}} tag for HTML output will be affected by the change.

In the event that this approach has adverse effects for those using XML 
serializers other than those present in the JDK, the change to {{TikaCLI}} can 
be reverted or made an option.

Those calling Tika programmatically will need to wrap their transformer 
handlers in a {{ExpandedTitleContentHandler}} as well, i.e.:

SAXTransformerFactory factory = (SAXTransformerFactory) 
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, indent);
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, encoding);
handler.setResult(new StreamResult(output));
return new ExpandedTitleContentHandler(handler);

Resolved in r1423538.


[1] http://xml.apache.org/xalan-j/usagepatterns.html

> Empty title element makes Tika-generated HTML documents not open in Chromium
> 
>
> Key: TIKA-725
> URL: https://issues.apache.org/jira/browse/TIKA-725
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.9
> Environment: Chromium 12 on Ubuntu Linux
>Reporter: Henri Bergius
>Assignee: Ray Gauss II
>Priority: Minor
>  Labels: html
> Fix For: 1.3, 0.10
>
>
> Currently when converting Excel sheets (both XLS and XLSX), Tika generates an 
> empty title element as  into the document HEAD section. This causes 
> Chromium not to display the document contents.
> Switching it to  fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-914) Invalid self-closing title tag when parsing an RTF file

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-914.
---

Resolution: Duplicate
  Assignee: Ray Gauss II

> Invalid self-closing title tag when parsing an RTF file
> ---
>
> Key: TIKA-914
> URL: https://issues.apache.org/jira/browse/TIKA-914
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.1
> Environment: Reproduced on Linux and Windows
>Reporter: Nicolas Guillaumin
>Assignee: Ray Gauss II
>Priority: Minor
>  Labels: rtf
> Attachments: test.rtf
>
>
> When parsing an RTF file with an empty TITLE metadata, the resulting HTML 
> contains an self-closing title tag:
> {code}
> $ java -jar tika-app-1.1.jar -h test.rtf
> http://www.w3.org/1999/xhtml";>
> 
> 
> 
> 
> 
> 
> [...]
> {code}
> I believe self-closing tags are not valid in XHTML, according to 
> http://www.w3.org/TR/xhtml1/#C_3 (However there's no XHTML doctype generated 
> here, just a namespace...). Anyway this causes some browsers like Chrome to 
> fail parsing the HTML, resulting in a blank page displayed.
> The expected output would be a non self-closing empty tag: {{}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   >