[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly

2018-10-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667789#comment-16667789
 ] 

ASF GitHub Bot commented on TIKA-2599:
--

dameikle opened a new pull request #253: TIKA-2599: Fixed closing of styles 
around Hyperlinks (by Ronan O'Sullivan)
URL: https://github.com/apache/tika/pull/253
 
 
   Contributed by Ronan O'Sullivan.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Hyperlink surrounded by Italics not closed Properly
> ---
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
>Reporter: Ronan O'Sullivan
>Priority: Minor
> Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> Italic Test before link  href="http://www.google.com"/>hyperlink italics 
> Italic text after hyperlink
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly

2018-10-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667793#comment-16667793
 ] 

ASF GitHub Bot commented on TIKA-2599:
--

dameikle closed pull request #253: TIKA-2599: Fixed closing of styles around 
Hyperlinks (by Ronan O'Sullivan)
URL: https://github.com/apache/tika/pull/253
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/CHANGES.txt b/CHANGES.txt
index 1f793d2f62..187531acf1 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -3,6 +3,9 @@ Release 1.20 - ???
* Use -javaHome or $JAVA_HOME (if they exist) when
  spawning child in tika-server's -spawnChild mode.
 
+   * Fixed closing of styles around Hyperlinks in Word Parser
+ Contributed by Ronan O'Sullivan (TIKA-2599).
+
 Release 1.19.1 - 10/4/2018
 
* Update PDFBox to 2.0.12, jempbox to 1.8.16
diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
index 30bd4bb969..6f7d3785bd 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
@@ -528,8 +528,8 @@ private int handleSpecialCharacterRuns(Paragraph p, int 
index, boolean skipStyli
 url = text.substring(start, end);
 }
 
-xhtml.startElement("a", "href", url);
 closeStyleElements(skipStyling, xhtml);
+xhtml.startElement("a", "href", url);
 for (CharacterRun cr : texts) {
 handleCharacterRun(cr, skipStyling, xhtml);
 }
diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
 
b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
index 31bd8ba293..d7d6daee56 100644
--- 
a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
+++ 
b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
@@ -570,6 +570,15 @@ public void testBoldHyperlink() throws Exception {
 assertContains("http://tika.apache.org/\";>hyper link; bold" , 
xml);
 }
 
+@Test
+public void testHyperlinkSurroundedByItalics() throws Exception {
+//TIKA-2599
+String xml = getXML("testWORD_italicsSurroundingHyperlink.doc").xml;
+xml = xml.replaceAll("\\s+", " ");
+assertContains("Italic Test before link http://www.google.com\";>" +
+"hyperlink italics Italic text after 
hyperlink", xml);
+}
+
 @Test
 public void testMacros() throws  Exception {
 
diff --git 
a/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc
 
b/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc
new file mode 100644
index 00..24edb8f718
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc
 differ


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Hyperlink surrounded by Italics not closed Properly
> ---
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
>Reporter: Ronan O'Sullivan
>Priority: Minor
> Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> Italic Test before link  href="http://www.google.com"/>hyperlink italics 
> Italic text after hyperlink
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly

2018-10-29 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle updated TIKA-2599:
--
Fix Version/s: 1.20

> Hyperlink surrounded by Italics not closed Properly
> ---
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
>Reporter: Ronan O'Sullivan
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.20
>
> Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> Italic Test before link  href="http://www.google.com"/>hyperlink italics 
> Italic text after hyperlink
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly

2018-10-29 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle reassigned TIKA-2599:
-

Assignee: Dave Meikle

> Hyperlink surrounded by Italics not closed Properly
> ---
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
>Reporter: Ronan O'Sullivan
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.20
>
> Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> Italic Test before link  href="http://www.google.com"/>hyperlink italics 
> Italic text after hyperlink
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly

2018-10-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667820#comment-16667820
 ] 

ASF GitHub Bot commented on TIKA-2599:
--

dameikle opened a new pull request #254: TIKA-2599: Fixed closing of styles 
around Hyperlinks. Contributed by Ronan O'Sullivan.
URL: https://github.com/apache/tika/pull/254
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Hyperlink surrounded by Italics not closed Properly
> ---
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
>Reporter: Ronan O'Sullivan
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.20
>
> Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> Italic Test before link  href="http://www.google.com"/>hyperlink italics 
> Italic text after hyperlink
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly

2018-10-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667821#comment-16667821
 ] 

ASF GitHub Bot commented on TIKA-2599:
--

dameikle closed pull request #254: TIKA-2599: Fixed closing of styles around 
Hyperlinks. Contributed by Ronan O'Sullivan.
URL: https://github.com/apache/tika/pull/254
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
index 30bd4bb969..6f7d3785bd 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
@@ -528,8 +528,8 @@ private int handleSpecialCharacterRuns(Paragraph p, int 
index, boolean skipStyli
 url = text.substring(start, end);
 }
 
-xhtml.startElement("a", "href", url);
 closeStyleElements(skipStyling, xhtml);
+xhtml.startElement("a", "href", url);
 for (CharacterRun cr : texts) {
 handleCharacterRun(cr, skipStyling, xhtml);
 }
diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
 
b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
index 7456ac409e..d2c38a42d5 100644
--- 
a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
+++ 
b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
@@ -560,6 +560,15 @@ public void testBoldHyperlink() throws Exception {
 assertContains("http://tika.apache.org/\";>hyper link; bold" , 
xml);
 }
 
+@Test
+public void testHyperlinkSurroundedByItalics() throws Exception {
+//TIKA-2599
+String xml = getXML("testWORD_italicsSurroundingHyperlink.doc").xml;
+xml = xml.replaceAll("\\s+", " ");
+assertContains("Italic Test before link http://www.google.com\";>" +
+"hyperlink italics Italic text after 
hyperlink", xml);
+}
+
 @Test
 public void testMacros() throws  Exception {
 
diff --git 
a/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc
 
b/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc
new file mode 100644
index 00..24edb8f718
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc
 differ


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Hyperlink surrounded by Italics not closed Properly
> ---
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
>Reporter: Ronan O'Sullivan
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.20
>
> Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> Italic Test before link  href="http://www.google.com"/>hyperlink italics 
> Italic text after hyperlink
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly

2018-10-29 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-2599.
---
Resolution: Fixed

Commited to branch_1x in 324cbd2eb4d64f1e34aba9789ee8b06cbf4d991e and master in 
6ccedbadd4f79d7888eabfcd3a74ab85e168.

> Hyperlink surrounded by Italics not closed Properly
> ---
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
>Reporter: Ronan O'Sullivan
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.20
>
> Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> Italic Test before link  href="http://www.google.com"/>hyperlink italics 
> Italic text after hyperlink
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly

2018-10-29 Thread Dave Meikle (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667823#comment-16667823
 ] 

Dave Meikle commented on TIKA-2599:
---

Commited to branch_1x in 324cbd2eb4d64f1e34aba9789ee8b06cbf4d991e and master in 
6ccedbadd4f79d7888eabfcd3a74ab85e168.

Thanks [~ronanos]!

> Hyperlink surrounded by Italics not closed Properly
> ---
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
>Reporter: Ronan O'Sullivan
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.20
>
> Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> Italic Test before link  href="http://www.google.com"/>hyperlink italics 
> Italic text after hyperlink
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2479) Handle empty cells in tables uniformly

2018-10-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667835#comment-16667835
 ] 

ASF GitHub Bot commented on TIKA-2479:
--

dameikle commented on issue #214: TIKA-2479 - Handle empty cells in XLSX
URL: https://github.com/apache/tika/pull/214#issuecomment-434102665
 
 
   It looks like Nick Burch hasn't seen this and has added a different fix for 
this in master.  Does this do what you were expecting?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Handle empty cells in tables uniformly
> --
>
> Key: TIKA-2479
> URL: https://issues.apache.org/jira/browse/TIKA-2479
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.19
>
> Attachments: patch.diff
>
>
> It looks like we output a  for empty cells in xls, and tables in doc, 
> docx and pptx.  However, we don't retain empty cells in xlsx or tables in 
> ppt.  We should make this handling uniform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2767) Problem with import xlsx with null cells

2018-10-29 Thread Dave Meikle (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667840#comment-16667840
 ] 

Dave Meikle commented on TIKA-2767:
---

Hi [~iodor] - I've tried to recreate this by building my own Excel but don't 
get the issue with the latest build. Do you have an example file for this?

TIKA-2479 should have fixed this.

> Problem with import xlsx with null cells
> 
>
> Key: TIKA-2767
> URL: https://issues.apache.org/jira/browse/TIKA-2767
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: ionut hodor
>Priority: Major
> Attachments: example.png
>
>
> I have a problem with xlsx when there are cell without value. The cells are 
> not considered and the next cells on the same row are tranlated.
>  in the example the cells with value "value4" are combined with header2.
> i'm use tika 1.18 but i met the same problem with tika 1.19
> I have this problem only with xlsx, with xls everything is ok



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly

2018-10-29 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667848#comment-16667848
 ] 

Hudson commented on TIKA-2599:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #338 (See 
[https://builds.apache.org/job/tika-2.x-windows/338/])
TIKA-2599: Fixed closing of styles around Hyperlinks. Contributed by (dmeikle: 
rev 10a48b7a0077fbe627d3a0111f92910228d05d77)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
* (add) 
tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc


> Hyperlink surrounded by Italics not closed Properly
> ---
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
>Reporter: Ronan O'Sullivan
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.20
>
> Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> Italic Test before link  href="http://www.google.com"/>hyperlink italics 
> Italic text after hyperlink
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly

2018-10-29 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667863#comment-16667863
 ] 

Hudson commented on TIKA-2599:
--

UNSTABLE: Integrated in Jenkins build Tika-trunk #1584 (See 
[https://builds.apache.org/job/Tika-trunk/1584/])
TIKA-2599: Fixed closing of styles around Hyperlinks. Contributed by (dmeikle: 
[https://github.com/apache/tika/commit/10a48b7a0077fbe627d3a0111f92910228d05d77])
* (add) 
tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java


> Hyperlink surrounded by Italics not closed Properly
> ---
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
>Reporter: Ronan O'Sullivan
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.20
>
> Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> Italic Test before link  href="http://www.google.com"/>hyperlink italics 
> Italic text after hyperlink
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2599) Hyperlink surrounded by Italics not closed Properly

2018-10-29 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667869#comment-16667869
 ] 

Hudson commented on TIKA-2599:
--

UNSTABLE: Integrated in Jenkins build tika-branch-1x #120 (See 
[https://builds.apache.org/job/tika-branch-1x/120/])
TIKA-2599: Fixed closing of styles around Hyperlinks. Contributed by (dmeikle: 
[https://github.com/apache/tika/commit/eb53077d62ed31795e676b5bcdce01b8ad809c99])
* (add) 
tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
TIKA-2599: Fixed closing of styles around Hyperlinks. Contributed by (dmeikle: 
[https://github.com/apache/tika/commit/50a2a8f6391b87fa8f1b766143f2d759c99cae4b])
* (edit) CHANGES.txt


> Hyperlink surrounded by Italics not closed Properly
> ---
>
> Key: TIKA-2599
> URL: https://issues.apache.org/jira/browse/TIKA-2599
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14, 1.15, 1.16, 1.17
> Environment: Any
>Reporter: Ronan O'Sullivan
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.20
>
> Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> Italic Test before link  href="http://www.google.com"/>hyperlink italics 
> Italic text after hyperlink
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TIKA-2630) Wrong height and width metadata for JPEG images

2018-10-29 Thread Dave Meikle (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle reassigned TIKA-2630:
-

Assignee: Dave Meikle

> Wrong height and width metadata for JPEG images
> ---
>
> Key: TIKA-2630
> URL: https://issues.apache.org/jira/browse/TIKA-2630
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ancuta Morarasu
>Assignee: Dave Meikle
>Priority: Major
> Attachments: Tika-metadata.txt, metadata-exctractor-metadata.txt, 
> sizesampleissue.jpg
>
>
> According to [Exif 
> specs|http://www.exif.org/Exif2-2.PDF#page=73&zoom=auto,-176,103], for 
> compressed images the values for width and height should come from the tags:
> * *PixelXDimension* mapped in metadata-extractor to 
> {{com.drew.metadata.Directory.ExifDirectoryBase.TAG_EXIF_IMAGE_WIDTH}} and
> * *PixelYDimension* mapped to {{ExifDirectoryBase.TAG_EXIF_IMAGE_HEIGHT}}.
> {{ImageMetadataExtractor$ExifHandler.[handlePhotoTags(...)|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java#L487]}}
>  should extract and set these in the metadata:
> {code:java}
>  if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
> metadata.set(Metadata.IMAGE_WIDTH,
>  
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)));
>   }
>   if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
>   metadata.set(Metadata.IMAGE_LENGTH,
>
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_HEIGHT)));
>}
> {code}
> Also the {{CopyUnknownFieldsHandler}} overrides the values for "Image Width" 
> ({{JpegDirectory.TAG_IMAGE_WIDTH}}) and "Image Height" 
> ({{JpegDirectory.TAG_IMAGE_HEIGHT}}) with the values from 
> {{ExifIFD0Descriptor.TAG_IMAGE_WIDTH}} and 
> {{ExifIFD0Descriptor.TAG_IMAGE_HEIGHT}} because they have the same tag name.
> I attached a sample image, these are the metadata values:
> * extracted by metadata-extractor:
> [JPEG] Image Height = 367 pixels
> [JPEG] Image Width = 1535 pixels
> [Exif IFD0] Image Width = 2173 pixels
> [Exif IFD0] Image Height = 520 pixels
> [Exif SubIFD] Exif Image Width = 1535 pixels
> [Exif SubIFD] Exif Image Height = 367 pixels
> * Tika metadata:
> Image Height: 520 pixels
> Image Width: 2173 pixels
> tiff:ImageLength: 520
> tiff:ImageWidth: 2173
> Exif Image Height: 367 pixels
> Exif Image Width: 1535 pixels



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2630) Wrong height and width metadata for JPEG images

2018-10-29 Thread Dave Meikle (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667903#comment-16667903
 ] 

Dave Meikle commented on TIKA-2630:
---

Thanks for raising this one. Short term we can add in the reading from these 
fields for compressed images knowing it will set the tiff:ImageHeight and 
tiff:ImageLength to the correct value.

Longer term we need to address the metadata clashes are part of the 2.x series 
as whilst we could add in the directory name as a key to the metadata (e.g. 
Exif IFD0:Image Height: 520 pixels) I would be worried about the impact on 
downstream code that has got used to what we do. This means we can also build 
up on the Metadata proposals for 2.x.

Will this work for you?

 

> Wrong height and width metadata for JPEG images
> ---
>
> Key: TIKA-2630
> URL: https://issues.apache.org/jira/browse/TIKA-2630
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ancuta Morarasu
>Assignee: Dave Meikle
>Priority: Major
> Attachments: Tika-metadata.txt, metadata-exctractor-metadata.txt, 
> sizesampleissue.jpg
>
>
> According to [Exif 
> specs|http://www.exif.org/Exif2-2.PDF#page=73&zoom=auto,-176,103], for 
> compressed images the values for width and height should come from the tags:
> * *PixelXDimension* mapped in metadata-extractor to 
> {{com.drew.metadata.Directory.ExifDirectoryBase.TAG_EXIF_IMAGE_WIDTH}} and
> * *PixelYDimension* mapped to {{ExifDirectoryBase.TAG_EXIF_IMAGE_HEIGHT}}.
> {{ImageMetadataExtractor$ExifHandler.[handlePhotoTags(...)|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java#L487]}}
>  should extract and set these in the metadata:
> {code:java}
>  if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
> metadata.set(Metadata.IMAGE_WIDTH,
>  
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)));
>   }
>   if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
>   metadata.set(Metadata.IMAGE_LENGTH,
>
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_HEIGHT)));
>}
> {code}
> Also the {{CopyUnknownFieldsHandler}} overrides the values for "Image Width" 
> ({{JpegDirectory.TAG_IMAGE_WIDTH}}) and "Image Height" 
> ({{JpegDirectory.TAG_IMAGE_HEIGHT}}) with the values from 
> {{ExifIFD0Descriptor.TAG_IMAGE_WIDTH}} and 
> {{ExifIFD0Descriptor.TAG_IMAGE_HEIGHT}} because they have the same tag name.
> I attached a sample image, these are the metadata values:
> * extracted by metadata-extractor:
> [JPEG] Image Height = 367 pixels
> [JPEG] Image Width = 1535 pixels
> [Exif IFD0] Image Width = 2173 pixels
> [Exif IFD0] Image Height = 520 pixels
> [Exif SubIFD] Exif Image Width = 1535 pixels
> [Exif SubIFD] Exif Image Height = 367 pixels
> * Tika metadata:
> Image Height: 520 pixels
> Image Width: 2173 pixels
> tiff:ImageLength: 520
> tiff:ImageWidth: 2173
> Exif Image Height: 367 pixels
> Exif Image Width: 1535 pixels



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2630) Wrong height and width metadata for JPEG images

2018-10-29 Thread Dave Meikle (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667918#comment-16667918
 ] 

Dave Meikle commented on TIKA-2630:
---

After writing it, I know it really wont given the class of metadata keys 
between the Exif directories.

Wondering if we could short term just add the directory name in as a key 
qualifier for just Exif information, given it is there where this is an issue 
just now.

Will create a proposed pull request and see what others think.

> Wrong height and width metadata for JPEG images
> ---
>
> Key: TIKA-2630
> URL: https://issues.apache.org/jira/browse/TIKA-2630
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ancuta Morarasu
>Assignee: Dave Meikle
>Priority: Major
> Attachments: Tika-metadata.txt, metadata-exctractor-metadata.txt, 
> sizesampleissue.jpg
>
>
> According to [Exif 
> specs|http://www.exif.org/Exif2-2.PDF#page=73&zoom=auto,-176,103], for 
> compressed images the values for width and height should come from the tags:
> * *PixelXDimension* mapped in metadata-extractor to 
> {{com.drew.metadata.Directory.ExifDirectoryBase.TAG_EXIF_IMAGE_WIDTH}} and
> * *PixelYDimension* mapped to {{ExifDirectoryBase.TAG_EXIF_IMAGE_HEIGHT}}.
> {{ImageMetadataExtractor$ExifHandler.[handlePhotoTags(...)|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java#L487]}}
>  should extract and set these in the metadata:
> {code:java}
>  if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
> metadata.set(Metadata.IMAGE_WIDTH,
>  
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)));
>   }
>   if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
>   metadata.set(Metadata.IMAGE_LENGTH,
>
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_HEIGHT)));
>}
> {code}
> Also the {{CopyUnknownFieldsHandler}} overrides the values for "Image Width" 
> ({{JpegDirectory.TAG_IMAGE_WIDTH}}) and "Image Height" 
> ({{JpegDirectory.TAG_IMAGE_HEIGHT}}) with the values from 
> {{ExifIFD0Descriptor.TAG_IMAGE_WIDTH}} and 
> {{ExifIFD0Descriptor.TAG_IMAGE_HEIGHT}} because they have the same tag name.
> I attached a sample image, these are the metadata values:
> * extracted by metadata-extractor:
> [JPEG] Image Height = 367 pixels
> [JPEG] Image Width = 1535 pixels
> [Exif IFD0] Image Width = 2173 pixels
> [Exif IFD0] Image Height = 520 pixels
> [Exif SubIFD] Exif Image Width = 1535 pixels
> [Exif SubIFD] Exif Image Height = 367 pixels
> * Tika metadata:
> Image Height: 520 pixels
> Image Width: 2173 pixels
> tiff:ImageLength: 520
> tiff:ImageWidth: 2173
> Exif Image Height: 367 pixels
> Exif Image Width: 1535 pixels



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2479) Handle empty cells in tables uniformly

2018-10-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667929#comment-16667929
 ] 

ASF GitHub Bot commented on TIKA-2479:
--

glb closed pull request #214: TIKA-2479 - Handle empty cells in XLSX
URL: https://github.com/apache/tika/pull/214
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
index c3b7285403..2264457f37 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
@@ -16,7 +16,6 @@
  */
 package org.apache.tika.parser.microsoft.ooxml;
 
-import javax.xml.parsers.SAXParser;
 import java.io.IOException;
 import java.io.InputStream;
 import java.util.ArrayList;
@@ -25,7 +24,8 @@
 import java.util.Locale;
 import java.util.Map;
 
-import org.apache.poi.POIXMLDocument;
+import javax.xml.parsers.SAXParser;
+
 import org.apache.poi.POIXMLTextExtractor;
 import org.apache.poi.hssf.extractor.ExcelExtractor;
 import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
@@ -39,6 +39,8 @@
 import org.apache.poi.openxml4j.opc.TargetMode;
 import org.apache.poi.ss.usermodel.DataFormatter;
 import org.apache.poi.ss.usermodel.HeaderFooter;
+import org.apache.poi.ss.util.CellAddress;
+import org.apache.poi.ss.util.CellReference;
 import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
 import org.apache.poi.xssf.eventusermodel.XSSFReader;
 import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
@@ -400,6 +402,8 @@ public void processSheet(
 private final boolean includeHeadersFooters;
 protected List headers;
 protected List footers;
+private int currentRow = -1;
+private int currentCol = -1;
 
 protected SheetTextAsHTML(boolean includeHeaderFooters, 
XHTMLContentHandler xhtml) {
 this.includeHeadersFooters = includeHeaderFooters;
@@ -408,7 +412,24 @@ protected SheetTextAsHTML(boolean includeHeaderFooters, 
XHTMLContentHandler xhtm
 footers = new ArrayList();
 }
 
+private void outputMissingRows(int number) {
+for (int i=0; i Handle empty cells in tables uniformly
> --
>
> Key: TIKA-2479
> URL: https://issues.apache.org/jira/browse/TIKA-2479
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.19
>
> Attachments: patch.diff
>
>
> It looks like we output a  for empty cells in xls, and tables in doc, 
> docx and pptx.  However, we don't retain empty cells in xlsx or tables in 
> ppt.  We should make this handling uniform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2479) Handle empty cells in tables uniformly

2018-10-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667928#comment-16667928
 ] 

ASF GitHub Bot commented on TIKA-2479:
--

glb commented on issue #214: TIKA-2479 - Handle empty cells in XLSX
URL: https://github.com/apache/tika/pull/214#issuecomment-434128458
 
 
   @dameikle it looks like it would solve the problem in a similar way, yes. 
Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Handle empty cells in tables uniformly
> --
>
> Key: TIKA-2479
> URL: https://issues.apache.org/jira/browse/TIKA-2479
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 2.0, 1.19
>
> Attachments: patch.diff
>
>
> It looks like we output a  for empty cells in xls, and tables in doc, 
> docx and pptx.  However, we don't retain empty cells in xlsx or tables in 
> ppt.  We should make this handling uniform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2630) Wrong height and width metadata for JPEG images

2018-10-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667942#comment-16667942
 ] 

ASF GitHub Bot commented on TIKA-2630:
--

dameikle opened a new pull request #255: TIKA-2630: Wrong height and width 
metadata for JPEG images
URL: https://github.com/apache/tika/pull/255
 
 
   - Added extraction of image height/width from ExifSubIFDDirectory for 
compressed images
   - Include directory name as key qualifier for Exif directories to avoid 
clashes


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Wrong height and width metadata for JPEG images
> ---
>
> Key: TIKA-2630
> URL: https://issues.apache.org/jira/browse/TIKA-2630
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ancuta Morarasu
>Assignee: Dave Meikle
>Priority: Major
> Attachments: Tika-metadata.txt, metadata-exctractor-metadata.txt, 
> sizesampleissue.jpg
>
>
> According to [Exif 
> specs|http://www.exif.org/Exif2-2.PDF#page=73&zoom=auto,-176,103], for 
> compressed images the values for width and height should come from the tags:
> * *PixelXDimension* mapped in metadata-extractor to 
> {{com.drew.metadata.Directory.ExifDirectoryBase.TAG_EXIF_IMAGE_WIDTH}} and
> * *PixelYDimension* mapped to {{ExifDirectoryBase.TAG_EXIF_IMAGE_HEIGHT}}.
> {{ImageMetadataExtractor$ExifHandler.[handlePhotoTags(...)|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java#L487]}}
>  should extract and set these in the metadata:
> {code:java}
>  if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
> metadata.set(Metadata.IMAGE_WIDTH,
>  
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)));
>   }
>   if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
>   metadata.set(Metadata.IMAGE_LENGTH,
>
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_HEIGHT)));
>}
> {code}
> Also the {{CopyUnknownFieldsHandler}} overrides the values for "Image Width" 
> ({{JpegDirectory.TAG_IMAGE_WIDTH}}) and "Image Height" 
> ({{JpegDirectory.TAG_IMAGE_HEIGHT}}) with the values from 
> {{ExifIFD0Descriptor.TAG_IMAGE_WIDTH}} and 
> {{ExifIFD0Descriptor.TAG_IMAGE_HEIGHT}} because they have the same tag name.
> I attached a sample image, these are the metadata values:
> * extracted by metadata-extractor:
> [JPEG] Image Height = 367 pixels
> [JPEG] Image Width = 1535 pixels
> [Exif IFD0] Image Width = 2173 pixels
> [Exif IFD0] Image Height = 520 pixels
> [Exif SubIFD] Exif Image Width = 1535 pixels
> [Exif SubIFD] Exif Image Height = 367 pixels
> * Tika metadata:
> Image Height: 520 pixels
> Image Width: 2173 pixels
> tiff:ImageLength: 520
> tiff:ImageWidth: 2173
> Exif Image Height: 367 pixels
> Exif Image Width: 1535 pixels



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2630) Wrong height and width metadata for JPEG images

2018-10-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667943#comment-16667943
 ] 

ASF GitHub Bot commented on TIKA-2630:
--

dameikle commented on issue #255: TIKA-2630: Wrong height and width metadata 
for JPEG images
URL: https://github.com/apache/tika/pull/255#issuecomment-434131434
 
 
   Hey @tballison - given the key name clashes for the Exif metadata I am 
proposing to add the directory name as the qualifier, hence the request for a 
review.
   
   I was tempted to do this for all directories to make it clean but worry 
about downstream code that rely on the current values in 1.x stream.
   
   I also thought about not doing it, but without doing it for at least Exif, 
we will continue to give the wrong value here without some logic to have a key 
hierarchy in CopyUnknownFieldsHandler.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Wrong height and width metadata for JPEG images
> ---
>
> Key: TIKA-2630
> URL: https://issues.apache.org/jira/browse/TIKA-2630
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ancuta Morarasu
>Assignee: Dave Meikle
>Priority: Major
> Attachments: Tika-metadata.txt, metadata-exctractor-metadata.txt, 
> sizesampleissue.jpg
>
>
> According to [Exif 
> specs|http://www.exif.org/Exif2-2.PDF#page=73&zoom=auto,-176,103], for 
> compressed images the values for width and height should come from the tags:
> * *PixelXDimension* mapped in metadata-extractor to 
> {{com.drew.metadata.Directory.ExifDirectoryBase.TAG_EXIF_IMAGE_WIDTH}} and
> * *PixelYDimension* mapped to {{ExifDirectoryBase.TAG_EXIF_IMAGE_HEIGHT}}.
> {{ImageMetadataExtractor$ExifHandler.[handlePhotoTags(...)|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java#L487]}}
>  should extract and set these in the metadata:
> {code:java}
>  if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
> metadata.set(Metadata.IMAGE_WIDTH,
>  
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)));
>   }
>   if (directory.containsTag(ExifSubIFDDirectory.TAG_EXIF_IMAGE_WIDTH)) {
>   metadata.set(Metadata.IMAGE_LENGTH,
>
> trimPixels(directory.getDescription(ExifSubIFDDirectory.TAG_EXIF_IMAGE_HEIGHT)));
>}
> {code}
> Also the {{CopyUnknownFieldsHandler}} overrides the values for "Image Width" 
> ({{JpegDirectory.TAG_IMAGE_WIDTH}}) and "Image Height" 
> ({{JpegDirectory.TAG_IMAGE_HEIGHT}}) with the values from 
> {{ExifIFD0Descriptor.TAG_IMAGE_WIDTH}} and 
> {{ExifIFD0Descriptor.TAG_IMAGE_HEIGHT}} because they have the same tag name.
> I attached a sample image, these are the metadata values:
> * extracted by metadata-extractor:
> [JPEG] Image Height = 367 pixels
> [JPEG] Image Width = 1535 pixels
> [Exif IFD0] Image Width = 2173 pixels
> [Exif IFD0] Image Height = 520 pixels
> [Exif SubIFD] Exif Image Width = 1535 pixels
> [Exif SubIFD] Exif Image Height = 367 pixels
> * Tika metadata:
> Image Height: 520 pixels
> Image Width: 2173 pixels
> tiff:ImageLength: 520
> tiff:ImageWidth: 2173
> Exif Image Height: 367 pixels
> Exif Image Width: 1535 pixels



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2760) LinkContentHandler does not report hyperlinks

2018-10-29 Thread Dave Meikle (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667966#comment-16667966
 ] 

Dave Meikle commented on TIKA-2760:
---

[~markus17] - is it typically the HTML parser being used in Nutch? Using your 
test with the HTML parser registered gives me 94 links.

> LinkContentHandler does not report hyperlinks
> -
>
> Key: TIKA-2760
> URL: https://issues.apache.org/jira/browse/TIKA-2760
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: TIKA-2760.patch, ronaldmcdonald-nolinks.html
>
>
> Nutch uses LinkContentHandler for collection hyperlinks, and does not report 
> any hyperlink for http://www.ronaldmcdonaldhouse.co.uk/ which i'll also 
> attach to this ticket.
> Debugging LinkContentHandler to print element names in startElement reveals 
> only very few HTML elements get reported, which i think is incorrect.
> Our own parser in Nutch uses a custom ContentHandler and does report many 
> elements, including hyperlinks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)