[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-05 Thread NW Brad (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352666#comment-16352666
 ] 

NW Brad commented on TIKA-2562:
---

Thanks.  I'll take a look at it.  It definitely looks the the same issue, but 
for the title tag.  Its too bad the SAXTransformer doesn't allow you the option 
to prevent the issue.

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread NW Brad (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351224#comment-16351224
 ] 

NW Brad commented on TIKA-2562:
---

I was doing some research on this today and this may not be a function of Tika. 
 I think it is probably the SAXTransformerFactory (javax.xml.transform) that is 
making the change.  At least I could find any code in Tika that did it 
directly.  But anything I ran through the SAXTransformerFactory converted the 
HTML I provided with void (empty) elements and self-closing start tags as shown 
below:

http://www.google.com;> *becomes* http://www.google.com*"/>*

and  *becomes* .

>From an XML standpoint the converted syntax is correct, but the anchor tag 
>code while correct in XML, does not appear to work correctly as HTML in both 
>the current version of Chrome and Firefox.  So, converting HTML via Tika in 
>this situation generates bad HTML for the examples I have.

I believe the SAXTransformerFactory is also deleting the  that is around 
the "empty" anchor tag since a div around nothing is may not be consider 
relevant.  I least that is what I speculate...
h1.  

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread NW Brad (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350634#comment-16350634
 ] 

NW Brad edited comment on TIKA-2562 at 2/2/18 4:51 PM:
---

Thanks.  I checked it out and tagsoup is definitely adding the shape.  I tried 
parsing the file using tagsoup command line, and tagsoup added the shape.  
However, it appears that the  removal is coming from tika.

Tagsoup parse results:


 http://www.google.com;>[http://www.google.com|http://www.google.com/]
 

Tika parse results:

http://www.google.com;>[http://www.google.com|http://www.google.com/]

The div is gone...

I also noted another problem with parsing that is coming from Tika and not 
tagsoup when dealing with hidden anchors/hyperlinks:

original:

http://www.google.com;>

Tagsoup:results

http://www.google.com*;>*

Tika results:

http://www.google.com*"/>*

Tika seems to alter anchor by removing the end-tag and replacing it with an 
empty-element tag.  This occurs on other tags as well, most common being 
 with .

This may not seem to be a big deal, but with anchors it is causing a problem 
with Chrome and Firefox and the anchor style bleeds into content immediately 
following the anchor.

Is there a way in Tika to turn off this feature?  If not, do you know where in 
the code this occurs. 

Thanks.

 

 

 

 

 

 

 

 


was (Author: nwbrad):
Thanks.  I checked it out and tagsoup is definitely adding the shape.  I tried 
parsing the file using tagsoup command line, and tagsoup is definitely the 
shape.  However, it appears that the  removal is coming from tika.

Tagsoup parse results:


 http://www.google.com;>[http://www.google.com|http://www.google.com/]
 

Tika parse results:

http://www.google.com;>[http://www.google.com|http://www.google.com/]

The div is gone...

I also noted another problem with parsing that is coming from Tika and not 
tagsoup when dealing with hidden anchors/hyperlinks:

original:

http://www.google.com;>

Tagsoup:results

http://www.google.com*;>*

Tika results:

http://www.google.com*"/>*

Tika seems to alter anchor by removing the end-tag and replacing it with an 
empty-element tag.  This occurs on other tags as well, most common being 
 with .

This may not seem to be a big deal, but with anchors it is causing a problem 
with Chrome and Firefox and the anchor style bleeds into content immediately 
following the anchor.

Is there a way in Tika to turn off this feature?  If not, do you know where in 
the code this occurs. 

Thanks.

 

 

 

 

 

 

 

 

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread NW Brad (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350634#comment-16350634
 ] 

NW Brad edited comment on TIKA-2562 at 2/2/18 4:50 PM:
---

Thanks.  I checked it out and tagsoup is definitely adding the shape.  I tried 
parsing the file using tagsoup command line, and tagsoup is definitely the 
shape.  However, it appears that the  removal is coming from tika.

Tagsoup parse results:


 http://www.google.com;>[http://www.google.com|http://www.google.com/]
 

Tika parse results:

http://www.google.com;>[http://www.google.com|http://www.google.com/]

The div is gone...

I also noted another problem with parsing that is coming from Tika and not 
tagsoup when dealing with hidden anchors/hyperlinks:

original:

http://www.google.com;>

Tagsoup:results

http://www.google.com*;>*

Tika results:

http://www.google.com*"/>*

Tika seems to alter anchor by removing the end-tag and replacing it with an 
empty-element tag.  This occurs on other tags as well, most common being 
 with .

This may not seem to be a big deal, but with anchors it is causing a problem 
with Chrome and Firefox and the anchor style bleeds into content immediately 
following the anchor.

Is there a way in Tika to turn off this feature?  If not, do you know where in 
the code this occurs. 

Thanks.

 

 

 

 

 

 

 

 


was (Author: nwbrad):
Thanks.  I check it out, it and tagsoup is definitely adding the shape.  I 
tried parsing the file using tagsoup command line, and tagsoup is definitely 
the shape.  However, it appears that the  removal is coming from tika.

Tagsoup parse results:


 http://www.google.com;>http://www.google.com
 

Tika parse results:

http://www.google.com;>http://www.google.com

The div is gone...

I also noted another problem with parsing that is coming from Tika and not 
tagsoup when dealing with hidden anchors/hyperlinks:

original:

http://www.google.com;>

Tagsoup:results

http://www.google.com*;>*

Tika results:

http://www.google.com*"/>*

Tika seems to alter anchor by removing the end-tag and replacing it with an 
empty-element tag.  This occurs on other tags as well, most common being 
 with .

This may not seem to be a big deal, but with anchors it is causing a problem 
with Chrome and Firefox and the anchor style bleeds into content immediately 
following the anchor.

Is there a way in Tika to turn off this feature?  If not, do you know where in 
the code this occurs. 

Thanks.

 

 

 

 

 

 

 

 

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-02 Thread NW Brad (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350634#comment-16350634
 ] 

NW Brad commented on TIKA-2562:
---

Thanks.  I check it out, it and tagsoup is definitely adding the shape.  I 
tried parsing the file using tagsoup command line, and tagsoup is definitely 
the shape.  However, it appears that the  removal is coming from tika.

Tagsoup parse results:


 http://www.google.com;>http://www.google.com
 

Tika parse results:

http://www.google.com;>http://www.google.com

The div is gone...

I also noted another problem with parsing that is coming from Tika and not 
tagsoup when dealing with hidden anchors/hyperlinks:

original:

http://www.google.com;>

Tagsoup:results

http://www.google.com*;>*

Tika results:

http://www.google.com*"/>*

Tika seems to alter anchor by removing the end-tag and replacing it with an 
empty-element tag.  This occurs on other tags as well, most common being 
 with .

This may not seem to be a big deal, but with anchors it is causing a problem 
with Chrome and Firefox and the anchor style bleeds into content immediately 
following the anchor.

Is there a way in Tika to turn off this feature?  If not, do you know where in 
the code this occurs. 

Thanks.

 

 

 

 

 

 

 

 

> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-01 Thread NW Brad (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

NW Brad updated TIKA-2562:
--
Description: 
Hyperlinks in a HTML document that are parsed via tika server:

curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
[http://localhost:9998/tika] --header "Accept: text/html"

sent:


 http://www.google.com;>[http://www.google.com|http://www.google.com/]
 

received back:

http://www.google.com;>[http://www.google.com|http://www.google.com/]

 

Divs are are gone and a shape has been added

 

  was:
parsing an HTML file via server:

curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
http://localhost:9998/tika --header "Accept: text/html"

sent:


 http://www.google.com;>http://www.google.com
 

received back:

http://www.google.com;>http://www.google.com

 

Divs are are gone and a shape has been added

 


> tika server parse HTML removes DIVs around hyperlink & adds shape
> -
>
> Key: TIKA-2562
> URL: https://issues.apache.org/jira/browse/TIKA-2562
> Project: Tika
>  Issue Type: Bug
>  Components: gui, parser, server
>Affects Versions: 1.17
>Reporter: NW Brad
>Priority: Major
> Attachments: tika_adds_shape_to_hyperlink.html
>
>
> Hyperlinks in a HTML document that are parsed via tika server:
> curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
> [http://localhost:9998/tika] --header "Accept: text/html"
> sent:
> 
>   href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> received back:
>  href="http://www.google.com;>[http://www.google.com|http://www.google.com/]
>  
> Divs are are gone and a shape has been added
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2562) tika server parse HTML removes DIVs around hyperlink & adds shape

2018-02-01 Thread NW Brad (JIRA)
NW Brad created TIKA-2562:
-

 Summary: tika server parse HTML removes DIVs around hyperlink & 
adds shape
 Key: TIKA-2562
 URL: https://issues.apache.org/jira/browse/TIKA-2562
 Project: Tika
  Issue Type: Bug
  Components: gui, parser, server
Affects Versions: 1.17
Reporter: NW Brad
 Attachments: tika_adds_shape_to_hyperlink.html

parsing an HTML file via server:

curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html 
http://localhost:9998/tika --header "Accept: text/html"

sent:


 http://www.google.com;>http://www.google.com
 

received back:

http://www.google.com;>http://www.google.com

 

Divs are are gone and a shape has been added

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)