[jira] [Commented] (TIKA-2653) Allow users to specify a directory of jars for classloading in ForkParser

2018-05-26 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491866#comment-16491866
 ] 

Luis Filipe Nassif commented on TIKA-2653:
--

+1! I will try to take a look, Tim, but unfortunatelly I will not have much 
time in the next 2 weeks.

Have you thought about injecting an EmbeddedDocumentExtractor to allow 
extracting embedded items to index them as separate documents, eg in Solr?

> Allow users to specify a directory of jars for classloading in ForkParser
> -
>
> Key: TIKA-2653
> URL: https://issues.apache.org/jira/browse/TIKA-2653
> Project: Tika
>  Issue Type: New Feature
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
>
> The ForkParser now builds the parser in the parent process and serializes it 
> to the child process.  It would be neat to make it easier for users of the 
> ForkParser to depend solely on tika-core and put all of our dependency 
> nastiness in a separate directory that will be used by the the fork server 
> (child process) to build the underlying parser.
> This would allow, e.g. Solr, to point to a directory with the tika-app.jar 
> and remove all of our dependencies (except tika-core) from their 
> dependencies. 
> I propose that we allow users to initialize ForkParser with a Path that 
> contains all the jars necessary to build the Parser, and, optionally, a 
> ParserFactory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2100) Html Parser does not keep the html tag attributes

2018-05-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491804#comment-16491804
 ] 

ASF GitHub Bot commented on TIKA-2100:
--

tballison commented on issue #238: TIKA-2100 extract content language from html 
lang attribute
URL: https://github.com/apache/tika/pull/238#issuecomment-392283950
 
 
   So... go forth!
   
   Side note: It’d be fun to see counts for elements w a lang attr in our
   corpus.
   
   On Sat, May 26, 2018 at 3:41 PM Tim Allison  wrote:
   
   > If :lang is special, we should treat it specially :).  If there are other
   > attrs that go on the html entity, we can add them later? Onward!
   >
   > On Sat, May 26, 2018 at 3:26 PM Chris Mattmann 
   > wrote:
   >
   >> @tballison  I hear you and am open to
   >> alternatives. What is a better way to do this? I think missing the lang
   >> attribute is a pretty bad thing and have seen it in the past. It feels 
like
   >> HTMLParser as a parser should contribute it (and I don't think you object
   >> to that) perhaps via Metadata.set and then what should we do to propagate
   >> it?
   >>
   >> —
   >> You are receiving this because you were mentioned.
   >>
   >>
   >> Reply to this email directly, view it on GitHub
   >> , or mute
   >> the thread
   >> 

   >> .
   >>
   >
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Html Parser does not keep the html tag attributes
> -
>
> Key: TIKA-2100
> URL: https://issues.apache.org/jira/browse/TIKA-2100
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Gerard Bouchar
>Priority: Major
>
> Parsing a very simple html like 
>  
> 
> 
> Page Title
> 
> 
> My First Heading
> My first paragraph.
> 
>  
> you won't be able to access the html tag's attributes (here lang="en") in the 
> ContentHandler : 
> *in the method startElement(String ns, String localName, String name,
>   Attributes atts), atts is empty.
> *Moreover it seems that the html tag's attributes are not passed trough the 
> HtmlMapper.mapSafeAttribute method too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2100) Html Parser does not keep the html tag attributes

2018-05-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491802#comment-16491802
 ] 

ASF GitHub Bot commented on TIKA-2100:
--

tballison commented on issue #238: TIKA-2100 extract content language from html 
lang attribute
URL: https://github.com/apache/tika/pull/238#issuecomment-392283864
 
 
   If :lang is special, we should treat it specially :).  If there are other
   attrs that go on the html entity, we can add them later? Onward!
   
   On Sat, May 26, 2018 at 3:26 PM Chris Mattmann 
   wrote:
   
   > @tballison  I hear you and am open to
   > alternatives. What is a better way to do this? I think missing the lang
   > attribute is a pretty bad thing and have seen it in the past. It feels like
   > HTMLParser as a parser should contribute it (and I don't think you object
   > to that) perhaps via Metadata.set and then what should we do to propagate
   > it?
   >
   > —
   > You are receiving this because you were mentioned.
   >
   >
   > Reply to this email directly, view it on GitHub
   > , or mute
   > the thread
   > 

   > .
   >
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Html Parser does not keep the html tag attributes
> -
>
> Key: TIKA-2100
> URL: https://issues.apache.org/jira/browse/TIKA-2100
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Gerard Bouchar
>Priority: Major
>
> Parsing a very simple html like 
>  
> 
> 
> Page Title
> 
> 
> My First Heading
> My first paragraph.
> 
>  
> you won't be able to access the html tag's attributes (here lang="en") in the 
> ContentHandler : 
> *in the method startElement(String ns, String localName, String name,
>   Attributes atts), atts is empty.
> *Moreover it seems that the html tag's attributes are not passed trough the 
> HtmlMapper.mapSafeAttribute method too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2100) Html Parser does not keep the html tag attributes

2018-05-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491795#comment-16491795
 ] 

ASF GitHub Bot commented on TIKA-2100:
--

chrismattmann commented on issue #238: TIKA-2100 extract content language from 
html lang attribute
URL: https://github.com/apache/tika/pull/238#issuecomment-392283049
 
 
   @tballison I hear you and am open to alternatives. What is a better way to 
do this? I think missing the lang attribute is a pretty bad thing and have seen 
it in the past. It feels like HTMLParser as a parser should contribute it (and 
I don't think you object to that) perhaps via Metadata.set and then what should 
we do to propagate it?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Html Parser does not keep the html tag attributes
> -
>
> Key: TIKA-2100
> URL: https://issues.apache.org/jira/browse/TIKA-2100
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Gerard Bouchar
>Priority: Major
>
> Parsing a very simple html like 
>  
> 
> 
> Page Title
> 
> 
> My First Heading
> My first paragraph.
> 
>  
> you won't be able to access the html tag's attributes (here lang="en") in the 
> ContentHandler : 
> *in the method startElement(String ns, String localName, String name,
>   Attributes atts), atts is empty.
> *Moreover it seems that the html tag's attributes are not passed trough the 
> HtmlMapper.mapSafeAttribute method too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2100) Html Parser does not keep the html tag attributes

2018-05-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491794#comment-16491794
 ] 

ASF GitHub Bot commented on TIKA-2100:
--

tballison commented on issue #238: TIKA-2100 extract content language from html 
lang attribute
URL: https://github.com/apache/tika/pull/238#issuecomment-392282862
 
 
   It feels weird to me to allow such special handling of lang, but if you and
   fellow devs don’t mind, go for it.
   
   On Sat, May 26, 2018 at 2:49 PM Chris Mattmann 
   wrote:
   
   > this LGTM - @tballison  are we good to
   > commit this?
   >
   > —
   > You are receiving this because you were mentioned.
   >
   >
   > Reply to this email directly, view it on GitHub
   > , or mute
   > the thread
   > 

   > .
   >
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Html Parser does not keep the html tag attributes
> -
>
> Key: TIKA-2100
> URL: https://issues.apache.org/jira/browse/TIKA-2100
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Gerard Bouchar
>Priority: Major
>
> Parsing a very simple html like 
>  
> 
> 
> Page Title
> 
> 
> My First Heading
> My first paragraph.
> 
>  
> you won't be able to access the html tag's attributes (here lang="en") in the 
> ContentHandler : 
> *in the method startElement(String ns, String localName, String name,
>   Attributes atts), atts is empty.
> *Moreover it seems that the html tag's attributes are not passed trough the 
> HtmlMapper.mapSafeAttribute method too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2646) Tika parse["content"] returns jumbled text across cells of a table in a pdf

2018-05-26 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491792#comment-16491792
 ] 

Chris A. Mattmann commented on TIKA-2646:
-

[~adidier] see comment above from [~lfcnassif]

> Tika parse["content"] returns jumbled text across cells of a table in a pdf
> ---
>
> Key: TIKA-2646
> URL: https://issues.apache.org/jira/browse/TIKA-2646
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.18
> Environment: MacOS Sierra 10.12.6
>Reporter: Annie Didier
>Priority: Trivial
>  Labels: performance
>
> When text from a table is extracted, sometimes the order of the cells becomes 
> mixed and the words get concatenated together. For example:
>  
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||DESCRIPTION||
> becomes: Hours Dur Code Sub DescriptionPhase
>  
> In other more serious cases, the text within a cell becomes scrambled with a 
> text from another cell. Such as:
> ||HOURS||DUR
> (hr)||PHASE||CODE||SUB||
> |00:00 - 17:00|17.00|FLOWBK|33 P - FLOWBACK / 
> TESTING|E - RIG OUT
> TESTERS|
> the second row becomes:
> 17.00-00:00 17:00 FLOWBK E - RIG OUT
>  
> TESTERS
>  
> 33 P -
>  
> FLOWBACK /
>  
> TESTING
> Note that the value of the second column has moved to the first column, and 
> the "-" within the first column is misordered. The last two columns have 
> switched places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2100) Html Parser does not keep the html tag attributes

2018-05-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491780#comment-16491780
 ] 

ASF GitHub Bot commented on TIKA-2100:
--

chrismattmann commented on issue #238: TIKA-2100 extract content language from 
html lang attribute
URL: https://github.com/apache/tika/pull/238#issuecomment-392280901
 
 
   this LGTM - @tballison are we good to commit this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Html Parser does not keep the html tag attributes
> -
>
> Key: TIKA-2100
> URL: https://issues.apache.org/jira/browse/TIKA-2100
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Gerard Bouchar
>Priority: Major
>
> Parsing a very simple html like 
>  
> 
> 
> Page Title
> 
> 
> My First Heading
> My first paragraph.
> 
>  
> you won't be able to access the html tag's attributes (here lang="en") in the 
> ContentHandler : 
> *in the method startElement(String ns, String localName, String name,
>   Attributes atts), atts is empty.
> *Moreover it seems that the html tag's attributes are not passed trough the 
> HtmlMapper.mapSafeAttribute method too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2100) Html Parser does not keep the html tag attributes

2018-05-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491779#comment-16491779
 ] 

ASF GitHub Bot commented on TIKA-2100:
--

chrismattmann commented on a change in pull request #238: TIKA-2100 extract 
content language from html lang attribute
URL: https://github.com/apache/tika/pull/238#discussion_r191056235
 
 

 ##
 File path: tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java
 ##
 @@ -138,7 +138,12 @@ private void lazyStartHead() throws SAXException {
 
 // Call directly, so we don't go through our startElement(), which 
will
 // ignore these elements.
-super.startElement(XHTML, "html", "html", EMPTY_ATTRIBUTES);
+AttributesImpl htmlAttrs = new AttributesImpl();
 
 Review comment:
   let's include :lang in the HTML attributes as well +1.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Html Parser does not keep the html tag attributes
> -
>
> Key: TIKA-2100
> URL: https://issues.apache.org/jira/browse/TIKA-2100
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.13
>Reporter: Gerard Bouchar
>Priority: Major
>
> Parsing a very simple html like 
>  
> 
> 
> Page Title
> 
> 
> My First Heading
> My first paragraph.
> 
>  
> you won't be able to access the html tag's attributes (here lang="en") in the 
> ContentHandler : 
> *in the method startElement(String ns, String localName, String name,
>   Attributes atts), atts is empty.
> *Moreover it seems that the html tag's attributes are not passed trough the 
> HtmlMapper.mapSafeAttribute method too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)