[jira] [Commented] (NUTCH-2033) parse-tika skips valid documents.

2017-10-18 Thread Luis Lopez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210626#comment-16210626
 ] 

Luis Lopez commented on NUTCH-2033:
---

Looks like the original NOAA resource is no longer there (404) but it can be 
tested with:
https://ecowatch.ncddc.noaa.gov/erddap/opensearch1.1/description.xml with the 
same result.

> parse-tika skips valid documents.
> -
>
> Key: NUTCH-2033
> URL: https://issues.apache.org/jira/browse/NUTCH-2033
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>  Labels: mime-type, parse-tika, parser, tika
> Fix For: 1.14
>
>
> If we run:
> {code}
> bin/nutch parsechecker -dumpText 
> http://ngdc.noaa.gov/geoportal/openSearchDescription
> {code}
> we’ll get:
> {code}
> Status: failed(2,0): Can't retrieve Tika parser for mime-type 
> application/opensearchdescription+xml
> {code}
> the same occurs  for:
> {code}
> bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json
> {code}
> Both perfectly valid documents if they were returned as "application/xml" and 
> "text/plain" respectively. 
> This happens because parse-tika uses the mime type to retrieve a suitable 
> parser, some composite mime types are not included in this list even though 
> they are perfectly valid and parsable documents. This not taking into account 
> that servers often return incorrect mime types for the documents requested.
> We created a helper class as a workaround for this issue. The class uses 
> regex expressions to define synonyms. In the first case any mime type that 
> matches "application/(.*)\+xml" will be replaced by "application/xml". This 
> way parse-tika will parse the document just fine.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209562#comment-16209562
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

sebastian-nagel commented on issue #189: NUTCH-1465 Support sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189#issuecomment-337636717
 
 
   Hi @marconett, please ask for help on the [Nutch user mailing 
list](http://nutch.apache.org/mailing_lists.html) or report the problem at 
https://issues.apache.org/jira/projects/NUTCH. That's a closed pull request, 
and nothing will be fixed here. Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209547#comment-16209547
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

marconett commented on issue #189: NUTCH-1465 Support sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189#issuecomment-337633586
 
 
   I'm running into the same problem and am unable to inject sitemap content 
into the db. here's the commands i used (not including output, it's the same as 
above):
   
   ```
   bin/nutch inject crawl/crawldb urls/
   bin/nutch sitemap crawl/crawldb -sitemapUrls sitemaps/ -noStrict -noFilter 
-noNormalize
   bin/nutch readdb crawl/crawldb -stats
   ```
   
   where `urls/seed.txt` contains "https://www.linux.com/; and 
`sitemaps/seed.txt` contains "https://www.linux.com/sitemap.xml;.
   
   I see (tcpdump) that there are https connections being established to 
linux.com while `bin/nutch sitemap` is running. But nothing gets injected into 
the crawldb.
   
   Is there any info on this? Should this be fixed?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-10-18 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2411:
-
Attachment: NUTCH-2411.patch

Crap, previous version was wrong too!

> Index-metadata to support indexing multiple values for a field 
> ---
>
> Key: NUTCH-2411
> URL: https://issues.apache.org/jira/browse/NUTCH-2411
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, 
> NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch
>
>
> {code}
> 
>   index.metadata.separator
>   
>   
>Separator to use if you want to index multiple values for a given field. 
> Leave empty to
>treat each value as a single value.
>   
> 
> 
>   index.metadata.multivalued.fields
>   
>   
> Comma separated list of fields that are multi valued.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-10-18 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2411:
-
Attachment: NUTCH-2411.patch

The patch was missing support for the index.metadata.multivalued.fields 
parameter.

> Index-metadata to support indexing multiple values for a field 
> ---
>
> Key: NUTCH-2411
> URL: https://issues.apache.org/jira/browse/NUTCH-2411
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, 
> NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch
>
>
> {code}
> 
>   index.metadata.separator
>   
>   
>Separator to use if you want to index multiple values for a given field. 
> Leave empty to
>treat each value as a single value.
>   
> 
> 
>   index.metadata.multivalued.fields
>   
>   
> Comma separated list of fields that are multi valued.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-10-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209124#comment-16209124
 ] 

Markus Jelsma commented on NUTCH-2411:
--

Will commit shortly unless objections

> Index-metadata to support indexing multiple values for a field 
> ---
>
> Key: NUTCH-2411
> URL: https://issues.apache.org/jira/browse/NUTCH-2411
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, 
> NUTCH-2411.patch, NUTCH-2411.patch
>
>
> {code}
> 
>   index.metadata.separator
>   
>   
>Separator to use if you want to index multiple values for a given field. 
> Leave empty to
>treat each value as a single value.
>   
> 
> 
>   index.metadata.multivalued.fields
>   
>   
> Comma separated list of fields that are multi valued.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)