[jira] [Commented] (NUTCH-2033) parse-tika skips valid documents.
[ https://issues.apache.org/jira/browse/NUTCH-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210626#comment-16210626 ] Luis Lopez commented on NUTCH-2033: --- Looks like the original NOAA resource is no longer there (404) but it can be tested with: https://ecowatch.ncddc.noaa.gov/erddap/opensearch1.1/description.xml with the same result. > parse-tika skips valid documents. > - > > Key: NUTCH-2033 > URL: https://issues.apache.org/jira/browse/NUTCH-2033 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.10 >Reporter: Luis Lopez >Assignee: Lewis John McGibbney > Labels: mime-type, parse-tika, parser, tika > Fix For: 1.14 > > > If we run: > {code} > bin/nutch parsechecker -dumpText > http://ngdc.noaa.gov/geoportal/openSearchDescription > {code} > we’ll get: > {code} > Status: failed(2,0): Can't retrieve Tika parser for mime-type > application/opensearchdescription+xml > {code} > the same occurs for: > {code} > bin/nutch parsechecker -dumpText http://petstore.swagger.io/v2/swagger.json > {code} > Both perfectly valid documents if they were returned as "application/xml" and > "text/plain" respectively. > This happens because parse-tika uses the mime type to retrieve a suitable > parser, some composite mime types are not included in this list even though > they are perfectly valid and parsable documents. This not taking into account > that servers often return incorrect mime types for the documents requested. > We created a helper class as a workaround for this issue. The class uses > regex expressions to define synonyms. In the first case any mime type that > matches "application/(.*)\+xml" will be replaced by "application/xml". This > way parse-tika will parse the document just fine. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209562#comment-16209562 ] ASF GitHub Bot commented on NUTCH-1465: --- sebastian-nagel commented on issue #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#issuecomment-337636717 Hi @marconett, please ask for help on the [Nutch user mailing list](http://nutch.apache.org/mailing_lists.html) or report the problem at https://issues.apache.org/jira/projects/NUTCH. That's a closed pull request, and nothing will be fixed here. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209547#comment-16209547 ] ASF GitHub Bot commented on NUTCH-1465: --- marconett commented on issue #189: NUTCH-1465 Support sitemaps in Nutch URL: https://github.com/apache/nutch/pull/189#issuecomment-337633586 I'm running into the same problem and am unable to inject sitemap content into the db. here's the commands i used (not including output, it's the same as above): ``` bin/nutch inject crawl/crawldb urls/ bin/nutch sitemap crawl/crawldb -sitemapUrls sitemaps/ -noStrict -noFilter -noNormalize bin/nutch readdb crawl/crawldb -stats ``` where `urls/seed.txt` contains "https://www.linux.com/; and `sitemaps/seed.txt` contains "https://www.linux.com/sitemap.xml;. I see (tcpdump) that there are https connections being established to linux.com while `bin/nutch sitemap` is running. But nothing gets injected into the crawldb. Is there any info on this? Should this be fixed? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support sitemaps in Nutch > - > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Lewis John McGibbney >Assignee: Markus Jelsma > Fix For: 1.14 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch, NUTCH-1465.patch, NUTCH-1465.patch, > NUTCH-1465.patch, NUTCH-1465.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2411: - Attachment: NUTCH-2411.patch Crap, previous version was wrong too! > Index-metadata to support indexing multiple values for a field > --- > > Key: NUTCH-2411 > URL: https://issues.apache.org/jira/browse/NUTCH-2411 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.14 > > Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, > NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch > > > {code} > > index.metadata.separator > > >Separator to use if you want to index multiple values for a given field. > Leave empty to >treat each value as a single value. > > > > index.metadata.multivalued.fields > > > Comma separated list of fields that are multi valued. > > > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2411: - Attachment: NUTCH-2411.patch The patch was missing support for the index.metadata.multivalued.fields parameter. > Index-metadata to support indexing multiple values for a field > --- > > Key: NUTCH-2411 > URL: https://issues.apache.org/jira/browse/NUTCH-2411 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.14 > > Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, > NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch > > > {code} > > index.metadata.separator > > >Separator to use if you want to index multiple values for a given field. > Leave empty to >treat each value as a single value. > > > > index.metadata.multivalued.fields > > > Comma separated list of fields that are multi valued. > > > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2411) Index-metadata to support indexing multiple values for a field
[ https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209124#comment-16209124 ] Markus Jelsma commented on NUTCH-2411: -- Will commit shortly unless objections > Index-metadata to support indexing multiple values for a field > --- > > Key: NUTCH-2411 > URL: https://issues.apache.org/jira/browse/NUTCH-2411 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.13 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.14 > > Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, > NUTCH-2411.patch, NUTCH-2411.patch > > > {code} > > index.metadata.separator > > >Separator to use if you want to index multiple values for a given field. > Leave empty to >treat each value as a single value. > > > > index.metadata.multivalued.fields > > > Comma separated list of fields that are multi valued. > > > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)