[jira] [Closed] (NUTCH-2392) Get same pages multiple times if URL contains relative path
[ https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jayesh Shende closed NUTCH-2392. Resolution: Not A Bug > Get same pages multiple times if URL contains relative path > --- > > Key: NUTCH-2392 > URL: https://issues.apache.org/jira/browse/NUTCH-2392 > Project: Nutch > Issue Type: Bug > Components: commoncrawl >Affects Versions: 1.13 > Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1 >Reporter: Jayesh Shende >Priority: Critical > Labels: features > Fix For: 1.14 > > Original Estimate: 60h > Remaining Estimate: 60h > > When websites have relative URL at different pages for same HTML document, > for example on first depth I fetched contents of a page > http://example.com/index.html, after few depths I got a link (constructed by > Nutch from some relative path pattern in some anchor tag) > http://example.com/Level1/Level2/../../index.html , in this case Nutch is > fetching same HTML document two times considering both URLs are different but > they are not. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (NUTCH-2392) Get same pages multiple times if URL contains relative path
[ https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042214#comment-16042214 ] Jayesh Shende edited comment on NUTCH-2392 at 6/8/17 5:18 AM: -- The URLs I was working with were very long URLs with difference of one single letter cased differently. Yes It is not a bug. was (Author: jayesh): The URLs I was worling with were very long URLs with difference of one single letter cased differently. Yes It is not a bug. > Get same pages multiple times if URL contains relative path > --- > > Key: NUTCH-2392 > URL: https://issues.apache.org/jira/browse/NUTCH-2392 > Project: Nutch > Issue Type: Bug > Components: commoncrawl >Affects Versions: 1.13 > Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1 >Reporter: Jayesh Shende >Priority: Critical > Labels: features > Fix For: 1.14 > > Original Estimate: 60h > Remaining Estimate: 60h > > When websites have relative URL at different pages for same HTML document, > for example on first depth I fetched contents of a page > http://example.com/index.html, after few depths I got a link (constructed by > Nutch from some relative path pattern in some anchor tag) > http://example.com/Level1/Level2/../../index.html , in this case Nutch is > fetching same HTML document two times considering both URLs are different but > they are not. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2392) Get same pages multiple times if URL contains relative path
[ https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042214#comment-16042214 ] Jayesh Shende commented on NUTCH-2392: -- The URLs I was worling with were very long URLs with difference of one single letter cased differently. Yes It is not a bug. > Get same pages multiple times if URL contains relative path > --- > > Key: NUTCH-2392 > URL: https://issues.apache.org/jira/browse/NUTCH-2392 > Project: Nutch > Issue Type: Bug > Components: commoncrawl >Affects Versions: 1.13 > Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1 >Reporter: Jayesh Shende >Priority: Critical > Labels: features > Fix For: 1.14 > > Original Estimate: 60h > Remaining Estimate: 60h > > When websites have relative URL at different pages for same HTML document, > for example on first depth I fetched contents of a page > http://example.com/index.html, after few depths I got a link (constructed by > Nutch from some relative path pattern in some anchor tag) > http://example.com/Level1/Level2/../../index.html , in this case Nutch is > fetching same HTML document two times considering both URLs are different but > they are not. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2392) Get same pages multiple times if URL contains relative path
[ https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040881#comment-16040881 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-2392: --- In this case, Nutch is detecting a relative URL and doing the work to make it "fetchable" which is making it a full URL, in this case. But you'll find the same issue not only with relative URLs, you could find the same situation where you find totally different URLs with the same content thanks to the "magic" of some CMS, one case that I've found quite often is the presence/lack of {{index.php}} in some URLs with exactly the same content. I've also found this issue with OCS (Open Conference Systems) https://pkp.sfu.ca/ocs/. Can you provide the exact URLs that you've found? Are both URLs being indexed in Solr? Even if both URLs are being fetched they should be deduplicated later on. Even if both URLs are totally different they should have the same signature/digest calculated using the text extracted, see https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/TextMD5Signature.java. The problem is that you need to actually fetch/parse the URL to be able to know that they are duplicated, we need to assume that both URLs are different until proven otherwise :). > Get same pages multiple times if URL contains relative path > --- > > Key: NUTCH-2392 > URL: https://issues.apache.org/jira/browse/NUTCH-2392 > Project: Nutch > Issue Type: Bug > Components: commoncrawl >Affects Versions: 1.13 > Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1 >Reporter: Jayesh Shende >Priority: Critical > Labels: features > Fix For: 1.14 > > Original Estimate: 60h > Remaining Estimate: 60h > > When websites have relative URL at different pages for same HTML document, > for example on first depth I fetched contents of a page > http://example.com/index.html, after few depths I got a link (constructed by > Nutch from some relative path pattern in some anchor tag) > http://example.com/Level1/Level2/../../index.html , in this case Nutch is > fetching same HTML document two times considering both URLs are different but > they are not. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (NUTCH-2392) Get same pages multiple times if URL contains relative path
[ https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jayesh Shende updated NUTCH-2392: - Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1 (was: Ubuntu, JRE 1.8.131) > Get same pages multiple times if URL contains relative path > --- > > Key: NUTCH-2392 > URL: https://issues.apache.org/jira/browse/NUTCH-2392 > Project: Nutch > Issue Type: Bug > Components: commoncrawl >Affects Versions: 1.13 > Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1 >Reporter: Jayesh Shende >Priority: Critical > Labels: features > Fix For: 1.14 > > Original Estimate: 60h > Remaining Estimate: 60h > > When websites have relative URL at different pages for same HTML document, > for example on first depth I fetched contents of a page > http://example.com/index.html, after few depths I got a link (constructed by > Nutch from some relative path pattern in some anchor tag) > http://example.com/Level1/Level2/../../index.html , in this case Nutch is > fetching same HTML document two times considering both URLs are different but > they are not. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (NUTCH-2392) Get same pages multiple times if URL contains relative path
Jayesh Shende created NUTCH-2392: Summary: Get same pages multiple times if URL contains relative path Key: NUTCH-2392 URL: https://issues.apache.org/jira/browse/NUTCH-2392 Project: Nutch Issue Type: Bug Components: commoncrawl Affects Versions: 1.13 Environment: Ubuntu, JRE 1.8.131 Reporter: Jayesh Shende Priority: Critical Fix For: 1.14 When websites have relative URL at different pages for same HTML document, for example on first depth I fetched contents of a page http://example.com/index.html, after few depths I got a link (constructed by Nutch from some relative path pattern in some anchor tag) http://example.com/Level1/Level2/../../index.html , in this case Nutch is fetching same HTML document two times considering both URLs are different but they are not. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2389) Precise data parsing using Jsoup CSS selectors
[ https://issues.apache.org/jira/browse/NUTCH-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040739#comment-16040739 ] Kaidul Islam commented on NUTCH-2389: - Hi [~lewismc] I've re-designed significantly and opened a pull request #192. Thank you. > Precise data parsing using Jsoup CSS selectors > -- > > Key: NUTCH-2389 > URL: https://issues.apache.org/jira/browse/NUTCH-2389 > Project: Nutch > Issue Type: New Feature > Components: parser >Affects Versions: 2.3 >Reporter: Kaidul Islam >Assignee: Kaidul Islam > Fix For: 2.4 > > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > As far as I know, currently Nutch 1.x and 2.x has no features to > extract/parse exact contents for specific websites. I've developed a plugin > {{parse-jsoup}} using Jsoup for my current project to extract precise content > for site specific crawling using detailed XML configuration(field name, > CSS-selector, attribute, extraction rules, data-type, default-value etc). > Please let me know if this feature seems relevant and currently not present > in Nutch. I have also plan to export it into Nutch 1.x. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2389) Precise data parsing using Jsoup CSS selectors
[ https://issues.apache.org/jira/browse/NUTCH-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040736#comment-16040736 ] ASF GitHub Bot commented on NUTCH-2389: --- kaidul opened a new pull request #192: NUTCH-2389 Precise data extractor implemented for 2.x URL: https://github.com/apache/nutch/pull/192 Webpage-wise precise data extractor based on jsoup CSS-selector API and configurable using XML file. Parse filter and complementary indexing filter plugin implemented. Functionality of defining custom normalizers on specific extracted data implemented. I've successfully tested this module on my large project and unit testing is added as well. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Precise data parsing using Jsoup CSS selectors > -- > > Key: NUTCH-2389 > URL: https://issues.apache.org/jira/browse/NUTCH-2389 > Project: Nutch > Issue Type: New Feature > Components: parser >Affects Versions: 2.3 >Reporter: Kaidul Islam >Assignee: Kaidul Islam > Fix For: 2.4 > > Original Estimate: 0.05h > Remaining Estimate: 0.05h > > As far as I know, currently Nutch 1.x and 2.x has no features to > extract/parse exact contents for specific websites. I've developed a plugin > {{parse-jsoup}} using Jsoup for my current project to extract precise content > for site specific crawling using detailed XML configuration(field name, > CSS-selector, attribute, extraction rules, data-type, default-value etc). > Please let me know if this feature seems relevant and currently not present > in Nutch. I have also plan to export it into Nutch 1.x. -- This message was sent by Atlassian JIRA (v6.3.15#6346)