[jira] [Closed] (NUTCH-2392) Get same pages multiple times if URL contains relative path

2017-06-07 Thread Jayesh Shende (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayesh Shende closed NUTCH-2392.

Resolution: Not A Bug

> Get same pages multiple times if URL contains relative path
> ---
>
> Key: NUTCH-2392
> URL: https://issues.apache.org/jira/browse/NUTCH-2392
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl
>Affects Versions: 1.13
> Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1
>Reporter: Jayesh Shende
>Priority: Critical
>  Labels: features
> Fix For: 1.14
>
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> When websites have relative URL at different pages for same HTML document, 
> for example on first depth I fetched contents of a page 
> http://example.com/index.html, after few depths I got a link (constructed by 
> Nutch from some relative path pattern in some anchor tag) 
> http://example.com/Level1/Level2/../../index.html , in this case Nutch is 
> fetching same HTML document two times considering both URLs are different but 
> they are not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (NUTCH-2392) Get same pages multiple times if URL contains relative path

2017-06-07 Thread Jayesh Shende (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042214#comment-16042214
 ] 

Jayesh Shende edited comment on NUTCH-2392 at 6/8/17 5:18 AM:
--

The URLs I was working with were very long URLs with difference of one single 
letter cased differently. Yes It is not a bug.


was (Author: jayesh):
The URLs I was worling with were very long URLs with difference of one single 
letter cased differently. Yes It is not a bug.

> Get same pages multiple times if URL contains relative path
> ---
>
> Key: NUTCH-2392
> URL: https://issues.apache.org/jira/browse/NUTCH-2392
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl
>Affects Versions: 1.13
> Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1
>Reporter: Jayesh Shende
>Priority: Critical
>  Labels: features
> Fix For: 1.14
>
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> When websites have relative URL at different pages for same HTML document, 
> for example on first depth I fetched contents of a page 
> http://example.com/index.html, after few depths I got a link (constructed by 
> Nutch from some relative path pattern in some anchor tag) 
> http://example.com/Level1/Level2/../../index.html , in this case Nutch is 
> fetching same HTML document two times considering both URLs are different but 
> they are not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2392) Get same pages multiple times if URL contains relative path

2017-06-07 Thread Jayesh Shende (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042214#comment-16042214
 ] 

Jayesh Shende commented on NUTCH-2392:
--

The URLs I was worling with were very long URLs with difference of one single 
letter cased differently. Yes It is not a bug.

> Get same pages multiple times if URL contains relative path
> ---
>
> Key: NUTCH-2392
> URL: https://issues.apache.org/jira/browse/NUTCH-2392
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl
>Affects Versions: 1.13
> Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1
>Reporter: Jayesh Shende
>Priority: Critical
>  Labels: features
> Fix For: 1.14
>
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> When websites have relative URL at different pages for same HTML document, 
> for example on first depth I fetched contents of a page 
> http://example.com/index.html, after few depths I got a link (constructed by 
> Nutch from some relative path pattern in some anchor tag) 
> http://example.com/Level1/Level2/../../index.html , in this case Nutch is 
> fetching same HTML document two times considering both URLs are different but 
> they are not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2392) Get same pages multiple times if URL contains relative path

2017-06-07 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040881#comment-16040881
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2392:
---

In this case, Nutch is detecting a relative URL and doing the work to make it 
"fetchable" which is making it a full URL, in this case. But you'll find the 
same issue not only with relative URLs, you could find the same situation where 
you find totally different URLs with the same content thanks to the "magic" of 
some CMS, one case that I've found quite often is the presence/lack of 
{{index.php}} in some URLs with exactly the same content. I've also found this 
issue with OCS (Open Conference Systems) https://pkp.sfu.ca/ocs/.

Can you provide the exact URLs that you've found? Are both URLs being indexed 
in Solr? Even if both URLs are being fetched they should be deduplicated later 
on. Even if both URLs are totally different they should have the same 
signature/digest calculated using the text extracted, see 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/TextMD5Signature.java.

The problem is that you need to actually fetch/parse the URL to be able to know 
that they are duplicated, we need to assume that both URLs are different until 
proven otherwise :).

> Get same pages multiple times if URL contains relative path
> ---
>
> Key: NUTCH-2392
> URL: https://issues.apache.org/jira/browse/NUTCH-2392
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl
>Affects Versions: 1.13
> Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1
>Reporter: Jayesh Shende
>Priority: Critical
>  Labels: features
> Fix For: 1.14
>
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> When websites have relative URL at different pages for same HTML document, 
> for example on first depth I fetched contents of a page 
> http://example.com/index.html, after few depths I got a link (constructed by 
> Nutch from some relative path pattern in some anchor tag) 
> http://example.com/Level1/Level2/../../index.html , in this case Nutch is 
> fetching same HTML document two times considering both URLs are different but 
> they are not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (NUTCH-2392) Get same pages multiple times if URL contains relative path

2017-06-07 Thread Jayesh Shende (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jayesh Shende updated NUTCH-2392:
-
Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1  (was: Ubuntu, JRE 
1.8.131)

> Get same pages multiple times if URL contains relative path
> ---
>
> Key: NUTCH-2392
> URL: https://issues.apache.org/jira/browse/NUTCH-2392
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl
>Affects Versions: 1.13
> Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1
>Reporter: Jayesh Shende
>Priority: Critical
>  Labels: features
> Fix For: 1.14
>
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> When websites have relative URL at different pages for same HTML document, 
> for example on first depth I fetched contents of a page 
> http://example.com/index.html, after few depths I got a link (constructed by 
> Nutch from some relative path pattern in some anchor tag) 
> http://example.com/Level1/Level2/../../index.html , in this case Nutch is 
> fetching same HTML document two times considering both URLs are different but 
> they are not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (NUTCH-2392) Get same pages multiple times if URL contains relative path

2017-06-07 Thread Jayesh Shende (JIRA)
Jayesh Shende created NUTCH-2392:


 Summary: Get same pages multiple times if URL contains relative 
path
 Key: NUTCH-2392
 URL: https://issues.apache.org/jira/browse/NUTCH-2392
 Project: Nutch
  Issue Type: Bug
  Components: commoncrawl
Affects Versions: 1.13
 Environment: Ubuntu, JRE 1.8.131
Reporter: Jayesh Shende
Priority: Critical
 Fix For: 1.14


When websites have relative URL at different pages for same HTML document, for 
example on first depth I fetched contents of a page 
http://example.com/index.html, after few depths I got a link (constructed by 
Nutch from some relative path pattern in some anchor tag) 
http://example.com/Level1/Level2/../../index.html , in this case Nutch is 
fetching same HTML document two times considering both URLs are different but 
they are not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2389) Precise data parsing using Jsoup CSS selectors

2017-06-07 Thread Kaidul Islam (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040739#comment-16040739
 ] 

Kaidul Islam commented on NUTCH-2389:
-

Hi [~lewismc] I've re-designed significantly and opened a pull request #192. 
Thank you.

> Precise data parsing using Jsoup CSS selectors
> --
>
> Key: NUTCH-2389
> URL: https://issues.apache.org/jira/browse/NUTCH-2389
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 2.3
>Reporter: Kaidul Islam
>Assignee: Kaidul Islam
> Fix For: 2.4
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> As far as I know, currently Nutch 1.x and 2.x has no features to 
> extract/parse exact contents for specific websites. I've developed a plugin 
> {{parse-jsoup}} using Jsoup for my current project to extract precise content 
> for site specific crawling using detailed XML configuration(field name, 
> CSS-selector, attribute, extraction rules, data-type, default-value etc).
> Please let me know if this feature seems relevant and currently not present 
> in Nutch. I have also plan to export it into Nutch 1.x.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2389) Precise data parsing using Jsoup CSS selectors

2017-06-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040736#comment-16040736
 ] 

ASF GitHub Bot commented on NUTCH-2389:
---

kaidul opened a new pull request #192: NUTCH-2389 Precise data extractor 
implemented for 2.x
URL: https://github.com/apache/nutch/pull/192
 
 
   Webpage-wise precise data extractor based on jsoup CSS-selector API and 
configurable using XML file. Parse filter and complementary indexing filter 
plugin implemented. Functionality of defining custom normalizers on specific 
extracted data implemented. I've successfully tested this module on my large 
project and unit testing is added as well.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Precise data parsing using Jsoup CSS selectors
> --
>
> Key: NUTCH-2389
> URL: https://issues.apache.org/jira/browse/NUTCH-2389
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 2.3
>Reporter: Kaidul Islam
>Assignee: Kaidul Islam
> Fix For: 2.4
>
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> As far as I know, currently Nutch 1.x and 2.x has no features to 
> extract/parse exact contents for specific websites. I've developed a plugin 
> {{parse-jsoup}} using Jsoup for my current project to extract precise content 
> for site specific crawling using detailed XML configuration(field name, 
> CSS-selector, attribute, extraction rules, data-type, default-value etc).
> Please let me know if this feature seems relevant and currently not present 
> in Nutch. I have also plan to export it into Nutch 1.x.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)