[jira] [Commented] (NUTCH-1764) readdb to show command-line help if no action (-stats, -dump, etc.) given

2014-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982128#comment-13982128
 ] 

Hudson commented on NUTCH-1764:
---

SUCCESS: Integrated in Nutch-trunk #2618 (See 
[https://builds.apache.org/job/Nutch-trunk/2618/])
NUTCH-1764 readdb to show command-line help if no action (-stats, -dump, etc.) 
given (snagel: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1590315)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java


> readdb to show command-line help if no action (-stats, -dump, etc.) given
> -
>
> Key: NUTCH-1764
> URL: https://issues.apache.org/jira/browse/NUTCH-1764
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.8, 1.9
>Reporter: Diaa
>Priority: Minor
> Fix For: 1.9
>
> Attachments: CrawlDbReader.java.patch
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> If you run the command readdb with just one argument nothing happens and no 
> usage warning is issued.
> Example: bin/nutch readdb crawldb
> Actual Result: Nothing happens
> Expected Result: "Usage CrawlDbReader ... "
> The issue is due to "if (args.length < 1) " which should be 2



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1765) SolrClean to remove redirected URLs from Solr

2014-04-26 Thread Iain Lopata (JIRA)
Iain Lopata created NUTCH-1765:
--

 Summary: SolrClean to remove redirected URLs from Solr
 Key: NUTCH-1765
 URL: https://issues.apache.org/jira/browse/NUTCH-1765
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.6
Reporter: Iain Lopata
Priority: Minor


SolrClean currently only removes urls with a status of STATUS_DB_GONE from the 
Solr Index.  It should also remove urls with a status of  STATUS_DB_REDIR_TEMP 
and  STATUS_DB_REDIR_PERM.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-1764) readdb to show command-line help if no action (-stats, -dump, etc.) given

2014-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1764.


   Resolution: Fixed
Fix Version/s: (was: 1.8)

+1
Thanks, [~diaa_abdallah]! Committed to trunk r1590315.

> readdb to show command-line help if no action (-stats, -dump, etc.) given
> -
>
> Key: NUTCH-1764
> URL: https://issues.apache.org/jira/browse/NUTCH-1764
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.8, 1.9
>Reporter: Diaa
>Priority: Minor
> Fix For: 1.9
>
> Attachments: CrawlDbReader.java.patch
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> If you run the command readdb with just one argument nothing happens and no 
> usage warning is issued.
> Example: bin/nutch readdb crawldb
> Actual Result: Nothing happens
> Expected Result: "Usage CrawlDbReader ... "
> The issue is due to "if (args.length < 1) " which should be 2



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1764) readdb to show command-line help if no action (-stats, -dump, etc.) given

2014-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1764:
---

Summary: readdb to show command-line help if no action (-stats, -dump, 
etc.) given  (was: readdb arguments check bug)

> readdb to show command-line help if no action (-stats, -dump, etc.) given
> -
>
> Key: NUTCH-1764
> URL: https://issues.apache.org/jira/browse/NUTCH-1764
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.8, 1.9
>Reporter: Diaa
>Priority: Minor
> Fix For: 1.8, 1.9
>
> Attachments: CrawlDbReader.java.patch
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> If you run the command readdb with just one argument nothing happens and no 
> usage warning is issued.
> Example: bin/nutch readdb crawldb
> Actual Result: Nothing happens
> Expected Result: "Usage CrawlDbReader ... "
> The issue is due to "if (args.length < 1) " which should be 2



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-797) parse-tika is not properly constructing URLs when the target begins with a "?"

2014-04-26 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982116#comment-13982116
 ] 

Sebastian Nagel commented on NUTCH-797:
---

Hi [~jnioche], is there anything left (except patching 2.x)? It's fixed for 1.x 
since long.

> parse-tika is not properly constructing URLs when the target begins with a "?"
> --
>
> Key: NUTCH-797
> URL: https://issues.apache.org/jira/browse/NUTCH-797
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.1, nutchgora
> Environment: Win 7, Java(TM) SE Runtime Environment (build 
> 1.6.0_16-b01)
> Also repro's on RHEL and java 1.4.2
>Reporter: Robert Hohman
>Assignee: Julien Nioche
>Priority: Minor
> Fix For: 1.9
>
> Attachments: NUTCH-797-2x.patch, NUTCH-797.patch, 
> pureQueryUrl-2.patch, pureQueryUrl.patch, test_nutch_797.html
>
>
> This is my first bug and patch on nutch, so apologies if I have not provided 
> enough detail.
> In crawling the page at 
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0 there are 
> links in the page that look like this:
> 2 href="?co=0&sk=0&p=3&pi=1">3
> in org.apache.nutch.parse.tika.DOMContentUtils rev 916362 (trunk), as 
> getOutlinks looks for links, it comes across this link, and constucts a new 
> url with a base URL class built from 
> "http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0";, and a 
> target of "?co=0&sk=0&p=2&pi=1"
> The URL class, per RFC 3986 at 
> http://labs.apache.org/webarch/uri/rfc/rfc3986.html#relative-merge, defines 
> how to merge these two, and per the RFC, the URL class merges these to: 
> http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&p=2&pi=1
> because the RFC explicitly states that the rightmost url segment (the 
> Search.aspx in this case) should be ripped off before combining.
> While this is compliant with the RFC, it means the URLs which are created for 
> the next round of fetching are incorrect.  Modern browsers seem to handle 
> this case (I checked IE8 and Firefox 3.5), so I'm guessing this is an obscure 
> exception or handling of what is a poorly formed url on accenture's part.
> I have fixed this by modifying DOMContentUtils to look for the case where a ? 
> begins the target, and then pulling the rightmost component out of the base 
> and inserting it into the target before the ?, so the target in this example 
> becomes:
> Search.aspx?co=0&sk=0&p=2&pi=1
> The URL class then properly constructs the new url as:
> http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0&p=2&pi=1
> If it is agreed that this solution works, I believe the other html parsers in 
> nutch would need to be modified in a similar way.
> Can I get feedback on this proposed solution?  Specifically I'm worried about 
> unforeseen side effects.
> Much thanks
> Here is the patch info:
> Index: 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
> ===
> --- 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>(revision 916362)
> +++ 
> src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
>(working copy)
> @@ -299,6 +299,50 @@
>  return false;
>}
>
> +  private URL fixURL(URL base, String target) throws MalformedURLException
> +  {
> +   // handle params that are embedded into the base url - move them to 
> target
> +   // so URL class constructs the new url class properly
> +   if  (base.toString().indexOf(';') > 0)  
> +  return fixEmbeddedParams(base, target);
> +   
> +   // handle the case that there is a target that is a pure query.
> +   // Strictly speaking this is a violation of RFC 2396 section 5.2.2 on 
> how to assemble
> +   // URLs but I've seen this in numerous places, for example at
> +   // http://careers3.accenture.com/Careers/ASPX/Search.aspx?co=0&sk=0
> +   // It has urls in the page of the form href="?co=0&sk=0&pg=1", and by 
> default
> +   // URL constructs the base+target combo as 
> +   // http://careers3.accenture.com/Careers/ASPX/?co=0&sk=0&pg=1, 
> incorrectly
> +   // dropping the Search.aspx target
> +   //
> +   // Browsers handle these just fine, they must have an exception 
> similar to this
> +   if (target.startsWith("?"))
> +   {
> +   return fixPureQueryTargets(base, target);
> +   }
> +   
> +   return new URL(base, target);
> +  }
> +  
> +  private URL fixPureQueryTargets(URL base, String target) throws 
> MalformedURLException
> +  {
> + if (!target.startsWith("?"))
> + return new URL(base, target);
> +
> + String

[jira] [Resolved] (NUTCH-952) fix outlink which started with '?' in html parser

2014-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-952.
---

Resolution: Fixed

> fix outlink which started with '?' in html parser
> -
>
> Key: NUTCH-952
> URL: https://issues.apache.org/jira/browse/NUTCH-952
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: nutchgora
>Reporter: Stondet
> Fix For: 1.9
>
> Attachments: NUTCH-952-v2.patch, test_nutch_952.html
>
>
> ruby on rails(a snippet from 
> http://bbs.soso.com/search?ty=c&sd=0&w=rails)
> outlink parsed from above link: 
> http://bbs.soso.com/?w=ruby%20on%20rails&ty=c&sd=0
> but expected is http://bbs.soso.com/search?w=ruby%20on%20rails&ty=c&sd=0



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2014-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-566:
--

Fix Version/s: (was: 1.9)

> Sun's URL class has bug in creation of relative query URLs
> --
>
> Key: NUTCH-566
> URL: https://issues.apache.org/jira/browse/NUTCH-566
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.8, 0.8.1, 0.9.0
> Environment: MacOS X and Linux (CentOS 4.5) both
>Reporter: Doug Cook
>Priority: Minor
> Attachments: RelativeURL.java
>
>
> I'm using 0.81, but this will affect all other versions as well.
> Relative links of the form "?blah" are resolved incorrectly. For example, 
> with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link 
> of "?id_entrep=111", Nutch will resolve this pair to the link
> "http://www.fleurie.org/?id_entrep=111";. No such URL exists, and all browsers 
> I tried will resolve the pair to 
> "http://www.fleurie.org/entreprise.asp?id_entrep=111";.
> I tracked this down to what could be called a bug in Sun's URL class. 
> According to Sun's spec, they parse the relative URL according to RFC 2396. 
> But the original RFC for relative links was RFC 1808, and the two RFCs differ 
> in how they handle relative links beginning with "?". Most browsers 
> (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for 
> compatibility and also because the behavior makes more sense). Apparently 
> even the people that wrote RFC 2396 recognized that this was a mistake, and 
> the specified behavior was changed in RFC 3986 to match what browsers do. 
> For a discussion of this, see  
> http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query
> Sun's URL implementation, however, still implements RFC2396, as far as I can 
> tell, and is out of step with the rest of the world.
> This breaks link extraction on a number of sites.
> I implemented a simple workaround, which I'm attaching. It is a static method 
> to create URLs which behaves exactly as new URL(URL base, String 
> relativePath), and I use it as a drop-in replacement for that in 
> DOMContentUtils, Javascript link extraction, etc. Obviously, it really only 
> matters wherever links are extracted. I haven't included the calling code 
> from DOMContentUtils, etc. because my local versions are largely rewritten, 
> but it should be pretty obvious.
> I put it in the org.apache.nutch.net directory, but obviously feel free to 
> move it to another place if you feel it belongs there!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-952) fix outlink which started with '?' in html parser

2014-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-952:
--

Fix Version/s: (was: 1.9)

> fix outlink which started with '?' in html parser
> -
>
> Key: NUTCH-952
> URL: https://issues.apache.org/jira/browse/NUTCH-952
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: nutchgora
>Reporter: Stondet
> Attachments: NUTCH-952-v2.patch, test_nutch_952.html
>
>
> ruby on rails(a snippet from 
> http://bbs.soso.com/search?ty=c&sd=0&w=rails)
> outlink parsed from above link: 
> http://bbs.soso.com/?w=ruby%20on%20rails&ty=c&sd=0
> but expected is http://bbs.soso.com/search?w=ruby%20on%20rails&ty=c&sd=0



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-952) fix outlink which started with '?' in html parser

2014-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-952:
--

Attachment: test_nutch_952.html

Was fixed by NUTCH-797 for v 1.4 (2.x will follow soon). Example link 
(attached) works now for 1.8 (both with parse-html and parse-tika):
{code}
% nutch parsechecker http://localhost/test_nutch_952.html
...
Outlinks: 1
  outlink: toUrl: http://bbs.soso.com/search?w=ruby%20on%20rails&ty=c&sd=0
{code}

> fix outlink which started with '?' in html parser
> -
>
> Key: NUTCH-952
> URL: https://issues.apache.org/jira/browse/NUTCH-952
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: nutchgora
>Reporter: Stondet
> Attachments: NUTCH-952-v2.patch, test_nutch_952.html
>
>
> ruby on rails(a snippet from 
> http://bbs.soso.com/search?ty=c&sd=0&w=rails)
> outlink parsed from above link: 
> http://bbs.soso.com/?w=ruby%20on%20rails&ty=c&sd=0
> but expected is http://bbs.soso.com/search?w=ruby%20on%20rails&ty=c&sd=0



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2014-04-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-566.
---

Resolution: Fixed

Was fixed by NUTCH-797 with version 1.4 (2.x will be patched soon), the 
problematic example ({{http://www.fleurie.org/entreprise.asp?id_entrep=111}}) 
is included in unit test (o.a.n.util.TestURLUtil).

> Sun's URL class has bug in creation of relative query URLs
> --
>
> Key: NUTCH-566
> URL: https://issues.apache.org/jira/browse/NUTCH-566
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.8, 0.8.1, 0.9.0
> Environment: MacOS X and Linux (CentOS 4.5) both
>Reporter: Doug Cook
>Priority: Minor
> Fix For: 1.9
>
> Attachments: RelativeURL.java
>
>
> I'm using 0.81, but this will affect all other versions as well.
> Relative links of the form "?blah" are resolved incorrectly. For example, 
> with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link 
> of "?id_entrep=111", Nutch will resolve this pair to the link
> "http://www.fleurie.org/?id_entrep=111";. No such URL exists, and all browsers 
> I tried will resolve the pair to 
> "http://www.fleurie.org/entreprise.asp?id_entrep=111";.
> I tracked this down to what could be called a bug in Sun's URL class. 
> According to Sun's spec, they parse the relative URL according to RFC 2396. 
> But the original RFC for relative links was RFC 1808, and the two RFCs differ 
> in how they handle relative links beginning with "?". Most browsers 
> (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for 
> compatibility and also because the behavior makes more sense). Apparently 
> even the people that wrote RFC 2396 recognized that this was a mistake, and 
> the specified behavior was changed in RFC 3986 to match what browsers do. 
> For a discussion of this, see  
> http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query
> Sun's URL implementation, however, still implements RFC2396, as far as I can 
> tell, and is out of step with the rest of the world.
> This breaks link extraction on a number of sites.
> I implemented a simple workaround, which I'm attaching. It is a static method 
> to create URLs which behaves exactly as new URL(URL base, String 
> relativePath), and I use it as a drop-in replacement for that in 
> DOMContentUtils, Javascript link extraction, etc. Obviously, it really only 
> matters wherever links are extracted. I haven't included the calling code 
> from DOMContentUtils, etc. because my local versions are largely rewritten, 
> but it should be pretty obvious.
> I put it in the org.apache.nutch.net directory, but obviously feel free to 
> move it to another place if you feel it belongs there!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Why are web urls not assumed to be http

2014-04-26 Thread Sebastian Nagel
Hi Diaa,

> Why doesn't nutch assume that web links that have www. at the beginning are
> of the http protocol?

It would be not a big problem to do so. The url normalizer provides scopes
(inject, fetch, etc.): you only have to point the property
"urlnormalizer.regex.file.inject" to a special regex-normalize-inject.xml
(or any other choice for the filename). In this file you can define any such
rules as described.

Why there are no such specific rules for injector?
- maybe just because no one did it or wants to maintain the rule set
  (to define a commonly accepted set of rules isn't easy:
   you can ever continue, e.g. what about adding also www. if it's missing)
- seeds are fully controlled by the crawl administrators, it's
  comparatively simple to teach them to use fully specified URLs.
  Much simpler than explaining usage of URL filters.

Sebastian

On 04/25/2014 11:53 AM, Diaa Abdallah wrote:
> Hi,
> I tried injecting www.google.com into my crawldb without prepending
> http://to it.
> It injected it fine, however when I ran generate on it it gave the
> following warning:
> "Malformed URL: 'www.google.com', skipping (java.net.MalformedURLException:
> no protocol: www.google.com"
> 
> Why doesn't nutch assume that web links that have www. at the beginning are
> of the http protocol?
> 
> Thanks,
> Diaa
> 



[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to use GORA_94 branch

2014-04-26 Thread Navid Shekoufa (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981964#comment-13981964
 ] 

Navid Shekoufa commented on NUTCH-1714:
---

Thanks for this very suitable patch! I've got a problem, I wanted to ask if 
it's happened for you too! After applying the patch all the phases input a 
reasonable amount of "Map input records" except for the Generator Phase which 
still inputs all the rows of the DB for the mapper job! Is it a rational 
behavior or I have done something wrong while patching?!

> Nutch 2.x upgrade to use GORA_94 branch
> ---
>
> Key: NUTCH-1714
> URL: https://issues.apache.org/jira/browse/NUTCH-1714
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Alparslan Avcı
> Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, 
> NUTCH-1714v2.patch
>
>
> Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the 
> details in this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Jenkins build is back to normal : Nutch-trunk #2617

2014-04-26 Thread Apache Jenkins Server
See