[jira] [Commented] (NUTCH-710) Support for rel="canonical" attribute

2012-12-10 Thread zm (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13528748#comment-13528748
 ] 

zm commented on NUTCH-710:
--

I have implemented non-canonical page detection by means of modifying 
parse-html plugin. Though I am not that familiar with Nutch architecture and 
not sure if my implementation is in line with it. What I did is to have utility 
method boolean isCanonical(Node root, String baseUrl) which returns status of 
currently parsed html page: true if it is proper page, false if it is non 
canonical. Then in parse-html plugin's HtmlParser.getParse I call isCanonical 
and return from the method with ParseStatus

if(!utils.isCanonical(root, baseUrl)
 return ParseStatusUtils.getEmptyParse(, "Non canonical page", getConf());

Is this the right way to do it (of cause this needs to be made configurable)? 
If someone more familiar with Nutch would confirm or suggest more proper way 
I'd submit a patch. 

> Support for rel="canonical" attribute
> -
>
> Key: NUTCH-710
> URL: https://issues.apache.org/jira/browse/NUTCH-710
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.1
>Reporter: Frank McCown
>Priority: Minor
> Fix For: 1.7
>
>
> There is a the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
> Adding support for this attribute value will potentially reduce the number of 
> URLs crawled and indexed and reduce duplicate page content.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1314) Impose a limit on the length of outlink target urls

2012-12-10 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1314:
-

Fix Version/s: 2.2
   1.7

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-648) debian style autocomplete

2012-12-10 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-648.
-

Resolution: Won't Fix

see comments above

> debian style autocomplete
> -
>
> Key: NUTCH-648
> URL: https://issues.apache.org/jira/browse/NUTCH-648
> Project: Nutch
>  Issue Type: Improvement
> Environment: debian, and other linux
>Reporter: Jim
>Priority: Minor
>
> Here is a suggested improvement:  At the end of this file is a debian 
> style bash autocomplete script, just place into /etc/bash_complete.d/ with 
> filename nutch, and you can tab complete at the command prompt, ie
> bash> nutch [tab][tab]
>crawl readdb convdb mergedb readlinkdb inject generate freegen fetch 
> fetch2 parse
>readseg mergesegs updatedb invertlinks mergelinkdb index merge dedup 
> plugin server
> bash> nutch c[tab][tab]
>crawl convdb
> etc.
>This also includes optional parameters, and filename completion where it 
> can be used.  I really like having this when typing in long nutch commands, 
> and think it would be a great addition to the project.
>The file is heavily taken from the corresponding svn file that does the 
> same thing.
> File begins here:
> shopt -s extglob
> _nutch()
> {
>local cur cmds cmdOpts optsParam opt
>local i
>COMPREPLY=()
>cur=${COMP_WORDS[COMP_CWORD]}
># Possible expansions
>cmds='crawl readdb convdb mergedb readlinkdb inject generate freegen 
> fetch fetch2 parse readseg mergesegs updatedb invertlinks \
> mergelinkdb index merge dedup plugin server'
>if [[ $COMP_CWORD -eq 1 ]] ; then
>COMPREPLY=( $( compgen -W "$cmds" -- $cur ) )
>return 0
>fi
># options that require a parameter
># This needs to be filled in better
>optsParam="-topN|-depth"
># if not typing an option, or if the previous option required a
># parameter, then fallback on ordinary filename expansion
>if [[ "$cur" != -* ]] || \
>   [[ ${COMP_WORDS[COMP_CWORD-1]} == @($optsParam) ]] ; then
>return 0
>fi
># possible options for the command
>cmdOpts=
>case ${COMP_WORDS[1]} in
>crawl)
>cmdOpts="-dir -threads -depth -topN"
>;;
>readdb)
>cmdOpts="-stats -dump -topN -url"
>;;
>convdb)
>cmdOpts="-withMetadata"
>;;
>mergedb)
>cmdOpts="-normalize -filter"
>;;
>readlinkdb)
>cmdOpts="-dump -url"
>;;
>inject)
>cmdOpts=""
>;;
>generate)
>cmdOpts="-force -topN -numFetchers -adddays -noFilter"
>;;
>freegen)
>cmdOpts="-filter -normalize"
>;;
>fetch)
>cmdOpts="-threads -noParsing"
>;;
>fetch2)
>cmdOpts="-threads -noParsing"
>;;
>parse)
>cmdOpts=""
>;;
>readseg)
>cmdOpts="-dump -list -get -nocontent -nofetch -nogenerate 
> -noparse -noparsedata -noparsetext -dir"
>;;
>mergesegs)
>cmdOpts="-dir -filter -slice"
>;;
>updatedb)
>cmdOpts="-dir -force -normalize -filter -noAdditions"
>;;
>invertlinks)
>cmdOpts="-dir -force -noNormalize -noFilter"
>;;
>mergelinkdb)
>cmdOpts="-normalize -filter"
>;;
>index)
>cmdOpts=""
>;;
>merge)
>cmdOpts="-workingdir"
>;;
>dedup)
>cmdOpts=""
>;;
>plugin)
>cmdOpts=""
>;;
>server)
>cmdOpts=""
>;;
>*)
>;;
>esac
># take out options already given
>for (( i=2; i<=$COMP_CWORD-1; ++i )) ; do
>opt=${COMP_WORDS[$i]}
>cmdOpts=" $cmdOpts "
>cmdOpts=${cmdOpts/ ${opt} / }
># skip next option if this one requires a parameter
>if [[ $opt == @($optsParam) ]] ; then
>((++i))
>fi
>done
>COMPREPLY=( $( compgen -W "$cmdOpts" -- $cur ) )
>return 0
> }
> complete -F _nutch -o default nutch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-412) plugin to parse the feed-url (rss/atom) of a blog

2012-12-10 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-412.
---

Resolution: Implemented

6 years later ;-) 
the feed and parse-tika plugins can handle feeds  

> plugin to parse the feed-url (rss/atom) of a blog
> -
>
> Key: NUTCH-412
> URL: https://issues.apache.org/jira/browse/NUTCH-412
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
>Priority: Minor
> Attachments: FeedUrlFilter.java, plugin_parse-feedUrl2.diff, 
> plugin_parse-feedUrl.diff
>
>
> A plugin that extracts the feed-url (rss/atom) of a blog by retrieving the 
> href from the  element (if found), and stores it in metadata. 
> The meta can be accessed with 
> parse.getData().getMeta("feedUrl");
> you can test this plugin with the main method of HtmlParser.
> Thanks for a feedback.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [ANNOUNCE] Apache Nutch 1.6 Released

2012-12-10 Thread Mattmann, Chris A (388J)
Here here, excellent work!

Cheers,
Chris

From: Julien Nioche 
mailto:lists.digitalpeb...@gmail.com>>
Reply-To: "dev@nutch.apache.org" 
mailto:dev@nutch.apache.org>>
Date: Saturday, December 8, 2012 10:34 PM
To: "dev@nutch.apache.org" 
mailto:dev@nutch.apache.org>>
Subject: Re: [ANNOUNCE] Apache Nutch 1.6 Released

Great stuff! Thanks Lewis

On 8 December 2012 21:50, Lewis John Mcgibbney 
mailto:lewis.mcgibb...@gmail.com>> wrote:
Hi All,

The Apache Nutch PMC are extremely pleased to announce the release of
Apache Nutch v1.6. This release includes over 20 bug fixes, the same
in improvements, as well as new functionalities including a new
HostNormalizer, the ability to dynamically set fetchInterval by
MIME-type and functional enhancements to the Indexer API inluding the
normalization of URL's and the deletion of robots noIndex documents.
Other notable improvements include the upgrade of key dependencies to
Tika 1.2 and Automaton 1.11-8.

A full PMC statement can be found here [0]

The release can be found on official Apache mirrors [1] as well as
sources in Maven Central [2]

Thank you

Lewis
On Behalf of the Nutch PMC

[0] http://s.apache.org/NFp
[1] http://www.apache.org/dyn/closer.cgi/nutch/
[2] http://search.maven.org/#artifactdetails|org.apache.nutch|nutch|1.6|jar

--
Lewis



--
[http://digitalpebble.com/img/logo.gif]
Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble