[jira] [Updated] (NUTCH-1087) Deprecate crawl command and replace with example script
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1087: - Attachment: NUTCH-1087-2.1.patch Similar patch for 2.x - NOT TESTED YET > Deprecate crawl command and replace with example script > --- > > Key: NUTCH-1087 > URL: https://issues.apache.org/jira/browse/NUTCH-1087 > Project: Nutch > Issue Type: Task >Affects Versions: 1.4 >Reporter: Markus Jelsma > Assignee: Julien Nioche >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, > NUTCH-1087-2.1.patch, NUTCH-1087.patch > > > * remove the crawl command > * add basic crawl shell script > See thread: > http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410423#comment-13410423 ] Julien Nioche commented on NUTCH-1087: -- Trunk : committed revision 1359720. 2.x => still needs testing > Deprecate crawl command and replace with example script > --- > > Key: NUTCH-1087 > URL: https://issues.apache.org/jira/browse/NUTCH-1087 > Project: Nutch > Issue Type: Task >Affects Versions: 1.4 >Reporter: Markus Jelsma >Assignee: Julien Nioche >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, > NUTCH-1087-2.1.patch, NUTCH-1087.patch > > > * remove the crawl command > * add basic crawl shell script > See thread: > http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1087) Deprecate crawl command and replace with example script
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1087: - Fix Version/s: 2.1 > Deprecate crawl command and replace with example script > --- > > Key: NUTCH-1087 > URL: https://issues.apache.org/jira/browse/NUTCH-1087 > Project: Nutch > Issue Type: Task >Affects Versions: 1.4 >Reporter: Markus Jelsma > Assignee: Julien Nioche >Priority: Minor > Fix For: 1.6, 2.1 > > Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, > NUTCH-1087-2.1.patch, NUTCH-1087.patch > > > * remove the crawl command > * add basic crawl shell script > See thread: > http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1433) Upgrade to Tika 1.2
Julien Nioche created NUTCH-1433: Summary: Upgrade to Tika 1.2 Key: NUTCH-1433 URL: https://issues.apache.org/jira/browse/NUTCH-1433 Project: Nutch Issue Type: Improvement Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.6, 2.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1433) Upgrade to Tika 1.2
[ https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1433: - Attachment: NUTCH-1433-trunk.patch patch for trunk - please test > Upgrade to Tika 1.2 > --- > > Key: NUTCH-1433 > URL: https://issues.apache.org/jira/browse/NUTCH-1433 > Project: Nutch > Issue Type: Improvement > Components: parser > Reporter: Julien Nioche > Assignee: Julien Nioche > Fix For: 1.6, 2.1 > > Attachments: NUTCH-1433-trunk.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1433) Upgrade to Tika 1.2
[ https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1433: - Attachment: NUTCH-1433-trunk-2.patch Dependency to juniversalchardet needed in root ivy.xml > Upgrade to Tika 1.2 > --- > > Key: NUTCH-1433 > URL: https://issues.apache.org/jira/browse/NUTCH-1433 > Project: Nutch > Issue Type: Improvement > Components: parser > Reporter: Julien Nioche > Assignee: Julien Nioche > Fix For: 1.6, 2.1 > > Attachments: NUTCH-1433-trunk-2.patch, NUTCH-1433-trunk.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1433) Upgrade to Tika 1.2
[ https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419014#comment-13419014 ] Julien Nioche commented on NUTCH-1433: -- Markus : I can't reproduce this issue. Are you getting this with trunk? > Upgrade to Tika 1.2 > --- > > Key: NUTCH-1433 > URL: https://issues.apache.org/jira/browse/NUTCH-1433 > Project: Nutch > Issue Type: Improvement > Components: parser >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 1.6, 2.1 > > Attachments: NUTCH-1433-trunk-2.patch, NUTCH-1433-trunk.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1341) NotModified time set to now but page not modified
[ https://issues.apache.org/jira/browse/NUTCH-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419084#comment-13419084 ] Julien Nioche commented on NUTCH-1341: -- Looks like a reasonable thing to do > NotModified time set to now but page not modified > - > > Key: NUTCH-1341 > URL: https://issues.apache.org/jira/browse/NUTCH-1341 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.5 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1341-1.6-1.patch > > > Servers tend to respond with incorrect or no value for LastModified. By > comparing signatures or when (fetch.getStatus() == > CrawlDatum.STATUS_FETCH_NOTMODIFIED) the reducer correctly sets the > db_notmodified status for the CrawlDatum. The modifiedTime value, however, is > not set accordingly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1388) Optionally maintain custom fetch interval despite AdaptiveFetchSchedule
[ https://issues.apache.org/jira/browse/NUTCH-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419083#comment-13419083 ] Julien Nioche commented on NUTCH-1388: -- don't really like the names fixedFetchInterval vs fetchInterval, that's confusing and unclear. What about having a single customFetchInterval instead that would be used during the injection and would take precedence when using the AdaptiveFetchSchedule? If the default fetchschedule is used then the custom value would be used obviously. > Optionally maintain custom fetch interval despite AdaptiveFetchSchedule > --- > > Key: NUTCH-1388 > URL: https://issues.apache.org/jira/browse/NUTCH-1388 > Project: Nutch > Issue Type: Improvement > Components: injector >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1388-1.6-1.patch, NUTCH-1388-1.6-2.patch > > > During injection a custom fetch interval can be configured but it is not > maintained with an AdaptiveFetchSchedule enabled. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1388) Optionally maintain custom fetch interval despite AdaptiveFetchSchedule
[ https://issues.apache.org/jira/browse/NUTCH-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419108#comment-13419108 ] Julien Nioche commented on NUTCH-1388: -- can't you define the default value in nutch-site.xml? > Optionally maintain custom fetch interval despite AdaptiveFetchSchedule > --- > > Key: NUTCH-1388 > URL: https://issues.apache.org/jira/browse/NUTCH-1388 > Project: Nutch > Issue Type: Improvement > Components: injector >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1388-1.6-1.patch, NUTCH-1388-1.6-2.patch > > > During injection a custom fetch interval can be configured but it is not > maintained with an AdaptiveFetchSchedule enabled. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1388) Optionally maintain custom fetch interval despite AdaptiveFetchSchedule
[ https://issues.apache.org/jira/browse/NUTCH-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419135#comment-13419135 ] Julien Nioche commented on NUTCH-1388: -- OK got it, thanks bq. We have to differentiate between a default interval and an interval that will never change. actually between a default interval (nutch-site.xml), a custom interval that can change and an custom interval that never changes. What about using 'nutch.fetchInterval.fixed' instead of nutch.fixedFetchInterval? Purely cosmetic of course ;-) > Optionally maintain custom fetch interval despite AdaptiveFetchSchedule > --- > > Key: NUTCH-1388 > URL: https://issues.apache.org/jira/browse/NUTCH-1388 > Project: Nutch > Issue Type: Improvement > Components: injector >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1388-1.6-1.patch, NUTCH-1388-1.6-2.patch > > > During injection a custom fetch interval can be configured but it is not > maintained with an AdaptiveFetchSchedule enabled. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1388) Optionally maintain custom fetch interval despite AdaptiveFetchSchedule
[ https://issues.apache.org/jira/browse/NUTCH-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419170#comment-13419170 ] Julien Nioche commented on NUTCH-1388: -- Looks fine +1 > Optionally maintain custom fetch interval despite AdaptiveFetchSchedule > --- > > Key: NUTCH-1388 > URL: https://issues.apache.org/jira/browse/NUTCH-1388 > Project: Nutch > Issue Type: Improvement > Components: injector >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1388-1.6-1.patch, NUTCH-1388-1.6-2.patch, > NUTCH-1388-1.6-3.patch > > > During injection a custom fetch interval can be configured but it is not > maintained with an AdaptiveFetchSchedule enabled. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1433) Upgrade to Tika 1.2
[ https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419175#comment-13419175 ] Julien Nioche commented on NUTCH-1433: -- Committed in trunk : revision 1363794. > Upgrade to Tika 1.2 > --- > > Key: NUTCH-1433 > URL: https://issues.apache.org/jira/browse/NUTCH-1433 > Project: Nutch > Issue Type: Improvement > Components: parser >Reporter: Julien Nioche > Assignee: Julien Nioche > Fix For: 1.6, 2.1 > > Attachments: NUTCH-1433-trunk-2.patch, NUTCH-1433-trunk.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1433) Upgrade to Tika 1.2
[ https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1433: - Attachment: NUTCH-1433.branch-2.patch PAtch for 2.x- strangely the version of the dependencies is not the same as for trunk. Passes the tests > Upgrade to Tika 1.2 > --- > > Key: NUTCH-1433 > URL: https://issues.apache.org/jira/browse/NUTCH-1433 > Project: Nutch > Issue Type: Improvement > Components: parser > Reporter: Julien Nioche > Assignee: Julien Nioche > Fix For: 1.6, 2.1 > > Attachments: NUTCH-1433-trunk-2.patch, NUTCH-1433-trunk.patch, > NUTCH-1433.branch-2.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1433) Upgrade to Tika 1.2
[ https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419260#comment-13419260 ] Julien Nioche commented on NUTCH-1433: -- Anyone to test the patch for 2.x? > Upgrade to Tika 1.2 > --- > > Key: NUTCH-1433 > URL: https://issues.apache.org/jira/browse/NUTCH-1433 > Project: Nutch > Issue Type: Improvement > Components: parser >Reporter: Julien Nioche > Assignee: Julien Nioche > Fix For: 1.6, 2.1 > > Attachments: NUTCH-1433-trunk-2.patch, NUTCH-1433-trunk.patch, > NUTCH-1433.branch-2.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1433) Upgrade to Tika 1.2
[ https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419258#comment-13419258 ] Julien Nioche commented on NUTCH-1433: -- Hmm, probably had a problem with the ivy cache unless the remote pom for Tika has changed. Anyway, now getting the same deps as 2.x Committed the revised plugin.xml in revision 1363842 > Upgrade to Tika 1.2 > --- > > Key: NUTCH-1433 > URL: https://issues.apache.org/jira/browse/NUTCH-1433 > Project: Nutch > Issue Type: Improvement > Components: parser >Reporter: Julien Nioche > Assignee: Julien Nioche > Fix For: 1.6, 2.1 > > Attachments: NUTCH-1433-trunk-2.patch, NUTCH-1433-trunk.patch, > NUTCH-1433.branch-2.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch
[ https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429058#comment-13429058 ] Julien Nioche commented on NUTCH-1445: -- Ferdy - just to reiterate what was said on a previous issue : please give people time to review your contribs before committing your own stuff. I am sure your code is fine and it does not really affect existing code too much but I think it is a good practice that we should try and stick to. Instead of having multiple commands for the indexing backends can't we have a single job and define what the backends (SOLR, ES) via configuration? There is an open issue on 'pluggable indexing backends' [https://issues.apache.org/jira/browse/NUTCH-1047] can we discuss this there? > Add ElasticIndexerJob that indexes to elasticsearch > --- > > Key: NUTCH-1445 > URL: https://issues.apache.org/jira/browse/NUTCH-1445 > Project: Nutch > Issue Type: New Feature >Reporter: Ferdy Galema > Fix For: 2.1 > > Attachments: NUTCH-1445-addPropsToConfig.patch, > NUTCH-1445-addToNutchScript.patch, NUTCH-1445.patch > > > We have created a new indexer job ElasticIndexerJob that indexes to > elasticsearch. It is orginally based upon > https://github.com/ctjmorgan/nutch-elasticsearch-indexer (Apache2 license), > but we have modified it greatly to make it integrate as good as possible into > Nutch. The greatest modification is that documents are asynchronously flushed > in bulk to elasticsearch. > Elasticsearch rocks. Both performance and ease of confiugration is awesome. > You simply deploy a server by unpacking the tar, configure the clustername, > start the server and fire away indexing requests. Indices are automatically > created. Fields are automapped. (Of course it is recommended to create your > own optimized mapping, but that is beyond scope of this issue). Multiple > servers connect without extra configuration, simply by using the same > clustername. (By means of multicast). There a tons of advanced options, such > as sharding, replication, disk striping etc. > To give an example of the performance: With 20+ nodes we are able to index > over 1M docs (average sized webdocuments) per minute. The best part is that > the added documents are almost instantly searchable, so there no hidden > commit costs that Solr has. This is with out-of-the-box configuration. > (I will attach patch and commit for Nutch2. Feel free to adapt for trunk.) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429101#comment-13429101 ] Julien Nioche commented on NUTCH-1047: -- Thanks for your comments Ferdy bq. What I've changed in Nutch2.x is that IndexerOutputFormat does not extend from FileOutputFormat anymore. would be good to do the same for 1.x bq. "whether we will be able to use implementations of NutchIndexWriter from within a plugin" bq. What do you mean with this? I meant that we need to check whether we can have the NutchIndexWriter implementations available in a plugin, which would be nice as we'd have our generic commands + the indexing endpoints implementations in their respective plugins (e.g. indexer-SOLR, indexer-ES) etc... > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Julien Nioche >Assignee: Julien Nioche > Labels: indexing > Fix For: 1.6 > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex
[ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434893#comment-13434893 ] Julien Nioche commented on NUTCH-1434: -- bq. I haven't added the configuration because it's overridden by the command line switch regardless of the nutch-site.xml configuration. I'd rather do like it's done in other parts of the code i.e take into account any value set in nutch-site.xml if nothing is set on the command line (see for instance fetcher.parse) and include in nutch-default.xml > Indexer to delete robots noIndex > > > Key: NUTCH-1434 > URL: https://issues.apache.org/jira/browse/NUTCH-1434 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 1.5.1 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch > > > Nutch does not treat pages with meta robots="noindex" properly. All it does > is remove the title and content fields from the parsed data. It does not stop > those pages from being indexed, nor can it delete existing pages from the > index if they change. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex
[ https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434927#comment-13434927 ] Julien Nioche commented on NUTCH-1434: -- Well, let's do configuration only then. After all it can be set on the command line with -D just as well + it means that we don't have to change the code reading the params etc... > Indexer to delete robots noIndex > > > Key: NUTCH-1434 > URL: https://issues.apache.org/jira/browse/NUTCH-1434 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 1.5.1 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch > > > Nutch does not treat pages with meta robots="noindex" properly. All it does > is remove the title and content fields from the parsed data. It does not stop > those pages from being indexed, nor can it delete existing pages from the > index if they change. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440271#comment-13440271 ] Julien Nioche commented on NUTCH-1233: -- Would be good to add some tests to illustrate the difference in behaviour + make sure that we are getting what we want > Rely on Tika for outlink extraction > --- > > Key: NUTCH-1233 > URL: https://issues.apache.org/jira/browse/NUTCH-1233 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, > NUTCH-1233-1.6-2.patch > > > Tika provides outlink extraction features that are not used in Nutch. To be > able to use it in Nutch we need Tika to return the rel attr value of each > link, which it currently doesn't. There's a patch for Tika 1.1. If that patch > is included in Tika and we upgraded to that new version this issue can be > worked on. Here's preliminary code that does both Tika and current outlink > extraction. This also includes parts of the Boilerpipe code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1459) Remove dead code (phase2) from InjectorJob
[ https://issues.apache.org/jira/browse/NUTCH-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450481#comment-13450481 ] Julien Nioche commented on NUTCH-1459: -- commit ref please Ferdy, thanks! > Remove dead code (phase2) from InjectorJob > -- > > Key: NUTCH-1459 > URL: https://issues.apache.org/jira/browse/NUTCH-1459 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.1 > > Attachments: nutch-1459.txt > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1459) Remove dead code (phase2) from InjectorJob
[ https://issues.apache.org/jira/browse/NUTCH-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450492#comment-13450492 ] Julien Nioche commented on NUTCH-1459: -- the branch reference but even more so the actual commit ref. You can do it for this one, can't you? > Remove dead code (phase2) from InjectorJob > -- > > Key: NUTCH-1459 > URL: https://issues.apache.org/jira/browse/NUTCH-1459 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.1 > > Attachments: nutch-1459.txt > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1459) Remove dead code (phase2) from InjectorJob
[ https://issues.apache.org/jira/browse/NUTCH-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450515#comment-13450515 ] Julien Nioche commented on NUTCH-1459: -- Nah, that's perfect! ;-) > Remove dead code (phase2) from InjectorJob > -- > > Key: NUTCH-1459 > URL: https://issues.apache.org/jira/browse/NUTCH-1459 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.1 > > Attachments: nutch-1459.txt > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags
[ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454790#comment-13454790 ] Julien Nioche commented on NUTCH-1467: -- bq. I will work on it soon but i am thinking of working on tika parser so that it can get all the attributes by default, index them and send it to solr 'attr_*' dynamic field, so that instead of specifying manually any attributes will be accepted. That would be helpful i think than the parse-metatags. a big fat -1 from me. definitely not a good idea to index all the possible attributes by default. Adding a test illustrating the new behaviour for this issue would have been good. +1 to being able to store multiple values instead of relying on a separator by convention Markus - my understanding is that committers mark an issue as resolved but it's up to the author of the issue to confirm that all is done by closing it. > nutch 1.5.1 not able to parse mutliValued metatags > -- > > Key: NUTCH-1467 > URL: https://issues.apache.org/jira/browse/NUTCH-1467 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: kiran >Priority: Minor > Fix For: 1.6 > > Attachments: patch.txt > > > Hi, > I have been able to parse metatags in an html page using > http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when > there are two metatags with same name but two different contents. > Does anyone encounter this kind of issue ? > Are there any changes that need to be made to the config files to make it > work ? > When there are two tags with same name and different content, it takes the > value of the later tag and saves it rather than creating a multiValue field. > Edit: I have attached the patch for the file and it is provided by DLA > (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. > Many Thanks, -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags
[ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454856#comment-13454856 ] Julien Nioche commented on NUTCH-1467: -- Hi Kiran Thank you for your comments. Re-index all attributes : this could be done by adding the option to parse-metatags and allowing values to be set using regular expressions in index-metadata. Don't worry about being slow, no one's in a hurry and we are all learning from each other > nutch 1.5.1 not able to parse mutliValued metatags > -- > > Key: NUTCH-1467 > URL: https://issues.apache.org/jira/browse/NUTCH-1467 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: kiran >Priority: Minor > Fix For: 1.6 > > Attachments: patch.txt > > > Hi, > I have been able to parse metatags in an html page using > http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when > there are two metatags with same name but two different contents. > Does anyone encounter this kind of issue ? > Are there any changes that need to be made to the config files to make it > work ? > When there are two tags with same name and different content, it takes the > value of the later tag and saves it rather than creating a multiValue field. > Edit: I have attached the patch for the file and it is provided by DLA > (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. > Many Thanks, -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags
[ https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468436#comment-13468436 ] Julien Nioche commented on NUTCH-1467: -- Thanks Kiran. See http://wiki.apache.org/nutch/HowToContribute for info on patches > nutch 1.5.1 not able to parse mutliValued metatags > -- > > Key: NUTCH-1467 > URL: https://issues.apache.org/jira/browse/NUTCH-1467 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: kiran >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1467-trunk.patch, Patch_HTMLMetaProcessor.patch, > Patch_HTMLMetaTags.patch, Patch_MetadataIndexer.patch, > Patch_MetaTagsParser.patch, patch.txt > > > Hi, > I have been able to parse metatags in an html page using > http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when > there are two metatags with same name but two different contents. > Does anyone encounter this kind of issue ? > Are there any changes that need to be made to the config files to make it > work ? > When there are two tags with same name and different content, it takes the > value of the later tag and saves it rather than creating a multiValue field. > Edit: I have attached the patch for the file and it is provided by DLA > (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. > Many Thanks, -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1475: - Affects Version/s: (was: nutchgora) 1.5.1 This is an issue for the 1.x branch as well > Nutch 2.1 Index-More Plugin -- A better fall back value for date field > -- > > Key: NUTCH-1475 > URL: https://issues.apache.org/jira/browse/NUTCH-1475 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.1, 1.5.1 > Environment: All >Reporter: James Sullivan >Priority: Minor > Labels: index-more, plugins > Attachments: index-more-2x.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Among other fields, the more plugin for Nutch 2.x provides a "last modified" > and "date" field for the Solr index. The "last modified" field is the last > modified date from the http headers if available, if not available it is left > empty. Currently, the "date" field is the same as the "last modified" field > unless that field is empty in which case getFetchTime is used as a fall back. > I think getFetchTime is not a good fall back as it is the next fetch time and > often a month or more in the future which doesn't make sense for the date > field. Users do not expect webpages/documents with future dates. A more > sensible fallback would be current date at the time it is indexed. > This is possible by simply changing line 97 of > https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java > from > time = page.getFetchTime(); // use fetch time > to > time = new Date().getTime(); > Users interested in the getFetchTime value can still get it from the "tstamp" > field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1344) BasicURLNormalizer to normalize https same as http
[ https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473066#comment-13473066 ] Julien Nioche commented on NUTCH-1344: -- Good catch Sebastian. PLease commit to both trunk and 2.x > BasicURLNormalizer to normalize https same as http > --- > > Key: NUTCH-1344 > URL: https://issues.apache.org/jira/browse/NUTCH-1344 > Project: Nutch > Issue Type: Bug >Affects Versions: nutchgora, 1.6 >Reporter: Sebastian Nagel > Attachments: NUTCH-1344.patch > > > Most of the normalization done by BasicURLNormalizer (lowercasing host, > removing default port, removal of page anchors, cleaning . and . in the path) > is not done for URLs with protocol https. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474198#comment-13474198 ] Julien Nioche commented on NUTCH-1475: -- Nope, looks like a reasonable thing to do > Nutch 2.1 Index-More Plugin -- A better fall back value for date field > -- > > Key: NUTCH-1475 > URL: https://issues.apache.org/jira/browse/NUTCH-1475 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.1, 1.5.1 > Environment: All >Reporter: James Sullivan >Priority: Minor > Labels: index-more, plugins > Fix For: 1.6, 2.2 > > Attachments: index-more-1xand2x.patch, index-more-2x.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Among other fields, the more plugin for Nutch 2.x provides a "last modified" > and "date" field for the Solr index. The "last modified" field is the last > modified date from the http headers if available, if not available it is left > empty. Currently, the "date" field is the same as the "last modified" field > unless that field is empty in which case getFetchTime is used as a fall back. > I think getFetchTime is not a good fall back as it is the next fetch time and > often a month or more in the future which doesn't make sense for the date > field. Users do not expect webpages/documents with future dates. A more > sensible fallback would be current date at the time it is indexed. > This is possible by simply changing line 97 of > https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java > from > time = page.getFetchTime(); // use fetch time > to > time = new Date().getTime(); > Users interested in the getFetchTime value can still get it from the "tstamp" > field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-710) Support for rel="canonical" attribute
[ https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477716#comment-13477716 ] Julien Nioche commented on NUTCH-710: - Iwan : sure, feel free to send a patch if you want to help it happen > Support for rel="canonical" attribute > - > > Key: NUTCH-710 > URL: https://issues.apache.org/jira/browse/NUTCH-710 > Project: Nutch > Issue Type: New Feature >Affects Versions: 1.1 >Reporter: Frank McCown >Priority: Minor > Fix For: 1.6, 2.2 > > > There is a the new rel="canonical" attribute which is > now being supported by Google, Yahoo, and Live: > http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html > Adding support for this attribute value will potentially reduce the number of > URLs crawled and indexed and reduce duplicate page content. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1477) NPE when injecting with DataFileAvroStore
[ https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13479919#comment-13479919 ] Julien Nioche commented on NUTCH-1477: -- Thanks Mike. I confirm the issue. Did you recompile the Webpage class from the AVRO defs when using the latest version of AVRO? Could be an incompatibility between the versions. Going back to the original problem I don't think the problem comes from AVRO as we would have it with the other backends as well. As for the MemStore I don't think it is used for anything else than tests. > NPE when injecting with DataFileAvroStore > - > > Key: NUTCH-1477 > URL: https://issues.apache.org/jira/browse/NUTCH-1477 > Project: Nutch > Issue Type: Bug > Components: storage >Affects Versions: 2.1 > Environment: Java 1.6.0_35 >Reporter: Mike Baranczak > > Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. > Injection job throws NullPointerException, see below. No error when I switch > to MemStore. > java.lang.NullPointerException > at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72) > at > org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55) > at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245) > at > org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54) > at > org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60) > at > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639) > at > org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > at > org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185) > at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1087) Deprecate crawl command and replace with example script
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1087. -- Resolution: Fixed Nutch 2-x : Committed revision 1400390. Can open a new issue if there are any problems with the script.Should be a good starting point > Deprecate crawl command and replace with example script > --- > > Key: NUTCH-1087 > URL: https://issues.apache.org/jira/browse/NUTCH-1087 > Project: Nutch > Issue Type: Task >Affects Versions: 1.4 >Reporter: Markus Jelsma > Assignee: Julien Nioche >Priority: Minor > Fix For: 1.6, 2.2 > > Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, > NUTCH-1087-2.1.patch, NUTCH-1087.patch > > > * remove the crawl command > * add basic crawl shell script > See thread: > http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1433) Upgrade to Tika 1.2
[ https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1433. -- Resolution: Fixed Committed revision 1400397. > Upgrade to Tika 1.2 > --- > > Key: NUTCH-1433 > URL: https://issues.apache.org/jira/browse/NUTCH-1433 > Project: Nutch > Issue Type: Improvement > Components: parser > Reporter: Julien Nioche > Assignee: Julien Nioche > Fix For: 1.6, 2.2 > > Attachments: NUTCH-1433.branch-2.patch, NUTCH-1433-trunk-2.patch, > NUTCH-1433-trunk.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1477) NPE when injecting with DataFileAvroStore
[ https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1477: - Fix Version/s: 2.2 Assignee: Julien Nioche > NPE when injecting with DataFileAvroStore > - > > Key: NUTCH-1477 > URL: https://issues.apache.org/jira/browse/NUTCH-1477 > Project: Nutch > Issue Type: Bug > Components: storage >Affects Versions: 2.1 > Environment: Java 1.6.0_35 >Reporter: Mike Baranczak >Assignee: Julien Nioche > Fix For: 2.2 > > > Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. > Injection job throws NullPointerException, see below. No error when I switch > to MemStore. > java.lang.NullPointerException > at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72) > at > org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55) > at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245) > at > org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54) > at > org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60) > at > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639) > at > org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > at > org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185) > at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1477) NPE when injecting with DataFileAvroStore
[ https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1477: - Attachment: webpage.avsc Modified avro schema which allows fields to be null > NPE when injecting with DataFileAvroStore > - > > Key: NUTCH-1477 > URL: https://issues.apache.org/jira/browse/NUTCH-1477 > Project: Nutch > Issue Type: Bug > Components: storage >Affects Versions: 2.1 > Environment: Java 1.6.0_35 >Reporter: Mike Baranczak >Assignee: Julien Nioche > Fix For: 2.2 > > Attachments: webpage.avsc > > > Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. > Injection job throws NullPointerException, see below. No error when I switch > to MemStore. > java.lang.NullPointerException > at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72) > at > org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55) > at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245) > at > org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54) > at > org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60) > at > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639) > at > org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > at > org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185) > at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1477) NPE when injecting with DataFileAvroStore
[ https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13484148#comment-13484148 ] Julien Nioche commented on NUTCH-1477: -- I found in http://mail-archives.apache.org/mod_mbox/avro-user/200910.mbox/%3c4ae78503.50...@apache.org%3E that we probably need to explicitly allow for null values in the schema (see attachment). I tried recompiling the schemas with {{ant compile-avro-schema}} but the classes generated do not compile and are nowhere near as complete as the original ones. More worryingly the same is true with the original schema. I assumed that the code in org.apache.nutch.storage could be generated from the schemas. Any idea? > NPE when injecting with DataFileAvroStore > - > > Key: NUTCH-1477 > URL: https://issues.apache.org/jira/browse/NUTCH-1477 > Project: Nutch > Issue Type: Bug > Components: storage >Affects Versions: 2.1 > Environment: Java 1.6.0_35 >Reporter: Mike Baranczak >Assignee: Julien Nioche > Fix For: 2.2 > > Attachments: webpage.avsc > > > Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. > Injection job throws NullPointerException, see below. No error when I switch > to MemStore. > java.lang.NullPointerException > at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72) > at > org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55) > at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245) > at > org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54) > at > org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60) > at > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639) > at > org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > at > org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185) > at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1477) NPE when injecting with DataFileAvroStore
[ https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1477: - Priority: Critical (was: Major) > NPE when injecting with DataFileAvroStore > - > > Key: NUTCH-1477 > URL: https://issues.apache.org/jira/browse/NUTCH-1477 > Project: Nutch > Issue Type: Bug > Components: storage >Affects Versions: 2.1 > Environment: Java 1.6.0_35 >Reporter: Mike Baranczak >Assignee: Julien Nioche >Priority: Critical > Fix For: 2.2 > > Attachments: webpage.avsc > > > Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. > Injection job throws NullPointerException, see below. No error when I switch > to MemStore. > java.lang.NullPointerException > at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72) > at > org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55) > at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245) > at > org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54) > at > org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60) > at > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639) > at > org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > at > org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185) > at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1477) NPE when injecting with DataFileAvroStore
[ https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13484169#comment-13484169 ] Julien Nioche commented on NUTCH-1477: -- Found a clue in https://issues.apache.org/jira/browse/NUTCH-842. Not sure what the point of compile-avro-schema is but we need to compile the schemas with gora and not just avro. The generated classes now compile fine. Using the modified schema fails at compilation as the generated objects don't have accessors e.g. getContentType() > NPE when injecting with DataFileAvroStore > - > > Key: NUTCH-1477 > URL: https://issues.apache.org/jira/browse/NUTCH-1477 > Project: Nutch > Issue Type: Bug > Components: storage >Affects Versions: 2.1 > Environment: Java 1.6.0_35 >Reporter: Mike Baranczak >Assignee: Julien Nioche >Priority: Critical > Fix For: 2.2 > > Attachments: webpage.avsc > > > Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. > Injection job throws NullPointerException, see below. No error when I switch > to MemStore. > java.lang.NullPointerException > at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72) > at > org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55) > at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245) > at > org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54) > at > org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60) > at > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639) > at > org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > at > org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185) > at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1477) NPE when injecting with DataFileAvroStore
[ https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485172#comment-13485172 ] Julien Nioche commented on NUTCH-1477: -- Hi Lewis bq. Do you suggest we update the patch in NUTCH-842 with the correct package name for Gora in the Nutch build.xml file and remove the ant compile-avro-schema target? yes, until someone can explain what that target is useful for? bq. If no accessors are generated then is this not a problem with the Gora compiler? If so we should open a ticket over there and link the issues. it is indeed. Looks like the gora compiler can't deal with the ["string", "null"] union. Will create an issue in GORA land > NPE when injecting with DataFileAvroStore > - > > Key: NUTCH-1477 > URL: https://issues.apache.org/jira/browse/NUTCH-1477 > Project: Nutch > Issue Type: Bug > Components: storage >Affects Versions: 2.1 > Environment: Java 1.6.0_35 > Reporter: Mike Baranczak >Assignee: Julien Nioche >Priority: Critical > Fix For: 2.2 > > Attachments: webpage.avsc > > > Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. > Injection job throws NullPointerException, see below. No error when I switch > to MemStore. > java.lang.NullPointerException > at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72) > at > org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55) > at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245) > at > org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54) > at > org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60) > at > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639) > at > org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > at > org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185) > at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1482) Rename HTMLParseFilter
Julien Nioche created NUTCH-1482: Summary: Rename HTMLParseFilter Key: NUTCH-1482 URL: https://issues.apache.org/jira/browse/NUTCH-1482 Project: Nutch Issue Type: Task Components: parser Affects Versions: 1.5.1 Reporter: Julien Nioche See NUTCH-861 for a background discussion. We have changed the name in 2.x to better reflect what it does and I think we should do the same for 1.x. any objections? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter
[ https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487624#comment-13487624 ] Julien Nioche commented on NUTCH-1482: -- Having 2 extension points would be a bit of an overkill IMHO - there aren't any changes in the methods and people just need to do a minor change to the core and xml config which I don't think is unreasonable when moving from one version to the next as long as it is mentioned in the Wiki. BTW maybe we should organize the CHANGES.txt a bit differently and organise it by type of change (optimisation - bug fix - incompatible change) as done in other projects instead of simply listing the JIRAs > Rename HTMLParseFilter > -- > > Key: NUTCH-1482 > URL: https://issues.apache.org/jira/browse/NUTCH-1482 > Project: Nutch > Issue Type: Task > Components: parser >Affects Versions: 1.5.1 >Reporter: Julien Nioche > > See NUTCH-861 for a background discussion. We have changed the name in 2.x to > better reflect what it does and I think we should do the same for 1.x. > any objections? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488728#comment-13488728 ] Julien Nioche commented on NUTCH-1480: -- Hi Lewis bq. Can I run multiple Solr servers in psudo distributed mode? SOLR is completely separated from Hadoop and has nothing to do with local vs distrib. You can run serveral instances of SOLR on the same machine if that is your question. Just invoke a different port when starting it from the command line with a separate SOLR home. Markus, Just to make sure I understand - this sends ALL the documents to ALL the SOLR instances specified, right? > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488738#comment-13488738 ] Julien Nioche commented on NUTCH-1480: -- OK thanks. What about having a mechanism for specifying a way of distributing the docs with the replicate-to-all being one of the options? Could do consistent hashing maybe? I expect that most people would want to shard. off topic re-deduplication : I think we've hit the limits of the current mechanism which I assume was based on the one we had when Nutch was managing its own Lucene indices. It's not reasonable to pump ALL the docs from SOLR into Hadoop to dedup and I'd rather have map reduce jobs to find the duplicates based on the crawldb and send the deletion commands to SOLR. And this would work for ElasticSearch as well. Am pretty sure there is a JIRA for this somewhere > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488786#comment-13488786 ] Julien Nioche commented on NUTCH-1480: -- nope. I meant implementing the distribution to the shards on the Nutch side without relying on the CloudSolrServer. Having said that we want to move to SOLR4 and if we get that from SOLR for cheap then that's even better > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1487) Nutch parse fails first time for PDF files and works on reparse
[ https://issues.apache.org/jira/browse/NUTCH-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1487: - Component/s: storage parser > Nutch parse fails first time for PDF files and works on reparse > --- > > Key: NUTCH-1487 > URL: https://issues.apache.org/jira/browse/NUTCH-1487 > Project: Nutch > Issue Type: Bug > Components: parser, storage >Affects Versions: 2.1 >Reporter: kiran > Labels: mysql > > The parser is failing to parse pdf files at one go and working on re-parsing > command the number of times the total number of PDF files as discussed in the > mailing list here > (http://www.mail-archive.com/user%40nutch.apache.org/msg07952.html) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1487) Nutch parse fails first time for PDF files and works on reparse
[ https://issues.apache.org/jira/browse/NUTCH-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1487: - Labels: mysql (was: ) > Nutch parse fails first time for PDF files and works on reparse > --- > > Key: NUTCH-1487 > URL: https://issues.apache.org/jira/browse/NUTCH-1487 > Project: Nutch > Issue Type: Bug > Components: parser, storage >Affects Versions: 2.1 >Reporter: kiran > Labels: mysql > > The parser is failing to parse pdf files at one go and working on re-parsing > command the number of times the total number of PDF files as discussed in the > mailing list here > (http://www.mail-archive.com/user%40nutch.apache.org/msg07952.html) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-747) inject&Index metadatas and inherit these metadatas to all matching suburls
[ https://issues.apache.org/jira/browse/NUTCH-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-747. - Resolution: Implemented This has been made possible since thanks to : - Metadata injection (https://issues.apache.org/jira/browse/NUTCH-655) - urlmeta plugin - index-metadata plugin > inject&Index metadatas and inherit these metadatas to all matching suburls > -- > > Key: NUTCH-747 > URL: https://issues.apache.org/jira/browse/NUTCH-747 > Project: Nutch > Issue Type: Improvement > Components: indexer, injector >Reporter: Marko Bauhardt > Attachments: index-metadata.patch, metadata.patch > > > Hi. > the following two patches supports > + inject metadatas to url's into a metadatadb > url.com : : > ... > ... > + updates the parse_data metadata from a shard and write the metadatas to all > fetched urls that starts with an url from the metadatadb > + this patch support's metadata to all matching suburls inheritance > the second patch implements a index-metadata plugin. > + this plugin extract all metadats from the parse_data of a shard and index > it. which metadats you can configure in the plugin.properties. > + to index for example the lang you have to configure the plugin.properties: > lang=STORE,UNTOKENIZED > + that means that the index plugin exract metadata values with key "lang". if > exists, all values are indexed stored and untokenized > Example > create start url's in "/tmp/urls/start/urls.txt" > http://lucene.apache.org/nutch/apidocs-1.0/index.html > http://lucene.apache.org/nutch/apidocs-0.9/index.html > create metadata url's in "/tmp/urls/metadata/urls.txt" > http://lucene.apache.org/nutch/apidocs-1.0/ version:1.0 > http://lucene.apache.org/nutch/apidocs-0.9/ version:0.9 > Inject Urls > bin/nutch inject crawldb /tmp/urls/start/ > bin/nutch org.apache.nutch.crawl.metadata.MetadataInjector metadatadb > /tmp/urls/metadata/ > Fetch & Parse & Update > bin/nutch generate crawldb segments > bin/nutch fetch segments/20090806105717/ > bin/nutch org.apache.nutch.crawl.metadata.ParseDataUpdater metadatadb > segments/20090806105717 > bin/nutch updatedb crawldb/ segments/20090806105717/ > Fetch & Parse & Update Again > ... > Index > bin/nutch invertlinks linkdb -dir segments/ > bin/nutch index index crawldb/ linkdb/ segments/20090806105717 > segments/20090806110127 > Check your Index > All urls starting with "http://lucene.apache.org/nutch/apidocs-1.0/ " are > indexed with "version:1.0". > All urls starting with "http://lucene.apache.org/nutch/apidocs-0.9/ " are > indexed with "version:0.9". > This issue is some related to NUTCH-655 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1477) NPE when injecting with DataFileAvroStore
[ https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526256#comment-13526256 ] Julien Nioche commented on NUTCH-1477: -- Hi Alfonso. That's right. I must have missed it when writing the modified schema. > NPE when injecting with DataFileAvroStore > - > > Key: NUTCH-1477 > URL: https://issues.apache.org/jira/browse/NUTCH-1477 > Project: Nutch > Issue Type: Bug > Components: storage >Affects Versions: 2.1 > Environment: Java 1.6.0_35 >Reporter: Mike Baranczak >Assignee: Julien Nioche >Priority: Critical > Fix For: 2.2 > > Attachments: webpage.avsc > > > Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. > Injection job throws NullPointerException, see below. No error when I switch > to MemStore. > java.lang.NullPointerException > at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176) > at > org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72) > at > org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55) > at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245) > at > org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54) > at > org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60) > at > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639) > at > org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) > at > org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185) > at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-840: Attachment: NUTCH-840-trunk.patch Modified version of the patch to fix the tests post NUTCH-797 > Port tests from parse-html to parse-tika > > > Key: NUTCH-840 > URL: https://issues.apache.org/jira/browse/NUTCH-840 > Project: Nutch > Issue Type: Task > Components: parser >Affects Versions: 1.1 > Reporter: Julien Nioche > Assignee: Julien Nioche > Fix For: 2.2 > > Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, > NUTCH-840v2.patch > > > We don't have test for HTML in parse-tika so I'll copy them from the old > parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527362#comment-13527362 ] Julien Nioche commented on NUTCH-840: - The tests now run OK with the patch I just attached. bq. There is a problem here where the new tests (for parse-tika) also seem to be executed against (within?) other plugin testing scenarios can you give more detail on this please Lewis? > Port tests from parse-html to parse-tika > > > Key: NUTCH-840 > URL: https://issues.apache.org/jira/browse/NUTCH-840 > Project: Nutch > Issue Type: Task > Components: parser >Affects Versions: 1.1 > Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 2.2 > > Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, > NUTCH-840v2.patch > > > We don't have test for HTML in parse-tika so I'll copy them from the old > parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-840: Affects Version/s: 1.6 Fix Version/s: 1.7 > Port tests from parse-html to parse-tika > > > Key: NUTCH-840 > URL: https://issues.apache.org/jira/browse/NUTCH-840 > Project: Nutch > Issue Type: Task > Components: parser >Affects Versions: 1.1, 1.6 > Reporter: Julien Nioche > Assignee: Julien Nioche > Fix For: 1.7, 2.2 > > Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, > NUTCH-840v2.patch > > > We don't have test for HTML in parse-tika so I'll copy them from the old > parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-891) Nutch build should not depend on unversioned local deps
[ https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-891: Affects Version/s: 2.1 Probably not an issue anymore. marking it as 2.x to triage unversioned issues, will check later > Nutch build should not depend on unversioned local deps > --- > > Key: NUTCH-891 > URL: https://issues.apache.org/jira/browse/NUTCH-891 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.1 >Reporter: Andrzej Bialecki > Attachments: gora-49_v1.patch, gora.build.patch > > > The fix in NUTCH-873 introduces an unknown variable to the build process. > Since local ivy artifacts are unversioned, different people that install Gora > jars at different points in time will use the same artifact id but in fact > the artifacts (jars) will differ because they will come from different > revisions of Gora sources. Therefore Nutch builds based on the same svn rev. > won't be repeatable across different environments. > As much as it pains the ivy purists ;) until Gora publishes versioned > artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars > built from a known external rev. We can add a README that contains commit id > from Gora. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-807) JSParseFilter produces malformed URL
[ https://issues.apache.org/jira/browse/NUTCH-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-807. --- Resolution: Won't Fix Closing old issues. The JSParseFilter is known to generate noisy URLS and is not used by default anymore. This won't get fixed > JSParseFilter produces malformed URL > > > Key: NUTCH-807 > URL: https://issues.apache.org/jira/browse/NUTCH-807 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.0.0 > Environment: Redhat 2.6.18-128.1.6.el5PAE i686 i686 i386 GNU/Linux >Reporter: Minyao Zhu > > This is found when crawling site: http://zhidao.baidu.com/( a Chinese > language site ) > It appears this page contains javascripts which confused JSParseFilter, which > produced URL like this: > http://zhidao.baidu.com/){if(A===46){baidu.hide( > Not sure the impact/scope of this issue in general. The observation for this > specific site is, much less pages got crawled. > Thanks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-62) Add html META tag information into metaData in index-more plugin
[ https://issues.apache.org/jira/browse/NUTCH-62?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-62. Resolution: Implemented This can be done in a more flexible way using index-metadata https://issues.apache.org/jira/browse/NUTCH-1264 > Add html META tag information into metaData in index-more plugin > > > Key: NUTCH-62 > URL: https://issues.apache.org/jira/browse/NUTCH-62 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Jack Tang >Priority: Trivial > Attachments: index-more.patch.zip > > > Now(version dev-0.7), only some metaData in http response such as type, > date, content-length are available int the index-more plugin. And we cannot > index/sotre the meta data in html header ( exactly) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1267) urlmeta to delegate indexing to index-metadata
[ https://issues.apache.org/jira/browse/NUTCH-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1267: - Assignee: Julien Nioche > urlmeta to delegate indexing to index-metadata > -- > > Key: NUTCH-1267 > URL: https://issues.apache.org/jira/browse/NUTCH-1267 > Project: Nutch > Issue Type: Sub-task > Components: indexer > Reporter: Julien Nioche > Assignee: Julien Nioche > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1267) urlmeta to delegate indexing to index-metadata
[ https://issues.apache.org/jira/browse/NUTCH-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1267: - Description: Ideally we should get rid of urlmeta altogether and add the transmission of the meta to the outlinks in the core classes - not as a plugin. URLMeta is also a terrible name :-( Affects Version/s: 1.6 > urlmeta to delegate indexing to index-metadata > -- > > Key: NUTCH-1267 > URL: https://issues.apache.org/jira/browse/NUTCH-1267 > Project: Nutch > Issue Type: Sub-task > Components: indexer >Affects Versions: 1.6 > Reporter: Julien Nioche > Assignee: Julien Nioche > > Ideally we should get rid of urlmeta altogether and add the transmission of > the meta to the outlinks in the core classes - not as a plugin. URLMeta is > also a terrible name :-( -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-412) plugin to parse the feed-url (rss/atom) of a blog
[ https://issues.apache.org/jira/browse/NUTCH-412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-412. --- Resolution: Implemented 6 years later ;-) the feed and parse-tika plugins can handle feeds > plugin to parse the feed-url (rss/atom) of a blog > - > > Key: NUTCH-412 > URL: https://issues.apache.org/jira/browse/NUTCH-412 > Project: Nutch > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: Renaud Richardet >Priority: Minor > Attachments: FeedUrlFilter.java, plugin_parse-feedUrl2.diff, > plugin_parse-feedUrl.diff > > > A plugin that extracts the feed-url (rss/atom) of a blog by retrieving the > href from the element (if found), and stores it in metadata. > The meta can be accessed with > parse.getData().getMeta("feedUrl"); > you can test this plugin with the main method of HtmlParser. > Thanks for a feedback. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-648) debian style autocomplete
[ https://issues.apache.org/jira/browse/NUTCH-648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-648. - Resolution: Won't Fix see comments above > debian style autocomplete > - > > Key: NUTCH-648 > URL: https://issues.apache.org/jira/browse/NUTCH-648 > Project: Nutch > Issue Type: Improvement > Environment: debian, and other linux >Reporter: Jim >Priority: Minor > > Here is a suggested improvement: At the end of this file is a debian > style bash autocomplete script, just place into /etc/bash_complete.d/ with > filename nutch, and you can tab complete at the command prompt, ie > bash> nutch [tab][tab] >crawl readdb convdb mergedb readlinkdb inject generate freegen fetch > fetch2 parse >readseg mergesegs updatedb invertlinks mergelinkdb index merge dedup > plugin server > bash> nutch c[tab][tab] >crawl convdb > etc. >This also includes optional parameters, and filename completion where it > can be used. I really like having this when typing in long nutch commands, > and think it would be a great addition to the project. >The file is heavily taken from the corresponding svn file that does the > same thing. > File begins here: > shopt -s extglob > _nutch() > { >local cur cmds cmdOpts optsParam opt >local i >COMPREPLY=() >cur=${COMP_WORDS[COMP_CWORD]} ># Possible expansions >cmds='crawl readdb convdb mergedb readlinkdb inject generate freegen > fetch fetch2 parse readseg mergesegs updatedb invertlinks \ > mergelinkdb index merge dedup plugin server' >if [[ $COMP_CWORD -eq 1 ]] ; then >COMPREPLY=( $( compgen -W "$cmds" -- $cur ) ) >return 0 >fi ># options that require a parameter ># This needs to be filled in better >optsParam="-topN|-depth" ># if not typing an option, or if the previous option required a ># parameter, then fallback on ordinary filename expansion >if [[ "$cur" != -* ]] || \ > [[ ${COMP_WORDS[COMP_CWORD-1]} == @($optsParam) ]] ; then >return 0 >fi ># possible options for the command >cmdOpts= >case ${COMP_WORDS[1]} in >crawl) >cmdOpts="-dir -threads -depth -topN" >;; >readdb) >cmdOpts="-stats -dump -topN -url" >;; >convdb) >cmdOpts="-withMetadata" >;; >mergedb) >cmdOpts="-normalize -filter" >;; >readlinkdb) >cmdOpts="-dump -url" >;; >inject) >cmdOpts="" >;; >generate) >cmdOpts="-force -topN -numFetchers -adddays -noFilter" >;; >freegen) >cmdOpts="-filter -normalize" >;; >fetch) >cmdOpts="-threads -noParsing" >;; >fetch2) >cmdOpts="-threads -noParsing" >;; >parse) >cmdOpts="" >;; >readseg) >cmdOpts="-dump -list -get -nocontent -nofetch -nogenerate > -noparse -noparsedata -noparsetext -dir" >;; >mergesegs) >cmdOpts="-dir -filter -slice" >;; >updatedb) >cmdOpts="-dir -force -normalize -filter -noAdditions" >;; >invertlinks) >cmdOpts="-dir -force -noNormalize -noFilter" >;; >mergelinkdb) >cmdOpts="-normalize -filter" >;; >index) >cmdOpts="" >;; >merge) >cmdOpts="-workingdir" >;; >dedup) >cmdOpts="" >;; >plugin) >cmdOpts="" >;; >server) >cmdOpts="" >;; >*) >;; >esac ># take out options already given >for (( i=2; i<=$COMP_CWORD-1; ++i )) ; do >opt=${COMP_WORDS[$i]} >cmdOpts=" $cmdOpts " >cmdOpts=${cmdOpts/ ${opt} / } ># skip next option if this one requires a parameter >if [[ $opt == @($optsParam) ]] ; then >((++i)) >fi >done >COMPREPLY=( $( compgen -W "$cmdOpts" -- $cur ) ) >return 0 > } > complete -F _nutch -o default nutch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1314) Impose a limit on the length of outlink target urls
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1314: - Fix Version/s: 2.2 1.7 > Impose a limit on the length of outlink target urls > --- > > Key: NUTCH-1314 > URL: https://issues.apache.org/jira/browse/NUTCH-1314 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1314.patch > > > In the past we have encountered situations where crawling specific broken > sites resulted in ridiciously long urls that caused the stalling of tasks. > The regex plugins (normalizing/filtering) processed single urls for hours, if > not indefinitely hanging. > My suggestion is to limit the outlink url target length as soon possible. It > is a configurable limit, the default is 3000. This should be reasonably long > enough for most uses. But sufficienly strict enough to make sure regex > plugins do not choke on urls that are too long. Please see attached patch for > the Nutchgora implementation. > I'd like to hear what you think about this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1347) fetcher politeness related to map-reduce
[ https://issues.apache.org/jira/browse/NUTCH-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1347. -- Resolution: Not A Problem > fetcher politeness related to map-reduce > > > Key: NUTCH-1347 > URL: https://issues.apache.org/jira/browse/NUTCH-1347 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.4 >Reporter: behnam nikbakht > Labels: fetch > > when Nutch is running on Hadoop , based on map-reduce concept, each map task > do some thing on it's owned data, so, each fetcher map-task work with it's > Queues and do not know any thing about other Queus. so, enforce delay between > successive requests and maximum concurrent requests policies on it's Queues. > but with a simple test we found that it's not good piliteness mechanism when > we have multiple map tasks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1331) limit crawler to defined depth
[ https://issues.apache.org/jira/browse/NUTCH-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535970#comment-13535970 ] Julien Nioche commented on NUTCH-1331: -- Any objections or shall I commit this new plugin? > limit crawler to defined depth > -- > > Key: NUTCH-1331 > URL: https://issues.apache.org/jira/browse/NUTCH-1331 > Project: Nutch > Issue Type: New Feature > Components: generator, parser, storage >Affects Versions: 1.4 >Reporter: behnam nikbakht > Attachments: NUTCH-1331.patch, NUTCH-1331-v2.patch > > > there is a need to limit crawler to some defined depth, and importance of > this option is to avoid crawling of infinite loops, with dynamic generated > urls, that occur in some sites, and to optimize crawler to select important > urls. > an option is define a iteration limit on generate,fetch,parse,updatedb cycle, > but it works only if in each cycle, all of unfetched urls become fetched, > (without recrawling them and with some other considerations) > we can define a new parameter in CrawlDatum, named depth, and like score-opic > algorithm, compute depth of a link after parse, and in generate, only select > urls with valid depth. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1508) Port limit crawler to defined depth to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1508: - Summary: Port limit crawler to defined depth to 2.x (was: Port limit crawler to defined depth to 23) > Port limit crawler to defined depth to 2.x > -- > > Key: NUTCH-1508 > URL: https://issues.apache.org/jira/browse/NUTCH-1508 > Project: Nutch > Issue Type: Improvement > Reporter: Julien Nioche > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1508) Port limit crawler to defined depth to 23
Julien Nioche created NUTCH-1508: Summary: Port limit crawler to defined depth to 23 Key: NUTCH-1508 URL: https://issues.apache.org/jira/browse/NUTCH-1508 Project: Nutch Issue Type: Improvement Reporter: Julien Nioche -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1508) Port limit crawler to defined depth to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537804#comment-13537804 ] Julien Nioche commented on NUTCH-1508: -- Need to port the scoring-depth plugin to Nutch 2.x > Port limit crawler to defined depth to 2.x > -- > > Key: NUTCH-1508 > URL: https://issues.apache.org/jira/browse/NUTCH-1508 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: Julien Nioche > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1508) Port limit crawler to defined depth to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1508: - Affects Version/s: 2.2 > Port limit crawler to defined depth to 2.x > -- > > Key: NUTCH-1508 > URL: https://issues.apache.org/jira/browse/NUTCH-1508 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 > Reporter: Julien Nioche > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1331) limit crawler to defined depth
[ https://issues.apache.org/jira/browse/NUTCH-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1331. -- Resolution: Fixed Fix Version/s: 1.7 Thanks Markus Committed in revision 1424875 for trunk and opened a separate issue for porting to 2.x and documented in nutch-default.xml {quote} scoring.depth.max 1000 Max depth value from seed allowed by default. Can be overriden on a per-seed basis by specifying "_maxdepth_=VALUE" as a seed metadata. This plugin adds a "_depth_" metadatum to the pages to track the distance from the seed it was found from. The depth is used to prioritise URLs in the generation step so that shallower pages are fetched first. {quote} > limit crawler to defined depth > -- > > Key: NUTCH-1331 > URL: https://issues.apache.org/jira/browse/NUTCH-1331 > Project: Nutch > Issue Type: New Feature > Components: generator, parser, storage >Affects Versions: 1.4 >Reporter: behnam nikbakht > Fix For: 1.7 > > Attachments: NUTCH-1331.patch, NUTCH-1331-v2.patch > > > there is a need to limit crawler to some defined depth, and importance of > this option is to avoid crawling of infinite loops, with dynamic generated > urls, that occur in some sites, and to optimize crawler to select important > urls. > an option is define a iteration limit on generate,fetch,parse,updatedb cycle, > but it works only if in each cycle, all of unfetched urls become fetched, > (without recrawling them and with some other considerations) > we can define a new parameter in CrawlDatum, named depth, and like score-opic > algorithm, compute depth of a link after parse, and in generate, only select > urls with valid depth. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (NUTCH-1331) limit crawler to defined depth
[ https://issues.apache.org/jira/browse/NUTCH-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537811#comment-13537811 ] Julien Nioche edited comment on NUTCH-1331 at 12/21/12 11:37 AM: - Thanks Markus Committed in revision 1424875 for trunk and opened a separate issue for porting to 2.x and documented in nutch-default.xml {noformat} scoring.depth.max 1000 Max depth value from seed allowed by default. Can be overriden on a per-seed basis by specifying "_maxdepth_=VALUE" as a seed metadata. This plugin adds a "_depth_" metadatum to the pages to track the distance from the seed it was found from. The depth is used to prioritise URLs in the generation step so that shallower pages are fetched first. {noformat} was (Author: jnioche): Thanks Markus Committed in revision 1424875 for trunk and opened a separate issue for porting to 2.x and documented in nutch-default.xml {quote} scoring.depth.max 1000 Max depth value from seed allowed by default. Can be overriden on a per-seed basis by specifying "_maxdepth_=VALUE" as a seed metadata. This plugin adds a "_depth_" metadatum to the pages to track the distance from the seed it was found from. The depth is used to prioritise URLs in the generation step so that shallower pages are fetched first. {quote} > limit crawler to defined depth > -- > > Key: NUTCH-1331 > URL: https://issues.apache.org/jira/browse/NUTCH-1331 > Project: Nutch > Issue Type: New Feature > Components: generator, parser, storage >Affects Versions: 1.4 >Reporter: behnam nikbakht > Fix For: 1.7 > > Attachments: NUTCH-1331.patch, NUTCH-1331-v2.patch > > > there is a need to limit crawler to some defined depth, and importance of > this option is to avoid crawling of infinite loops, with dynamic generated > urls, that occur in some sites, and to optimize crawler to select important > urls. > an option is define a iteration limit on generate,fetch,parse,updatedb cycle, > but it works only if in each cycle, all of unfetched urls become fetched, > (without recrawling them and with some other considerations) > we can define a new parameter in CrawlDatum, named depth, and like score-opic > algorithm, compute depth of a link after parse, and in generate, only select > urls with valid depth. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1510) Upgrade to Hadoop 1.1.1
[ https://issues.apache.org/jira/browse/NUTCH-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538134#comment-13538134 ] Julien Nioche commented on NUTCH-1510: -- can you test for 2.x as well? should work straight of the box > Upgrade to Hadoop 1.1.1 > --- > > Key: NUTCH-1510 > URL: https://issues.apache.org/jira/browse/NUTCH-1510 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 1.6 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.7 > > Attachments: NUTCH-1510-1.7-1.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1507) Remove FetcherOutput
[ https://issues.apache.org/jira/browse/NUTCH-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538137#comment-13538137 ] Julien Nioche commented on NUTCH-1507: -- Wouldn't that break the compatibility when trying to read from an existing crawlDB? > Remove FetcherOutput > > > Key: NUTCH-1507 > URL: https://issues.apache.org/jira/browse/NUTCH-1507 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.6 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.7 > > Attachments: NUTCH-1507-1.7-1.patch > > > The FetcherOutput class is not used anywhere and it and its references should > be removed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1507) Remove FetcherOutput
[ https://issues.apache.org/jira/browse/NUTCH-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538151#comment-13538151 ] Julien Nioche commented on NUTCH-1507: -- bq. This code is used nowhere and only had two references from MapWritable and NutchWritable which means nothing by itself. what I was wondering was whether the change to these 2 classes would change their signature and break things as they are used in the crawldb (correct me if I am wrong) > Remove FetcherOutput > > > Key: NUTCH-1507 > URL: https://issues.apache.org/jira/browse/NUTCH-1507 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.6 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.7 > > Attachments: NUTCH-1507-1.7-1.patch > > > The FetcherOutput class is not used anywhere and it and its references should > be removed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1507) Remove FetcherOutput
[ https://issues.apache.org/jira/browse/NUTCH-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538158#comment-13538158 ] Julien Nioche commented on NUTCH-1507: -- Ok. Not entirely clear to me how this stuff works but it should be fairly easy to test anyway. Thanks! > Remove FetcherOutput > > > Key: NUTCH-1507 > URL: https://issues.apache.org/jira/browse/NUTCH-1507 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.6 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.7 > > Attachments: NUTCH-1507-1.7-1.patch > > > The FetcherOutput class is not used anywhere and it and its references should > be removed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1508) Port limit crawler to defined depth to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545757#comment-13545757 ] Julien Nioche commented on NUTCH-1508: -- Hi Ferdy I did not see NUTCH-1431 at all :-( NUTCH-1331 does the same but in a less intrusive way in terms of code changes + allows to specify a max distance per seed as well as a global one. Does NUTCH-1431 do that as well? Not sure what the best course of action is. I'd rather we kept the same approach in both branches. WDYT? > Port limit crawler to defined depth to 2.x > -- > > Key: NUTCH-1508 > URL: https://issues.apache.org/jira/browse/NUTCH-1508 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.2 >Reporter: Julien Nioche > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545958#comment-13545958 ] Julien Nioche commented on NUTCH-1031: -- well we have 2 separate params : http.agent.name which is a single value sent to the servers when fetching and http.robots.agents which can have multiple values and is used for parsing robots. The value of this parameter SHOULD be split based on commas. I don't think CC supports multiple values for http.robots.agents, but I'll ask Ken to be sure. > Delegate parsing of robots.txt to crawler-commons > - > > Key: NUTCH-1031 > URL: https://issues.apache.org/jira/browse/NUTCH-1031 > Project: Nutch > Issue Type: Task >Reporter: Julien Nioche >Assignee: Julien Nioche >Priority: Minor > Labels: robots.txt > Fix For: 1.7 > > Attachments: NUTCH-1031.v1.patch > > > We're about to release the first version of Crawler-Commons > [http://code.google.com/p/crawler-commons/] which contains a parser for > robots.txt files. This parser should also be better than the one we currently > have in Nutch. I will delegate this functionality to CC as soon as it is > available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547783#comment-13547783 ] Julien Nioche commented on NUTCH-840: - Thanks Lewis. Will commit shortly unless someone has any objections > Port tests from parse-html to parse-tika > > > Key: NUTCH-840 > URL: https://issues.apache.org/jira/browse/NUTCH-840 > Project: Nutch > Issue Type: Task > Components: parser >Affects Versions: 1.1, 1.6 > Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 1.7, 2.2 > > Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, > NUTCH-840v2.patch > > > We don't have test for HTML in parse-tika so I'll copy them from the old > parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1047: - Attachment: NUTCH-1047-1.x-v1.patch This is work in progress. This patch creates a new endpoint (IndexWriter) that plugins can implement. Comes with one such plugin (indexer-solr) and generic code for replacing the index and delete jobs. Haven't tested very much. The main difference is that the SOLR URL must be passed as a Hadoop param e.g. -D solr.server.url. It could also be put in the nutch-site.xml once and for all. There will be some cleaning to do once this is stable to remove the SOLR stuff in the core code etc... Please have a look and let me know your thoughts on this > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer > Reporter: Julien Nioche >Assignee: Julien Nioche > Labels: indexing > Fix For: 1.7 > > Attachments: NUTCH-1047-1.x-v1.patch > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1047: - Attachment: NUTCH-1047-1.x-v2.patch new version of the patch which removes all SOLR related stuff from the core. The crawl class assumes that solr is used (but this can be changed) and does not do the SOLR dedup anymore. We'll need a better mechanism for the dedup as the existing one is SOLR centric and not very scalable. Quite a drastic modification of the code, but should be for the best. Please give it a try and let me know your thoughts. PS: you might need to delete the index.solr package by hand > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer > Reporter: Julien Nioche >Assignee: Julien Nioche > Labels: indexing > Fix For: 1.7 > > Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1517) CloudSearch indexer
Julien Nioche created NUTCH-1517: Summary: CloudSearch indexer Key: NUTCH-1517 URL: https://issues.apache.org/jira/browse/NUTCH-1517 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Fix For: 1.7 Once we have made the indexers pluggable, we should add a plugin for Amazon CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a JSON based representation Search Data Format (SDF), which we could reuse for a file based indexer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1371) Replace Ivy with Maven Ant tasks
[ https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552513#comment-13552513 ] Julien Nioche commented on NUTCH-1371: -- Hi Lewis. Yep the plugins need to be managed in the same way + cleanup the ivy stuff etc... > Replace Ivy with Maven Ant tasks > > > Key: NUTCH-1371 > URL: https://issues.apache.org/jira/browse/NUTCH-1371 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Julien Nioche >Assignee: Lewis John McGibbney > Fix For: 1.7, 2.2 > > Attachments: NUTCH-1371.patch > > > We might move to Maven altogether but a good intermediate step could be to > rely on the maven ant tasks for managing the dependencies. Ivy does a good > job but we need to have a pom file anyway for publishing the artefacts which > means keeping the pom.xml and ivy.xml contents in sync. Most devs are also > more familiar with Maven, and it is well integrated in IDEs. Going the > ANT+MVN way also means that we don't have to rewrite the whole building > process and can rely on our existing script -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1047: - Attachment: NUTCH-1047-1.x-v3.patch Cleaner version of the patch which removes the content from the solr package, adds the dependencies to the indexer-solr plugin in the plugin.xml definition and changes the nutch script so that the SOLR related commands work in the same way but using the plugin under the bonnet. A few more things to do e.g. management of the commits when indexing but we are getting there > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer > Reporter: Julien Nioche > Assignee: Julien Nioche > Labels: indexing > Fix For: 1.7 > > Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, > NUTCH-1047-1.x-v3.patch > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554369#comment-13554369 ] Julien Nioche commented on NUTCH-1087: -- Hi Sebastian bq. SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1` is not a good option as it won't work in deploy mode, only in local whereas using 'hadoop fs -ls' works in both cases. Julien > Deprecate crawl command and replace with example script > --- > > Key: NUTCH-1087 > URL: https://issues.apache.org/jira/browse/NUTCH-1087 > Project: Nutch > Issue Type: Task >Affects Versions: 1.4 >Reporter: Markus Jelsma >Assignee: Julien Nioche >Priority: Minor > Fix For: 1.6, 2.2 > > Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, > NUTCH-1087-2.1-2.patch, NUTCH-1087-2.1.patch, NUTCH-1087.patch > > > * remove the crawl command > * add basic crawl shell script > See thread: > http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554862#comment-13554862 ] Julien Nioche commented on NUTCH-1087: -- Apologies Seb, I should (a) not read emails late in the evening after a long day (b) check the code before commenting ;-) > Deprecate crawl command and replace with example script > --- > > Key: NUTCH-1087 > URL: https://issues.apache.org/jira/browse/NUTCH-1087 > Project: Nutch > Issue Type: Task >Affects Versions: 1.4 >Reporter: Markus Jelsma > Assignee: Julien Nioche >Priority: Minor > Fix For: 1.6, 2.2 > > Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, > NUTCH-1087-2.1-2.patch, NUTCH-1087-2.1.patch, NUTCH-1087.patch > > > * remove the crawl command > * add basic crawl shell script > See thread: > http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556026#comment-13556026 ] Julien Nioche commented on NUTCH-1047: -- Good point Markus, thanks. The main issue I am struggling with at the moment is what to do with the SOLR deduplication. I don't think we can run a MapReduce job from a plugin so it's not going to work. One (temporary) option would be to leave it as is so that the crawl command works as expected as well as the crawl script and the nutch command and we then get rid of it when we have a generic deduplication job. > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Julien Nioche >Assignee: Julien Nioche > Labels: indexing > Fix For: 1.7 > > Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, > NUTCH-1047-1.x-v3.patch > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556041#comment-13556041 ] Julien Nioche commented on NUTCH-1047: -- We definitely need a better mechanism for deduplication. +1 to leave as is for now until we have a better option. Slightly annoying for this issue is that it means adding it back to the main classes as well as SOLR as dependency, not a big deal though. > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Julien Nioche > Assignee: Julien Nioche > Labels: indexing > Fix For: 1.7 > > Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, > NUTCH-1047-1.x-v3.patch > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556054#comment-13556054 ] Julien Nioche commented on NUTCH-1047: -- Tried, failed. Re- other issues : wouldn't it make sense to do NUTCH-1047 first before you improve the SOLR-backends? > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Julien Nioche >Assignee: Julien Nioche > Labels: indexing > Fix For: 1.7 > > Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, > NUTCH-1047-1.x-v3.patch > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556079#comment-13556079 ] Julien Nioche commented on NUTCH-1047: -- Should not be a big deal as the classes affected by NUTCH-1480 are not modified that much by NUTCH-1047 and it also means that you'll get to look at the code for this issue which is a good way of reviewing it :-) > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Julien Nioche >Assignee: Julien Nioche > Labels: indexing > Fix For: 1.7 > > Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, > NUTCH-1047-1.x-v3.patch > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556090#comment-13556090 ] Julien Nioche commented on NUTCH-1480: -- I'd rather it was implemented as an extension of NUTCH-945 where we'd have a partitioner that sends to all SOLR instances, which is I believe what NUTCH-1480 is about. There are many cases where we'd want to shard according to other criteria and NUTCH-945 would provide a more generic framework. Does this make sense? > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.7 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556091#comment-13556091 ] Julien Nioche commented on NUTCH-1047: -- my suggestion was that you give NUTCH-1047 a try, wait until it is committed then commit your changes to it, not that I'd patch it to include your changes. BTW have commented on NUTCH-1480 thanks Julien > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Julien Nioche >Assignee: Julien Nioche > Labels: indexing > Fix For: 1.7 > > Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, > NUTCH-1047-1.x-v3.patch > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.
[ https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556100#comment-13556100 ] Julien Nioche commented on NUTCH-1480: -- probably depends on whether we want to support both SOLR 3.x and SOLR 4.x. Got your point about indexing to multiple clouds, thanks! > SolrIndexer to write to multiple servers. > - > > Key: NUTCH-1480 > URL: https://issues.apache.org/jira/browse/NUTCH-1480 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.7 > > Attachments: NUTCH-1480-1.6.1.patch > > > SolrUtils should return an array of SolrServers and read the SolrUrl as a > comma delimited list of URL's using Configuration.getString(). SolrWriter > should be able to handle this list of SolrServers. > This is useful if you want to send documents to multiple servers if no > replication is available or if you want to send documents to multiple NOCs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-840) Port tests from parse-html to parse-tika
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557163#comment-13557163 ] Julien Nioche commented on NUTCH-840: - Trunk => Committed revision 1435101. Anyone to port to 2x? > Port tests from parse-html to parse-tika > > > Key: NUTCH-840 > URL: https://issues.apache.org/jira/browse/NUTCH-840 > Project: Nutch > Issue Type: Task > Components: parser >Affects Versions: 1.1, 1.6 >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 1.7, 2.2 > > Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, > NUTCH-840v2.patch > > > We don't have test for HTML in parse-tika so I'll copy them from the old > parse-html plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1047: - Attachment: NUTCH-1047-1.x-v4.patch First working patch! Added the SOLRDedup back into the core classes as it does not seem to be possible to run a MapReduce class from within a plugin. Added 2 new methods to the IndexWriter interface (commit, update) + fixed CleaningJob and nutch script. Tried on a small crawl with the crawl script and it worked as expected > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer > Reporter: Julien Nioche > Assignee: Julien Nioche > Labels: indexing > Fix For: 1.7 > > Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, > NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558195#comment-13558195 ] Julien Nioche commented on NUTCH-1031: -- bq. 1. Continue to have the legacy code for parsing robots file. bq. 2. As an add-in, crawler-commons can be employed for the parsing. User can pick based on a config parameter with a note indicating that #2 wont work with multiple HTTP agents. 2 is an overkill IMHO. the existing code works fine and the point in moving to CC was to get rid of some of our code, not make it bigger with yet another configuration. Lewis : donating out code is a good idea but in the case of the robots parsing it's more about modifying the existing one in CC. I haven't had time to look at robot parsing in CC and am not familiar with it but it would be a good thing to improve it. In the meantime let's go for option 1. Thanks! > Delegate parsing of robots.txt to crawler-commons > - > > Key: NUTCH-1031 > URL: https://issues.apache.org/jira/browse/NUTCH-1031 > Project: Nutch > Issue Type: Task >Reporter: Julien Nioche >Assignee: Julien Nioche >Priority: Minor > Labels: robots.txt > Fix For: 1.7 > > Attachments: NUTCH-1031.v1.patch > > > We're about to release the first version of Crawler-Commons > [http://code.google.com/p/crawler-commons/] which contains a parser for > robots.txt files. This parser should also be better than the one we currently > have in Nutch. I will delegate this functionality to CC as soon as it is > available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1522) Upgrade to Tika 1.3
Julien Nioche created NUTCH-1522: Summary: Upgrade to Tika 1.3 Key: NUTCH-1522 URL: https://issues.apache.org/jira/browse/NUTCH-1522 Project: Nutch Issue Type: Task Components: parser Reporter: Julien Nioche Priority: Minor Fix For: 1.7, 2.2 http://www.apache.org/dist/tika/CHANGES-1.3.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1482) Rename HTMLParseFilter
[ https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1482: - Fix Version/s: 1.7 > Rename HTMLParseFilter > -- > > Key: NUTCH-1482 > URL: https://issues.apache.org/jira/browse/NUTCH-1482 > Project: Nutch > Issue Type: Task > Components: parser >Affects Versions: 1.5.1 > Reporter: Julien Nioche > Fix For: 1.7 > > > See NUTCH-861 for a background discussion. We have changed the name in 2.x to > better reflect what it does and I think we should do the same for 1.x. > any objections? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13562558#comment-13562558 ] Julien Nioche commented on NUTCH-1047: -- Hi Lufeng. The solrindex command in the nutch script works just as before. You can also invoke the IndexingJob command and pass it the SOLR URL as a Hadoop parameter e.g. {{-D solr.server.url=xx}} SolrUtils is duplicated indeed because of DeleteDuplicates, which is a SOLR-specific implementation. We need to build a generic deduplicator at some point and it will use the pluggable backends. I decided to leave the SOLR-based one in for now, but if most people don't use it then we should probably shelve it. This is a separate issue though. Thanks for your comments > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Julien Nioche >Assignee: Julien Nioche > Labels: indexing > Fix For: 1.7 > > Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, > NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1250) parse-html does not parse links with empty anchor
[ https://issues.apache.org/jira/browse/NUTCH-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13562577#comment-13562577 ] Julien Nioche commented on NUTCH-1250: -- See comment in DOMContentUtils {quote} * Links without inner structure (tags, text, etc) are discarded, as * are links which contain only single nested links and empty text * nodes (this is a common DOM-fixup artifact, at least with * nekohtml). {quote} the solution you suggested would probably generate quite a lot of noise by not filtering the links added by Neko. I agree that outlinks without anchors should not be filtered. What about testing that they have a href attribute instead of testing for the presence of a child node? > parse-html does not parse links with empty anchor > - > > Key: NUTCH-1250 > URL: https://issues.apache.org/jira/browse/NUTCH-1250 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.4 >Reporter: Andreas Janning > Fix For: 1.7, 2.2 > > Attachments: DOMContentUtils_v1.patch > > > The parse-html plugin does not generate an outlink if the link has no anchor > For example the following HTML-Code does not create an Outlink: > {code:html} > > {code} > The JUnit-Test TestDOMContentUtils tries to test this but fails since there > is a comment inside the -Tag. > {code:title=TestDOMContentUtils.java|borderStyle=solid} > new String(" title " > + "" > + "" > + " " > + " " > + ""), > {code} > When you remove the comment the test fails. > {code:title=TestDOMContentUtils.java Test fails|borderStyle=solid} > new String(" title " > + "" > + "" // no anchor > + " " > + " " > + ""), > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564173#comment-13564173 ] Julien Nioche commented on NUTCH-1047: -- @tejasp can reproduce the issue and am looking into it, thanks. Somehow the configuration does not get passed on properly when using the crawl command. Thanks. Lufeng {quote} But i don't know why not add an option to set IndexerUrl such as bin/nutch solrindex -indexurl http://localhost:8983/solr/. {quote} whether it is passed as a parameter or via configuration should not make much of a difference. Your suggestion also assumes that the indexing backend can be reached via a single URL which is not necessarily the case as it could not need a URL at all or at the opposite need multiple URLs. Better to leave that logic in the configuration and assume that the backends will find whatever they need there. {quote} the corrent command to invoke the IndexingJob command is "bin/nutch solrindex http://localhost:8983/solr/ crawldb/ segments/20130121115214/ -filter". {quote} as explained above we want to keep compatibility with the existing sorlindex command and not change its syntax. Underneath it uses the new code based on plugins but sets the value of the solr config. There is no shortcut for the generic indexing job command in the nutch script yet but we could add one. For now it has to be called in full e.g. bin/nutch org.apache.nutch.indexer.IndexingJob ... which will make sense when we have other indexing backends and not just SOLR. Think about 'nutch solrindex' as a shortcut for the generic command. > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer > Reporter: Julien Nioche >Assignee: Julien Nioche > Labels: indexing > Fix For: 1.7 > > Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, > NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564196#comment-13564196 ] Julien Nioche commented on NUTCH-1047: -- Hi Tejas It will work everytime you set it in nutch-site.xml. As for setting it with -D in the crawl command - you definitely should not have to do that and this is where the bug is. The problem is that for some reason we value we take from the crawl command is correctly set in the configuration object however the later is reloaded or overridden during the call to JobClient.runJob(job) (IndexingJob line 120). BTW the crawl command is deprecated and should be removed at some point as we have the crawl script. Could you try using the SOLRIndex command as well as the crawl script while I try and solve the problem with the crawl command? Thanks Julien > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Julien Nioche > Assignee: Julien Nioche > Labels: indexing > Fix For: 1.7 > > Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, > NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564263#comment-13564263 ] Julien Nioche commented on NUTCH-1047: -- Tejas The crawl script and the solr index should work without setting "solr.server.url" in nutch-site.xml or using -D as this is handled for you in the nutch script. Can you please test without specifying "solr.server.url" in nutch-site.xml? Thanks > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Julien Nioche >Assignee: Julien Nioche > Labels: indexing > Fix For: 1.7 > > Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, > NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566291#comment-13566291 ] Julien Nioche commented on NUTCH-1047: -- [~wastl-nagel] a text based indexer is a good idea. Having one generating data at the format used by CloudSearch see [NUTCH-1517] would be cool as well. As for your concerns : most people currently use the SOLR indexer which will still be the one activated by default. I expect a minority of people will try and use something else and if they do then checking which one is activated is no big deal, either via config file or from logs. Passing the options via the config with -D is not very different from using a standard parameter, with the added benefit though that it gives us the possibility to set things in nutch-site.xml once and for all and hence make the commands much simpler. As for the list of properties, they would vary from backend to backend anyway. Each plugin could have a README describing what its options are, compared to having everything in nutch-default.xml at least the descriptions will be contained within the related plugin. [~tejasp] good catch for the number of args, will fix it. Re-usage message : we could add a getUsage() method to each backend that the generic command will call for all the active indexing plugins. I think the solrindex shortcut is just a temporary measure though until the documentation is up to scratch and the user base has got used to the generic commands. Thanks for taking the time to share your thoughts, guys. > Pluggable indexing backends > --- > > Key: NUTCH-1047 > URL: https://issues.apache.org/jira/browse/NUTCH-1047 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Julien Nioche > Assignee: Julien Nioche > Labels: indexing > Fix For: 1.7 > > Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, > NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch > > > One possible feature would be to add a new endpoint for indexing-backends and > make the indexing plugable. at the moment we are hardwired to SOLR - which is > OK - but as other resources like ElasticSearch are becoming more popular it > would be better to handle this as plugins. Not sure about the name of the > endpoint though : we already have indexing-plugins (which are about > generating fields sent to the backends) and moreover the backends are not > necessarily for indexing / searching but could be just an external storage > e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this > could be pertaining to the storage in GORA. 'indexing-backend' is the best > name that came to my mind so far - please suggest better ones. > We should come up with generic map/reduce jobs for indexing, deduplicating > and cleaning and maybe add a Nutch extension point there so we can easily > hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira