Re: Choosing an efficient family configuration for GORA HBase
Ok thanks. I was just wondering whether there were any developments on this. I'm not sure yet what would be the fastest in the case of Nutch, all I know from our own experience is that it is best practice to group frequently-accessed columns together, but nevertheless store large columns in a separate family. Joining multiple families in a Scan should be no problem performance wise. (People reporting problems with too many families will probably have a problem with their HBase deployment in general). In short, if the Parser needs Content but most other jobs don't (that do need other columns from the Fetch family for example Generator or DbUpdater), it might be beneficial to optimize the family configuration to reflect this. This could make Parser jobs slightly slower, but increase throughput of the other jobs so that perhaps total throughput will be better. For now we will use the default configuration, but we will report back on this when we have tried some alternatives. On 10/01/2011 10:23 PM, Alexis wrote: Dear Ferdy, This mapping is user defined. It specifies where Avro fields required by Nutch jobs are stored in HBase. You can tweak the schema according to this kind of considerations by editing the config file. So content is populated by the Fetcher job (writes) that downloads the web page. It is parsed by the Parser job (reads) that extracts the links and the metadata. For example, these are the fields that might need to be grouped in the same column family (but they are not) because they are all required for the parse step: From http://svn.apache.org/viewvc/nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserJob.java?view=markup 58static { 59 FIELDS.add(WebPage.Field.STATUS); 60 FIELDS.add(WebPage.Field.CONTENT); 61 FIELDS.add(WebPage.Field.CONTENT_TYPE); 62 FIELDS.add(WebPage.Field.SIGNATURE); 63 FIELDS.add(WebPage.Field.MARKERS); 64 FIELDS.add(WebPage.Field.PARSE_STATUS); 65 FIELDS.add(WebPage.Field.OUTLINKS); 66 FIELDS.add(WebPage.Field.METADATA); 67} It looks tricky. I've heard that on the contrary people usually don't use more that 3 column famillies to avoid slowing down the scans as you mentioned. Not sure though. If you manage to optimize the config with big improvements in the processing times don't hesitate to edit the wiki page... On Fri, Sep 30, 2011 at 5:57 AM, Ferdy Galemaferdy.gal...@kalooga.com wrote: Hi, About the example GORA HBase mapping at: http://wiki.apache.org/nutch/GORA_HBase Are there any current developments on improving the configuration for the column mappings? For example, at first glance it seems that it would be more efficient to put the fairly big column 'content' in a completely separate family. This way, doing scans over the smaller columns that do not need the 'content' column run much faster because the scan will completely skip 'content' on the regionserver level. (All columns in each family are stored in the same file per region.) Any thoughts on this? Ferdy.
[jira] [Resolved] (NUTCH-1137) LinkDb / invertlinks: command line arguments ignored
[ https://issues.apache.org/jira/browse/NUTCH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1137. -- Resolution: Fixed Committed for 1.4 in rev. 1178376. Reused crawldb code instead. Thanks for opening this issue Sebastian. LinkDb / invertlinks: command line arguments ignored Key: NUTCH-1137 URL: https://issues.apache.org/jira/browse/NUTCH-1137 Project: Nutch Issue Type: Bug Components: linkdb Affects Versions: 1.3 Reporter: Sebastian Nagel Assignee: Markus Jelsma Priority: Minor Fix For: 1.4 Attachments: NUTCH-1137-1.5.patch If the tool invertlinks is called with option -dir segmentsDir all remaining arguments are ignored: {noformat} % $NUTCH_HOME/bin/nutch invertlinks linkdb -dir segments -noNormalize -noFilter LinkDb: starting at 2011-09-28 23:24:07 LinkDb: linkdb: linkdb LinkDb: URL normalize: true LinkDb: URL filter: true {noformat} (URLs are normalized and filtered despite -noNormalize/-noFilter) The patch also restricts the ordering of arguments according to the help text: Usage: LinkDb linkdb (-dir segmentsDir | seg1 seg2 ...) [-force] [-noNormalize] [-noFilter] (segments must be given before the optional flags) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Providing a list of FAQ's with every new subscribe request
Hi Sami, At the moment I am not in a position to take on the role of mailing list moderator. But I've found out that the list moderators should be able to configure the nature of documentation on a per-list basis by emailing ${list}-help@ from their moderator address and following the instructions. Would it be possible to send out a list of our official FAQ's when a new user confirms their subscription to both user@ and dev@ lists. What are your thoughts on this? Thanks Lewis On Tue, Sep 27, 2011 at 9:24 PM, Sami Siren ssi...@gmail.com wrote: I think moderators can be changed by filing a jira issue (by one of the PMC members) to the infra project, for example see https://issues.apache.org/jira/browse/INFRA-3511 Moderation is a simple task you just let good messages (usually|only coming from non subscribed senders) through and forget abut the rest. Julien: I am pretty sure I am still a moderator at dev user - i just tried some of the moderator commands and they were successful. -- Sami Siren On Tue, Sep 27, 2011 at 9:32 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Sami, Who is it that we are supposed to speak to regarding moderation. I tried to contact the infra@ team but still awaiting reply. What in included in moderation? I'm completely foreign to all of this, and as Julien stated I was not aware that there was anyone directly linked to Nutch list moderation. The info on the apache developers area is pretty vague and I haven't been able to get much further with this. Thanks On Tue, Sep 27, 2011 at 6:33 PM, Sami Siren ssi...@gmail.com wrote: I am getting moderation emails and I think that there's somebody else doing moderation too since the messages get sent to the list without me accepting them. I would like to step down from the moderator status and have someone else do moderation instead, because frankly I have not been doing a great job with it. Any volunteers? -- Sami Siren On Tue, Sep 27, 2011 at 12:09 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: We don't have moderators for the user and dev lists On 26 September 2011 20:09, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Thanks Markus, Who is mailing list moderator? If I can get this info before trying to contact infra it would be great. On Mon, Sep 26, 2011 at 7:37 PM, Markus Jelsma markus.jel...@openindex.io wrote: SOunds like a good idea. I think you need to be ML moderator to make changes http://www.apache.org/dev/committers.html#mail-moderate Hi, I just signed up to the JUnit users lists and received a really well documented FAQ accompaniment when I subscribed. I think this would be a great resource for new Nutch users. Does anyone agree/disagree? How do we go about configuring this? Is this a request for the infra team? Thank you -- *Lewis* -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- *Lewis* -- *Lewis*
[jira] [Updated] (NUTCH-1144) Filtering optional in WebGraph
[ https://issues.apache.org/jira/browse/NUTCH-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1144: - Fix Version/s: (was: 1.5) Filtering optional in WebGraph -- Key: NUTCH-1144 URL: https://issues.apache.org/jira/browse/NUTCH-1144 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor There is no URL filtering mechanism in the web graph program. When a CrawlDatum is removed from the CrawlDB by an URL filter is should be possible to remove it from the web graph as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1144) Filtering optional in WebGraph
[ https://issues.apache.org/jira/browse/NUTCH-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1144. -- Resolution: Won't Fix Decided to do filtering and normalizing in one issue. Filtering optional in WebGraph -- Key: NUTCH-1144 URL: https://issues.apache.org/jira/browse/NUTCH-1144 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor There is no URL filtering mechanism in the web graph program. When a CrawlDatum is removed from the CrawlDB by an URL filter is should be possible to remove it from the web graph as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1142) Normalization and filtering in WebGraph
[ https://issues.apache.org/jira/browse/NUTCH-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1142: - Description: The WebGraph programs performs URL normalization. Since normalization of outlinks is already performed during the parse it should become optional. There is also no URL filtering mechanism in the web graph program. When a CrawlDatum is removed from the CrawlDB by an URL filter is should be possible to remove it from the web graph as well. was:The WebGraph programs performs URL normalization. Since normalization of outlinks is already performed during the parse it should become optional. Patch Info: Patch Available Summary: Normalization and filtering in WebGraph (was: Normalization optional in WebGraph) Normalization and filtering in WebGraph --- Key: NUTCH-1142 URL: https://issues.apache.org/jira/browse/NUTCH-1142 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 The WebGraph programs performs URL normalization. Since normalization of outlinks is already performed during the parse it should become optional. There is also no URL filtering mechanism in the web graph program. When a CrawlDatum is removed from the CrawlDB by an URL filter is should be possible to remove it from the web graph as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1143) Omit anchor in webgraph's LinkDatum
[ https://issues.apache.org/jira/browse/NUTCH-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119281#comment-13119281 ] Markus Jelsma commented on NUTCH-1143: -- It seems the anchor field was once used for indexing the best ranking anchor for a given URL but the indexing code is legacy. With the current version users must invert links and pass the linkdb and enable index-anchor to index anchors so having an anchor in LinkDatum is obsolete for now. Instead of completely removing the anchor code we should make it optional, by doing that we can write indexing code later and pass the webgraph to the indexer instead of a linkdb. I opt for defaulting the setting to false (i.e. do not store anchors) since they are unusable at the moment. Omit anchor in webgraph's LinkDatum --- Key: NUTCH-1143 URL: https://issues.apache.org/jira/browse/NUTCH-1143 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Anchors are stored unchecked in the webgraph. it looks like for cosmetic reasons only. When dealing with hundreds of millions of records it takes up significant space and I/O time. This issue should add an option to omit the anchor. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1058) Upgrade Solr schema to version 1.4
[ https://issues.apache.org/jira/browse/NUTCH-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1058. -- Resolution: Fixed Assignee: Markus Jelsma Committed for 1.4 in rev. 1178409 and for nutchgora in rev. 1178410. Upgrade Solr schema to version 1.4 -- Key: NUTCH-1058 URL: https://issues.apache.org/jira/browse/NUTCH-1058 Project: Nutch Issue Type: Task Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.4, nutchgora The version of our Solr schema should be updated from 1.3 to the current version. I propose to commit the change prior to 1.4 and 2.0 RC's, the Solr schema version may have incremented more than once at the time of an RC. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-717) Make Nutch Solr integration easier
[ https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-717: Fix Version/s: (was: 1.4) 1.5 Make Nutch Solr integration easier -- Key: NUTCH-717 URL: https://issues.apache.org/jira/browse/NUTCH-717 Project: Nutch Issue Type: New Feature Reporter: Sami Siren Priority: Critical Fix For: nutchgora, 1.5 Erik Hatcher proposed we should provide a full solr config dir to be used with Nutch-Solr. Now we only provide index schema. It would be considerably easier to setup nutch-solr if we provided the whole conf dir that you could use with solr like: java -Dsolr.solr.home=Nutch's Solr Home -jar start.jar -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1136) Ant pmd target is broken
[ https://issues.apache.org/jira/browse/NUTCH-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119365#comment-13119365 ] Lewis John McGibbney commented on NUTCH-1136: - Would like to commit before RC for 1.4 if possible. Ant pmd target is broken Key: NUTCH-1136 URL: https://issues.apache.org/jira/browse/NUTCH-1136 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.4, nutchgora Attachments: NUTCH-1136-nutchgora-20110930.patch, NUTCH-1136-trunk-1.4-20110930.patch issuing an 'ant pmd' command results in a failure as follows {code} BUILD FAILED /home/lewis/ASF/trunk/build.xml:327: taskdef class net.sourceforge.pmd.ant.PMDTask cannot be found using the classloader AntClassLoader[] {code} The resulting fix should address this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1109) Add Sonar targets to Ant build.xml
[ https://issues.apache.org/jira/browse/NUTCH-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119367#comment-13119367 ] Lewis John McGibbney commented on NUTCH-1109: - Would like to commit before RC for 1.4 if possible. Add Sonar targets to Ant build.xml -- Key: NUTCH-1109 URL: https://issues.apache.org/jira/browse/NUTCH-1109 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Labels: build Fix For: 1.4, nutchgora Attachments: NUTCH-1109-branch-1.4-20110910.patch, NUTCH-1109-trunk-1.4-20110927.patch, sonar-ant-task-1.1.jar Sonar [1] is an open platform to manage code quality. I was experimenting today with what kind of analysis it allows us to do on a given codebase and was pleasantly surprised with the results. For details on the documentation please see here [2]. It can be easily integrated into our ant build.xml and is an easy way to explicitly identify latent areas of code which we could possibly improve upon. At this stage I wish to highlight some of my statistics in findings... Running Sonar via the attached patch identifies (based upon the analysis rules from Sonar) that the Branch-1.4 codebase contains issues as follows {code} Critical 28 Major 1,231 Minor 356 Info 119 {code} These range from a catch statement being identified in o.a.n.crawl.Generator which shouldn't be catching throwable since it includes errors, through to trivial issues such as nested statements which could be combined in the same class. Although on the face of it, this seems an excellent way to make code more consistent across the board, which may in turn lead to 'better' code, I am by no way saying that this is a step we should move towards without thinking it through and discussing at length. I also think that there needs to be a good deal of our own judgement to decide whether any issues flagged up by Sonar should be marked as false positives. To conclude I would like to add that I onl decided to open this issue in an attempt to gauge peoples views on the direction it takes us in. [1] http://www.sonarsource.org/ [2] http://docs.codehaus.org/display/SONAR/Documentation -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Providing a list of FAQ's with every new subscribe request
On Mon, Oct 3, 2011 at 3:48 PM, lewis john mcgibbney lewis.mcgibb...@gmail.com wrote: Would it be possible to send out a list of our official FAQ's when a new user confirms their subscription to both user@ and dev@ lists. It seems this is possible. Can you craft a piece of text you would like to be sent out on successful subscribe and I'll try to set it up. This is the full list of files ezmlm lists as editable, just in case if someone comes up with something else to customize: FileUse bottom bottom of all responses. General command info. digest 'administrivia' section of digests. faq frequently asked questions specific to this list. get_bad in place of messages not found in the archive. helpgeneral help (between 'top' and 'bottom'). infolist info. First line should be meaningful on its own. mod_helpspecific help for list moderators. mod_reject to sender of rejected post. mod_request to message moderators together with post. mod_sub to subscriber after moderator confirmed subscribe. mod_sub_confirm to subscription mod to request subscribe confirm. mod_timeout to sender of timed-out post. mod_unsub_confirm to remote admin to request unsubscribe confirm. sub_bad to subscriber if confirm was bad. sub_confirm to subscriber to request subscribe confirm. sub_nop to subscriber after re-subscription. sub_ok to subscriber after successful subscription. top top of all responses. unsub_bad to subscriber if unsubscribe confirm was bad. unsub_confirm to subscriber to request unsubscribe confirm. unsub_nop to non-subscriber after unsubscribe. unsub_okto ex-subscriber after successful unsubscribe. -- Sami Siren
Build failed in Jenkins: Nutch-trunk #1623
See https://builds.apache.org/job/Nutch-trunk/1623/changes Changes: [markus] NUTCH-1058 Upgrade Solr schema to version 1.4 [markus] NUTCH-1137 LinkDB other options ignored with -dir -- [...truncated 937 lines...] A src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/sv.test A src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/nl.test A src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/it.test AU src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/test-referencial.txt A src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/fi.test AU src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/el.test A src/plugin/language-identifier/src/java A src/plugin/language-identifier/src/java/org A src/plugin/language-identifier/src/java/org/apache A src/plugin/language-identifier/src/java/org/apache/nutch A src/plugin/language-identifier/src/java/org/apache/nutch/analysis A src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang AU src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/LanguageIndexingFilter.java AU src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java AU src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/langmappings.properties AU src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/package.html AUsrc/plugin/language-identifier/plugin.xml AUsrc/plugin/language-identifier/build.xml A src/plugin/feed A src/plugin/feed/sample A src/plugin/feed/sample/rsstest.rss A src/plugin/feed/ivy.xml A src/plugin/feed/src A src/plugin/feed/src/test A src/plugin/feed/src/test/org A src/plugin/feed/src/test/org/apache A src/plugin/feed/src/test/org/apache/nutch A src/plugin/feed/src/test/org/apache/nutch/parse A src/plugin/feed/src/test/org/apache/nutch/parse/feed AU src/plugin/feed/src/test/org/apache/nutch/parse/feed/TestFeedParser.java A src/plugin/feed/src/java A src/plugin/feed/src/java/org A src/plugin/feed/src/java/org/apache A src/plugin/feed/src/java/org/apache/nutch A src/plugin/feed/src/java/org/apache/nutch/parse A src/plugin/feed/src/java/org/apache/nutch/parse/feed AUsrc/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java A src/plugin/feed/src/java/org/apache/nutch/indexer A src/plugin/feed/src/java/org/apache/nutch/indexer/feed AU src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java AUsrc/plugin/feed/plugin.xml AUsrc/plugin/feed/build.xml A src/plugin/subcollection A src/plugin/subcollection/ivy.xml A src/plugin/subcollection/src A src/plugin/subcollection/src/test A src/plugin/subcollection/src/test/org A src/plugin/subcollection/src/test/org/apache A src/plugin/subcollection/src/test/org/apache/nutch A src/plugin/subcollection/src/test/org/apache/nutch/collection AU src/plugin/subcollection/src/test/org/apache/nutch/collection/TestSubcollection.java A src/plugin/subcollection/src/java A src/plugin/subcollection/src/java/org A src/plugin/subcollection/src/java/org/apache A src/plugin/subcollection/src/java/org/apache/nutch A src/plugin/subcollection/src/java/org/apache/nutch/collection AU src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java AU src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java AU src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html A src/plugin/subcollection/src/java/org/apache/nutch/indexer A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection AU src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java AUsrc/plugin/subcollection/README.txt AUsrc/plugin/subcollection/plugin.xml AUsrc/plugin/subcollection/build.xml A src/plugin/index-more A src/plugin/index-more/ivy.xml A src/plugin/index-more/src A src/plugin/index-more/src/test A src/plugin/index-more/src/test/org A src/plugin/index-more/src/test/org/apache A src/plugin/index-more/src/test/org/apache/nutch A src/plugin/index-more/src/test/org/apache/nutch/indexer A src/plugin/index-more/src/test/org/apache/nutch/indexer/more AU src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java A src/plugin/index-more/src/java A
[jira] [Commented] (NUTCH-1058) Upgrade Solr schema to version 1.4
[ https://issues.apache.org/jira/browse/NUTCH-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119888#comment-13119888 ] Hudson commented on NUTCH-1058: --- Integrated in Nutch-trunk #1623 (See [https://builds.apache.org/job/Nutch-trunk/1623/]) NUTCH-1058 Upgrade Solr schema to version 1.4 markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1178409 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/schema.xml Upgrade Solr schema to version 1.4 -- Key: NUTCH-1058 URL: https://issues.apache.org/jira/browse/NUTCH-1058 Project: Nutch Issue Type: Task Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.4, nutchgora The version of our Solr schema should be updated from 1.3 to the current version. I propose to commit the change prior to 1.4 and 2.0 RC's, the Solr schema version may have incremented more than once at the time of an RC. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1137) LinkDb / invertlinks: command line arguments ignored
[ https://issues.apache.org/jira/browse/NUTCH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119889#comment-13119889 ] Hudson commented on NUTCH-1137: --- Integrated in Nutch-trunk #1623 (See [https://builds.apache.org/job/Nutch-trunk/1623/]) NUTCH-1137 LinkDB other options ignored with -dir markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1178376 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java LinkDb / invertlinks: command line arguments ignored Key: NUTCH-1137 URL: https://issues.apache.org/jira/browse/NUTCH-1137 Project: Nutch Issue Type: Bug Components: linkdb Affects Versions: 1.3 Reporter: Sebastian Nagel Assignee: Markus Jelsma Priority: Minor Fix For: 1.4 Attachments: NUTCH-1137-1.5.patch If the tool invertlinks is called with option -dir segmentsDir all remaining arguments are ignored: {noformat} % $NUTCH_HOME/bin/nutch invertlinks linkdb -dir segments -noNormalize -noFilter LinkDb: starting at 2011-09-28 23:24:07 LinkDb: linkdb: linkdb LinkDb: URL normalize: true LinkDb: URL filter: true {noformat} (URLs are normalized and filtered despite -noNormalize/-noFilter) The patch also restricts the ordering of arguments according to the help text: Usage: LinkDb linkdb (-dir segmentsDir | seg1 seg2 ...) [-force] [-noNormalize] [-noFilter] (segments must be given before the optional flags) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1058) Upgrade Solr schema to version 1.4
[ https://issues.apache.org/jira/browse/NUTCH-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119890#comment-13119890 ] Hudson commented on NUTCH-1058: --- Integrated in Nutch-nutchgora #25 (See [https://builds.apache.org/job/Nutch-nutchgora/25/]) NUTCH-1058 Upgrade Solr schema to version 1.4 markus : http://svn.apache.org/viewvc/nutch/branches/nutchgora/viewvc/?view=revroot=revision=1178410 Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/conf/schema.xml Upgrade Solr schema to version 1.4 -- Key: NUTCH-1058 URL: https://issues.apache.org/jira/browse/NUTCH-1058 Project: Nutch Issue Type: Task Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.4, nutchgora The version of our Solr schema should be updated from 1.3 to the current version. I propose to commit the change prior to 1.4 and 2.0 RC's, the Solr schema version may have incremented more than once at the time of an RC. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Nutch-nutchgora #25
See https://builds.apache.org/job/Nutch-nutchgora/25/changes Changes: [markus] NUTCH-1058 Upgrade Solr schema to version 1.4 -- [...truncated 2491 lines...] [ivy:resolve] :: loading settings :: file = /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlfilter-suffix [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-suffix/classes [javac] Note: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. jar: [jar] Building jar: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-suffix/urlfilter-suffix.jar deps-test: deploy: [copy] Copying 1 file to /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-suffix copy-generated-lib: [copy] Copying 1 file to /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-suffix init: [mkdir] Created dir: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator [mkdir] Created dir: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/classes [mkdir] Created dir: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/test [mkdir] Created dir: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlfilter-validator [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/classes jar: [jar] Building jar: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/urlfilter-validator.jar deps-test: deploy: [copy] Copying 1 file to /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator copy-generated-lib: [copy] Copying 1 file to /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator init: [mkdir] Created dir: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic [mkdir] Created dir: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/classes [mkdir] Created dir: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/test [mkdir] Created dir: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-basic [javac] /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/classes jar: [jar] Building jar: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/urlnormalizer-basic.jar deps-test: deploy: [copy] Copying 1 file to /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic copy-generated-lib: [copy] Copying 1 file to /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic init: [mkdir] Created dir: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-pass [mkdir] Created dir: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-pass/classes [mkdir] Created dir: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-pass/test [mkdir] Created dir: /x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-pass init-plugin: