[jira] Updated: (NUTCH-245) DTD Schemas for plugin.xml configuration files in conf directory
[ http://issues.apache.org/jira/browse/NUTCH-245?page=all ] Chris A. Mattmann updated NUTCH-245: Attachment: NUTCH-245.Mattmann.patch.txt Here's the patch for the plugin DTD file. I got a lot of info from: http://help.eclipse.org/help31/index.jsp?topic=/org.eclipse.platform.doc.isv/reference/misc/plugin_manifest.html i.e., the eclipse manifest file. Turns out that by examining the plugin manifest parser code though, a lot of some elements that eclipse uses we don't currently use in Nutch. Additionally, I also noticed that the element "implementation" can basically have any attribute name/value pair on it, besides "id" and "class", so I wasn't sure how exactly to represent this in DTD terminology other than going through all the plugin.xml files for the nutch plugins and adding #IMPLIED attribute names for the implementation attribute corresponding to each of the optional attributes used by different extension point implementations. Maybe there's a more elegant way, but, for now, this works (I ran it against all the plugin.xml files through my XML validator and they check out). Okay, thanks! > DTD Schemas for plugin.xml configuration files in conf directory > > > Key: NUTCH-245 > URL: http://issues.apache.org/jira/browse/NUTCH-245 > Project: Nutch > Type: New Feature > Components: fetcher, indexer, ndfs, searcher, web gui > Versions: 0.7.2, 0.7.1, 0.7, 0.6, 0.8-dev > Environment: Power PC Dual Processor 2.0 Ghz, Mac OS X 10.4, although > improvement is independent of environment > Reporter: Chris A. Mattmann > Assignee: Chris A. Mattmann > Priority: Minor > Attachments: NUTCH-245.Mattmann.patch.txt > > Currently, the plugin.xml file does not have a DTD or XML Schema associated > with it, and most people just go look at an existing plugin's plugin.xml file > to determine what are the allowable elements, etc. There should be an > explicit plugin DTD file that describes the plugin.xml file. I'll look at the > code and attach a plugin.dtd file for the Nutch conf directory later today. > This way, people can use the DTD file to automatically (using tools such as > XMLSpy) generate plugin.xml files that can then be validated. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-245) DTD Schemas for plugin.xml configuration files in conf directory
[ http://issues.apache.org/jira/browse/NUTCH-245?page=all ] Chris A. Mattmann updated NUTCH-245: Description: Currently, the plugin.xml file does not have a DTD or XML Schema associated with it, and most people just go look at an existing plugin's plugin.xml file to determine what are the allowable elements, etc. There should be an explicit plugin DTD file that describes the plugin.xml file. I'll look at the code and attach a plugin.dtd file for the Nutch conf directory later today. This way, people can use the DTD file to automatically (using tools such as XMLSpy) generate plugin.xml files that can then be validated. (was: Currently, the plugin.xml file does not have a DTD or XML Schema associated with it, and most people just go look at an existing plugin's plugin.xml file to determine what are the allowable elements, etc. There should be an explicit plugin DTD file that describes the plugin.xml file. I'll look at the code and attach a plugin.dtd file for the Nutch conf directory later today. This way, people can use the DTD file to automatically (using tools such as XMLSpy) generate plugin.xml files that can then be validated. I'm also going to post another issue regarding adding an addition to the ant target that builds the Nutch website. The addition to the ant target would copy the existing DTD files in $NUTCH_HOME/conf to the Nutch website ROOT. That way, we could then reference the DTD file in all the XML instance files by reference something like http://lucene.apache.org/nutch/dtd/parse-plugins.dtd";>, within the parse-plugins.xml, or similarly for the nutch-site.xml, or mime-types.xml file.) update the issue to just be a single issue - I may post the one about copying the DTDs to the website at a later point > DTD Schemas for plugin.xml configuration files in conf directory > > > Key: NUTCH-245 > URL: http://issues.apache.org/jira/browse/NUTCH-245 > Project: Nutch > Type: New Feature > Components: fetcher, indexer, ndfs, searcher, web gui > Versions: 0.7.2, 0.7.1, 0.7, 0.6, 0.8-dev > Environment: Power PC Dual Processor 2.0 Ghz, Mac OS X 10.4, although > improvement is independent of environment > Reporter: Chris A. Mattmann > Assignee: Chris A. Mattmann > Priority: Minor > > Currently, the plugin.xml file does not have a DTD or XML Schema associated > with it, and most people just go look at an existing plugin's plugin.xml file > to determine what are the allowable elements, etc. There should be an > explicit plugin DTD file that describes the plugin.xml file. I'll look at the > code and attach a plugin.dtd file for the Nutch conf directory later today. > This way, people can use the DTD file to automatically (using tools such as > XMLSpy) generate plugin.xml files that can then be validated. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-247) robot parser to restrict.
robot parser to restrict. - Key: NUTCH-247 URL: http://issues.apache.org/jira/browse/NUTCH-247 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Minor Fix For: 0.8-dev If the agent name and the robots agents are not proper configure the Robot rule parser uses LOG.severe to log the problem but solve it also. Later on the fetcher thread checks for severe errors and stop if there is one. RobotRulesParser: if (agents.size() == 0) { agents.add(agentName); LOG.severe("No agents listed in 'http.robots.agents' property!"); } else if (!((String)agents.get(0)).equalsIgnoreCase(agentName)) { agents.add(0, agentName); LOG.severe("Agent we advertise (" + agentName + ") not listed first in 'http.robots.agents' property!"); } Fetcher.FetcherThread: if (LogFormatter.hasLoggedSevere()) // something bad happened break; I suggest to use warn or something similar instead of severe to log this problem. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: Swap with Nutch
You can go even further and load all of the index into RAM using RAM Disk. How big of a index are you talking about? -Ledio -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 11, 2006 3:51 PM To: nutch-dev@lucene.apache.org Subject: Re: Swap with Nutch larryp wrote: > Hi, I'm trying to get Nutch to load it's index into swap as I believe it will > give better performance that having it as a file on the hard drive as it > will be mapped as virtual memory, has anyone every attempted this - any > suggestion as to how one might force the index into swap? > > > Thanks in advance > > larry > -- > View this message in context: > http://www.nabble.com/Swap-with-Nutch-t1434922.html#a3871982 > Sent from the Nutch - Dev forum at Nabble.com. > > The FSDirectory in Lucene uses the org.apache.lucene.store.MMapDirectory underlying which already uses memory mapping (basically the same as virtual memory). Dennis
Re: Swap with Nutch
larryp wrote: Hi, I'm trying to get Nutch to load it's index into swap as I believe it will give better performance that having it as a file on the hard drive as it will be mapped as virtual memory, has anyone every attempted this - any suggestion as to how one might force the index into swap? Thanks in advance larry -- View this message in context: http://www.nabble.com/Swap-with-Nutch-t1434922.html#a3871982 Sent from the Nutch - Dev forum at Nabble.com. The FSDirectory in Lucene uses the org.apache.lucene.store.MMapDirectory underlying which already uses memory mapping (basically the same as virtual memory). Dennis
Swap with Nutch
Hi, I'm trying to get Nutch to load it's index into swap as I believe it will give better performance that having it as a file on the hard drive as it will be mapped as virtual memory, has anyone every attempted this - any suggestion as to how one might force the index into swap? Thanks in advance larry -- View this message in context: http://www.nabble.com/Swap-with-Nutch-t1434922.html#a3871982 Sent from the Nutch - Dev forum at Nabble.com.
Re: Microformats Support - HReview
Thanks. I'll go through your rel-tag plugin in version 0.8 and use it as a basis for adding my hreview code. -- View this message in context: http://www.nabble.com/Microformats-Support---HReview-t1433896.html#a3869485 Sent from the Nutch - Dev forum at Nabble.com.
Re: Microformats Support - HReview
> I have noticed that there are the beginnings of microformats support > (rel-tag) in nutch version 0.8. Hi Mike, I have created this plugin for playing a little around microformats. It can be a kind of "tutorial" for people who want to add support for further microformats. > Is anyone still working on adding other > microformats (hreview, hcard)? I don't remember somebody spoke about this on the lists. > If so, I would be interested in helping and/or collaborating. I already > created a simple hreview parser using nutch version 0.7. You can for instance adapt it for nutch 0.8 and then attach the patch to a JIRA issue. (I will be interested in committing it in nutch) Regards -- http://motrech.free.fr/ http://www.frutch.org/
Microformats Support - HReview
I have noticed that there are the beginnings of microformats support (rel-tag) in nutch version 0.8. Is anyone still working on adding other microformats (hreview, hcard)? If so, I would be interested in helping and/or collaborating. I already created a simple hreview parser using nutch version 0.7. -Mike -- View this message in context: http://www.nabble.com/Microformats-Support---HReview-t1433896.html#a3868806 Sent from the Nutch - Dev forum at Nabble.com.
Re: PMD integration
> > Piotr, please keep oro-2.0.8 in pmd-ext > I do not agree here - we are going to make a new release next week and > releasing with two versions of oro does not look nice. oro is quite > stable product and changes are in fact minimal: > http://svn.apache.org/repos/asf/jakarta/oro/trunk/CHANGES OK for me. But we cannot make a release without minimal tests. (I will made some tests for removing oro from nutch's regex for post 0.8release) Jérôme
[jira] Commented: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement
[ http://issues.apache.org/jira/browse/NUTCH-246?page=comments#action_12374049 ] Chris Schneider commented on NUTCH-246: --- A few more details: Stefan and I were able to reproduce this problem using either an injection set of 4500 URLs or a larger set of DMOZ URLs. With the 4500 URL injection, only 653 URLs were generated for the first segment, despite the fact that topN was set to 500K. I confirmed that nearly all of the 4500 injected URLs passed our URL filer and were actually injected into the crawldb. To eliminate the possibility that the bug had been fixed recently or was due to a code modification that we'd made ourselves, we deployed yesterday's sandbox version of nutch (2006-04-10), including hadoop-0.1.1.jar (though I believe that Stefan had to build it himself because the nutch-0.8-dev.jar didn't match the source). We made the absolute minimum changes to nutch-site.xml, hadoop-site.xml, and hadoop-env.sh in order to deploy this version properly in our cluster (1 jobtracker/namenode machine, 10 tasktracker/datanode machines). However, we got the same results (i.e., very few URLs actually generated). This bug has apparently been present since at least change 382948, but I suspect that it may have been present for the entire history of the mapreduce implementation of Nutch. It may also be the root cause of NUTCH-136, the explanation for which has always left me somewhat dissatisfied. Just because a nutch-site.xml containing default properties may override the desired mapred properties (incorrectly) specified in one of the *-default.xml files, and may therefore set mapred.map.tasks and mapred.reduce.tasks back to the defaults (2 and 1, respectively), it's not clear to me exactly how/why you'd get only a fraction of topN URLs fetched. As Stefan has suggested, it would actually seem more plausible if each tasktracker tried to fetch the entire set of URLs in this case. I would suggest that someone with a good understanding of the hadoop implementation investigate the first generation job in fine detail, both for the case where the mapred properties are specified in an appropriate manner and for the case where nutch-site.xml overrides the desired properties, setting them back to the defaults. > segment size is never as big as topN or crawlDB size in a distributed > deployement > - > > Key: NUTCH-246 > URL: http://issues.apache.org/jira/browse/NUTCH-246 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Reporter: Stefan Groschupf > Priority: Blocker > Fix For: 0.8-dev > > I didn't reopen NUTCH-136 since it is may related to the hadoop split. > I tested this on two different deployement (with 10 ttrackers + 1 jobtracker > and 9 ttracks and 1 jobtracker). > Defining map and reduce task number in a mapred-default.xml does not solve > the problem. (is in nutch/conf on all boxes) > We verified that it is not a problem of maximum urls per hosts and also not > a problem of the url filter. > Looks like the first job of the Generator (Selector) already got to less > entries to process. > May be this is somehow releasted to split generation or configuration inside > the distributed jobtracker since it runs in a different jvm as the jobclient. > However we was not able to find the source for this problem. > I think that should be fixed before publishing a nutch 0.8. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement
segment size is never as big as topN or crawlDB size in a distributed deployement - Key: NUTCH-246 URL: http://issues.apache.org/jira/browse/NUTCH-246 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8-dev I didn't reopen NUTCH-136 since it is may related to the hadoop split. I tested this on two different deployement (with 10 ttrackers + 1 jobtracker and 9 ttracks and 1 jobtracker). Defining map and reduce task number in a mapred-default.xml does not solve the problem. (is in nutch/conf on all boxes) We verified that it is not a problem of maximum urls per hosts and also not a problem of the url filter. Looks like the first job of the Generator (Selector) already got to less entries to process. May be this is somehow releasted to split generation or configuration inside the distributed jobtracker since it runs in a different jvm as the jobclient. However we was not able to find the source for this problem. I think that should be fixed before publishing a nutch 0.8. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: nighly build brocken?
I didn't even think about that. trying it out now :) thanks, -byron --- Stefan Groschupf <[EMAIL PROTECTED]> wrote: > Hi Byron, > > This sounds like the url filter problem. > Please try to remove the "-.*(/.+?)/.*?\1/.*?\1/" > from regex- > urlfilter.txt just for a test and tell us if this > may be would solve > the problem. > Thanks. > Stefan > Am 11.04.2006 um 14:43 schrieb Byron Miller: > > > i get nightly to run, but it never completes > anything. > > always get stuck at 98% here and there.. i'll try > > todays build and see what happens. > > > > --- Stefan Groschupf <[EMAIL PROTECTED]> wrote: > > > >> Hi, > >> > >> looks like the latest nightly build is broken. > >> Looks like the jar that comes with the nightly > build > >> contains some > >> patches that are not yet in the svn sources. > >> Is someone able to get the latest nutch nightly > to > >> run? > >> > >> Thanks. > >> Stefan > >> > >> > >> > > > > > > --- > company:http://www.media-style.com > forum:http://www.text-mining.org > blog:http://www.find23.net > > >
Re: nighly build brocken?
Hi Byron, This sounds like the url filter problem. Please try to remove the "-.*(/.+?)/.*?\1/.*?\1/" from regex- urlfilter.txt just for a test and tell us if this may be would solve the problem. Thanks. Stefan Am 11.04.2006 um 14:43 schrieb Byron Miller: i get nightly to run, but it never completes anything. always get stuck at 98% here and there.. i'll try todays build and see what happens. --- Stefan Groschupf <[EMAIL PROTECTED]> wrote: Hi, looks like the latest nightly build is broken. Looks like the jar that comes with the nightly build contains some patches that are not yet in the svn sources. Is someone able to get the latest nutch nightly to run? Thanks. Stefan --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: nighly build brocken?
i get nightly to run, but it never completes anything. always get stuck at 98% here and there.. i'll try todays build and see what happens. --- Stefan Groschupf <[EMAIL PROTECTED]> wrote: > Hi, > > looks like the latest nightly build is broken. > Looks like the jar that comes with the nightly build > contains some > patches that are not yet in the svn sources. > Is someone able to get the latest nutch nightly to > run? > > Thanks. > Stefan > > >