[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project
[ http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12365051 ] John Xing commented on NUTCH-193: - what's in the name hadoop? Because "had oops"? > move NDFS and MapReduce to a separate project > - > > Key: NUTCH-193 > URL: http://issues.apache.org/jira/browse/NUTCH-193 > Project: Nutch > Type: Task > Components: ndfs > Versions: 0.8-dev > Reporter: Doug Cutting > Assignee: Doug Cutting > Fix For: 0.8-dev > > The NDFS and MapReduce code should move from Nutch to a new Lucene > sub-project named Hadoop. > My plan is to do this as follows: > 1. Move all code in the following packages from Nutch to Hadoop: > org.apache.nutch.fs > org.apache.nutch.io > org.apache.nutch.ipc > org.apache.nutch.mapred > org.apache.nutch.ndfs > These packages will all be renamed to org.apache.hadoop, and Nutch code will > be updated to reflect this. > 2. Move selected classes from Nutch to Hadoop, as follows: > org.apache.nutch.util.NutchConf -> org.apache.hadoop.conf.Configuration > org.apache.nutch.util.NutchConfigurable -> org.apache.hadoop.Configurable > org.apache.nutch.util.NutchConfigured -> org.apache.hadoop.Configured > org.apache.nutch.util.Progress -> org.apache.hadoop.util.Progress > org.apache.nutch.util.LogFormatter-> org.apache.hadoop.util.LogFormatter > org.apache.nutch.util.Daemon -> org.apache.hadoop.util.Daemon > 3. Add a jar containing all of the above the Nutch's lib directory. > Does this plan sound reasonable? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: tool to mount nutch filesystem
On Sat, Jan 21, 2006 at 09:23:01AM -0800, John X wrote: > Hi, Sami, > > On Sat, Jan 21, 2006 at 05:32:37PM +0200, Sami Siren wrote: > > >I have created a simple tool to mount nutch filesystem on linux. > > >http://nutch.neasys.com/ndfs/fuse-nutchfs-0.1.0.tar.gz > > >Check README inside for how to set up. > > >It is very barebone, only browse, no read/write yet. > > > > > >Doug and Mike: any plan to make ndfs codes into a separate package? > > > > John, > > > > I didn't check out your version yet, but I have also written > > a version wich is read/write capable, should we combine our efforts here? > > Sure, why not ;-) Let me know where your stuff is. The tarball, now called fuse-hadoop-0.1.0.tar.gz, is attached at https://issues.apache.org/jira/browse/NUTCH-199. A result of combined work of Sami's and mine. It does read/write. John
[jira] Updated: (NUTCH-199) tool to mount ndfs on linux
[ http://issues.apache.org/jira/browse/NUTCH-199?page=all ] John Xing updated NUTCH-199: Attachment: fuse-hadoop-0.1.0.tar.gz It works with current nutch-0.8-dev. Will be ported to hadoop after ndfs is moved. > tool to mount ndfs on linux > --- > > Key: NUTCH-199 > URL: http://issues.apache.org/jira/browse/NUTCH-199 > Project: Nutch > Type: New Feature > Components: ndfs > Environment: linux only > Reporter: John Xing > Assignee: John Xing > Attachments: fuse-hadoop-0.1.0.tar.gz > > tool to mount ndfs on linux. It depends on fuse and fuse-j. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-199) tool to mount ndfs on linux
tool to mount ndfs on linux --- Key: NUTCH-199 URL: http://issues.apache.org/jira/browse/NUTCH-199 Project: Nutch Type: New Feature Components: ndfs Environment: linux only Reporter: John Xing Assigned to: John Xing tool to mount ndfs on linux. It depends on fuse and fuse-j. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-59) meta data support in webdb
[ http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12365012 ] James Jonas commented on NUTCH-59: -- Thanks, I have been tracking Nutch-139 and Nutch-192 and look forward to these patches being committed into the .8 trunk. James > meta data support in webdb > -- > > Key: NUTCH-59 > URL: http://issues.apache.org/jira/browse/NUTCH-59 > Project: Nutch > Type: New Feature > Reporter: Stefan Groschupf > Priority: Minor > Attachments: webDBMetaDataPatch.txt > > Meta data support in web db would very usefully for a new set of nutch > feature that needs long life meta data. > Actually page meta data need to be regenerated or lookup every 30 days a page > is re-fetched, in a long context web db meta data would bring a dramatically > performance improvement for such tasks. > Furthermore Storage of meta data in webdb would make a new generation of > linklist generation filters possible. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-59) meta data support in webdb
[ http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12365009 ] Stefan Groschupf commented on NUTCH-59: --- Please let's move this discuss into the user mailing list, since this is no 'real' issue comment. Also please note that meta data support for nutch 0.8 is under development and is comming hopefully soon into sources. So may a better idea is to wait for nutch 0.8 meta data support. > meta data support in webdb > -- > > Key: NUTCH-59 > URL: http://issues.apache.org/jira/browse/NUTCH-59 > Project: Nutch > Type: New Feature > Reporter: Stefan Groschupf > Priority: Minor > Attachments: webDBMetaDataPatch.txt > > Meta data support in web db would very usefully for a new set of nutch > feature that needs long life meta data. > Actually page meta data need to be regenerated or lookup every 30 days a page > is re-fetched, in a long context web db meta data would bring a dramatically > performance improvement for such tasks. > Furthermore Storage of meta data in webdb would make a new generation of > linklist generation filters possible. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-59) meta data support in webdb
[ http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12365008 ] James Jonas commented on NUTCH-59: -- I deployed this patch into a Nutch 7.1 sandbox and performed a test run. The 'topic' metadata has been captured. Congrats! How do I display this information inside the 'more' section of my query result page? How do I use this metadata to filter a standard query? Thanks, James > meta data support in webdb > -- > > Key: NUTCH-59 > URL: http://issues.apache.org/jira/browse/NUTCH-59 > Project: Nutch > Type: New Feature > Reporter: Stefan Groschupf > Priority: Minor > Attachments: webDBMetaDataPatch.txt > > Meta data support in web db would very usefully for a new set of nutch > feature that needs long life meta data. > Actually page meta data need to be regenerated or lookup every 30 days a page > is re-fetched, in a long context web db meta data would bring a dramatically > performance improvement for such tasks. > Furthermore Storage of meta data in webdb would make a new generation of > linklist generation filters possible. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Some bugs I'm trying to characterize....
I've been putting mapreduce through its paces for some custom jobs lately. Here're a few issues I've been running into, please help me clarify if I misunderstand, or offer insight if they're worth filing bugs against. 1) If you fill up the space of a datanode, it appears to fail with the wrong exception and reload. This, combined with the currently simple block-allocation method (random), means that one "full" node can cause a big dropoff in NDFS write performance, as clients end up timing out some percent of the time when asked to talk to the "full" node, while the full node is busy reloading. 2) Running out of disk space also kills the task tracker. This has an especially devilling side effect - if you've finished all map tasks, and, during reduce, a task tracker runs out of space, it disappears, taking all of its completed maps with it. As far as I can tell, once mapping has completed once, lost map jobs never get rescheduled. This, in effect, means that running out of space on any task tracker node any time after map has completed prevents the job from finishing. If you run out of space, maybe you should stop taking tasks, and error out the *current* task, but still be available to distributed completed map results to the reduce tasks, no? Likewise, when map tasks are lost, shouldn't they be getting rescheduled on still-available nodes? 3) It should be possible to improve the problems of 2, even given severe space restrictions. Why can't a node be queried for how much space it has available before accepting a reduce task? It should be straightforward to figure out exactly how much space is needed to complete the reduce - 2x the sum of the appropriate partitions (room for the appended version, and room for a sorted output of that), right? If space is low, you don't allocate a reduce to a tasktracker that doesn't have enough room. Likewise, once a reduce job has entirely completed, the space from the original map/partition files it reduced can be freed. Thus, so long as at least one node in the cluster has room for at least the smallest reduce job, it should be possible to make progress, rather than failing out. 4) Finally, performance-wise, why is it that we don't mimic the Google Mapreduce technique of starting duplicates of tasks when the overall job starts to near completion? I have a few slow machines in my cluster, which can usefully complete work on large runs, but are, unsurprisingly, extending the average completion time of my runs, somewhat needlessly. -- Bryan A. Pendleton Ph: (877) geek-1-bp
[jira] Commented: (NUTCH-198) SWF parser
[ http://issues.apache.org/jira/browse/NUTCH-198?page=comments#action_12364983 ] Doug Cutting commented on NUTCH-198: +1 > SWF parser > -- > > Key: NUTCH-198 > URL: http://issues.apache.org/jira/browse/NUTCH-198 > Project: Nutch > Type: New Feature > Components: fetcher > Versions: 0.8-dev > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Attachments: parse-swf.zip > > This is a parser for the Flash SWF files. It uses JavaSWF2 library (BSD > license), and uses some heuristic to extract as much text as possible > (including potential links) from ActionScript sections. > If there are no objections, I'd like to add it soon. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Cmd line for running plugins
Andrzej Bialecki wrote: It works rather nicely. If other people find it useful, I can add this to PluginRepository. Committed. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Created: (NUTCH-198) SWF parser
SWF parser -- Key: NUTCH-198 URL: http://issues.apache.org/jira/browse/NUTCH-198 Project: Nutch Type: New Feature Components: fetcher Versions: 0.8-dev Reporter: Andrzej Bialecki Assigned to: Andrzej Bialecki Attachments: parse-swf.zip This is a parser for the Flash SWF files. It uses JavaSWF2 library (BSD license), and uses some heuristic to extract as much text as possible (including potential links) from ActionScript sections. If there are no objections, I'd like to add it soon. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-198) SWF parser
[ http://issues.apache.org/jira/browse/NUTCH-198?page=all ] Andrzej Bialecki updated NUTCH-198: Attachment: parse-swf.zip > SWF parser > -- > > Key: NUTCH-198 > URL: http://issues.apache.org/jira/browse/NUTCH-198 > Project: Nutch > Type: New Feature > Components: fetcher > Versions: 0.8-dev > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Attachments: parse-swf.zip > > This is a parser for the Flash SWF files. It uses JavaSWF2 library (BSD > license), and uses some heuristic to extract as much text as possible > (including potential links) from ActionScript sections. > If there are no objections, I'd like to add it soon. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Cmd line for running plugins
+1 On 2/1/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > > +1 > > Am 01.02.2006 um 22:35 schrieb Andrzej Bialecki: > > > Hi, > > > > I just found out that it's not possible to invoke main() methods of > > plugins through the bin/nutch script. Sometimes it's useful for > > testing and debugging - I can do it from within Eclipse, because I > > have all plugins on the classpath, but from the command-line it's > > not possible - in the code they are accessed through > > PluginRepository. So I added this: > > > >public static void main(String[] args) throws Exception { > > NutchConf conf = new NutchConf(); > > PluginRepository repo = new PluginRepository(conf); > > // args[0] - plugin ID > > PluginDescriptor d = repo.getPluginDescriptor(args[0]); > > if (d == null) { > >System.err.println("Plugin '" + args[0] + "' not present or > > inactive."); > >return; > > } > > ClassLoader cl = d.getClassLoader(); > > // args[1] - class name > > Class clazz = Class.forName(args[1], true, cl); > > Method m = clazz.getMethod("main", new Class[]{args.getClass()}); > > String[] subargs = new String[args.length - 2]; > > System.arraycopy(args, 2, subargs, 0, subargs.length); > > m.invoke(null, new Object[]{subargs}); > >} > > > > It works rather nicely. If other people find it useful, I can add > > this to PluginRepository. > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > > > > > --- > company:http://www.media-style.com > forum:http://www.text-mining.org > blog:http://www.find23.net > > > > -- http://motrech.free.fr/ http://www.frutch.org/