[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-02-02 Thread John Xing (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12365051 ] 

John Xing commented on NUTCH-193:
-

what's in the name hadoop? Because "had oops"?

> move NDFS and MapReduce to a separate project
> -
>
>  Key: NUTCH-193
>  URL: http://issues.apache.org/jira/browse/NUTCH-193
>  Project: Nutch
> Type: Task
>   Components: ndfs
> Versions: 0.8-dev
> Reporter: Doug Cutting
> Assignee: Doug Cutting
>  Fix For: 0.8-dev

>
> The NDFS and MapReduce code should move from Nutch to a new Lucene 
> sub-project named Hadoop.
> My plan is to do this as follows:
> 1. Move all code in the following packages from Nutch to Hadoop:
> org.apache.nutch.fs
> org.apache.nutch.io
> org.apache.nutch.ipc
> org.apache.nutch.mapred
> org.apache.nutch.ndfs
> These packages will all be renamed to org.apache.hadoop, and Nutch code will 
> be updated to reflect this.
> 2. Move selected classes from Nutch to Hadoop, as follows:
> org.apache.nutch.util.NutchConf -> org.apache.hadoop.conf.Configuration
> org.apache.nutch.util.NutchConfigurable -> org.apache.hadoop.Configurable 
> org.apache.nutch.util.NutchConfigured -> org.apache.hadoop.Configured
> org.apache.nutch.util.Progress -> org.apache.hadoop.util.Progress
> org.apache.nutch.util.LogFormatter-> org.apache.hadoop.util.LogFormatter
> org.apache.nutch.util.Daemon -> org.apache.hadoop.util.Daemon
> 3. Add a jar containing all of the above the Nutch's lib directory.
> Does this plan sound reasonable?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: tool to mount nutch filesystem

2006-02-02 Thread John X
On Sat, Jan 21, 2006 at 09:23:01AM -0800, John X wrote:
> Hi, Sami,
> 
> On Sat, Jan 21, 2006 at 05:32:37PM +0200, Sami Siren wrote:
> > >I have created a simple tool to mount nutch filesystem on linux.
> > >http://nutch.neasys.com/ndfs/fuse-nutchfs-0.1.0.tar.gz
> > >Check README inside for how to set up.
> > >It is very barebone, only browse, no read/write yet.
> > >
> > >Doug and Mike: any plan to make ndfs codes into a separate package?
> > 
> > John,
> > 
> > I didn't check out your version yet, but I have also written
> > a version wich is read/write capable, should we combine our efforts here?
> 
> Sure, why not ;-) Let me know where your stuff is.

The tarball, now called fuse-hadoop-0.1.0.tar.gz, is attached at
https://issues.apache.org/jira/browse/NUTCH-199.
A result of combined work of Sami's and mine. It does read/write.

John


[jira] Updated: (NUTCH-199) tool to mount ndfs on linux

2006-02-02 Thread John Xing (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-199?page=all ]

John Xing updated NUTCH-199:


Attachment: fuse-hadoop-0.1.0.tar.gz

It works with current nutch-0.8-dev. Will be ported to hadoop after ndfs is 
moved.

> tool to mount ndfs on linux
> ---
>
>  Key: NUTCH-199
>  URL: http://issues.apache.org/jira/browse/NUTCH-199
>  Project: Nutch
> Type: New Feature
>   Components: ndfs
>  Environment: linux only
> Reporter: John Xing
> Assignee: John Xing
>  Attachments: fuse-hadoop-0.1.0.tar.gz
>
> tool to mount ndfs on linux. It depends on fuse and fuse-j.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-199) tool to mount ndfs on linux

2006-02-02 Thread John Xing (JIRA)
tool to mount ndfs on linux
---

 Key: NUTCH-199
 URL: http://issues.apache.org/jira/browse/NUTCH-199
 Project: Nutch
Type: New Feature
  Components: ndfs  
 Environment: linux only
Reporter: John Xing
 Assigned to: John Xing 


tool to mount ndfs on linux. It depends on fuse and fuse-j.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-59) meta data support in webdb

2006-02-02 Thread James Jonas (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12365012 ] 

James Jonas commented on NUTCH-59:
--

Thanks,

I have been tracking Nutch-139 and Nutch-192 and look forward to these patches 
being committed into the .8 trunk.

James

> meta data support in webdb
> --
>
>  Key: NUTCH-59
>  URL: http://issues.apache.org/jira/browse/NUTCH-59
>  Project: Nutch
> Type: New Feature
> Reporter: Stefan Groschupf
> Priority: Minor
>  Attachments: webDBMetaDataPatch.txt
>
> Meta data support in web db would very usefully for a new set of nutch 
> feature that needs long life meta data. 
> Actually page meta data need to be regenerated or lookup every 30 days a page 
> is re-fetched, in a long context web db meta data would bring a dramatically 
> performance improvement for such tasks.
> Furthermore Storage of meta data in webdb would make a new generation of 
> linklist generation filters possible.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-59) meta data support in webdb

2006-02-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12365009 ] 

Stefan Groschupf commented on NUTCH-59:
---

Please let's move this discuss into the user mailing list, since this is no 
'real' issue comment.
Also please note that meta data support for nutch 0.8 is under development and 
is comming hopefully soon into sources. So may a better idea is to wait for 
nutch 0.8 meta data support.

> meta data support in webdb
> --
>
>  Key: NUTCH-59
>  URL: http://issues.apache.org/jira/browse/NUTCH-59
>  Project: Nutch
> Type: New Feature
> Reporter: Stefan Groschupf
> Priority: Minor
>  Attachments: webDBMetaDataPatch.txt
>
> Meta data support in web db would very usefully for a new set of nutch 
> feature that needs long life meta data. 
> Actually page meta data need to be regenerated or lookup every 30 days a page 
> is re-fetched, in a long context web db meta data would bring a dramatically 
> performance improvement for such tasks.
> Furthermore Storage of meta data in webdb would make a new generation of 
> linklist generation filters possible.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-59) meta data support in webdb

2006-02-02 Thread James Jonas (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12365008 ] 

James Jonas commented on NUTCH-59:
--

I deployed this patch into a Nutch 7.1 sandbox and performed a test run. The 
'topic' metadata has been captured. Congrats!

How do I display this information inside the 'more' section of my query result 
page?
How do I use this metadata to filter a standard query?

Thanks,
James

> meta data support in webdb
> --
>
>  Key: NUTCH-59
>  URL: http://issues.apache.org/jira/browse/NUTCH-59
>  Project: Nutch
> Type: New Feature
> Reporter: Stefan Groschupf
> Priority: Minor
>  Attachments: webDBMetaDataPatch.txt
>
> Meta data support in web db would very usefully for a new set of nutch 
> feature that needs long life meta data. 
> Actually page meta data need to be regenerated or lookup every 30 days a page 
> is re-fetched, in a long context web db meta data would bring a dramatically 
> performance improvement for such tasks.
> Furthermore Storage of meta data in webdb would make a new generation of 
> linklist generation filters possible.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Some bugs I'm trying to characterize....

2006-02-02 Thread Bryan A. Pendleton
I've been putting mapreduce through its paces for some custom jobs lately.
Here're a few issues I've been running into, please help me clarify if I
misunderstand, or offer insight if they're worth filing bugs against.

1) If you fill up the space of a datanode, it appears to fail with the wrong
exception and reload. This, combined with the currently simple
block-allocation method (random), means that one "full" node can cause a big
dropoff in NDFS write performance, as clients end up timing out some percent
of the time when asked to talk to the "full" node, while the full node is
busy reloading.

2) Running out of disk space also kills the task tracker. This has an
especially devilling side effect - if you've finished all map tasks, and,
during reduce, a task tracker runs out of space, it disappears, taking all
of its completed maps with it. As far as I can tell, once mapping has
completed once, lost map jobs never get rescheduled. This, in effect, means
that running out of space on any task tracker node any time after map has
completed prevents the job from finishing. If you run out of space, maybe
you should stop taking tasks, and error out the *current* task, but still be
available to distributed completed map results to the reduce tasks, no?
Likewise, when map tasks are lost, shouldn't they be getting rescheduled on
still-available nodes?

3) It should be possible to improve the problems of 2, even given severe
space restrictions. Why can't a node be queried for how much space it has
available before accepting a reduce task? It should be straightforward to
figure out exactly how much space is needed to complete the reduce - 2x the
sum of the appropriate partitions (room for the appended version, and room
for a sorted output of that), right? If space is low, you don't allocate a
reduce to a tasktracker that doesn't have enough room. Likewise, once a
reduce job has entirely completed, the space from the original map/partition
files it reduced can be freed. Thus, so long as at least one node in the
cluster has room for at least the smallest reduce job, it should be possible
to make progress, rather than failing out.

4) Finally, performance-wise, why is it that we don't mimic the Google
Mapreduce technique of starting duplicates of tasks when the overall job
starts to near completion? I have a few slow machines in my cluster, which
can usefully complete work on large runs, but are, unsurprisingly, extending
the average completion time of my runs, somewhat needlessly.

--
Bryan A. Pendleton
Ph: (877) geek-1-bp


[jira] Commented: (NUTCH-198) SWF parser

2006-02-02 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-198?page=comments#action_12364983 ] 

Doug Cutting commented on NUTCH-198:


+1

> SWF parser
> --
>
>  Key: NUTCH-198
>  URL: http://issues.apache.org/jira/browse/NUTCH-198
>  Project: Nutch
> Type: New Feature
>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Andrzej Bialecki 
> Assignee: Andrzej Bialecki 
>  Attachments: parse-swf.zip
>
> This is a parser for the Flash SWF files. It uses JavaSWF2 library (BSD 
> license), and uses some heuristic to extract as much text as possible 
> (including potential links) from ActionScript sections.
> If there are no objections, I'd like to add it soon.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Cmd line for running plugins

2006-02-02 Thread Andrzej Bialecki

Andrzej Bialecki wrote:
It works rather nicely. If other people find it useful, I can add this 
to PluginRepository.




Committed.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




[jira] Created: (NUTCH-198) SWF parser

2006-02-02 Thread Andrzej Bialecki (JIRA)
SWF parser
--

 Key: NUTCH-198
 URL: http://issues.apache.org/jira/browse/NUTCH-198
 Project: Nutch
Type: New Feature
  Components: fetcher  
Versions: 0.8-dev
Reporter: Andrzej Bialecki 
 Assigned to: Andrzej Bialecki  
 Attachments: parse-swf.zip

This is a parser for the Flash SWF files. It uses JavaSWF2 library (BSD 
license), and uses some heuristic to extract as much text as possible 
(including potential links) from ActionScript sections.

If there are no objections, I'd like to add it soon.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-198) SWF parser

2006-02-02 Thread Andrzej Bialecki (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-198?page=all ]

Andrzej Bialecki  updated NUTCH-198:


Attachment: parse-swf.zip

> SWF parser
> --
>
>  Key: NUTCH-198
>  URL: http://issues.apache.org/jira/browse/NUTCH-198
>  Project: Nutch
> Type: New Feature
>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Andrzej Bialecki 
> Assignee: Andrzej Bialecki 
>  Attachments: parse-swf.zip
>
> This is a parser for the Flash SWF files. It uses JavaSWF2 library (BSD 
> license), and uses some heuristic to extract as much text as possible 
> (including potential links) from ActionScript sections.
> If there are no objections, I'd like to add it soon.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: Cmd line for running plugins

2006-02-02 Thread Jérôme Charron
+1

On 2/1/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>
> +1
>
> Am 01.02.2006 um 22:35 schrieb Andrzej Bialecki:
>
> > Hi,
> >
> > I just found out that it's not possible to invoke main() methods of
> > plugins through the bin/nutch script. Sometimes it's useful for
> > testing and debugging - I can do it from within Eclipse, because I
> > have all plugins on the classpath, but from the command-line it's
> > not possible - in the code they are accessed through
> > PluginRepository. So I added this:
> >
> >public static void main(String[] args) throws Exception {
> >  NutchConf conf = new NutchConf();
> >  PluginRepository repo = new PluginRepository(conf);
> >  // args[0] - plugin ID
> >  PluginDescriptor d = repo.getPluginDescriptor(args[0]);
> >  if (d == null) {
> >System.err.println("Plugin '" + args[0] + "' not present or
> > inactive.");
> >return;
> >  }
> >  ClassLoader cl = d.getClassLoader();
> >  // args[1] - class name
> >  Class clazz = Class.forName(args[1], true, cl);
> >  Method m = clazz.getMethod("main", new Class[]{args.getClass()});
> >  String[] subargs = new String[args.length - 2];
> >  System.arraycopy(args, 2, subargs, 0, subargs.length);
> >  m.invoke(null, new Object[]{subargs});
> >}
> >
> > It works rather nicely. If other people find it useful, I can add
> > this to PluginRepository.
> >
> > --
> > Best regards,
> > Andrzej Bialecki <><
> > ___. ___ ___ ___ _ _   __
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> >
>
> ---
> company:http://www.media-style.com
> forum:http://www.text-mining.org
> blog:http://www.find23.net
>
>
>
>


--
http://motrech.free.fr/
http://www.frutch.org/