patch for nutch and nutch-daemon.sh

2006-01-23 Thread Zaheed Haque
Hi:

Due to a bug in the if statement its not possible to use the symlinks
for the shell scripts. Below you will find the  patch.

Thanks
Zaheed

---

$ svn diff nutch
Index: nutch
===
--- nutch   (revision 371849)
+++ nutch   (working copy)
@@ -17,7 +17,7 @@
 while [ -h "$THIS" ]; do
   ls=`ls -ld "$THIS"`
   link=`expr "$ls" : '.*-> \(.*\)$'`
-  if expr "$link" : '.*/.*' > /dev/null; then
+  if expr "$link" : '/.*' > /dev/null; then
 THIS="$link"
   else
 THIS=`dirname "$THIS"`/"$link"
$ svn diff nutch-daemon.sh
Index: nutch-daemon.sh
===
--- nutch-daemon.sh (revision 371849)
+++ nutch-daemon.sh (working copy)
@@ -29,7 +29,7 @@
 while [ -h "$this" ]; do
   ls=`ls -ld "$this"`
   link=`expr "$ls" : '.*-> \(.*\)$'`
-  if expr "$link" : '.*/.*' > /dev/null; then
+  if expr "$link" : '/.*' > /dev/null; then
 this="$link"
   else
 this=`dirname "$this"`/"$link"
$


Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Jack Tang
Hi

Is it reasonable to guess language info. from target servers geographical info.?

/Jack

On 1/23/06, Jérôme Charron <[EMAIL PROTECTED]> wrote:
> > Any plan to implement this ? I mean move LanguageIdentifier class
> > intto nutch core.
>
> As I already suggested it on this list, I really would like to move the
> LanguageIdentifier class (and profiles) to
> an independant Lucene sub-project (and the MimeType repository too).
> I don't remember why but there were some objections about this...
>
> Here is a short status of what I have in mind for next improvements with the
> LanguageIdentifier / MultiLanguage support :
> * Enhance LanguageIdentifier APIs by returning something like an ordered
> LangDetail[] array when guessing language (each LangDetail should contains
> the language code and its score) - I have a prototype version of this on my
> disk but I doesn't take time to finalize it
> * I encountered some identification problems with some specific sites (with
> blogger for instance), and I plan to investigate on this point.
> * Another pending task : the analysis (and coding) of multilingual querying
> support.
>
> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


xml-parser plugin contribution

2006-01-23 Thread Rida Benjelloun
Hi,
I have developed an xml parser plugin. I have test it with nutch 0.7.2.
The parser use namespaces and xpath to do the mapping between XML nodes and
lucene fields.
I'm trying to send the source of the plugin in a zip file but my message is
always rejected (it is considered as a spam).
How can I send the source code ?
Best regards.


Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Andrzej Bialecki

[EMAIL PROTECTED] wrote:

I would like to decouple Lang Id from Nutch and move it in Lucene contrib/ in 
the near future.

Does that sound ok?
  


+1 from me.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread ogjunk-nutch
I would like to decouple Lang Id from Nutch and move it in Lucene contrib/ in 
the near future.

Does that sound ok?

Otis


- Original Message 
From: Stefan Groschupf <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Mon 23 Jan 2006 02:55:46 PM EST
Subject: Re: lang identifier and nutch analyzer in trunk

>> As I already suggested it on this list, I really would like to  
>> move the
>> LanguageIdentifier class (and profiles) to
>> an independant Lucene sub-project (and the MimeType repository too).
>> I don't remember why but there were some objections about this...
>>
>>
>
> I think most people agree that it would be worthwhile to un-tie  
> this component from Nutch internals. The only objections were  
> related not to the idea itself, but to the management aspects of  
> creating a full-blown sub-project, both wrt. to the initial setup  
> and the continuing maintenance. An alternative solution was  
> proposed (creating a contrib/ package). This would still help to  
> separate the code from Nutch internals, so that it can be used in  
> other projects, but it would require much less effort to set up and  
> maintain.

+1, what's about lucene sandbox or jsut open a source forge project  
with Apache 2 license, than we can use just the jar.

Stefan








Re: Patch for NDFS's df.java

2006-01-23 Thread Stefan Groschupf

+1
May you can clean up thinks by your self and create a jira issue  
where you can attach this patch.

Thanks.
Stefan

Am 20.01.2006 um 16:33 schrieb Dominik Friedrich:

This is a bugfixed version. I wasn't running a normal NDFS datanode  
but a jmx based version I'm working on and I didn't recorgnize that  
the first version always returned 0 when the NDFS datanode was  
started with no data dir.


best regards,
Dominik

Dominik Friedrich schrieb:
This is a patch for NDFS's df.java. With this patch I was able to  
run NDFS datanode on Windows systems without cygwin. I haven't  
found a way to get the partition size on Windows so it's always  
set to two times the available space. Maybe somebody can clean  
this up so it can be included into nutch trunk.


best regards,
Dominik


Index: DF.java
===
--- DF.java (revision 370204)
+++ DF.java (working copy)
@@ -15,81 +15,167 @@
  */
 package org.apache.nutch.ndfs;

-import java.io.File;
+import java.io.BufferedReader;
 import java.io.IOException;
 import java.io.InputStreamReader;
-import java.io.BufferedReader;
-
 import java.util.StringTokenizer;
-import java.util.Iterator;

-/** Filesystem disk space usage statistics.  Uses the unix 'df'  
program.

- * Tested on Linux, FreeBSD and Cygwin. */
+/**
+ * Filesystem disk space usage statistics. Uses the unix 'df'  
program. Tested on Linux, FreeBSD and Cygwin.

+ */
 public class DF {
-  private String filesystem;
-  private long capacity;
-  private long used;
-  private long available;
-  private int percentUsed;
-  private String mount;
-
-  public DF(String path) throws IOException {
+   private String filesystem;

-Process process = Runtime.getRuntime().exec(new String[]  
{"df","-k",path});

+   private long capacity;

-try {
-  if (process.waitFor() == 0) {
-BufferedReader lines =
-  new BufferedReader(new InputStreamReader 
(process.getInputStream()));

+   private long used;

-lines.readLine(); // skip headings
+   private long available;

-StringTokenizer tokens =
-  new StringTokenizer(lines.readLine(), " \t\n\r\f%");
-
-this.filesystem = tokens.nextToken();
-if (!tokens.hasMoreTokens()) {// for long  
filesystem name
-  tokens = new StringTokenizer(lines.readLine(), " \t\n\r\f 
%");

-}
-this.capacity = Long.parseLong(tokens.nextToken()) * 1024;
-this.used = Long.parseLong(tokens.nextToken()) * 1024;
-this.available = Long.parseLong(tokens.nextToken()) * 1024;
-this.percentUsed = Integer.parseInt(tokens.nextToken());
-this.mount = tokens.nextToken();
+   private int percentUsed;

-  } else {
-throw new IOException
-  (new BufferedReader(new InputStreamReader 
(process.getErrorStream()))

-   .readLine());
-  }
-} catch (InterruptedException e) {
-  throw new IOException(e.toString());
-} finally {
-  process.destroy();
-}
-  }
+   private String mount;

-  /// ACCESSORS
+   public DF(String path) throws IOException {
+   String os = System.getProperty("os.name");
+   if(os.startsWith("Windows"))
+   dfWidows(path);
+   else
+   dfUnix(path);
+   }

-  public String getFilesystem() { return filesystem; }
-  public long getCapacity() { return capacity; }
-  public long getUsed() { return used; }
-  public long getAvailable() { return available; }
-  public int getPercentUsed() { return percentUsed; }
-  public String getMount() { return mount; }
-
-  public String toString() {
-return
-  "df -k " + mount +"\n" +
-  filesystem + "\t" +
-  capacity / 1024 + "\t" +
-  used / 1024 + "\t" +
-  available / 1024 + "\t" +
-  percentUsed + "%\t" +
-  mount;
-  }
+   // / ACCESSORS

-  public static void main(String[] args) throws Exception {
-System.out.println(new DF(args[0]));
-  }
+   /**
+* @return  Returns the filesystem.
+* @uml.property  name="filesystem"
+*/
+   public String getFilesystem() {
+   return filesystem;
+   }
+
+   /**
+* @return  Returns the capacity.
+* @uml.property  name="capacity"
+*/
+   public long getCapacity() {
+   return capacity;
+   }
+
+   /**
+* @return  Returns the used.
+* @uml.property  name="used"
+*/
+   public long getUsed() {
+   return used;
+   }
+
+   /**
+* @return  Returns the available.
+* @uml.property  name="available"
+*/
+   public long getAvailable() {
+   return available;
+   }
+
+   /**
+* @return  Returns the percentUsed.
+* @uml.property  name="percentUsed"
+*/
+   public int getPercentUsed() {
+  

Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Stefan Groschupf
As I already suggested it on this list, I really would like to  
move the

LanguageIdentifier class (and profiles) to
an independant Lucene sub-project (and the MimeType repository too).
I don't remember why but there were some objections about this...




I think most people agree that it would be worthwhile to un-tie  
this component from Nutch internals. The only objections were  
related not to the idea itself, but to the management aspects of  
creating a full-blown sub-project, both wrt. to the initial setup  
and the continuing maintenance. An alternative solution was  
proposed (creating a contrib/ package). This would still help to  
separate the code from Nutch internals, so that it can be used in  
other projects, but it would require much less effort to set up and  
maintain.


+1, what's about lucene sandbox or jsut open a source forge project  
with Apache 2 license, than we can use just the jar.


Stefan





Re: protocol-httpclient; maximum total connections

2006-01-23 Thread Stefan Groschupf
Thanks for finding this bug, please open a bug report in jira and if  
you like I guess patches are always welcome. :-)


Am 23.01.2006 um 15:00 schrieb [EMAIL PROTECTED]:


Hi,

Protocol-httpclient sets the maximum number of total connections to
"fetcher.threads.fetch" configuration parameter for underlying
commons-httpclient. However, if -threads argument is used with the  
fetcher it
doesn't change fetcher.threads.fetch. Giving whatever number of  
threads to
-threads argument, httpclient will use default value of number of  
total
connections (10). This will affect the performance of crawling. It  
seems to

be a bug. Any comment on this?

Possible solution can be adding below line to setThreadCount  
function of

Fetcher class.
 NutchConf.get().setInt("fetcher.threads.fetch", threadCount);

Also, fetcher seems to be using lots of memory; maybe due to memory  
leak. It
starts with %10~%15; after several hours Linux top command reports  
it's using

%50~%70 of the whole memory. Anyone experiencing this behaviour?

Thanks,
-orkunt.



---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




[jira] Resolved: (NUTCH-127) uncorrect values using -du, or ls does not return items

2006-01-23 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-127?page=all ]
 
Stefan Groschupf resolved NUTCH-127:


Resolution: Fixed

I guess it is solved, thanks. If able to reproduce it again I will just reopen 
this or a new report. 
Thanks!

> uncorrect values using -du, or ls does not return items
> ---
>
>  Key: NUTCH-127
>  URL: http://issues.apache.org/jira/browse/NUTCH-127
>  Project: Nutch
> Type: Bug
>   Components: ndfs
> Versions: 0.8-dev, 0.7.2-dev
> Reporter: Stefan Groschupf
> Priority: Blocker

>
> The ndfs client return uncorrect values by using du or ls does not return 
> items.
> It looks like there is a problem with the virtual file strcuture, since -du 
> only reads the meta data, isn't it?
> We had moved some data from folder to folder and after that we notice that a 
> folder with zero items has a size.
> [EMAIL PROTECTED] bin/nutch ndfs -du indexes/
> 051118 092409 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-default.xml
> 051118 092409 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-site.xml
> 051118 092409 No FS indicated, using default:192.168.200.3:5
> 051118 092409 Client connection to 192.168.200.3:5: starting
> Found 1 items
> /user/nutch/indexes/20051022033721  974606348
> [EMAIL PROTECTED] bin/nutch ndfs -du indexes/20051022033721/
> 051118 092416 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-default.xml
> 051118 092416 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-site.xml
> 051118 092416 No FS indicated, using default:192.168.200.3:5
> 051118 092416 Client connection to 192.168.200.3:5: starting
> Found 0 items
> [EMAIL PROTECTED] bin/nutch ndfs -ls indexes/20051022033721
> 051118 093331 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-default.xml
> 051118 093332 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-site.xml
> 051118 093332 No FS indicated, using default:192.168.200.3:5
> 051118 093332 Client connection to 192.168.200.3:5: starting
> Found 0 items
> So may the mv tool has a problem, the du or the ls tool. :-O Any ideas where 
> to search for the problem? Dubugging ndfs is tricky.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



protocol-httpclient; maximum total connections

2006-01-23 Thread orkunt . sabuncu
Hi,

Protocol-httpclient sets the maximum number of total connections to
"fetcher.threads.fetch" configuration parameter for underlying
commons-httpclient. However, if -threads argument is used with the fetcher it
doesn't change fetcher.threads.fetch. Giving whatever number of threads to
-threads argument, httpclient will use default value of number of total
connections (10). This will affect the performance of crawling. It seems to
be a bug. Any comment on this?

Possible solution can be adding below line to setThreadCount function of
Fetcher class.
 NutchConf.get().setInt("fetcher.threads.fetch", threadCount);

Also, fetcher seems to be using lots of memory; maybe due to memory leak. It
starts with %10~%15; after several hours Linux top command reports it's using
%50~%70 of the whole memory. Anyone experiencing this behaviour?

Thanks,
-orkunt.


Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Jérôme Charron
> +1. Other local modifications which I use frequently:
>
> * exporting a list of supported languages,
>
> * exporting an NGramProfile of the analyzed text,
>
> * allow processing of chunks of input (i.e.
> LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is
> very useful if the text to be analyzed is already present in memory, and
> the choice of sections (chunks) is made elsewhere, e.g. for documents
> with clearly outlined sections, or for multi-language documents.

Thanks for these intereseting comments Andrzej => I add them to my todo
list.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Andrzej Bialecki

Jérôme Charron wrote:

Any plan to implement this ? I mean move LanguageIdentifier class
intto nutch core.



As I already suggested it on this list, I really would like to move the
LanguageIdentifier class (and profiles) to
an independant Lucene sub-project (and the MimeType repository too).
I don't remember why but there were some objections about this...

  


I think most people agree that it would be worthwhile to un-tie this 
component from Nutch internals. The only objections were related not to 
the idea itself, but to the management aspects of creating a full-blown 
sub-project, both wrt. to the initial setup and the continuing 
maintenance. An alternative solution was proposed (creating a contrib/ 
package). This would still help to separate the code from Nutch 
internals, so that it can be used in other projects, but it would 
require much less effort to set up and maintain.



Here is a short status of what I have in mind for next improvements with the
LanguageIdentifier / MultiLanguage support :
* Enhance LanguageIdentifier APIs by returning something like an ordered
LangDetail[] array when guessing language (each LangDetail should contains
the language code and its score) - I have a prototype version of this on my
disk but I doesn't take time to finalize it
  


+1. Other local modifications which I use frequently:

* exporting a list of supported languages,

* exporting an NGramProfile of the analyzed text,

* allow processing of chunks of input (i.e. 
LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is 
very useful if the text to be analyzed is already present in memory, and 
the choice of sections (chunks) is made elsewhere, e.g. for documents 
with clearly outlined sections, or for multi-language documents.



* I encountered some identification problems with some specific sites (with
blogger for instance), and I plan to investigate on this point.
* Another pending task : the analysis (and coding) of multilingual querying
support.
  


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Jérôme Charron
> Any plan to implement this ? I mean move LanguageIdentifier class
> intto nutch core.

As I already suggested it on this list, I really would like to move the
LanguageIdentifier class (and profiles) to
an independant Lucene sub-project (and the MimeType repository too).
I don't remember why but there were some objections about this...

Here is a short status of what I have in mind for next improvements with the
LanguageIdentifier / MultiLanguage support :
* Enhance LanguageIdentifier APIs by returning something like an ordered
LangDetail[] array when guessing language (each LangDetail should contains
the language code and its score) - I have a prototype version of this on my
disk but I doesn't take time to finalize it
* I encountered some identification problems with some specific sites (with
blogger for instance), and I plan to investigate on this point.
* Another pending task : the analysis (and coding) of multilingual querying
support.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/