date:20061107

Modifiying Nutch Indexer

2006-11-07 Thread Javier P. L.

Hi, 


I need to modify the Nutch Indexer class because for me it is very
useful to add some fields to the generated Lucene index. I was trying
and I find out that it is possible to add fields to the Document with
doc.addField() in the reduce function. My point is that for those fields
I need the html content of the webpage to process it, but it looks not
to be present yet in the Document because it throws a null pointer
exception with getField(content), maybe that is not the correct way to
access it, or the correct place. So, How and where can I access to the
html content of the document to add a new field to the Lucene Document
and so on to the generated index?

Any advice will be very helpful, 


Thanks in advance. 

Javier.

Re: implement thai lanaguage analyzer in nutch

2006-11-07 Thread kauu


i think you should learn the javacc ,then understand the analasis.jj
then the thai will be resolved soon .
just try it

On 11/7/06, sanjeev [EMAIL PROTECTED] wrote:



Hello,

After playing around with nutch for a few months I was tying to implement
the thai lanaguage analyzer for nutch.

Downloaded the subversion version and compiled using ant - everything
fine.

Next - I didn't see any tutorial for thai - but i did see one for chinese
at

http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62153

Tried following the same steps outlined above but ran into compiler errors
...type mismatch

between lucene Token and nutch Token.

Suffice to say I am back at square one as far as trying to implement the
thai language analyzer for nutch.

Can someone please outline for me the exact procedure for this ? Or point
me
to a tutorial which explains how to ?

Would be highly obliged.
Thanks.



--
View this message in context:
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7214087
Sent from the Nutch - Dev mailing list archive at Nabble.com.





--
www.babatu.com

Re: Modifiying Nutch Indexer

2006-11-07 Thread Enis Soztutar


Javier P. L. wrote:
Hi, 



I need to modify the Nutch Indexer class because for me it is very
useful to add some fields to the generated Lucene index. I was trying
and I find out that it is possible to add fields to the Document with
doc.addField() in the reduce function. My point is that for those fields
I need the html content of the webpage to process it, but it looks not
to be present yet in the Document because it throws a null pointer
exception with getField(content), maybe that is not the correct way to
access it, or the correct place. So, How and where can I access to the
html content of the document to add a new field to the Lucene Document
and so on to the generated index?

Any advice will be very helpful, 



Thanks in advance. 


Javier.



  

Hi,

You do not need to change the indexer code for adding new fields to the 
index. You need to implement an indexing filter and add it to your 
configuration during indexing. You can look at the codes of 
index-basic(BasicIndexingFilter) and index-more(MoreIndexingFilter). 
IndexingFilter interface has filter() method which takes document, 
parse, url, CrawlDatum and inlinks as arguments, so you readily have the 
content of the document to be indexed.


You can look at the tutorial on implementing a plugin from the wiki.

Best wishes.

[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

2006-11-07 Thread Enis Soztutar (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-389?page=all ]

Enis Soztutar updated NUTCH-389:


Attachment: urlTokenizer-improved.diff

This is an improvement and a minor bug fix over the previous url tokenizer. 
This version first replaces characters, which are represented in hexadecimal 
format in the urls. 

For example the url file:///tmp/foo%20baz%20bar/foo/baz~bar/index.html will 
first be converted to file:///tmp/foo baz bar/foo/baz~bar/index.html by 
replacing the %20 characters with the space. 

A NullPointerException is corrected in case or input reader returning null for 
the url. 

Further improvements on the url tokenization can be discussed here. 


 a url tokenizer implementation for tokenizing index fields : url and host
 -

 Key: NUTCH-389
 URL: http://issues.apache.org/jira/browse/NUTCH-389
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 0.9.0
Reporter: Enis Soztutar
Priority: Minor
 Attachments: urlTokenizer-improved.diff, urlTokenizer.diff


 NutchAnalysis.jj tokenizes the input by threating  and _ as non token 
 seperators, which is in the case of the urls not appropriate. So i have 
 written a url tokenizer which the tokens that match the regular exp 
 [a-zA-Z0-9]. As stated in http://www.gbiv.com/protocols/uri/rfc/rfc3986.html 
 which describes the grammer for URIs, URL's can be tokenized with the above 
 expression. 
 NutchDocumentAnalyzer code is modified to use the UrlTokenizer with the 
 url, site and host fields.
 see : http://www.mail-archive.com/nutch-user@lucene.apache.org/msg06247.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters

2006-11-07 Thread Enis Soztutar (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-393?page=comments#action_12447787 ] 

Enis Soztutar commented on NUTCH-393:
-

Also IndexingException is catched by the Indexer, in which  case the whole 
document is not added to the writer (the function returns).

Indexer : 334
try {
// run indexing filters
doc = this.filters.filter(doc, parse, (UTF8)key, fetchDatum, inlinks);
} catch (IndexingException e) {
if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); }
   return;
}  

IndexingException should be cought in the IndexingFilters.filter(), so that 
when an IndexingException is thrown in one indexing plugin, the other plugins 
could still be run. 



 Indexer doesn't handle null documents returned by filters
 -

 Key: NUTCH-393
 URL: http://issues.apache.org/jira/browse/NUTCH-393
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.8.1
Reporter: Eelco Lempsink
 Attachments: NUTCH-393.patch


 Plugins (like IndexingFilter) may return a null value, but this isn't handled 
 by the Indexer.  A trivial adjustment is all it takes:
 @@ -237,6 +237,7 @@
if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); }
return;
  }
 +if (doc == null) return;
  
  float boost = 1.0f;
  // run scoring filters

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-397) porting clustering-carrot2 plugin to carrot2 v2.0

2006-11-07 Thread JIRA

porting clustering-carrot2 plugin to carrot2 v2.0
-

 Key: NUTCH-397
 URL: http://issues.apache.org/jira/browse/NUTCH-397
 Project: Nutch
  Issue Type: Improvement
Reporter: Do?acan Güney
Priority: Trivial


A rather trivial port of clustering-carrot2 to new carrot2. I also added the 
necessary jars for Polish, so that nutch will not give the annoying exceptions 
when it is initializing clustering-carrot2. 

There is a small problem, though. AFAICS, a small patch has to be applied to 
carrot2, otherwise nutch can not start the plugin. (I am also attaching that 
here.)



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters

2006-11-07 Thread Eelco Lempsink (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-393?page=comments#action_12447939 ] 

Eelco Lempsink commented on NUTCH-393:
--

I'm not sure I agree with that. After running a document through a set of 
filters you'd expect all filters ran. If not, that's an exception.  For 
instance, your index might depend on all numbers and non-english words being 
stripped. When one of those filters hits an exception, but the other one runs, 
your index will become dirty.

 Indexer doesn't handle null documents returned by filters
 -

 Key: NUTCH-393
 URL: http://issues.apache.org/jira/browse/NUTCH-393
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.8.1
Reporter: Eelco Lempsink
 Attachments: NUTCH-393.patch


 Plugins (like IndexingFilter) may return a null value, but this isn't handled 
 by the Indexer.  A trivial adjustment is all it takes:
 @@ -237,6 +237,7 @@
if (LOG.isWarnEnabled()) { LOG.warn(Error indexing +key+: +e); }
return;
  }
 +if (doc == null) return;
  
  float boost = 1.0f;
  // run scoring filters

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: implement thai lanaguage analyzer in nutch

2006-11-07 Thread sanjeev


Oh btw - I followed the chinese tutorial and was able to compile and
everything was fine.

Lemme just test if it is working properly - however i didn't make any
changes to NutchAnalysis.jj 

I need more information please.

Thanks a bunch.
-- 
View this message in context: 
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7232439
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: implement thai lanaguage analyzer in nutch

2006-11-07 Thread Arun Kaundal


Hi sanjeev and Kauu

  I want to support Hindi-Language widely spoken in India language.
Can u guide what else I need to modify ? I think there is no support to
search and index Hindi language.
 I want to work on this. But I need some information as what
to modify and where eaxctly the changes are require.? Can anybody help me?

  Thanx.
./Arun


On 11/8/06, sanjeev [EMAIL PROTECTED] wrote:



Oh btw - I followed the chinese tutorial and was able to compile and
everything was fine.

Lemme just test if it is working properly - however i didn't make any
changes to NutchAnalysis.jj

I need more information please.

Thanks a bunch.
--
View this message in context:
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7232439
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Modifiying Nutch Indexer

Re: implement thai lanaguage analyzer in nutch

Re: Modifiying Nutch Indexer

[jira] Updated: (NUTCH-389) a url tokenizer implementation for tokenizing index fields : url and host

[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters

[jira] Created: (NUTCH-397) porting clustering-carrot2 plugin to carrot2 v2.0

[jira] Commented: (NUTCH-393) Indexer doesn't handle null documents returned by filters

Re: implement thai lanaguage analyzer in nutch

Re: implement thai lanaguage analyzer in nutch

9 matches

Site Navigation

Mail list logo

Footer information