hi,how to use the ICTCLASCall

2006-03-26 Thread kauu
hi all:
  i get a problem when I integrat Nutch-0.7.1 with an intelligent Chinese
Lexical Analysis System.
  and i follow the next page:
which wrote by *caoyuzhong
  *when ant my modified java files , javac told me that couldn't find the
symbol caomo.ICTCLASCaller
in this line
private final static caomo.ICTCLASCaller spliter = new

so my question is how to deal with it?
any reply will be appreciated!

Re: hi,how to use the ICTCLASCall

2006-03-27 Thread kauu
thanks any way

On 3/27/06, Yong-gang Cao <[EMAIL PROTECTED]> wrote:
> Please visit http://chiefadminofficer.googlepages.com/mycodes for the
> source
> code of ICTCLASCaller and the DLL used by it.
> You also need to get the data files from
> ICTCLAS<http://www.nlp.org.cn/project/project.php?proj_id=6>source
> site to run ICTCLASCaller.
> Notice: The codes and the DLL's usage is restricted by ICTCLAS copyright
> Details of usage are put into the comments of ICTCLASCaller.java.
> Good Luck!
> 2006/3/27, kauu <[EMAIL PROTECTED]>:
> >
> > hi all:
> >   i get a problem when I integrat Nutch-0.7.1 with an intelligent
> Chinese
> > Lexical Analysis System.
> >   and i follow the next page:
> > http://www.nutchhacks.com/ftopic391.php&highlight=chinese
> > which wrote by *caoyuzhong
> >   *when ant my modified java files , javac told me that couldn't find
> the
> > symbol caomo.ICTCLASCaller
> > in this line
> > private final static caomo.ICTCLASCaller spliter = new
> > caomo.ICTCLASCaller();
> >
> > so my question is how to deal with it?
> > any reply will be appreciated!
> > --
> > www.babatu.com
> >
> >
> --
> http://spaces.msn.com/members/caomo
> Beijing University of Aeronautics and Astronautics (BeiHang University)
> P.B.: 2-53# MailBox, 37 Xueyuan Road ,Beijing, 100083  P.R.China


Re: hi,how to use the ICTCLASCall

2006-03-27 Thread kauu
i get it 
thank goodness!
i'am so happy to tell everyone i get it ! and i will write it for anyone

On 3/27/06, kauu <[EMAIL PROTECTED]> wrote:
> thanks any way
> On 3/27/06, Yong-gang Cao <[EMAIL PROTECTED]> wrote:
> >
> > Please visit http://chiefadminofficer.googlepages.com/mycodes for the
> > source
> > code of ICTCLASCaller and the DLL used by it.
> > You also need to get the data files from
> > ICTCLAS<http://www.nlp.org.cn/project/project.php?proj_id=6>source
> > site to run ICTCLASCaller.
> > Notice: The codes and the DLL's usage is restricted by ICTCLAS copyright
> >
> > (NOT MINE).
> > Details of usage are put into the comments of ICTCLASCaller.java.
> > Good Luck!
> >
> > 2006/3/27, kauu <[EMAIL PROTECTED]>:
> > >
> > > hi all:
> > >   i get a problem when I integrat Nutch-0.7.1 with an intelligent
> > Chinese
> > > Lexical Analysis System.
> > >   and i follow the next page:
> > > http://www.nutchhacks.com/ftopic391.php&highlight=chinese
> > > which wrote by *caoyuzhong
> > >   *when ant my modified java files , javac told me that couldn't find
> > the
> > > symbol caomo.ICTCLASCaller
> > > in this line
> > > private final static caomo.ICTCLASCaller spliter = new
> > > caomo.ICTCLASCaller();
> > >
> > > so my question is how to deal with it?
> > > any reply will be appreciated!
> > > --
> > > www.babatu.com
> > >
> > >
> >
> >
> > --
> > http://spaces.msn.com/members/caomo
> > Beijing University of Aeronautics and Astronautics (BeiHang University)
> > P.B.: 2-53# MailBox, 37 Xueyuan Road ,Beijing, 100083   P.R.China
> >
> >
> --
> www.babatu.com


ICTCLAS with nutch 0.7.1.

2006-03-27 Thread kauu
hi all
  i get a big problem when i integrated ICTCLAS with nutch 0.7.1.
i followed the page "
but when i ant the nutch,i got a lot of errors like this:

i 've modified the files in org.apache.nutch.analysis directory. and my
question is that should i modified the lucene.
and how to deal with it!!!

any reply will be appreciated.

I have integrated Nutch with an intelligent Chinese
Lexical Analysis System.So Nutch now can segment
Chinese words effectively.

Following is my solution:

1.modify NutchAnalysis.jj:

-| <#CJK: // non-alphabets
- [
- "\u3040"-"\u318f",
- "\u3300"-"\u337f",
- "\u3400"-"\u3d2d",
- "\u4e00"-"\u9fff",
- "\uf900"-"\ufaff"
- ]
- >

+| <#OTHER_CJK: //japanese and korean characters
+ [
+ "\u3040"-"\u318f",
+ "\u3300"-"\u337f",
+ "\u3400"-"\u3d2d",
+ "\uf900"-"\ufaff"
+ ]
+ >
+| <#CHINESE: //chinese characters
+ [
+ "\u4e00"-"\u9fff"
+ ]
+ >

-|  >

+|  >
+| )+ > //chinese words

- ( token= | token= | token=)
+ ( token= | token= | token= | token=)

I will segment chinese characters intelligently but japanese
and korean characters remains single-gram segmentation.

2.modify NutchDocumentTokenizer.java

-case EOF: case WORD: case ACRONYM: case SIGRAM:
+case EOF: case WORD: case ACRONYM: case SIGRAM: case CNWORD:

3.modify FastCharStream.java

+private final static caomo.ICTCLASCaller spliter = new
+private final int IO_BUFFER_SIZE=2048;

-buffer = new char[2048];
+buffer = new char[IO_BUFFER_SIZE];

-int charsRead = input.read(buffer, newPosition,
+int charsRead=readString(newPosition);

+ // do intelligent Chinese word segmentation
+private int readString(int newPosition) throws java.io.IOException {
+ char[] tempBuffer = new char[IO_BUFFER_SIZE / 2]; //read from io
+ char[] hzBuffer = new char[IO_BUFFER_SIZE / 2]; //store Chinese
characters string
+ int len=0;
+ len = input.read(tempBuffer, 0, IO_BUFFER_SIZE / 4);
+ int pos=-1; //position in buffer
+ if (len > 0) {
+ pos=0;
+ int hzPos=0; //position in hzBuffer
+ char c=' ';
+ int value=-1;
+ for(int i=0;i40959) ){ //non-chinese characters
+ buffer[pos + newPosition] = c;
+ pos++;
+ }
+ else{ //Chinese character unicode: '\u4e00---'\u9fff'
+ hzBuffer[hzPos++]=' ';
+ hzBuffer[hzPos] = c;
+ hzPos++;
+ i++;
+ while(i=19968)&&(value<=40959) ){
+ hzBuffer[hzPos] = c;
+ hzPos++;
+ i++;
+ }
+ else
+ break; //have extracted a Chinese String
+ }
+ i--;
+ if(hzPos>0){
+ String str = new String(hzBuffer, 0, hzPos);
+ String str2 = spliter.segSentence(str2); // perform
Chinese word
+ // segmentation
+ if(str2!=null){
+ while(str2.length()>buffer.length-newPosition){ //expand the buffer
+ char[] newBuffer = new char[buffer.length*2];
+ System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
+ buffer = newBuffer;
+ }
+ for(int j=0;jhttp://www.nlp.org.cn/project/project.php?proj_id=6

4.modify Summarizer.java

+ //reset startOffset and endOffset of tokens
+ private void resetTokenOffset(Token[] tokens,String text)
+ {
+ String text3=text.toLowerCase();
+ char[] textArray=text3.toCharArray();
+ int tokenStart=0;
+ char[] tokenArray=null;
+ int j;
+ Token preToken=new Token(" ",0,1);
+ Token curToken=new Token(" ",0,1);
+ Token nextToken=null;
+ int startSearch=0;
+ while(true){
+ tokenArray = null;
+ for (int i = startSearch; i < textArray.length; i++) {
+ if (tokenStart == tokens.length)
+ break;
+ if (tokenArray == null) {
+ tokenArray =
+ preToken = curToken;
+ curToken = tokens[tokenStart];
+ nextToken = null;
+ }
+ //deals with following situation:(common grams)
+ //text: about buaa a welcome from buaa president
+ //token sequences:about buaa buaa-a a a-welcome welcome from
buaa president
+ if ((preToken.termText().charAt(0) ==
+ curToken.termText().charAt(0)) &&
+ (preToken.termText().length() <
curToken.termText().length())) {
+ if (curToken.termText().startsWith(preToken.termText() +
"-")) { //buaa-a starts with buaa-
+ if (tokenStart + 1 < tokens.length) {
+ nextToken = tokens[tokenStart + 1];
+ if (curToken.termText().endsWith("-" +
+ nextToken.termText())) { //meets buaa
buaa-a a
+ int curTokenLength = curToken.endOffset() -
+ curToken.startOffset();
+ curToken.setEndOffset(preToken.startOffset()
+ curTokenLength);
+ tokenStart++;
+ tokenArray = null;
+ i = preToken.startOffset();
+ startSearch=i;//the start position in
textArray for the next turn,if need.
+ continue;
+ }
+ }
+ }
+ }
+ //
+ j = 0;
+ if (textArray[i] == tokenArray[j]) {
+ if (i + tokenArray.length - 1 >= textArray.length) {
+ //do nothing?
+ } else {
+ int k = i + 1;
+ for (j = 1; j < tokenArray.length; j++) {
+ if (textArray[k++] != tokenArray[j])
+ break; //not meets
+ }
+ if (j >= tokenArray.length) { //meets
+ curToken.setStartOffset(i);
+ curToken.setEndOffset(i + tokenArray.length);
+ i = i + tokenArray.

Re: hi,how to use the ICTCLASCall

2006-03-27 Thread kauu
== tokenArray[j]) {
+ if (i + tokenArray.length - 1 >= textArray.length) {
+ //do nothing?
+ } else {
+ int k = i + 1;
+ for (j = 1; j < tokenArray.length; j++) {
+ if (textArray[k++] != tokenArray[j])
+ break; //not meets
+ }
+ if (j >= tokenArray.length) { //meets
+ curToken.setStartOffset(i);
+ curToken.setEndOffset(i + tokenArray.length);
+ i = i + tokenArray.length - 1;
+ tokenStart++;
+ startSearch=i;//the start position in textArray
for the next turn,if need.
+ tokenArray = null;
+ }
+ }
+ }
+ }
+ if (tokenStart == tokens.length)
+ break; //have resetted all tokens
+ if (tokenStart < tokens.length ) { //next turn
+ curToken.setStartOffset(preToken.startOffset());
+ curToken.setEndOffset(preToken.endOffset());
+ tokenStart++; //skip this token
+ }
+ }//the end of while(true)
+ }

under the line: Token[] tokens = getTokens(text)
in getSummary(String text, Query query);

+resetTokenOffset(tokens, text);

I perform Chinese word Segmentation after tokenizer and insert space
two Chinese words.So I need reset all tokens' startOffset and
endOffset in Summarizer.java.
To do this,I added method resetTokenOffset(Token[] tokens,String text)
in Summarizer.java and I have to add two methods setStartOffset(int start)
setEndOffset(int end) in Lucene's Token.java.

By the above four steps,Nutch can search Chinese web site
nearly perfectly.You can try it.I just made Nutch to do it,
but my solution is less perfect.

If Chinese word segmentation could be done in NutchAnalysis.jj
before tokenizer,then we don't need reset tokens' offset in
Summarizer.java and everything will be perfect.
But it seems too difficult to perform intelligent Chinese word
segmentation in NutchAnalysis.jj.Even impossible??

Any suggestions?

Buildfile: build.xml


[javac] Compiling 247 source files to E:\search\new\nutch-
[javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Query.java:408: unreported
exception org.apache.nutch.analysis.ParseException ; must be caught or
declared to be thrown
[javac] return fixup(NutchAnalysis.parseQuery (queryString));
[javac]  ^
[javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:140: cannot find
[javac] symbol  : method setStartOffset(int)
[javac] location: class org.apache.lucene.analysis.Token
[javac]  curToken.setStartOffset(
[javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:141: cannot find
[javac] symbol  : method setEndOffset(int)
[javac] location: class org.apache.lucene.analysis.Token
[javac]   curToken.setEndOffset(
preToken.startOffset() + curTokenLength);
[javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:164: cannot find
[javac] symbol  : method setStartOffset(int)
[javac] location: class org.apache.lucene.analysis.Token
[javac] curToken.setStartOffset(i);

[javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:165: cannot find
[javac] symbol  : method setEndOffset(int)
[javac] location: class org.apache.lucene.analysis.Token
[javac] curToken.setEndOffset(i +

[javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:179: cannot find
[javac] symbol  : method setStartOffset(int)
[javac] location: class org.apache.lucene.analysis.Token
[javac]   curToken.setStartOffset(preToken.startOffset
[javac]   ^
[javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:180: cannot find
[javac] symbol  : method setEndOffset(int)
[javac] location: class org.apache.lucene.analysis.Token
[javac]   curToken.setEndOffset (preToken.endOffset());
[javac]   ^
[javac] Note: * uses or overrides a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 7 errors

E:\search\new\nutch-0.7.1\build.xml:70: Compile failed; see the compiler
error output for details.

Total time: 39 seconds

On 3/27/06, kauu <[EMAIL PROTECTED]> wrote:
> i get it 
> thank goodness!
> i'am so happy to tell everyone i get it ! and i will write it for anyone
> else!
> On 3/27/06, kauu < [EMAIL PROTECTED]> wrote:
> >
> > thanks any way
> >
> >
> > On 3/27/06, Yong-g

Re: implement thai lanaguage analyzer in nutch

2006-11-07 Thread kauu

i think you should learn the javacc ,then understand the analasis.jj
then the thai will be resolved soon .
just try it

On 11/7/06, sanjeev <[EMAIL PROTECTED]> wrote:


After playing around with nutch for a few months I was tying to implement
the thai lanaguage analyzer for nutch.

Downloaded the subversion version and compiled using ant - everything

Next - I didn't see any tutorial for thai - but i did see one for chinese


Tried following the same steps outlined above but ran into compiler errors
...type mismatch

between lucene Token and nutch Token.

Suffice to say I am back at square one as far as trying to implement the
thai language analyzer for nutch.

Can someone please outline for me the exact procedure for this ? Or point
to a tutorial which explains how to ?

Would be highly obliged.

View this message in context:
Sent from the Nutch - Dev mailing list archive at Nabble.com.


why can't build in the Linux with ant

2006-11-08 Thread kauu

hi :

 i get a problem now  ,i can't build the nutch in the linux os with ant
and my ant version is

Apache Ant version 1.5.2-20 compiled on September 25 2003

the error is below
so anyone get the same problem ?i need ur help

Buildfile: build.xml

file:/nutch/nutch-0.7.2/build.xml:20: Unexpected element "dirname"

Total time: 1 second


How to start working with MapReduce?

2006-11-09 Thread kauu

anyone kown the detail of the process with the topic "how to start working
with MapReduce?"

i'v read something in the FAQ ,but i don't understand it very well , my
version is 0.7.2, not 0.8x

Re: How to start working with MapReduce?

2006-11-09 Thread kauu

or it's the same with the version 0.8.x
any idea is preciated

On 11/9/06, kauu <[EMAIL PROTECTED]> wrote:

anyone kown the detail of the process with the topic "how to start working
with MapReduce?"

i'v read something in the FAQ ,but i don't understand it very well , my
version is 0.7.2, not 0.8x


Re: Question on adaptive re-fetch plugin

2006-11-23 Thread kauu

yes, i 'm ur side

On 11/23/06, Scott Green <[EMAIL PROTECTED]> wrote:


NUTCH-61(http://issues.apache.org/jira/browse/NUTCH-61) is about
adaptive re-fetch plugin, and Jerome Charron had commented --"Why not
making FetchSchedule a new ExtensionPoint and then
DefaultFetchSchedule and AdaptiveFetchSchedule some fetch schedule
plugins? ". I am for it. Maintaining non-offical nutch source is
bitter to me. So why not provide another plugin and test it. When it
is stable enough, we can merge them, right?

- Scott


Re: hi all:

2006-12-09 Thread kauu

thx very much ,i'll try it

On 12/9/06, Sami Siren <[EMAIL PROTECTED]> wrote:

吴志敏 wrote:
>  I want to read the stored segments to a xml file, but when I read the
> SegmentReader.java, I find that it 's not a simple thing.
> it's a hadoop's job to dump a text file. I just want to dump the
> segments' some content witch I interested to a xml.
> So some one can tell me hwo to do this, any reply will be appreciated!

Segment data is basically just a bunch of files containing
key->value pairs, so there's always the possibility of reading the data
directly with help of:


To see what kind of object to expect you can just examine the beginning
of file where there is some metadata stored - like class used for key
and class used for value (that metadata is also available from methods
of SequenceFile.Reader class).

For example to read the contents of Content data from a segment one
could use something like:

SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);

Text url = new Text();  //key
Content content = new Content();//value
while (reader.next(url, content)) {
  //now just use url and content the way you like

Sami Siren


parse-rss test problem

2007-01-25 Thread kauu
I can't test my parse-rss pluging in the nutch-0.8.1


I just can't test the default "rsstest.rss" file.


2007-01-25 17:04:34,703 INFO  conf.Configuration
(Configuration.java:getConfResourceAsInputStream(340)) - found resource
parse-plugins.xml at file:/E:/work/digibot_news/build_tt/parse-plugins.xml

2007-01-25 17:04:35,328 WARN  parse.rss (?:invoke0(?)) -
java.lang.NoClassDefFoundError: org/jdom/Parent

2007-01-25 17:04:35,328 WARN  parse.rss (?:invoke0(?)) - at

2007-01-25 17:04:35,343 WARN  parse.rss (?:invoke0(?)) - at

2007-01-25 17:04:35,343 WARN  parse.rss (?:invoke0(?)) - at

2007-01-25 17:04:35,343 WARN  parse.rss (?:invoke0(?)) - at

2007-01-25 17:04:35,343 WARN  parse.rss (?:invoke0(?)) - at

2007-01-25 17:04:35,343 WARN  parse.rss (?:invoke0(?)) - at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

2007-01-25 17:04:35,343 WARN  parse.rss (?:invoke0(?)) - at
sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

2007-01-25 17:04:35,359 WARN  parse.rss (?:invoke0(?)) - at
sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

2007-01-25 17:04:35,359 WARN  parse.rss (?:invoke0(?)) - at
java.lang.reflect.Method.invoke(Unknown Source)

2007-01-25 17:04:35,359 WARN  parse.rss (?:invoke0(?)) - at

2007-01-25 17:04:35,359 WARN  parse.rss (?:invoke0(?)) - at

2007-01-25 17:04:35,359 WARN  parse.rss (?:invoke0(?)) - at

2007-01-25 17:04:35,359 WARN  parse.rss (?:invoke0(?)) - at

2007-01-25 17:04:35,375 WARN  parse.rss (?:invoke0(?)) - at

2007-01-25 17:04:35,375 WARN  parse.rss (?:invoke0(?)) - at

2007-01-25 17:04:35,375 WARN  parse.rss (?:invoke(?)) - at

2007-01-25 17:04:35,375 WARN  parse.rss (?:invoke(?)) - at

2007-01-25 17:04:35,375 WARN  parse.rss (?:invoke(?)) - at

2007-01-25 17:04:35,375 WARN  parse.rss (?:invoke(?)) - at

2007-01-25 17:04:35,375 WARN  parse.rss (?:invoke(?)) - at

2007-01-25 17:04:35,406 WARN  parse.rss (?:invoke(?)) - at

2007-01-25 17:04:35,421 WARN  parse.rss (?:invoke(?)) - at

2007-01-25 17:04:35,421 WARN  parse.rss (?:invoke(?)) - at

2007-01-25 17:04:35,421 WARN  parse.rss (?:invoke(?)) - Caused by:
java.lang.NoClassDefFoundError: org/jdom/Parent

2007-01-25 17:04:35,421 WARN  parse.rss (?:invoke(?)) - at

2007-01-25 17:04:35,421 WARN  parse.rss (?:invoke(?)) - at

2007-01-25 17:04:35,421 WARN  parse.rss (?:invoke(?)) - at

2007-01-25 17:04:35,421 WARN  parse.rss (?:invoke(?)) - ... 22 more

2007-01-25 17:04:35,421 WARN  parse.rss (RSSParser.java:getParse(100)) -
nutch:parse-rss:RSSParser Exception: java.lang.NoClassDefFoundError:

2007-01-25 17:04:35,437 WARN  parse.ParseUtil
(ParseUtil.java:parseByExtensionId(138)) - Unable to successfully parse
content file:/E:/work/digibot_news/rsstest.rss of type 


Re: Fetcher2

2007-01-25 Thread kauu

please give us the url,thx

On 1/25/07, chee wu <[EMAIL PROTECTED]> wrote:

Just appended the portion for .81  to NUTCH-339

- Original Message -
From: "Armel T. Nene" <[EMAIL PROTECTED]>
Sent: Thursday, January 25, 2007 8:06 AM
Subject: RE: Fetcher2

> Chee,
> Can you make the code available through Jira.
> Thanks,
> Armel
> -
> Armel T. Nene
> iDNA Solutions
> Tel: +44 (207) 257 6124
> Mobile: +44 (788) 695 0483
> http://blog.idna-solutions.com
> -Original Message-
> From: chee wu [mailto:[EMAIL PROTECTED]
> Sent: 24 January 2007 03:59
> To: nutch-dev@lucene.apache.org
> Subject: Re: Fetcher2
> Thanks! I successfully  port Fetcher2 to Nutch.81, it's prettyly easy...
> can share the code,if any one want to use ..
> - Original Message -
> From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
> To: 
> Sent: Tuesday, January 23, 2007 12:09 AM
> Subject: Re: Fetcher2
>> chee wu wrote:
>>> Fetcher2 should be a great help for me,but seems can't integrate with
> Nutch81.
>>> Any advice on how to use it based on .81?
>> You would have to port it to Nutch 0.8.1 - e.g. change all Text
>> occurences to UTF8, and most likely make other changes too ...
>> --
>> Best regards,
>> Andrzej Bialecki <><
>> ___. ___ ___ ___ _ _   __
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com


parse-rss make them items as different pages

2007-01-25 Thread kauu

i want to crawl the rss feeds and parse them ,then index them and at last
when search the content I just want that the hit just like an individual

i don't know wether i tell u clearly.

   欧洲暴风雪后发制人 致航班延误交通混乱(组图)
   Thu, 25 Jan 2007 11:29:11 +0800

this one item in an rss file

i want nutch deal with an item like an individual page.

so i search something in this item,the nutch return it as a hit.

so ...
any one can tell me how to do about ?
any reply will be appreciated


Re: parse-rss make them items as different pages

2007-01-26 Thread kauu

that's the right thing.

i think we should to do some thing when nutch fetch a page successfully,
judge if a rss then create as many pages as the items'  number.i  don't know
whether it work.
In the other hand , we can do some thing in the segment just like what u say

i don't know that whether we can write a plugin to get the functionality.

anyone who can give me some hint?

On 1/26/07, Gal Nitzan <[EMAIL PROTECTED]> wrote:

Hi Kauu,

The functionality you require doesn't exist in the current parse-rss
plugin. I need the same functionality but it doesn't exist and I believe
it's not a simple task.

The functionality required basically is to create a page in a segment for
each item and the URL to the crawldb.

Since the data already exists in the item element there is no reason to
"fetch" the page (item). After that the only thing left is to index it.

Any thoughts on how to achieve that goal?


-Original Message-
From: kauu [mailto:[EMAIL PROTECTED]
Sent: Friday, January 26, 2007 4:17 AM
To: nutch-dev@lucene.apache.org
Subject: parse-rss make them items as different pages

i want to crawl the rss feeds and parse them ,then index them and at last
when search the content I just want that the hit just like an individual

i don't know wether i tell u clearly.

欧洲暴风雪后发制人 致航班延误交通混乱(组图)

Thu, 25 Jan 2007 11:29:11 +0800

this one item in an rss file

i want nutch deal with an item like an individual page.

so i search something in this item,the nutch return it as a hit.

so ...
any one can tell me how to do about ?
any reply will be appreciated



Re: parse-rss make them items as different pages

2007-01-26 Thread kauu

who can tell  me where and how to build a nutch document in nutch-0.8.1?

for example , one html page is a document , but i want to detach a document
to several ones .

On 1/27/07, kauu <[EMAIL PROTECTED]> wrote:

that's the right thing.

i think we should to do some thing when nutch fetch a page successfully,
judge if a rss then create as many pages as the items'  number.i  don't
know whether it work.
In the other hand , we can do some thing in the segment just like what u
say .

i don't know that whether we can write a plugin to get the functionality.

anyone who can give me some hint?

On 1/26/07, Gal Nitzan <[EMAIL PROTECTED]> wrote:
> Hi Kauu,
> The functionality you require doesn't exist in the current parse-rss
> plugin. I need the same functionality but it doesn't exist and I believe
> it's not a simple task.
> The functionality required basically is to create a page in a segment
> for each item and the URL to the crawldb.
> Since the data already exists in the item element there is no reason to
> "fetch" the page (item). After that the only thing left is to index it.
> Any thoughts on how to achieve that goal?
> Gal.
> -Original Message-
> From: kauu [mailto:[EMAIL PROTECTED]
> Sent: Friday, January 26, 2007 4:17 AM
> To: nutch-dev@lucene.apache.org
> Subject: parse-rss make them items as different pages
> i want to crawl the rss feeds and parse them ,then index them and at
> last
> when search the content I just want that the hit just like an individual
> page.
> i don't know wether i tell u clearly.
> 欧洲暴风雪后发制人 致航班延误交通混乱(组图)
> 暴风雪横扫欧洲,导致多次航班延误
> 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。
> 据报道,迟来的暴风雪连续两天横扫中...
> http://news.sohu.com/20070125/n247833568.shtml 
> 搜狐焦点图新闻
> Thu, 25 Jan 2007 11:29:11 +0800
> http://comment.news.sohu.com/comment/topic.jsp?id=247833847
> this one item in an rss file
> i want nutch deal with an item like an individual page.
> so i search something in this item,the nutch return it as a hit.
> so ...
> any one can tell me how to do about ?
> any reply will be appreciated
> --
> www.babatu.com



Re: parse-rss make them items as different pages

2007-01-26 Thread kauu

that's right ,but in the other word , i just need to index the exact
information in  a page .but in real ,the real world pages contain lots of
spam ,so i just want to index the description.

On 1/27/07, sishen <[EMAIL PROTECTED]> wrote:

On 1/26/07, Gal Nitzan <[EMAIL PROTECTED]> wrote:
> Hi Kauu,
> The functionality you require doesn't exist in the current parse-rss
> plugin. I need the same functionality but it doesn't exist and I believe
> it's not a simple task.
> The functionality required basically is to create a page in a segment
> each item and the URL to the crawldb.
> Since the data already exists in the item element there is no reason to
> "fetch" the page (item). After that the only thing left is to index it.

I don't think so.  The data in description is  not completed. So to fetch
the page through the link is needed.

Any thoughts on how to achieve that goal?
> Gal.
> -Original Message-
> From: kauu [mailto:[EMAIL PROTECTED]
> Sent: Friday, January 26, 2007 4:17 AM
> To: nutch-dev@lucene.apache.org
> Subject: parse-rss make them items as different pages
> i want to crawl the rss feeds and parse them ,then index them and at
> when search the content I just want that the hit just like an individual
> page.
> i don't know wether i tell u clearly.
> 欧洲暴风雪后发制人 致航班延误交通混乱(组图)
> 暴风雪横扫欧洲,导致多次航班延误
> 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。
> 据报道,迟来的暴风雪连续两天横扫中...
> http://news.sohu.com/20070125/n247833568.shtml
> 搜狐焦点图新闻
> Thu, 25 Jan 2007 11:29:11 +0800
> this one item in an rss file
> i want nutch deal with an item like an individual page.
> so i search something in this item,the nutch return it as a hit.
> so ...
> any one can tell me how to do about ?
> any reply will be appreciated
> --
> www.babatu.com


Re: parse-rss make them items as different pages

2007-01-28 Thread kauu

it's a great idea i think .
we can't just have more than one document in the index because of the unique
key is the URL.
but the only problem is that how to write a separate protocol for the RSS.

On 1/28/07, Alan Tanaman <[EMAIL PROTECTED]> wrote:

This is a problem that we have encountered too (although in a different
context than RSS).  The problem is that the "unique key" is the URL - you
cannot have more than one document in the index with the same URL.

The way around this might be to have a separate protocol (instead of the
usual http one) that will be used only for RSS feeds, and which will
an sequential number to the real-URL (passing say 10 identical copies of
each page to the parse-rss).  The parse-rss would need to extract only the
nth news item from the whole page.

Any comments?

Best regards,
Alan Tanaman
iDNA Solutions

-Original Message-
From: kauu [mailto:[EMAIL PROTECTED]
Sent: 27 January 2007 06:43
To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: parse-rss make them items as different pages

who can tell  me where and how to build a nutch document in nutch-0.8.1?

for example , one html page is a document , but i want to detach a
to several ones .

On 1/27/07, kauu <[EMAIL PROTECTED]> wrote:
> that's the right thing.
> i think we should to do some thing when nutch fetch a page successfully,
> judge if a rss then create as many pages as the items'  number.i  don't
> know whether it work.
> In the other hand , we can do some thing in the segment just like what u
> say .
> i don't know that whether we can write a plugin to get the
> anyone who can give me some hint?
> On 1/26/07, Gal Nitzan <[EMAIL PROTECTED]> wrote:
> >
> > Hi Kauu,
> >
> > The functionality you require doesn't exist in the current parse-rss
> > plugin. I need the same functionality but it doesn't exist and I
> > it's not a simple task.
> >
> > The functionality required basically is to create a page in a segment
> > for each item and the URL to the crawldb.
> >
> > Since the data already exists in the item element there is no reason
> > "fetch" the page (item). After that the only thing left is to index
> >
> > Any thoughts on how to achieve that goal?
> >
> > Gal.
> >
> >
> >
> >
> >
> >
> > -Original Message-
> > From: kauu [mailto:[EMAIL PROTECTED]
> > Sent: Friday, January 26, 2007 4:17 AM
> > To: nutch-dev@lucene.apache.org
> > Subject: parse-rss make them items as different pages
> >
> > i want to crawl the rss feeds and parse them ,then index them and at
> > last
> > when search the content I just want that the hit just like an
> > page.
> >
> >
> > i don't know wether i tell u clearly.
> >
> > 
> > 欧洲暴风雪后发制人 致航班延误交通混乱(组图)
> > 暴风雪横扫欧洲,导致多次航班延误
> > 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工
> > 据报道,迟来的暴风雪连续两天横扫中...
> > 
> > http://news.sohu.com/20070125/n247833568.shtml 
> > 搜狐焦点图新闻
> > Thu, 25 Jan 2007 11:29:11 +0800
> > 
> > http://comment.news.sohu.com/comment/topic.jsp?id=247833847
> > 
> >
> > this one item in an rss file
> >
> > i want nutch deal with an item like an individual page.
> >
> > so i search something in this item,the nutch return it as a hit.
> >
> > so ...
> > any one can tell me how to do about ?
> > any reply will be appreciated
> >
> > --
> > www.babatu.com
> >
> --
> www.babatu.com



why can't test the parse-xml plugin

2007-01-28 Thread kauu
2007-01-29 15:48:49,844 INFO  conf.Configuration
(Configuration.java:loadResource(397)) - parsing

2007-01-29 15:48:50,079 INFO  conf.Configuration
(Configuration.java:loadResource(397)) - parsing

2007-01-29 15:48:50,173 INFO  conf.Configuration
(Configuration.java:loadResource(397)) - parsing

2007-01-29 15:48:50,204 INFO  conf.Configuration
(Configuration.java:loadResource(397)) - parsing

2007-01-29 15:48:50,219 INFO  plugin.PluginRepository (PluginManifestParser.
java:parsePluginFolder(81)) - Plugins: looking in:

2007-01-29 15:48:50,641 WARN  plugin.PluginRepository (PluginManifestParser.
java:parsePluginFolder(102)) - java.io.FileNotFoundException:
E:\work\digibot_news\plugins\parse-xml\plugin.xml (系统找不到指定的文件。)

2007-01-29 15:48:50,907 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(333)) - Plugin Auto-activation mode:

2007-01-29 15:48:50,907 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(334)) - Registered Plugins:

2007-01-29 15:48:50,907 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   CyberNeko HTML Parser

2007-01-29 15:48:50,907 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   Site Query Filter

2007-01-29 15:48:50,907 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   Html Parse Plug-in

2007-01-29 15:48:50,907 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   Jakarta Commons HTTP Client

2007-01-29 15:48:50,907 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   Regex URL Filter Framework

2007-01-29 15:48:50,923 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   Basic Indexing Filter

2007-01-29 15:48:50,923 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   Basic Summarizer Plug-in

2007-01-29 15:48:50,923 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   File Protocol Plug-in

2007-01-29 15:48:50,923 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   Text Parse Plug-in

2007-01-29 15:48:50,923 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   JavaScript Parser (parse-js)

2007-01-29 15:48:50,923 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   Regex URL Filter

2007-01-29 15:48:50,938 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   Basic Query Filter

2007-01-29 15:48:50,938 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   XML Libraries (lib-xml)

2007-01-29 15:48:50,938 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   HTTP Framework (lib-http)

2007-01-29 15:48:50,938 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   URL Query Filter (query-url)

2007-01-29 15:48:50,938 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   Log4j (lib-log4j)

2007-01-29 15:48:50,938 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   Zip Parse Plug-in (parse-zip)

2007-01-29 15:48:50,938 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   Http Protocol Plug-in

2007-01-29 15:48:50,938 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   RSS Parse Plug-in (parse-rss)

2007-01-29 15:48:50,938 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   the nutch core extension
points (nutch-extensionpoints)

2007-01-29 15:48:50,938 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(341)) -   OPIC Scoring Plug-in

2007-01-29 15:48:50,954 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(345)) - Registered Extension-Points:

2007-01-29 15:48:50,954 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(352)) -   Nutch Summarizer (org.apache.

2007-01-29 15:48:50,954 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(352)) -   Nutch Scoring

2007-01-29 15:48:50,954 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(352)) -   Nutch Protocol

2007-01-29 15:48:50,954 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(352)) -   Nutch URL Filter (org.apache.

2007-01-29 15:48:51,032 INFO  plugin.PluginRepository
(PluginRepository.java:displayStatus(352)) -   HTML Parse Filter

RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread kauu
Hi folks :

   What’s I want to do is to separate a rss file into several pages .

  Just as what has been discussed before. I want fetch a rss page and index
it as different documents in the index. So the searcher can search the
Item’s info as a individual hit.

 What’s my opinion create a protocol for fetch the rss page and store it as
several one which just contain one ITEM tag .but the unique key is the url ,
so how can I store them with the ITEM’s link tag as the unique key for a

  So my question is how to realize this function in nutch-.0.8.x. 

  I’ve check the code of the plug-in protocol-http’s code ,but I can’t
find the code where to store a page to a document. I want to separate the
rss page to several ones before storing it as a document but several ones.

  So any one can give me some hints?

Any reply will be appreciated !



  ITEM’s structure 


欧洲暴风雪后发制人 致航班延误交通混乱(组图)

暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...




Thu, 25 Jan 2007 11:29:11 +0800



Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread kauu

thx for ur reply .
mybe i didn't tell clearly .
I want to index the item as a individual page .then when i search the some
thing for example "nutch-open source", the nutch return a hit which contain

  title : nutch-open source
  description : nutch nutch nutch nutch  nutch
  url : http://lucene.apache.org/nutch
  category : news
 author  : kauu

so , is the plugin parse-rss can satisfy what i need?

   nutch--open source

   nutch nutch nutch nutch  nutch
> http://lucene.apache.org/nutch
> news 
> kauu

On 1/31/07, Chris Mattmann <[EMAIL PROTECTED]> wrote:

Hi there,

I could most likely be of assistance, if you gave me some more
For instance: I'm wondering if the use case you describe below is already
supported by the current RSS parse plugin?

The current RSS parser, parse-rss, does in fact index individual items
are pointed to by an RSS document. The items are added as Nutch Outlinks,
and added to the overall queue of URLs to fetch. Doesn't this satisfy what
you mention below? Or am I missing something?


On 1/30/07 6:01 PM, "kauu" <[EMAIL PROTECTED]> wrote:

> Hi folks :
>What's I want to do is to separate a rss file into several pages .
>   Just as what has been discussed before. I want fetch a rss page and
> it as different documents in the index. So the searcher can search the
> Item's info as a individual hit.
>  What's my opinion create a protocol for fetch the rss page and store it
> several one which just contain one ITEM tag .but the unique key is the
url ,
> so how can I store them with the ITEM's link tag as the unique key for a
> document.
>   So my question is how to realize this function in nutch-.0.8.x.
>   I've check the code of the plug-in protocol-http's code ,but I can't
> find the code where to store a page to a document. I want to separate
> rss page to several ones before storing it as a document but several
>   So any one can give me some hints?
> Any reply will be appreciated !
>   ITEM's structure
> 欧洲暴风雪后发制人 致航班延误交通混乱(组图)
> 暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
> 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
> http://news.sohu.com/20070125
> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml link>
> 搜狐焦点图新闻
> Thu, 25 Jan 2007 11:29:11 +0800
> > http://comment.news.sohu.com
> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> /comment/topic.jsp?id=247833847


log4j problem

2007-01-30 Thread kauu
why when I changed the nutch/conf/log4j.properties


I just changed the first line 

  Log4j.rootLogger=info,drfa to log4j.rootLogger=debug,drfa

Like this:

***  **

# RootLogger - DailyRollingFileAppender




# Logging Threshold



#special logging requirements for some commandline tools


















*  **

*  In the console ,it show me the error like below




log4j:ERROR setFile(null,true) call failed.

java.io.FileNotFoundException: \ (系统找不到指定的路径。)

at java.io.FileOutputStream.openAppend(Native Method)

at java.io.FileOutputStream.(Unknown Source)

at java.io.FileOutputStream.(Unknown Source)

at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)

at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163)











at org.apache.log4j.LogManager.(LogManager.java:122)

at org.apache.log4j.Logger.getLogger(Logger.java:104)



at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

Re: RSS-fecter and index individul-how can i realize this function

2007-01-31 Thread kauu

hi ,
thx any way , but i don't think I tell clearly enough.

what i want  is nutch  just fetch  rss seeds for 1 depth. So  nutch should
just  fetch some xml pages .I don't want to fetch the items' outlink 's
pages, because there r too much spam in those pages.
 so , i just need to parse the rss file.
so when i search some words which in description tag in one xml's item. the
return hit will be like this
title ==one item's title
summary ==one item's description
link ==one itme's outlink.

so , i don't know whether the parse-rss plugin provide this function?

On 1/31/07, Chris Mattmann <[EMAIL PROTECTED]> wrote:

Hi there,

  With the explanation that you give below, it seems like parse-rss as it
exists would address what you are trying to do. parse-rss parses an RSS
channel as a set of items, and indexes overall metadata about the RSS
including parse text, and index data, but it also adds each item (in the
channel)'s URL as an Outlink, so that Nutch will process those pieces of
content as well. The only thing that you suggest below that parse-rss
currently doesn't do, is to allow you to associate the metadata fields
category:, and author: with the item Outlink...


On 1/30/07 7:30 PM, "kauu" <[EMAIL PROTECTED]> wrote:

> thx for ur reply .
mybe i didn't tell clearly .
I want to index the item as a
> individual page .then when i search the some
thing for example "nutch-open
> source", the nutch return a hit which contain

   title : nutch-open source

> description : nutch nutch nutch nutch  nutch
   url :
> http://lucene.apache.org/nutch
   category : news
  author  : kauu

so , is
> the plugin parse-rss can satisfy what i need?

> source
>nutch nutch nutch nutch
> nutch
> > 
> >
> >
> >
> http://lucene.apache.org/nutch
> >
> >
> > news
> >
> >
> > kauu

On 1/31/07, Chris
> Mattmann <[EMAIL PROTECTED]> wrote:
> Hi there,
> I could most
> likely be of assistance, if you gave me some more
> information.
> For
> instance: I'm wondering if the use case you describe below is already
> supported by the current RSS parse plugin?
> The current RSS parser,
> parse-rss, does in fact index individual items
> that
> are pointed to by an
> RSS document. The items are added as Nutch Outlinks,
> and added to the
> overall queue of URLs to fetch. Doesn't this satisfy what
> you mention below?
> Or am I missing something?
> Cheers,
>   Chris
> On 1/30/07 6:01 PM,
> "kauu" <[EMAIL PROTECTED]> wrote:
> > Hi folks :
> >
> >What's I want to
> do is to separate a rss file into several pages .
> >
> >   Just as what has
> been discussed before. I want fetch a rss page and
> index
> > it as different
> documents in the index. So the searcher can search the
> > Item's info as a
> individual hit.
> >
> >  What's my opinion create a protocol for fetch the rss
> page and store it
> as
> > several one which just contain one ITEM tag .but
> the unique key is the
> url ,
> > so how can I store them with the ITEM's link
> tag as the unique key for a
> > document.
> >
> >   So my question is how to
> realize this function in nutch-.0.8.x.
> >
> >   I've check the code of the
> plug-in protocol-http's code ,but I can't
> > find the code where to store a
> page to a document. I want to separate
> the
> > rss page to several ones
> before storing it as a document but several
> ones.
> >
> >   So any one can
> give me some hints?
> >
> > Any reply will be appreciated !
> >
> >
> >
> >
> >
> >   ITEM's structure
> >
> >  
> >
> >
> > 欧洲暴风雪后发制人 致航班
> 延误交通混乱(组图)
> >
> >
> > 暴风雪横扫欧洲,导致多次航班延误 1
> 月24日,几架民航客机在德
> > 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
> 的慕尼黑机场
> > 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
> >
> >
> >
> > 
> >
> >
> >
> http://news.sohu.com/20070125
> >
> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml >
> link>
> >
> >
> > 搜狐焦点图新闻
> >
> >
> >
> > 
> >
> >
> > Thu, 25 Jan 2007
> 11:29:11 +0800
> >
> >
> >  >>
> http://comment.news.sohu.com
> >
> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> >
> /comment/topic.jsp?id=247833847
> >
> >
> >  >
> >
> >



Re: log4j problem

2007-01-31 Thread kauu

sorry , i will be careful .thx any way

On 1/31/07, chee wu <[EMAIL PROTECTED]> wrote:

set the two java arguments"-Dhadoop.log.file" and "-Dhadoop.log.dir"
should fix your problem.
btw,not to put much chinese characters in your mail..

- Original Message -
From: "kauu" <[EMAIL PROTECTED]>
Sent: Wednesday, January 31, 2007 1:45 PM
Subject: log4j problem

why when I changed the nutch/conf/log4j.properties

I just changed the first line

  Log4j.rootLogger=info,drfa to log4j.rootLogger=debug,drfa

Like this:

***  **

# RootLogger - DailyRollingFileAppender



# Logging Threshold


#special logging requirements for some commandline tools


















*  **

*  In the console ,it show me the error like below

log4j:ERROR setFile(null,true) call failed.

java.io.FileNotFoundException: \ (???)

at java.io.FileOutputStream.openAppend(Native Method)

at java.io.FileOutputStream.(Unknown Source)

at java.io.FileOutputStream.(Unknown Source)

at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)

at org.apache.log4j.FileAppender.activateOptions(FileAppender.java











at org.apache.log4j.LogManager.(LogManager.java:122)

at org.apache.log4j.Logger.getLogger(Logger.java:104)



at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native


Re: RSS-fecter and index individul-how can i realize this function

2007-02-01 Thread kauu

hi all,
 what Gal said is just my meaning on the rss-parse need.
 i just want to fetch rss seeds once,

On 2/2/07, Gal Nitzan <[EMAIL PROTECTED]> wrote:

Hi Chris,

I'm sorry I wasn't clear enough. What I mean is that in the current

1. The RSS (channels, items) page ends up as one Lucene document in the
2. Indeed the links are extracted and each  link will be fetched in
the next fetch as a separate page and will end up as one Lucene document.

IMHO the data that is needed i.e. the data that will be fetched in the
next fetch process is already available in the  element. Each 
element represents one web resource. And there is no reason to go to the
server and re-fetch that resource.

Another issue that arises from rss feeds is that once the feed page is
fetched you can not re-fetch it until its "time to fetch" expired. The feeds
TTL is usually very short. Since for now in Nutch, all pages created equal
:) it is one more thing to think about.



-Original Message-
From: Chris Mattmann [mailto:[EMAIL PROTECTED]
Sent: Thursday, February 01, 2007 7:01 PM
To: nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this

Hi Gal, et al.,

  I'd like to be explicit when we talk about what the issue with the RSS
parsing plugin is here; I think we have had conversations similar to this
before and it seems that we keep talking around each other. I'd like to
to the heart of this matter so that the issue (if there is an actual one)
gets addressed ;)

  Okay, so you mention below that the thing that you see missing from the
current RSS parsing plugin is the ability to store data in the CrawlDatum,
and parse "it" in the next fetch phase. Well, there are 2 options here for
what you refer to as "it":

1. If you're talking about the RSS file, then in fact, it is parsed, and
its data is stored in the CrawlDatum, akin to any other form of content
is fetched, parsed and indexed.

2. If you're talking about the item links within the RSS file, in fact,
they are parsed (eventually), and their data stored in the CrawlDatum,
to any other form of content that is fetched, parsed, and indexed. This is
accomplished by adding the RSS items as Outlinks when the RSS file is
parsed: in this fashion, we go after all of the links in the RSS file, and
make sure that we index their content as well.

Thus, if you had an RSS file R that contained links in it to a PDF file A,
and another HTML page P, then not only would R get fetched, parsed, and
indexed, but so would A and P, because they are item links within R. Then
queries that would match R (the physical RSS file), would additionally
things such as P and A, and all 3 would be capable of being returned in a
Nutch query. Does this make sense? Is this the issue that you're talking
about? Am I nuts? ;)


On 1/31/07 10:40 PM, "Gal Nitzan" <[EMAIL PROTECTED]> wrote:

> Hi,
> Many sites provide RSS feeds for several reasons, usually to save
> to give the users concentrated data and so forth.
> Some of the RSS files supplied by sites are created specially for search
> engines where each RSS "item" represent a web page in the site.
> IMHO the only thing "missing" in the parse-rss plugin is storing the
data in
> the CrawlDatum and "parsing" it in the next fetch phase. Maybe adding a
> flag to CrawlDatum, that would flag the URL as "parsable" not
> Just my two cents...
> Gal.
> -Original Message-
> From: Chris Mattmann [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, January 31, 2007 8:44 AM
> To: nutch-dev@lucene.apache.org
> Subject: Re: RSS-fecter and index individul-how can i realize this
> Hi there,
>   With the explanation that you give below, it seems like parse-rss as
> exists would address what you are trying to do. parse-rss parses an RSS
> channel as a set of items, and indexes overall metadata about the RSS
> including parse text, and index data, but it also adds each item (in the
> channel)'s URL as an Outlink, so that Nutch will process those pieces of
> content as well. The only thing that you suggest below that parse-rss
> currently doesn't do, is to allow you to associate the metadata fields
> category:, and author: with the item Outlink...
> Cheers,
>   Chris
> On 1/30/07 7:30 PM, "kauu" <[EMAIL PROTECTED]> wrote:
>> thx for ur reply .
> mybe i didn't tell clearly .
>  I want to index the item as a
>> individual page .then when i search the some
> thing for example "nutch-open
>> source", the nutch return a hit which contain
>title : nutc

Re: RSS-fecter and index individul-how can i realize this function

2007-02-04 Thread kauu

I've change code like what u said, but i get an exception like this.
why, why is the MD5Signature class's exception

2007-02-05 11:28:38,453 WARN  feedparser.FeedFilter (
FeedFilter.java:doDecodeEntities(223)) - Filter encountered unknown entities
2007-02-05 11:28:39,390 INFO  crawl.SignatureFactory (
SignatureFactory.java:getSignature(45)) - Using Signature impl:
2007-02-05 11:28:40,078 WARN  mapred.LocalJobRunner
- job_f6j55m
   at org.apache.nutch.parse.ParseOutputFormat$1.write(
   at org.apache.nutch.fetcher.FetcherOutputFormat$1.write(
   at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:235)
   at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:247)
   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java

On 2/3/07, Renaud Richardet <[EMAIL PROTECTED]> wrote:

Gal, Chris, Kauu,

So, if I understand correctly, you need a way to pass information along
the fetches, so that when Nutch fetches a feed entry, its  value
previously fetched is available.

This is how I tackled the issue:
- extend Outlinks.java and allow to create outlinks with more meta data.
So, in your feed parser, use this way to create outlinks
- pass on the metadata through ParseOutputFormat.java and Fetcher.java
- retrieve the metadata in HtmlParser.java and use it

This is very tedious, will blow the size of your outlinks db, makes
changes in the core code of Nutch, etc... But this is the only way I
came up with...
If someone sees a better way, please let me know :-)

Sample code, for Nutch 0.8.x :

+  public Outlink(String toUrl, String anchor, String entryContents,
Configuration conf) throws MalformedURLException {
+  this.toUrl = new
+  this.anchor = anchor;
+  this.entryContents= entryContents;
+  }
and update the other methods

ParseOutputFormat.java, around lines 140
+// set outlink info in metadata ME
+String entryContents= links[i].getEntryContents();
+if (entryContents.length() > 0) { // it's a feed entry
+MapWritable meta = new MapWritable();
+meta.put(new UTF8("entryContents"), new
+target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
+} else {
+target = new CrawlDatum(CrawlDatum.STATUS_LINKED,
interval); // no meta

Fetcher.java, around l. 266
+  // add feed info to metadata
+  try {
+  String entryContents = datum.getMetaData().get(new
+  metadata.set("entryContents", entryContents);
+  } catch (Exception e) { } //not found

// get entry metadata
String entryContents = content.getMetadata().get("entryContents");


Gal Nitzan wrote:
> Hi Chris,
> I'm sorry I wasn't clear enough. What I mean is that in the current
> 1. The RSS (channels, items) page ends up as one Lucene document in the
> 2. Indeed the links are extracted and each  link will be fetched
in the next fetch as a separate page and will end up as one Lucene document.
> IMHO the data that is needed i.e. the data that will be fetched in the
next fetch process is already available in the  element. Each 
element represents one web resource. And there is no reason to go to the
server and re-fetch that resource.
> Another issue that arises from rss feeds is that once the feed page is
fetched you can not re-fetch it until its "time to fetch" expired. The feeds
TTL is usually very short. Since for now in Nutch, all pages created equal
:) it is one more thing to think about.
> HTH,
> Gal.
> -Original Message-
> From: Chris Mattmann [mailto:[EMAIL PROTECTED]
> Sent: Thursday, February 01, 2007 7:01 PM
> To: nutch-dev@lucene.apache.org
> Subject: Re: RSS-fecter and index individul-how can i realize this
> Hi Gal, et al.,
>   I'd like to be explicit when we talk about what the issue with the RSS
> parsing plugin is here; I think we have had conversations similar to
> before and it seems that we keep talking around each other. I'd like to
> to the heart of this matter so that the issue (if there is an actual
> gets addressed ;)
>   Okay, so you mention below that the thing that you see missing from
> current RSS parsing plugin is the ability to store data in the
> and parse "it" in the next fetch phase. Well, t