Re: hi all:
吴志敏 wrote: > I want to read the stored segments to a xml file, but when I read the > SegmentReader.java, I find that it ‘s not a simple thing. > > it’s a hadoop’s job to dump a text file. I just want to dump the > segments’ some content witch I interested to a xml. > > So some one can tell me hwo to do this, any reply will be appreciated! Segment data is basically just a bunch of files containing key->value pairs, so there's always the possibility of reading the data directly with help of: http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.Reader.html To see what kind of object to expect you can just examine the beginning of file where there is some metadata stored - like class used for key and class used for value (that metadata is also available from methods of SequenceFile.Reader class). For example to read the contents of Content data from a segment one could use something like: SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf); Text url = new Text(); //key Content content = new Content();//value while (reader.next(url, content)) { //now just use url and content the way you like } -- Sami Siren
Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
Hi Sami, On 12/9/06 2:27 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote: > Author: siren > Date: Sat Dec 9 14:27:07 2006 > New Revision: 485076 > > URL: http://svn.apache.org/viewvc?view=rev&rev=485076 > Log: > Optimize SpellCheckedMetadata further by taking into account the fact that it > is used only for http-headers. > > I am starting to believe that spellchecking should just be an utility method > used by http protocol plugins. I think that right now I'm -1 for this change. I would make note of all the comments on NUTCH-139, from which this code was born. In the end, I think what we all realized was that the spell checking capabilities is necessary, but not everywhere, as you point out. However, I don't think it's limited entirely to HTTP headers (what you've currently changed the code to). I think it should be implemented as a protocol layer service, also providing spell checking support to other protocol plugins, like protocol-file, etc., where field headers run the risk of being misspelled as well. What's to stop someone from implementing protocol-file++ that returns different file header keys than that of protocol-file? Just b/c HTTP is the most pervasively used plugin right now, I think it's convenient to assume that only HTTP protocol field keys may need spell checking services. Just my 2 cents... Cheers, Chris
Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
Chris Mattmann wrote: > Hi Sami, > > On 12/9/06 2:27 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote: > >> Author: siren >> Date: Sat Dec 9 14:27:07 2006 >> New Revision: 485076 >> >> URL: http://svn.apache.org/viewvc?view=rev&rev=485076 >> Log: >> Optimize SpellCheckedMetadata further by taking into account the fact that it >> is used only for http-headers. >> >> I am starting to believe that spellchecking should just be an utility method >> used by http protocol plugins. > > I think that right now I'm -1 for this change. I would make note of all the > comments on NUTCH-139, from which this code was born. In the end, I think > what we all realized was that the spell checking capabilities is necessary, > but not everywhere, as you point out. However, I don't think it's limited > entirely to HTTP headers (what you've currently changed the code to). I > think it should be implemented as a protocol layer service, also providing > spell checking support to other protocol plugins, like protocol-file, etc., In protocol file all headers are artificial an generated in nutch code so if there's spelling mistake there then we should fix the code generating the headers and not rely on spellchecking in the first place. > where field headers run the risk of being misspelled as well. What's to stop > someone from implementing protocol-file++ that returns different file header > keys than that of protocol-file? Just b/c HTTP is the most pervasively used > plugin right now, I think it's convenient to assume that only HTTP protocol > field keys may need spell checking services. If there's a real need for spell checking on other keys one can just add more classes to the array no big deal. -- Sami Siren
Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
Hi Sami, Indeed, I see your point. I guess what I was advocating for was more of a ProtocolHeaders interface, that lives in org.apache.nutch.metadata. Then, we could update the code that you have below to use ProtocolHeaders.class rather than HttpHeaders.class. We would then make ProtocolHeaders extend HttpHeaders, so that it by default inherits all of the HttpHeaders, while still allowing more ProtocolHeader met keys (e.g., we could have an interface for FileHeaders, etc.). What do you think about that? Alternatively we could just create a ProtocolHeaders interface in org.apache.nutch.metadata that aggreates all the met key fields from HttpHeaders, and it would be the place that the met key fields for FileHeaders, etc. could go into. Let me know what you think, and thanks! Cheers, Chris On 12/9/06 3:53 PM, "Sami Siren" <[EMAIL PROTECTED]> wrote: > Chris Mattmann wrote: >> Hi Sami, >> >> On 12/9/06 2:27 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote: >> >>> Author: siren >>> Date: Sat Dec 9 14:27:07 2006 >>> New Revision: 485076 >>> >>> URL: http://svn.apache.org/viewvc?view=rev&rev=485076 >>> Log: >>> Optimize SpellCheckedMetadata further by taking into account the fact that >>> it >>> is used only for http-headers. >>> >>> I am starting to believe that spellchecking should just be an utility method >>> used by http protocol plugins. >> >> I think that right now I'm -1 for this change. I would make note of all the >> comments on NUTCH-139, from which this code was born. In the end, I think >> what we all realized was that the spell checking capabilities is necessary, >> but not everywhere, as you point out. However, I don't think it's limited >> entirely to HTTP headers (what you've currently changed the code to). I >> think it should be implemented as a protocol layer service, also providing >> spell checking support to other protocol plugins, like protocol-file, etc., > > In protocol file all headers are artificial an generated in nutch code > so if there's spelling mistake there then we should fix the code > generating the headers and not rely on spellchecking in the first place. > >> where field headers run the risk of being misspelled as well. What's to stop >> someone from implementing protocol-file++ that returns different file header >> keys than that of protocol-file? Just b/c HTTP is the most pervasively used >> plugin right now, I think it's convenient to assume that only HTTP protocol >> field keys may need spell checking services. > > If there's a real need for spell checking on other keys one can just add > more classes to the array no big deal. > > -- > Sami Siren >
Re: hi all:
thx very much ,i'll try it On 12/9/06, Sami Siren <[EMAIL PROTECTED]> wrote: 吴志敏 wrote: > I want to read the stored segments to a xml file, but when I read the > SegmentReader.java, I find that it 's not a simple thing. > > it's a hadoop's job to dump a text file. I just want to dump the > segments' some content witch I interested to a xml. > > So some one can tell me hwo to do this, any reply will be appreciated! Segment data is basically just a bunch of files containing key->value pairs, so there's always the possibility of reading the data directly with help of: http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.Reader.html To see what kind of object to expect you can just examine the beginning of file where there is some metadata stored - like class used for key and class used for value (that metadata is also available from methods of SequenceFile.Reader class). For example to read the contents of Content data from a segment one could use something like: SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf); Text url = new Text(); //key Content content = new Content();//value while (reader.next(url, content)) { //now just use url and content the way you like } -- Sami Siren -- www.babatu.com