[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501921 ]
Doğacan Güney edited comment on NUTCH-466 at 6/6/07 6:08 AM: ------------------------------------------------------------- I still haven't tested it yet, but the code looks solid. I have a couple of comments, though: * One can't define order of execution for ParseFilter-s. It seems we always need it in one way or another in filters so it may be good to just add ordering and be done with it. * ParseFilters.filter method throws IOException. I think it will be better if it throws a ParseFilterException or whatever, keeping in spirit with IndexingFilters -> IndexingException and ScoringFilters -> ScoringFilterException. * There are few uses of iterating over Map.keySet() then getting the value with Map.get(key). FindBugs suggests that it is better to iterate over Map.entrySet() in these cases. * When someone requests more than 1 part-data, we start a couple of threads, receive data and join threads. Nutch also does this for summary. Is starting and joining threads again and again a problem? Especially, if you are clustering you may end up starting and joining _100_ threads for each query. Perhaps a thread pool? This is not completely related to this patch, it is just something that bugs me. * I just realized that there is no ParseFilter class either :) was: I still haven't tested it yet, but the code looks solid. I have a couple of comments, though: * One can't define order of execution for ParseFilter-s. It seems we always need it in one way or another in filters so it may be good to just add ordering and be done with it. * ParseResult.filter method throws IOException. I think it will be better if it throws a ParseFilterException or whatever, keeping in spirit with IndexingFilters -> IndexingException and ScoringFilters -> ScoringFilterException. * There are few uses of iterating over Map.keySet() then getting the value with Map.get(key). FindBugs suggests that it is better to iterate over Map.entrySet() in these cases. * When someone requests more than 1 part-data, we start a couple of threads, receive data and join threads. Nutch also does this for summary. Is starting and joining threads again and again a problem? Especially, if you are clustering you may end up starting and joining _100_ threads for each query. Perhaps a thread pool? This is not completely related to this patch, it is just something that bugs me. * I just realized that there is no ParseFilter class either :) > Flexible segment format > ----------------------- > > Key: NUTCH-466 > URL: https://issues.apache.org/jira/browse/NUTCH-466 > Project: Nutch > Issue Type: Improvement > Components: searcher > Affects Versions: 1.0.0 > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Attachments: ParseFilters.java, segmentparts.patch > > > In many situations it is necessary to store more data associated with pages > than it's possible now with the current segment format. Quite often it's a > binary data. There are two common workarounds for this: one is to use > per-page metadata, either in Content or ParseData, the other is to use an > external independent database using page ID-s as foreign keys. > Currently segments can consist of the following predefined parts: content, > crawl_fetch, crawl_generate, crawl_parse, parse_text and parse_data. I > propose a third option, which is a natural extension of this existing segment > format, i.e. to introduce the ability to add arbitrarily named segment > "parts", with the only requirement that they should be MapFile-s that store > Writable keys and values. Alternatively, we could define a > SegmentPart.Writer/Reader to accommodate even more sophisticated scenarios. > Existing segment API and searcher API (NutchBean, DistributedSearch > Client/Server) should be extended to handle such arbitrary parts. > Example applications: > * storing HTML previews of non-HTML pages, such as PDF, PS and Office > documents > * storing pre-tokenized version of plain text for faster snippet generation > * storing linguistically tagged text for sophisticated data mining > * storing image thumbnails > etc, etc ... > I'm going to prepare a patchset shortly. Any comments and suggestions are > welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers