[jira] Updated: (NUTCH-434) Replace usage of ObjectWritable with something based on GenericWritable
[ https://issues.apache.org/jira/browse/NUTCH-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-434: Attachment: NUTCH-434.patch This patch adds two new classes: GenericWritableConfigurable which extends GenericWritable and can inject confs and NutchWritable which can wrap most of the Nutch-specific writables (I may have missed some). This patch also updates Indexer, Segment{Reader,Merger} and MetaWrapper to use NutchWritable instead of ObjectWritable. > Replace usage of ObjectWritable with something based on GenericWritable > --- > > Key: NUTCH-434 > URL: https://issues.apache.org/jira/browse/NUTCH-434 > Project: Nutch > Issue Type: Improvement >Reporter: Sami Siren > Attachments: NUTCH-434.patch > > > We should replace the usage of ObjectWritable and classes extending it with > class extending GenericWritable. Classes based on GenericWritable have > smaller footprint on disc and the baseclass also does not contain any classes > that are Deprecated. > There is one problem though: the ParseData currently needs Configuration > object before it can deserialize itself and GenericWritable > doesn't provide a way to inject configuration in. We could either a) remove > the need for Configuration, or b) write a class similar to GenericWritable > that does conf injecting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-369) StringUtil.resolveEncodingAlias is unuseful.
[ https://issues.apache.org/jira/browse/NUTCH-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Richardet updated NUTCH-369: --- Attachment: remover.diff just FYI, you can further filter which element neko should keep and remove. see the patch for an example and http://people.apache.org/~andyc/neko/doc/html/settings.html > StringUtil.resolveEncodingAlias is unuseful. > - > > Key: NUTCH-369 > URL: https://issues.apache.org/jira/browse/NUTCH-369 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 0.9.0 > Environment: all >Reporter: King Kong >Priority: Minor > Attachments: patch.diff, remover.diff > > > After we defined encoding alias map in StringUtil , but parse html use > orginal encoding also. > I found it is reading charset from meta in nekohtml which HtmlParser used . > we can set it's feature > "http://cyberneko.org/html/features/scanner/ignore-specified-charset"; to true > that nekohtml will use encoding we set; > concretely, > private DocumentFragment parseNeko(InputSource input) throws Exception { > DOMFragmentParser parser = new DOMFragmentParser(); > // some plugins, e.g., creativecommons, need to examine html comments > try { >+ > parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true); > parser.setFeature("http://apache.org/xml/features/include-comments";, > true); > > BTW, It must be add on front of try block,because the following sentence > (parser.setFeature("http://apache.org/xml/features/include-comments";, > true);) will throw exception. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-369) StringUtil.resolveEncodingAlias is unuseful.
[ https://issues.apache.org/jira/browse/NUTCH-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Richardet updated NUTCH-369: --- Priority: Minor (was: Major) Affects Version/s: (was: 0.8) 0.9.0 > StringUtil.resolveEncodingAlias is unuseful. > - > > Key: NUTCH-369 > URL: https://issues.apache.org/jira/browse/NUTCH-369 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 0.9.0 > Environment: all >Reporter: King Kong >Priority: Minor > Attachments: patch.diff > > > After we defined encoding alias map in StringUtil , but parse html use > orginal encoding also. > I found it is reading charset from meta in nekohtml which HtmlParser used . > we can set it's feature > "http://cyberneko.org/html/features/scanner/ignore-specified-charset"; to true > that nekohtml will use encoding we set; > concretely, > private DocumentFragment parseNeko(InputSource input) throws Exception { > DOMFragmentParser parser = new DOMFragmentParser(); > // some plugins, e.g., creativecommons, need to examine html comments > try { >+ > parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true); > parser.setFeature("http://apache.org/xml/features/include-comments";, > true); > > BTW, It must be add on front of try block,because the following sentence > (parser.setFeature("http://apache.org/xml/features/include-comments";, > true);) will throw exception. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-369) StringUtil.resolveEncodingAlias is unuseful.
[ https://issues.apache.org/jira/browse/NUTCH-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Richardet updated NUTCH-369: --- Attachment: patch.diff unified diff against head. - fixes encoding, as described by King Kong - removes non-valid features - fixes logging > StringUtil.resolveEncodingAlias is unuseful. > - > > Key: NUTCH-369 > URL: https://issues.apache.org/jira/browse/NUTCH-369 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 0.8 > Environment: all >Reporter: King Kong > Attachments: patch.diff > > > After we defined encoding alias map in StringUtil , but parse html use > orginal encoding also. > I found it is reading charset from meta in nekohtml which HtmlParser used . > we can set it's feature > "http://cyberneko.org/html/features/scanner/ignore-specified-charset"; to true > that nekohtml will use encoding we set; > concretely, > private DocumentFragment parseNeko(InputSource input) throws Exception { > DOMFragmentParser parser = new DOMFragmentParser(); > // some plugins, e.g., creativecommons, need to examine html comments > try { >+ > parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true); > parser.setFeature("http://apache.org/xml/features/include-comments";, > true); > > BTW, It must be add on front of try block,because the following sentence > (parser.setFeature("http://apache.org/xml/features/include-comments";, > true);) will throw exception. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.