[jira] Commented: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse
[ https://issues.apache.org/jira/browse/NUTCH-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518629 ] Hudson commented on NUTCH-535: -- Integrated in Nutch-Nightly #175 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/175/]) ParseData's contentMeta accumulates unnecessary values during parse --- Key: NUTCH-535 URL: https://issues.apache.org/jira/browse/NUTCH-535 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Doğacan Güney Assignee: Doğacan Güney Fix For: 1.0.0 Attachments: NUTCH-535.patch, NUTCH_535_v2.patch After NUTCH-506, if you run parse on a segment, parseData's contentMeta accumulates metadata of every content parsed so far. This is because NUTCH-506 changed constructor to create a new metadata (before NUTCH-506, a new metadata was created for every call to readFields). It seems hadoop somehow caches Content instance so each new call to Content.readFields during ParseSegment increases size of metadata. Because of this, one can end up with *huge* parse_data directory (something like 10 times larger than content directory) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-522) Use URLValidator in the Injector
[ https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518627 ] Hudson commented on NUTCH-522: -- Integrated in Nutch-Nightly #175 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/175/]) Use URLValidator in the Injector Key: NUTCH-522 URL: https://issues.apache.org/jira/browse/NUTCH-522 Project: Nutch Issue Type: Improvement Components: injector Reporter: Emmanuel Joke Assignee: Emmanuel Joke Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch, NUTCH_522_v4.patch Same as NUTCH-505, we should use the UrlValidator to check url in the Injector -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-540) some problem about the Nutch cache
some problem about the Nutch cache -- Key: NUTCH-540 URL: https://issues.apache.org/jira/browse/NUTCH-540 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9 Reporter: crossany Priority: Blocker Fix For: 0.9.0 I'am a chinese. I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error. I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-541) Index url field untokenized
Index url field untokenized --- Key: NUTCH-541 URL: https://issues.apache.org/jira/browse/NUTCH-541 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 1.0.0 Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 1.0.0 Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the untokenized version of the url field in some contexts : 1. For deleting duplicates by url (at search time). see NUTCH-455 2. For restricting the search to a certain url (may be used in the case of RSS search where each entry in the Rss is added as a distinct document with (possibly) same url ) query-url extends FieldQueryFilter so: Query: url:http://www.apache.org/ Parsed: url:http http-www http-www-apache www www-apache apache org Translated: +url:http-http-www http-www-http-www-apache http-www-apache-www www-www-apache www-apache apache org 3. for accessing a document(s) in the search servers in the search servers. (using query plugin) I suggest we add url as in index-basic and implement a query-url-untoken plugin. doc.add(new Field(url, url.toString(), Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field(url_untoken, url.toString(), Field.Store.NO, Field.Index.UN_TOKENIZED)); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-540) some problem about the Nutch cache
[ https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Richardet updated NUTCH-540: --- Priority: Major (was: Blocker) could you please attach log files and error messages? thanks some problem about the Nutch cache -- Key: NUTCH-540 URL: https://issues.apache.org/jira/browse/NUTCH-540 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0 Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9 Reporter: crossany Fix For: 0.9.0 I'am a chinese. I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat for search chinese word , I find the search result' Title and description was right to display. but when I click the cache, the cache web was display a error charset code, I see the cache web' charset also utf-8. I find a website use Nutch http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also error. I use Luke to see the segments It's can display chinese word, I think maybe it's a Bug. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.