[jira] Created: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0
Upgrade the Carrot2 plug-in to release 3.0 -- Key: NUTCH-673 URL: https://issues.apache.org/jira/browse/NUTCH-673 Project: Nutch Issue Type: Improvement Components: web gui Affects Versions: 0.9.0 Environment: All Nutch deployments. Reporter: Sean Dean Fix For: 1.0.0 Release 3.0 of the Carrot2 plug-in was released recently. We currently have version 2.1 in the source tree and upgrading it to the latest version before 1.0-release might make sence. Details on the release can be found here: http://project.carrot2.org/release-3.0-notes.html One major change in requirements is for JDK 1.5 to be used, but this is also now required for Hadoop 0.19 so this wouldnt be the only reason for the switch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-417) After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.
[ http://issues.apache.org/jira/browse/NUTCH-417?page=comments#action_12459073 ] Sean Dean commented on NUTCH-417: - Speculative execution is now off by default with Hadoop 0.9.2 as per issue HADOOP-827. Since there was only two other fixes with that distribution, neither of which should effect Nutch in a bad way can that be updated in trunk? After upgrade to hadoop-0.9.1, parsing and indexing doesn't work. - Key: NUTCH-417 URL: http://issues.apache.org/jira/browse/NUTCH-417 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Dogacan Güney Attachments: index.patch If you parse while fetching then it is fine, but if you run parse as a different job, it creates an essentially empty parse_data directory(which has index files, but doesn't have data files). I am not sure why this is happening. Also, indexing fails at Indexer.OutputFormat.getRecordWriter. The parameter fs seems to be an instance of PhasedFileSystem which throws exceptions on delete and {start,complete}LocalOutput. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-224) Nutch doesn't handle Korean text at all
[ http://issues.apache.org/jira/browse/NUTCH-224?page=comments#action_12455064 ] Sean Dean commented on NUTCH-224: - I just tested this today using 0.9-dev and it seems the changes made back in 0.7.2 to Lucene didnt fix the issue. At some point in the Nutch code it isnt doing something the same way as for Chinese and Japanese. Im also aware that searching using Chinese has an issue, which is in ticket NUTCH-36 but still does show results exactly matching. Testing details: I searched for the word 뉴스, which is news in english. I have fetched korean pages with this word, so I know for sure its part of the index. Zero results were displayed. Nutch doesn't handle Korean text at all --- Key: NUTCH-224 URL: http://issues.apache.org/jira/browse/NUTCH-224 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.7.1 Reporter: KuroSaka TeruHiko I was browing NutchAnalysis.jj and found that Hungul Syllables (U+AC00 ... U+D7AF; U+ means a Unicode character of the hex value ) are not part of LETTER or CJK class. This seems to me that Nutch cannot handle Korean documents at all. I posted the above message at nutch-user ML and Cheolgoo Kang [EMAIL PROTECTED] replied as: There was similar issue with Lucene's StandardTokenizer.jj. http://issues.apache.org/jira/browse/LUCENE-444 and http://issues.apache.org/jira/browse/LUCENE-461 I'm have almost no experience with Nutch, but you can handle it like those issues above. Both fixes should probably be ported back to NuatchAnalysis.jj. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-224) Nutch doesn't handle Korean text at all
[ http://issues.apache.org/jira/browse/NUTCH-224?page=comments#action_12455065 ] Sean Dean commented on NUTCH-224: - Just a note on my comment above, it seems JIRA cant display (or wont display) Korean text after I accept the comment. If your trying to test this, and cant write Korean my best suggestion is to visit http://babelfish.altavista.com/ and type in news or whatever word your using then translate it to Korean. If your using Windows you might need to have East Asian languages support installed, which can be found under Regional and Language Options under the Control Panel. Nutch doesn't handle Korean text at all --- Key: NUTCH-224 URL: http://issues.apache.org/jira/browse/NUTCH-224 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 0.7.1 Reporter: KuroSaka TeruHiko I was browing NutchAnalysis.jj and found that Hungul Syllables (U+AC00 ... U+D7AF; U+ means a Unicode character of the hex value ) are not part of LETTER or CJK class. This seems to me that Nutch cannot handle Korean documents at all. I posted the above message at nutch-user ML and Cheolgoo Kang [EMAIL PROTECTED] replied as: There was similar issue with Lucene's StandardTokenizer.jj. http://issues.apache.org/jira/browse/LUCENE-444 and http://issues.apache.org/jira/browse/LUCENE-461 I'm have almost no experience with Nutch, but you can handle it like those issues above. Both fixes should probably be ported back to NuatchAnalysis.jj. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever
[ http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12453919 ] Sean Dean commented on NUTCH-233: - Could I suggest that this change, from .*(/.+?)/.*?\1/.*?\1/ to .*(/[^/]+)/[^/]+\1/[^/]+\1/ be committed to at least trunk for the time being. I recently created a segment with 1M urls exactly, I ran the fetch and it did indeed stall on the reduce part of the operation due to the regex filter. This was verified with a thread dump (kill -3 pid) on FreeBSD. I then made the suggested change in the config file and re-fetched the exact same segment. It completed without issue. I'm aware we might be losing some filtering functionality with this new expression, but is it not better then knowing there is always the chance your whole-web crawl fetch will fail because of this? wrong regular expression hang reduce process for ever - Key: NUTCH-233 URL: http://issues.apache.org/jira/browse/NUTCH-233 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.9.0 Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt wasn't compatible with java.util.regex that is actually used in the regex url filter. May be it was missed to change it when the regular expression packages was changed. The problem was that until reducing a fetch map output the reducer hangs forever since the outputformat was applying the urlfilter a url that causes the hang. 060315 230823 task_r_3n4zga at java.lang.Character.codePointAt(Character.java:2335) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Dot.match(Pattern.java:4092) 060315 230823 task_r_3n4zga at java.util.regex.Pattern$Curly.match1(Pattern.java: I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the fetch job works. (thanks to Grant and Chris B. helping to find the new regex) However may people can review it and can suggest improvements, since the old regex would match : abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the old regex would also match : abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-224) Nutch doesn't handle Korean text at all
[ http://issues.apache.org/jira/browse/NUTCH-224?page=comments#action_12416108 ] Sean Dean commented on NUTCH-224: - Im still using 0.7.1 and also see this problem. In the Nutch 0.7.2 release they upgraded to Lucene 1.9.1, which included the above fixes for Korean language support. Have you tried 0.7.2 or .8-dev with any luck? Nutch doesn't handle Korean text at all --- Key: NUTCH-224 URL: http://issues.apache.org/jira/browse/NUTCH-224 Project: Nutch Type: Bug Components: indexer Versions: 0.7.1 Reporter: KuroSaka TeruHiko I was browing NutchAnalysis.jj and found that Hungul Syllables (U+AC00 ... U+D7AF; U+ means a Unicode character of the hex value ) are not part of LETTER or CJK class. This seems to me that Nutch cannot handle Korean documents at all. I posted the above message at nutch-user ML and Cheolgoo Kang [EMAIL PROTECTED] replied as: There was similar issue with Lucene's StandardTokenizer.jj. http://issues.apache.org/jira/browse/LUCENE-444 and http://issues.apache.org/jira/browse/LUCENE-461 I'm have almost no experience with Nutch, but you can handle it like those issues above. Both fixes should probably be ported back to NuatchAnalysis.jj. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira