[jira] Created: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2008-12-15 Thread Sean Dean (JIRA)
Upgrade the Carrot2 plug-in to release 3.0
--

 Key: NUTCH-673
 URL: https://issues.apache.org/jira/browse/NUTCH-673
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.9.0
 Environment: All Nutch deployments.
Reporter: Sean Dean
 Fix For: 1.0.0


Release 3.0 of the Carrot2 plug-in was released recently.

We currently have version 2.1 in the source tree and upgrading it to the latest 
version before 1.0-release might make sence.

Details on the release can be found here: 
http://project.carrot2.org/release-3.0-notes.html

One major change in requirements is for JDK 1.5 to be used, but this is also 
now required for Hadoop 0.19 so this wouldnt be the only reason for the switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-417) After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.

2006-12-16 Thread Sean Dean (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-417?page=comments#action_12459073 ] 

Sean Dean commented on NUTCH-417:
-

Speculative execution is now off by default with Hadoop 0.9.2 as per issue 
HADOOP-827. Since there was only two other fixes with that distribution, 
neither of which should effect Nutch in a bad way can that be updated in trunk?

 After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.
 -

 Key: NUTCH-417
 URL: http://issues.apache.org/jira/browse/NUTCH-417
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Dogacan Güney
 Attachments: index.patch


 If you parse while fetching then it is fine, but if you run parse as a 
 different job, it creates an essentially empty parse_data directory(which has 
 index files, but doesn't have data files). I am not sure why this is 
 happening.
 Also, indexing fails at Indexer.OutputFormat.getRecordWriter. The parameter 
 fs seems to be an instance of PhasedFileSystem which throws exceptions on 
 delete and {start,complete}LocalOutput.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-224) Nutch doesn't handle Korean text at all

2006-12-01 Thread Sean Dean (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-224?page=comments#action_12455064 ] 

Sean Dean commented on NUTCH-224:
-

I just tested this today using 0.9-dev and it seems the changes made back in 
0.7.2 to Lucene didnt fix the issue. At some point in the Nutch code it isnt 
doing something the same way as for Chinese and Japanese. Im also aware that 
searching using Chinese has an issue, which is in ticket NUTCH-36 but still 
does show results exactly matching.

Testing details:

I searched for the word 뉴스, which is news in english. I have fetched korean 
pages with this word, so I know for sure its part of the index. Zero results 
were displayed.

 Nutch doesn't handle Korean text at all
 ---

 Key: NUTCH-224
 URL: http://issues.apache.org/jira/browse/NUTCH-224
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.7.1
Reporter: KuroSaka TeruHiko

 I was browing NutchAnalysis.jj and found that
 Hungul Syllables (U+AC00 ... U+D7AF; U+ means
 a Unicode character of the hex value ) are not
 part of LETTER or CJK class.  This seems to me that
 Nutch cannot handle Korean documents at all.
 I posted the above message at nutch-user ML and Cheolgoo Kang [EMAIL 
 PROTECTED]
 replied as:
 
 There was similar issue with Lucene's StandardTokenizer.jj.
 http://issues.apache.org/jira/browse/LUCENE-444
 and
 http://issues.apache.org/jira/browse/LUCENE-461
 I'm have almost no experience with Nutch, but you can handle it like
 those issues above.
 
 Both fixes should probably be ported back to NuatchAnalysis.jj.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-224) Nutch doesn't handle Korean text at all

2006-12-01 Thread Sean Dean (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-224?page=comments#action_12455065 ] 

Sean Dean commented on NUTCH-224:
-

Just a note on my comment above, it seems JIRA cant display (or wont display) 
Korean text after I accept the comment.

If your trying to test this, and cant write Korean my best suggestion is to 
visit http://babelfish.altavista.com/ and type in news or whatever word your 
using then translate it to Korean. If your using Windows you might need to have 
East Asian languages support installed, which can be found under Regional and 
Language Options under the Control Panel.

 Nutch doesn't handle Korean text at all
 ---

 Key: NUTCH-224
 URL: http://issues.apache.org/jira/browse/NUTCH-224
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 0.7.1
Reporter: KuroSaka TeruHiko

 I was browing NutchAnalysis.jj and found that
 Hungul Syllables (U+AC00 ... U+D7AF; U+ means
 a Unicode character of the hex value ) are not
 part of LETTER or CJK class.  This seems to me that
 Nutch cannot handle Korean documents at all.
 I posted the above message at nutch-user ML and Cheolgoo Kang [EMAIL 
 PROTECTED]
 replied as:
 
 There was similar issue with Lucene's StandardTokenizer.jj.
 http://issues.apache.org/jira/browse/LUCENE-444
 and
 http://issues.apache.org/jira/browse/LUCENE-461
 I'm have almost no experience with Nutch, but you can handle it like
 those issues above.
 
 Both fixes should probably be ported back to NuatchAnalysis.jj.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-11-28 Thread Sean Dean (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12453919 ] 

Sean Dean commented on NUTCH-233:
-

Could I suggest that this change, from .*(/.+?)/.*?\1/.*?\1/ to 
.*(/[^/]+)/[^/]+\1/[^/]+\1/ be committed to at least trunk for the time being.

I recently created a segment with 1M urls exactly, I ran the fetch and it did 
indeed stall on the reduce part of the operation due to the regex filter. This 
was verified with a thread dump (kill -3 pid) on FreeBSD.

I then made the suggested change in the config file and re-fetched the exact 
same segment. It completed without issue.

I'm aware we might be losing some filtering functionality with this new 
expression, but is it not better then knowing there is always the chance your 
whole-web crawl fetch will fail because of this?

 wrong regular expression hang reduce process for ever
 -

 Key: NUTCH-233
 URL: http://issues.apache.org/jira/browse/NUTCH-233
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.9.0


 Looks like that the expression .*(/.+?)/.*?\1/.*?\1/ in regex-urlfilter.txt 
 wasn't compatible with java.util.regex that is actually used in the regex url 
 filter. 
 May be it was missed to change it when the regular expression packages was 
 changed.
 The problem was that until reducing a fetch map output the reducer hangs 
 forever since the outputformat was applying the urlfilter a url that causes 
 the hang.
 060315 230823 task_r_3n4zga at 
 java.lang.Character.codePointAt(Character.java:2335)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Dot.match(Pattern.java:4092)
 060315 230823 task_r_3n4zga at 
 java.util.regex.Pattern$Curly.match1(Pattern.java:
 I changed the regular expression to .*(/[^/]+)/[^/]+\1/[^/]+\1/ and now the 
 fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
 However may people can review it and can suggest improvements, since the old 
 regex would match :
 abcd/foo/bar/foo/bar/foo/ and so will the new one match it also. But the 
 old regex would also match :
 abcd/foo/bar/xyz/foo/bar/foo/ which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-224) Nutch doesn't handle Korean text at all

2006-06-13 Thread Sean Dean (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-224?page=comments#action_12416108 ] 

Sean Dean commented on NUTCH-224:
-

Im still using 0.7.1 and also see this problem.

In the Nutch 0.7.2 release they upgraded to Lucene 1.9.1, which included the 
above fixes for Korean language support.

Have you tried 0.7.2 or .8-dev with any luck?

 Nutch doesn't handle Korean text at all
 ---

  Key: NUTCH-224
  URL: http://issues.apache.org/jira/browse/NUTCH-224
  Project: Nutch
 Type: Bug

   Components: indexer
 Versions: 0.7.1
 Reporter: KuroSaka TeruHiko


 I was browing NutchAnalysis.jj and found that
 Hungul Syllables (U+AC00 ... U+D7AF; U+ means
 a Unicode character of the hex value ) are not
 part of LETTER or CJK class.  This seems to me that
 Nutch cannot handle Korean documents at all.
 I posted the above message at nutch-user ML and Cheolgoo Kang [EMAIL 
 PROTECTED]
 replied as:
 
 There was similar issue with Lucene's StandardTokenizer.jj.
 http://issues.apache.org/jira/browse/LUCENE-444
 and
 http://issues.apache.org/jira/browse/LUCENE-461
 I'm have almost no experience with Nutch, but you can handle it like
 those issues above.
 
 Both fixes should probably be ported back to NuatchAnalysis.jj.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira