.BAT file for running nutch in Windows (no cygwin)

2011-07-23 Thread Radim Kolar
I ported shell start-up script to standard windows .BAT file (tested in Windows XP). Where can i upload it? I need help with testing nutch under native windows. Should i open bug report and attach .BAT file to it?

Re: Unreleased Gora dependencies in Nutch Trunk build

2011-08-19 Thread Radim Kolar
> in a nutshell you can't use Ivy or Maven for the Gora dependency, which is why we are currently stuck with the trunk and can't compile it without first downloading and compiling GORA locally. i compiled gora-*-0.2-incubating.jars locally. Where should i put them to get nutch trunk compiled?

Re: The crawl command, keep or get rid of

2011-08-23 Thread Radim Kolar
I agree. Nuke crawl command

Re: [VOTE] Move 2.0 out of trunk

2011-09-18 Thread Radim Kolar
-1 I don't want to mark release 2.0 as unmaintained. Cassandra backend works really well for us and fixed performance problems with hadoop database. Instead of moving it out trunk, recruit more ppl should come and fix open problems. don't give up.

Re: [VOTE] Move 2.0 out of trunk

2011-09-19 Thread Radim Kolar
> I'm glad to hear that there at least 2 people in the community that do business in their field and proudly use a Nutch-based crawler together with > Cassandra to store the data through Gora. That would not have been possible with Nutch 1.x version. what about to drop Gora, because it is progr

Re: [DISCUSS] What will happen to Nutch Gora aka Nutchbase (was Re: [VOTE] Move 2.0 out of trunk)

2011-09-19 Thread Radim Kolar
> The nutchgora branch will still be there, and if there's a desire to have a nutchcassandra or nutchhbase pure branch, and you have some spare cycles to help see it come about, we would welcome it. it needs to be done in more long term strategic way. 1. research what ppl expect from Nutch 2?

Re: Prepare for 1.4 release?

2011-09-27 Thread Radim Kolar
can you add NUTCH-1098 to 1.4?

1.4 release - newer hadoop jars

2011-09-30 Thread Radim Kolar
can you package 1.4 with updated hadoop jars? i have problems with running nutch in local mode. If i run multiple tasks at once, they delete each other temporary files. Its worth a try if newer hadoop libs will fix that.

injector in nutch-1.4

2011-10-13 Thread Radim Kolar
I have problems with running injector in nutch-1.4 on hadoop, same command with nutch-1.3 works fine. As you can see, list of URLs is loaded from hdfs correctly Map input records=66906 but no records are on map ouput. Could it be some problems with broken filtering? ponto:(crawler)runtime/depl

Re: injector in nutch-1.4

2011-10-13 Thread Radim Kolar
Let me know if anybody got injector to work in 1.4 branch i have Hadoop 0.20.204.0 and cant make it to insert single url

[jira] [Created] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-07-27 Thread Radim Kolar (JIRA)
Environment: Windows XP Home Reporter: Radim Kolar Priority: Minor Its possible to run Nutch in windows without cygwin. 1. Startup script needs to be ported from SH to BAT 2. Because hadoop runs on unix only, we must emulate unix commands to make it work. Luckily only chmod, bash

[jira] [Updated] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-07-27 Thread Radim Kolar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1070: --- Attachment: chmod.c bash.c nutch.bat > Run nutch under nat

[jira] [Commented] (NUTCH-990) protocol-httpclient fails with short pages

2011-08-21 Thread Radim Kolar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088443#comment-13088443 ] Radim Kolar commented on NUTCH-990: --- I have this problem too protocol-httpclient f

[jira] [Created] (NUTCH-1098) better url-normalizer basic

2011-08-25 Thread Radim Kolar (JIRA)
Environment: Any Reporter: Radim Kolar Fix For: 1.4 Attachments: urlnormalizer.patch Basic URL normalizer lacks 2 important features Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL Ability to decode %33 encoding in

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-08-25 Thread Radim Kolar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: urlnormalizer.patch Patch against branch-1.4 > better url-normalizer ba

[jira] [Commented] (NUTCH-937) When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins because MapReduce will not unpack plugin/ directory from the job's pack (due to MAPREDUCE-967)

2011-08-26 Thread Radim Kolar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091740#comment-13091740 ] Radim Kolar commented on NUTCH-937: --- we should stick with hadoop 0.20.203.0 not CDH

[jira] [Commented] (NUTCH-937) When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins because MapReduce will not unpack plugin/ directory from the job's pack (due to MAPREDUCE-967)

2011-08-26 Thread Radim Kolar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092059#comment-13092059 ] Radim Kolar commented on NUTCH-937: --- nutch-1.4 contains hadoop-core 0.20.2. If nutch

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-09-25 Thread Radim Kolar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: nutch.diff Updated patch. It also normalizes unprintable % sequences to upper case. Like

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-09-25 Thread Radim Kolar (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: (was: urlnormalizer.patch) > better url-normalizer ba

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-10-05 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120912#comment-13120912 ] Radim Kolar commented on NUTCH-1098: 1. Some servers sends spaces in URLs 2. Base

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-10-05 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120966#comment-13120966 ] Radim Kolar commented on NUTCH-1098: Actually it might be even better to

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-10-13 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13126594#comment-13126594 ] Radim Kolar commented on NUTCH-1098: Patch is good. i will add replace high bit c

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-10-14 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127440#comment-13127440 ] Radim Kolar commented on NUTCH-1098: I did, but due to lack of time to test

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-10-15 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128264#comment-13128264 ] Radim Kolar commented on NUTCH-1098: Browsers seems to send spaces in URL enc

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-10-19 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: patch-urlnormalizer.diff > better url-normalizer ba

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-10-24 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: (was: patch-urlnormalizer.diff) > better url-normalizer ba

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-10-24 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: (was: nutch.diff) > better url-normalizer ba

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-10-24 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: patch-urlnormalizer.diff Do not decode # and / characters during %XX decoding. Unit

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-11-01 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141233#comment-13141233 ] Radim Kolar commented on NUTCH-1098: 1. patch --ignore-whitespace 2. only s

[jira] [Commented] (NUTCH-1194) CrawlDB lock should be released earlier

2011-11-02 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142544#comment-13142544 ] Radim Kolar commented on NUTCH-1194: locking should be done in setup/cleanup

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-11-02 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13142699#comment-13142699 ] Radim Kolar commented on NUTCH-1098: a/ Please direct your complains about qualit

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-11-02 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: (was: patch-urlnormalizer.diff) > better url-normalizer ba

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-11-02 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: patch-with-utf8-encoding.diff Added support for encoding string to UTF-8 and then URL

[jira] [Updated] (NUTCH-1194) CrawlDB lock should be released earlier

2011-11-03 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1194: --- Comment: was deleted (was: locking should be done in setup/cleanup task. Currently if you kill

[jira] [Updated] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-03 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1070: --- Attachment: (was: nutch.bat) > Run nutch under native windows (no cyg

[jira] [Resolved] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-03 Thread Radim Kolar (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar resolved NUTCH-1070. Resolution: Won't Fix > Run nutch under native windows (n

[jira] [Updated] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-03 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1070: --- Attachment: (was: bash.c) > Run nutch under native windows (no cyg

[jira] [Updated] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-03 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1070: --- Attachment: (was: chmod.c) > Run nutch under native windows (no cyg

[jira] [Commented] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-04 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13143984#comment-13143984 ] Radim Kolar commented on NUTCH-1070: i closed it because i removed my patches, i

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-11-04 Thread Radim Kolar (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar updated NUTCH-1098: --- Attachment: (was: patch-with-utf8-encoding.diff) > better url-normalizer ba

[jira] [Resolved] (NUTCH-1098) better url-normalizer basic

2011-11-04 Thread Radim Kolar (Resolved) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar resolved NUTCH-1098. Resolution: Invalid Attached patch was in improper format. > better

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-11-04 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144020#comment-13144020 ] Radim Kolar commented on NUTCH-1098: By removing my patch i also withdraw permis

[jira] [Closed] (NUTCH-1070) Run nutch under native windows (no cygwin)

2011-11-04 Thread Radim Kolar (Closed) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Radim Kolar closed NUTCH-1070. -- > Run nutch under native windows (no cyg

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-11-04 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144160#comment-13144160 ] Radim Kolar commented on NUTCH-1098: If you are so clever and hard working then

[jira] [Commented] (NUTCH-1098) better url-normalizer basic

2011-11-04 Thread Radim Kolar (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144183#comment-13144183 ] Radim Kolar commented on NUTCH-1098: Remove my patch from this ticket. I