[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339609#comment-14339609 ] Lewis John McGibbney commented on NUTCH-1946: - bq. Where can I see which tests are failing? The test which is failing is in {code} $NUTCH_HOME/build/test/org.apache.nutch.fetcher.TestFetcher {code} > Upgrade to Gora 0.6 > --- > > Key: NUTCH-1946 > URL: https://issues.apache.org/jira/browse/NUTCH-1946 > Project: Nutch > Issue Type: Improvement > Components: storage >Affects Versions: 2.3.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.3.1 > > Attachments: NUTCH-1946.patch > > > Apache Gora was released recently. > We should upgrade before pushing Nutch 2.3.1 as it will come in very handy > for the new Docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[no subject]
unsubscribe
[no subject]
unsubscribe
unsuscribe
[jira] [Comment Edited] (NUTCH-1946) Upgrade to Gora 0.6
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339252#comment-14339252 ] Henry Saputra edited comment on NUTCH-1946 at 2/26/15 9:56 PM: --- Try to replicate the error stack but when I ran ant test I just saw this at the end: [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec [junit] Running org.apache.nutch.util.TestSuffixStringMatcher [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.046 sec [junit] Running org.apache.nutch.util.TestTableUtil [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.021 sec [junit] Running org.apache.nutch.util.TestURLUtil [junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.376 sec [junit] Running org.apache.nutch.webui.client.TestCrawlCycle [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.247 sec [junit] Running org.apache.nutch.webui.client.TestNutchClientFactory [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.673 sec [junit] Running org.apache.nutch.webui.client.TestRemoteCommandExecutor [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.249 sec [junit] Running org.apache.nutch.webui.client.TestRemoteCommandsBatchFactory [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.085 sec [junit] Running org.apache.nutch.webui.view.TestColorEnumLabel [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.451 sec BUILD FAILED /Users/hsaputra/open/asf/nutch/branches/2_x/build.xml:450: Tests failed! Where can I see which tests are failing? was (Author: hsaputra): Try to replicate the error stack but when I ran ant test I just saw this at the end: {{ [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec [junit] Running org.apache.nutch.util.TestSuffixStringMatcher [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.046 sec [junit] Running org.apache.nutch.util.TestTableUtil [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.021 sec [junit] Running org.apache.nutch.util.TestURLUtil [junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.376 sec [junit] Running org.apache.nutch.webui.client.TestCrawlCycle [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.247 sec [junit] Running org.apache.nutch.webui.client.TestNutchClientFactory [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.673 sec [junit] Running org.apache.nutch.webui.client.TestRemoteCommandExecutor [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.249 sec [junit] Running org.apache.nutch.webui.client.TestRemoteCommandsBatchFactory [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.085 sec [junit] Running org.apache.nutch.webui.view.TestColorEnumLabel [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.451 sec BUILD FAILED /Users/hsaputra/open/asf/nutch/branches/2_x/build.xml:450: Tests failed! }} Where can I which tests are failing? > Upgrade to Gora 0.6 > --- > > Key: NUTCH-1946 > URL: https://issues.apache.org/jira/browse/NUTCH-1946 > Project: Nutch > Issue Type: Improvement > Components: storage >Affects Versions: 2.3.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.3.1 > > Attachments: NUTCH-1946.patch > > > Apache Gora was released recently. > We should upgrade before pushing Nutch 2.3.1 as it will come in very handy > for the new Docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-1946) Upgrade to Gora 0.6
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339252#comment-14339252 ] Henry Saputra edited comment on NUTCH-1946 at 2/26/15 9:56 PM: --- Try to replicate the error stack but when I ran ant test I just saw this at the end: {{ [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec [junit] Running org.apache.nutch.util.TestSuffixStringMatcher [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.046 sec [junit] Running org.apache.nutch.util.TestTableUtil [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.021 sec [junit] Running org.apache.nutch.util.TestURLUtil [junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.376 sec [junit] Running org.apache.nutch.webui.client.TestCrawlCycle [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.247 sec [junit] Running org.apache.nutch.webui.client.TestNutchClientFactory [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.673 sec [junit] Running org.apache.nutch.webui.client.TestRemoteCommandExecutor [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.249 sec [junit] Running org.apache.nutch.webui.client.TestRemoteCommandsBatchFactory [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.085 sec [junit] Running org.apache.nutch.webui.view.TestColorEnumLabel [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.451 sec BUILD FAILED /Users/hsaputra/open/asf/nutch/branches/2_x/build.xml:450: Tests failed! }} Where can I which tests are failing? was (Author: hsaputra): Try to replicate the error stack but when I ran ant test I just saw this at the end: {{{ [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec [junit] Running org.apache.nutch.util.TestSuffixStringMatcher [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.046 sec [junit] Running org.apache.nutch.util.TestTableUtil [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.021 sec [junit] Running org.apache.nutch.util.TestURLUtil [junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.376 sec [junit] Running org.apache.nutch.webui.client.TestCrawlCycle [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.247 sec [junit] Running org.apache.nutch.webui.client.TestNutchClientFactory [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.673 sec [junit] Running org.apache.nutch.webui.client.TestRemoteCommandExecutor [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.249 sec [junit] Running org.apache.nutch.webui.client.TestRemoteCommandsBatchFactory [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.085 sec [junit] Running org.apache.nutch.webui.view.TestColorEnumLabel [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.451 sec BUILD FAILED /Users/hsaputra/open/asf/nutch/branches/2_x/build.xml:450: Tests failed! }}} Where can I which tests are failing? > Upgrade to Gora 0.6 > --- > > Key: NUTCH-1946 > URL: https://issues.apache.org/jira/browse/NUTCH-1946 > Project: Nutch > Issue Type: Improvement > Components: storage >Affects Versions: 2.3.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.3.1 > > Attachments: NUTCH-1946.patch > > > Apache Gora was released recently. > We should upgrade before pushing Nutch 2.3.1 as it will come in very handy > for the new Docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339252#comment-14339252 ] Henry Saputra commented on NUTCH-1946: -- Try to replicate the error stack but when I ran ant test I just saw this at the end: {{{ [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec [junit] Running org.apache.nutch.util.TestSuffixStringMatcher [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.046 sec [junit] Running org.apache.nutch.util.TestTableUtil [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.021 sec [junit] Running org.apache.nutch.util.TestURLUtil [junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.376 sec [junit] Running org.apache.nutch.webui.client.TestCrawlCycle [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.247 sec [junit] Running org.apache.nutch.webui.client.TestNutchClientFactory [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.673 sec [junit] Running org.apache.nutch.webui.client.TestRemoteCommandExecutor [junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.249 sec [junit] Running org.apache.nutch.webui.client.TestRemoteCommandsBatchFactory [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.085 sec [junit] Running org.apache.nutch.webui.view.TestColorEnumLabel [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.451 sec BUILD FAILED /Users/hsaputra/open/asf/nutch/branches/2_x/build.xml:450: Tests failed! }}} Where can I which tests are failing? > Upgrade to Gora 0.6 > --- > > Key: NUTCH-1946 > URL: https://issues.apache.org/jira/browse/NUTCH-1946 > Project: Nutch > Issue Type: Improvement > Components: storage >Affects Versions: 2.3.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.3.1 > > Attachments: NUTCH-1946.patch > > > Apache Gora was released recently. > We should upgrade before pushing Nutch 2.3.1 as it will come in very handy > for the new Docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339224#comment-14339224 ] Lewis John McGibbney commented on NUTCH-1946: - Grand > Upgrade to Gora 0.6 > --- > > Key: NUTCH-1946 > URL: https://issues.apache.org/jira/browse/NUTCH-1946 > Project: Nutch > Issue Type: Improvement > Components: storage >Affects Versions: 2.3.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.3.1 > > Attachments: NUTCH-1946.patch > > > Apache Gora was released recently. > We should upgrade before pushing Nutch 2.3.1 as it will come in very handy > for the new Docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339203#comment-14339203 ] Henry Saputra commented on NUTCH-1946: -- Ah thanks, it compiled now =) > Upgrade to Gora 0.6 > --- > > Key: NUTCH-1946 > URL: https://issues.apache.org/jira/browse/NUTCH-1946 > Project: Nutch > Issue Type: Improvement > Components: storage >Affects Versions: 2.3.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.3.1 > > Attachments: NUTCH-1946.patch > > > Apache Gora was released recently. > We should upgrade before pushing Nutch 2.3.1 as it will come in very handy > for the new Docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Unsubscribe
Massimo, http://nutch.apache.org/mailing_lists.html => dev-unsubscr...@nutch.apache.org Thanks On 26 February 2015 at 19:11, Massimo Miccoli wrote: > > > Massimo > > > Il giorno 26/feb/2015, alle ore 19:31, lewi...@apache.org ha scritto: > > > > Author: lewismc > > Date: Thu Feb 26 18:31:39 2015 > > New Revision: 1662530 > > > > URL: http://svn.apache.org/r1662530 > > Log: > > NUTCH-1933 nutch-selenium plugin > > > > Added: > >nutch/trunk/src/plugin/lib-selenium/ > >nutch/trunk/src/plugin/lib-selenium/build.xml > >nutch/trunk/src/plugin/lib-selenium/ivy.xml > >nutch/trunk/src/plugin/lib-selenium/plugin.xml > >nutch/trunk/src/plugin/lib-selenium/src/ > >nutch/trunk/src/plugin/lib-selenium/src/java/ > >nutch/trunk/src/plugin/lib-selenium/src/java/org/ > >nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/ > >nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/ > > > nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/ > > > nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/ > > > nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java > >nutch/trunk/src/plugin/protocol-selenium/ > >nutch/trunk/src/plugin/protocol-selenium/build-ivy.xml > >nutch/trunk/src/plugin/protocol-selenium/build.xml > >nutch/trunk/src/plugin/protocol-selenium/ivy.xml > >nutch/trunk/src/plugin/protocol-selenium/plugin.xml > >nutch/trunk/src/plugin/protocol-selenium/src/ > >nutch/trunk/src/plugin/protocol-selenium/src/java/ > >nutch/trunk/src/plugin/protocol-selenium/src/java/org/ > >nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/ > >nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/ > > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/ > > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/ > > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java > > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java > > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html > >nutch/trunk/src/plugin/protocol-selenium/src/target/ > >nutch/trunk/src/plugin/protocol-selenium/src/target/classes/ > >nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/ > > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/ > > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/ > > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/ > > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/ > > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html > > Modified: > >nutch/trunk/CHANGES.txt > >nutch/trunk/build.xml > >nutch/trunk/ivy/ivy.xml > >nutch/trunk/src/plugin/build.xml > > > > Modified: nutch/trunk/CHANGES.txt > > URL: > http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?rev=1662530&r1=1662529&r2=1662530&view=diff > > > == > > --- nutch/trunk/CHANGES.txt (original) > > +++ nutch/trunk/CHANGES.txt Thu Feb 26 18:31:39 2015 > > @@ -2,6 +2,8 @@ Nutch Change Log > > > > Nutch Current Development 1.10-SNAPSHOT > > > > +* NUTCH-1933 nutch-selenium plugin (Mo Omer, Mohammad Al-Moshin, > lewismc) > > + > > * NUTCH-827 HTTP POST Authentication (Jasper van Veghel, yuanyun.cn, > snagel, lewismc) > > > > * NUTCH-1724 LinkDBReader to support regex output filtering (markus) > > > > Modified: nutch/trunk/build.xml > > URL: > http://svn.apache.org/viewvc/nutch/trunk/build.xml?rev=1662530&r1=1662529&r2=1662530&view=diff > > > == > > --- nutch/trunk/build.xml (original) > > +++ nutch/trunk/build.xml Thu Feb 26 18:31:39 2015 > > @@ -184,6 +184,7 @@ > > > > > > > > + > > > > > > > > @@ -197,6 +198,7 @@ > > > > > > > > + > > > > > > > > @@ -591,6 +593,7 @@ > > > > > > > > + > > > > > > > > @@ -604,6 +607,7 @@ > > > > > > > > + > > > > > > > > @@ -985,6 +989,8 @@ > > > > > > > > + > > + > > > > > > > > @@ -1008,6 +1014,8 @@ > > > > > > > > + > > + > > > > > > > > > > Modified: nutch/trunk/ivy/ivy.xml > > URL: > http://svn.apache.org/viewvc/nutch/trunk/ivy/ivy.xml?rev=1662530&r1=1662529&r2=1662530&view=diff > > > ==
Re: MetaData fornear duplicates
Ya. I know about that. But I just thought that because Parse_Data already does that for us, I did not want to do tthe same processing again. I will try to figure something out. Thanks a lot. Regards, Ami Parikh (213)590-0005 On Thu, Feb 26, 2015 at 12:39 PM, Renxia Wang wrote: > Not sure how you implement it so it is hard to tell. You may want to take > a look at the SegmentReader's get and getMapRecords methods. Those may give > you ideas. You can use SegmentReader.get directly to get the segment data > too. While it is slow as it slepp(5000) at every time you call it, so slow > that you definitely cannot get the result tomorrow by running it on your > 50K urls data set. Muti-threading to call the SegmentReader.get on all the > segments at the same time can speed this up, while if you have a lot of > segments(like me, > 20), OutOfMemory issue will come to you, even if you > set the java heap size to be 4GBs(or even more) I am stuck at here. T_T > > Zhique > > > > On Thu, Feb 26, 2015 at 11:54 AM, Ami Akshay Parikh > wrote: > >> I am using the MapFileReader to iterate through the file. And I read the >> key into a Text object and the MetaData into a ParseData object. I get the >> following exception: >> >> Exception in thread "main" java.io.EOFException >> at java.io.DataInputStream.readFully(DataInputStream.java:197) >> at org.apache.hadoop.io.Text.readString(Text.java:402) >> at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243) >> at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144) >> at >> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813) >> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941) >> at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517) >> at NearDuplicates.main(NearDuplicates.java:58) >> >> Thanks, >> >> Regards, >> Ami Parikh >> (213)590-0005 >> >> On Thu, Feb 26, 2015 at 11:00 AM, Renxia Wang wrote: >> >>> Hi Ami, >>> >>> What method of what class do you use to get the meta data? Please >>> provide more info about this, log etc. >>> >>> Zhique >>> >>> On Thu, Feb 26, 2015 at 10:53 AM, Ami Akshay Parikh >>> wrote: >>> Hello, When I try to use the parse_data from the segment directory for getting the MetaData for finding near duplicates, My code runs into a EOFException. I found something about a bug in nutch in the archives, but I wanted to know if anyone else is facing this problem and how can I possibly resolve it. Thanks, Regards, Ami Parikh (213)590-0005 >>> >>> >> >
Re: MetaData fornear duplicates
Not sure how you implement it so it is hard to tell. You may want to take a look at the SegmentReader's get and getMapRecords methods. Those may give you ideas. You can use SegmentReader.get directly to get the segment data too. While it is slow as it slepp(5000) at every time you call it, so slow that you definitely cannot get the result tomorrow by running it on your 50K urls data set. Muti-threading to call the SegmentReader.get on all the segments at the same time can speed this up, while if you have a lot of segments(like me, > 20), OutOfMemory issue will come to you, even if you set the java heap size to be 4GBs(or even more) I am stuck at here. T_T Zhique On Thu, Feb 26, 2015 at 11:54 AM, Ami Akshay Parikh wrote: > I am using the MapFileReader to iterate through the file. And I read the > key into a Text object and the MetaData into a ParseData object. I get the > following exception: > > Exception in thread "main" java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at org.apache.hadoop.io.Text.readString(Text.java:402) > at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243) > at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144) > at > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813) > at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941) > at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517) > at NearDuplicates.main(NearDuplicates.java:58) > > Thanks, > > Regards, > Ami Parikh > (213)590-0005 > > On Thu, Feb 26, 2015 at 11:00 AM, Renxia Wang wrote: > >> Hi Ami, >> >> What method of what class do you use to get the meta data? Please provide >> more info about this, log etc. >> >> Zhique >> >> On Thu, Feb 26, 2015 at 10:53 AM, Ami Akshay Parikh >> wrote: >> >>> Hello, >>> >>> When I try to use the parse_data from the segment directory for getting >>> the MetaData for finding near duplicates, My code runs into a EOFException. >>> I found something about a bug in nutch in the archives, but I wanted to >>> know if anyone else is facing this problem and how can I possibly resolve >>> it. >>> >>> Thanks, >>> >>> Regards, >>> Ami Parikh >>> (213)590-0005 >>> >> >> >
[jira] [Commented] (NUTCH-1933) nutch-selenium plugin
[ https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339119#comment-14339119 ] Lewis John McGibbney commented on NUTCH-1933: - Hi [~jorgelbg] thanks for noticing this. I did not evidently. bq. I see that is posible to use a phantomjs driver with selenium to provide headless browsing. Is there any way to configure the selenium driver used? Please see NUTCH-1948 > nutch-selenium plugin > - > > Key: NUTCH-1933 > URL: https://issues.apache.org/jira/browse/NUTCH-1933 > Project: Nutch > Issue Type: New Feature > Components: protocol >Reporter: Mo Omer >Assignee: Mohammad Al-Mohsin > Fix For: 1.10 > > Attachments: NUTCH-selenium-trunk.patch, > NUTCH-selenium-trunk.v2.1.patch, NUTCH-selenium-trunk.v2.patch > > > I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] > plugin to run against trunk. > I feel that there is a good bit of work to be done here however early testing > on my system are that it works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339103#comment-14339103 ] Lewis John McGibbney commented on NUTCH-1946: - Hi [~hsaputra], can you try clearing your ~/.ivy2 cache and trying again? I know this sounds pretty extreme but I just removed mine and it applies fine with no dependency issues. {code} $ rm -r ~/.ivy2 $ ant clean runtime ... $ [copy] Copied 2 empty directories to 2 empty directories under /usr/local/2webgui/runtime/local/test BUILD SUCCESSFUL Total time: 6 minutes 36 seconds {code} > Upgrade to Gora 0.6 > --- > > Key: NUTCH-1946 > URL: https://issues.apache.org/jira/browse/NUTCH-1946 > Project: Nutch > Issue Type: Improvement > Components: storage >Affects Versions: 2.3.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.3.1 > > Attachments: NUTCH-1946.patch > > > Apache Gora was released recently. > We should upgrade before pushing Nutch 2.3.1 as it will come in very handy > for the new Docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1933) nutch-selenium plugin
[ https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339080#comment-14339080 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-1933: --- I see a {{target}} folder in /nutch/trunk/src/plugin/protocol-selenium/src/target/ is this suppose to be there? I see that is posible to use a phantomjs driver with selenium to provide headless browsing. Is there any way to configure the selenium driver used? > nutch-selenium plugin > - > > Key: NUTCH-1933 > URL: https://issues.apache.org/jira/browse/NUTCH-1933 > Project: Nutch > Issue Type: New Feature > Components: protocol >Reporter: Mo Omer >Assignee: Mohammad Al-Mohsin > Fix For: 1.10 > > Attachments: NUTCH-selenium-trunk.patch, > NUTCH-selenium-trunk.v2.1.patch, NUTCH-selenium-trunk.v2.patch > > > I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] > plugin to run against trunk. > I feel that there is a good bit of work to be done here however early testing > on my system are that it works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339053#comment-14339053 ] Henry Saputra commented on NUTCH-1946: -- Tried to run ant in the 2.0 branch with your patch and saw this: BUILD FAILED /Users/hsaputra/open/asf/nutch/branches/2_x/build.xml:468: impossible to ivy retrieve: java.lang.RuntimeException: problem during retrieve of org.apache.nutch#nutch: java.lang.RuntimeException: Multiple artifacts of the module org.apache.avro#avro-ipc;1.7.6 are retrieved to the same file! Update the retrieve pattern to fix this error. at org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:211) at org.apache.ivy.Ivy.retrieve(Ivy.java:555) at org.apache.ivy.ant.IvyRetrieve.doExecute(IvyRetrieve.java:97) at org.apache.ivy.ant.IvyTask.execute(IvyTask.java:277) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:292) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils. Any idea how to pass this? > Upgrade to Gora 0.6 > --- > > Key: NUTCH-1946 > URL: https://issues.apache.org/jira/browse/NUTCH-1946 > Project: Nutch > Issue Type: Improvement > Components: storage >Affects Versions: 2.3.1 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 2.3.1 > > Attachments: NUTCH-1946.patch > > > Apache Gora was released recently. > We should upgrade before pushing Nutch 2.3.1 as it will come in very handy > for the new Docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: MetaData fornear duplicates
I am using the MapFileReader to iterate through the file. And I read the key into a Text object and the MetaData into a ParseData object. I get the following exception: Exception in thread "main" java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at org.apache.hadoop.io.Text.readString(Text.java:402) at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243) at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941) at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517) at NearDuplicates.main(NearDuplicates.java:58) Thanks, Regards, Ami Parikh (213)590-0005 On Thu, Feb 26, 2015 at 11:00 AM, Renxia Wang wrote: > Hi Ami, > > What method of what class do you use to get the meta data? Please provide > more info about this, log etc. > > Zhique > > On Thu, Feb 26, 2015 at 10:53 AM, Ami Akshay Parikh > wrote: > >> Hello, >> >> When I try to use the parse_data from the segment directory for getting >> the MetaData for finding near duplicates, My code runs into a EOFException. >> I found something about a bug in nutch in the archives, but I wanted to >> know if anyone else is facing this problem and how can I possibly resolve >> it. >> >> Thanks, >> >> Regards, >> Ami Parikh >> (213)590-0005 >> > >
[jira] [Commented] (NUTCH-1933) nutch-selenium plugin
[ https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338951#comment-14338951 ] Hudson commented on NUTCH-1933: --- SUCCESS: Integrated in Nutch-trunk #2991 (See [https://builds.apache.org/job/Nutch-trunk/2991/]) NUTCH-1933 nutch-selenium plugin (lewismc: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1662530) * /nutch/trunk/CHANGES.txt * /nutch/trunk/build.xml * /nutch/trunk/ivy/ivy.xml * /nutch/trunk/src/plugin/build.xml * /nutch/trunk/src/plugin/lib-selenium * /nutch/trunk/src/plugin/lib-selenium/build.xml * /nutch/trunk/src/plugin/lib-selenium/ivy.xml * /nutch/trunk/src/plugin/lib-selenium/plugin.xml * /nutch/trunk/src/plugin/lib-selenium/src * /nutch/trunk/src/plugin/lib-selenium/src/java * /nutch/trunk/src/plugin/lib-selenium/src/java/org * /nutch/trunk/src/plugin/lib-selenium/src/java/org/apache * /nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch * /nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol * /nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium * /nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java * /nutch/trunk/src/plugin/protocol-selenium * /nutch/trunk/src/plugin/protocol-selenium/build-ivy.xml * /nutch/trunk/src/plugin/protocol-selenium/build.xml * /nutch/trunk/src/plugin/protocol-selenium/ivy.xml * /nutch/trunk/src/plugin/protocol-selenium/plugin.xml * /nutch/trunk/src/plugin/protocol-selenium/src * /nutch/trunk/src/plugin/protocol-selenium/src/java * /nutch/trunk/src/plugin/protocol-selenium/src/java/org * /nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache * /nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch * /nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol * /nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium * /nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java * /nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java * /nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html * /nutch/trunk/src/plugin/protocol-selenium/src/target * /nutch/trunk/src/plugin/protocol-selenium/src/target/classes * /nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org * /nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache * /nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch * /nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol * /nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit * /nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html > nutch-selenium plugin > - > > Key: NUTCH-1933 > URL: https://issues.apache.org/jira/browse/NUTCH-1933 > Project: Nutch > Issue Type: New Feature > Components: protocol >Reporter: Mo Omer >Assignee: Mohammad Al-Mohsin > Fix For: 1.10 > > Attachments: NUTCH-selenium-trunk.patch, > NUTCH-selenium-trunk.v2.1.patch, NUTCH-selenium-trunk.v2.patch > > > I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] > plugin to run against trunk. > I feel that there is a good bit of work to be done here however early testing > on my system are that it works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Unsubscribe
Massimo > Il giorno 26/feb/2015, alle ore 19:31, lewi...@apache.org ha scritto: > > Author: lewismc > Date: Thu Feb 26 18:31:39 2015 > New Revision: 1662530 > > URL: http://svn.apache.org/r1662530 > Log: > NUTCH-1933 nutch-selenium plugin > > Added: >nutch/trunk/src/plugin/lib-selenium/ >nutch/trunk/src/plugin/lib-selenium/build.xml >nutch/trunk/src/plugin/lib-selenium/ivy.xml >nutch/trunk/src/plugin/lib-selenium/plugin.xml >nutch/trunk/src/plugin/lib-selenium/src/ >nutch/trunk/src/plugin/lib-selenium/src/java/ >nutch/trunk/src/plugin/lib-selenium/src/java/org/ >nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/ >nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/ >nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/ > > nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/ > > nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java >nutch/trunk/src/plugin/protocol-selenium/ >nutch/trunk/src/plugin/protocol-selenium/build-ivy.xml >nutch/trunk/src/plugin/protocol-selenium/build.xml >nutch/trunk/src/plugin/protocol-selenium/ivy.xml >nutch/trunk/src/plugin/protocol-selenium/plugin.xml >nutch/trunk/src/plugin/protocol-selenium/src/ >nutch/trunk/src/plugin/protocol-selenium/src/java/ >nutch/trunk/src/plugin/protocol-selenium/src/java/org/ >nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/ >nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/ > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/ > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/ > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html >nutch/trunk/src/plugin/protocol-selenium/src/target/ >nutch/trunk/src/plugin/protocol-selenium/src/target/classes/ >nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/ >nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/ > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/ > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/ > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/ > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html > Modified: >nutch/trunk/CHANGES.txt >nutch/trunk/build.xml >nutch/trunk/ivy/ivy.xml >nutch/trunk/src/plugin/build.xml > > Modified: nutch/trunk/CHANGES.txt > URL: > http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?rev=1662530&r1=1662529&r2=1662530&view=diff > == > --- nutch/trunk/CHANGES.txt (original) > +++ nutch/trunk/CHANGES.txt Thu Feb 26 18:31:39 2015 > @@ -2,6 +2,8 @@ Nutch Change Log > > Nutch Current Development 1.10-SNAPSHOT > > +* NUTCH-1933 nutch-selenium plugin (Mo Omer, Mohammad Al-Moshin, lewismc) > + > * NUTCH-827 HTTP POST Authentication (Jasper van Veghel, yuanyun.cn, snagel, > lewismc) > > * NUTCH-1724 LinkDBReader to support regex output filtering (markus) > > Modified: nutch/trunk/build.xml > URL: > http://svn.apache.org/viewvc/nutch/trunk/build.xml?rev=1662530&r1=1662529&r2=1662530&view=diff > == > --- nutch/trunk/build.xml (original) > +++ nutch/trunk/build.xml Thu Feb 26 18:31:39 2015 > @@ -184,6 +184,7 @@ > > > > + > > > > @@ -197,6 +198,7 @@ > > > > + > > > > @@ -591,6 +593,7 @@ > > > > + > > > > @@ -604,6 +607,7 @@ > > > > + > > > > @@ -985,6 +989,8 @@ > > > > + > + > > > > @@ -1008,6 +1014,8 @@ > > > > + > + > > > > > Modified: nutch/trunk/ivy/ivy.xml > URL: > http://svn.apache.org/viewvc/nutch/trunk/ivy/ivy.xml?rev=1662530&r1=1662529&r2=1662530&view=diff > == > --- nutch/trunk/ivy/ivy.xml (original) > +++ nutch/trunk/ivy/ivy.xml Thu Feb 26 18:31:39 2015 > @@ -23,24 +23,24 @@ >database etc. > > > - > + > > > > - > + > > > > > - > + > >conf="*->master" /> >conf="*->master" /> > -
Re: MetaData fornear duplicates
Hi Ami, What method of what class do you use to get the meta data? Please provide more info about this, log etc. Zhique On Thu, Feb 26, 2015 at 10:53 AM, Ami Akshay Parikh wrote: > Hello, > > When I try to use the parse_data from the segment directory for getting > the MetaData for finding near duplicates, My code runs into a EOFException. > I found something about a bug in nutch in the archives, but I wanted to > know if anyone else is facing this problem and how can I possibly resolve > it. > > Thanks, > > Regards, > Ami Parikh > (213)590-0005 >
MetaData fornear duplicates
Hello, When I try to use the parse_data from the segment directory for getting the MetaData for finding near duplicates, My code runs into a EOFException. I found something about a bug in nutch in the archives, but I wanted to know if anyone else is facing this problem and how can I possibly resolve it. Thanks, Regards, Ami Parikh (213)590-0005
[jira] [Resolved] (NUTCH-1933) nutch-selenium plugin
[ https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1933. - Resolution: Fixed Committed @revision 1662530 in trunk > nutch-selenium plugin > - > > Key: NUTCH-1933 > URL: https://issues.apache.org/jira/browse/NUTCH-1933 > Project: Nutch > Issue Type: New Feature > Components: protocol >Reporter: Mo Omer >Assignee: Mohammad Al-Mohsin > Fix For: 1.10 > > Attachments: NUTCH-selenium-trunk.patch, > NUTCH-selenium-trunk.v2.1.patch, NUTCH-selenium-trunk.v2.patch > > > I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] > plugin to run against trunk. > I feel that there is a good bit of work to be done here however early testing > on my system are that it works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1933) nutch-selenium plugin
[ https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1933: Assignee: Mohammad Al-Mohsin (was: Lewis John McGibbney) > nutch-selenium plugin > - > > Key: NUTCH-1933 > URL: https://issues.apache.org/jira/browse/NUTCH-1933 > Project: Nutch > Issue Type: New Feature > Components: protocol >Reporter: Mo Omer >Assignee: Mohammad Al-Mohsin > Fix For: 1.10 > > Attachments: NUTCH-selenium-trunk.patch, > NUTCH-selenium-trunk.v2.1.patch, NUTCH-selenium-trunk.v2.patch > > > I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] > plugin to run against trunk. > I feel that there is a good bit of work to be done here however early testing > on my system are that it works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338684#comment-14338684 ] Sebastian Nagel commented on NUTCH-1950: Great! For a MD5 calculation, see o.a.hadoop.io.MD5Hash (example usage in src/java/org/apache/nutch/crawl/TextMD5Signature.java). Since a MD5 sum should guarantee a unique name: why not remove/replace ugly characters from the prefix at all? They may also cause errors if not allowed by the file system. E.g., {noformat} http://en.wikipedia.org/wiki/$100 -> http_en_wikipedia_org_wiki_100_d7a09ded039d2833ff602ac9d4cd5a8d http://en.wikipedia.org/wiki/100-> http_en_wikipedia_org_wiki_100_483a8ae86d3af6b656cdb3ec67753c24 {noformat} > File name too long when bin/nutch dump > -- > > Key: NUTCH-1950 > URL: https://issues.apache.org/jira/browse/NUTCH-1950 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.10 >Reporter: Chong Li >Priority: Minor > Fix For: 1.10 > > Original Estimate: 48h > Remaining Estimate: 48h > > When bin/dump in version 1.10-trunk, there will be an exception saying "File > name too long". When crawling, the length of the url may be longer than 255 > bytes and nutch save the file using the url as file name. It can be saved in > segments but when dumping the files to local file system, the length of the > filename can not be longer than 255 bytes. > The FileDumper.java need to be changed to handle such exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338211#comment-14338211 ] Chong Li commented on NUTCH-1950: - I have thought about that and at first we just wanted every new filename to be unique. I tried to save the exact 255 characters and 128 characters as the filename before and the new url was still not human readable because there were a lot of random characters in it.. and that is the reason why those filenames are so long I think it is a good idea to save the first 32 characters or just save the domain name, and then plus a unique key. Thanks for the advice! I will change my solution! > File name too long when bin/nutch dump > -- > > Key: NUTCH-1950 > URL: https://issues.apache.org/jira/browse/NUTCH-1950 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.10 >Reporter: Chong Li >Priority: Minor > Fix For: 1.10 > > Original Estimate: 48h > Remaining Estimate: 48h > > When bin/dump in version 1.10-trunk, there will be an exception saying "File > name too long". When crawling, the length of the url may be longer than 255 > bytes and nutch save the file using the url as file name. It can be saved in > segments but when dumping the files to local file system, the length of the > filename can not be longer than 255 bytes. > The FileDumper.java need to be changed to handle such exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338198#comment-14338198 ] Sebastian Nagel commented on NUTCH-1950: Is it really a good idea to take the system time as fall-back file name? Could take e.g. the first 32 characters (for human readability) plus the MD5 of the filename/URL: this would make the filename predictable and constant over time. > File name too long when bin/nutch dump > -- > > Key: NUTCH-1950 > URL: https://issues.apache.org/jira/browse/NUTCH-1950 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.10 >Reporter: Chong Li >Priority: Minor > Fix For: 1.10 > > Original Estimate: 48h > Remaining Estimate: 48h > > When bin/dump in version 1.10-trunk, there will be an exception saying "File > name too long". When crawling, the length of the url may be longer than 255 > bytes and nutch save the file using the url as file name. It can be saved in > segments but when dumping the files to local file system, the length of the > filename can not be longer than 255 bytes. > The FileDumper.java need to be changed to handle such exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)