[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-26 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339609#comment-14339609
 ] 

Lewis John McGibbney commented on NUTCH-1946:
-

bq. Where can I see which tests are failing?
The test which is failing is in 

{code}
$NUTCH_HOME/build/test/org.apache.nutch.fetcher.TestFetcher
{code}

> Upgrade to Gora 0.6
> ---
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-1946.patch
>
>
> Apache Gora was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[no subject]

2015-02-26 Thread Jiangang Sun
unsubscribe


[no subject]

2015-02-26 Thread Kan Zhou
unsubscribe


unsuscribe

2015-02-26 Thread Jiangang Sun



[jira] [Comment Edited] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-26 Thread Henry Saputra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339252#comment-14339252
 ] 

Henry Saputra edited comment on NUTCH-1946 at 2/26/15 9:56 PM:
---

Try to replicate the error stack but when I ran ant test I just saw this at the 
end:

  [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 
sec
[junit] Running org.apache.nutch.util.TestSuffixStringMatcher
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.046 sec
[junit] Running org.apache.nutch.util.TestTableUtil
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.021 sec
[junit] Running org.apache.nutch.util.TestURLUtil
[junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.376 sec
[junit] Running org.apache.nutch.webui.client.TestCrawlCycle
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.247 sec
[junit] Running org.apache.nutch.webui.client.TestNutchClientFactory
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.673 sec
[junit] Running org.apache.nutch.webui.client.TestRemoteCommandExecutor
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.249 sec
[junit] Running org.apache.nutch.webui.client.TestRemoteCommandsBatchFactory
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.085 sec
[junit] Running org.apache.nutch.webui.view.TestColorEnumLabel
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.451 sec

BUILD FAILED
/Users/hsaputra/open/asf/nutch/branches/2_x/build.xml:450: Tests failed!

Where can I see which tests are failing?



was (Author: hsaputra):
Try to replicate the error stack but when I ran ant test I just saw this at the 
end:
{{
  [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 
sec
[junit] Running org.apache.nutch.util.TestSuffixStringMatcher
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.046 sec
[junit] Running org.apache.nutch.util.TestTableUtil
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.021 sec
[junit] Running org.apache.nutch.util.TestURLUtil
[junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.376 sec
[junit] Running org.apache.nutch.webui.client.TestCrawlCycle
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.247 sec
[junit] Running org.apache.nutch.webui.client.TestNutchClientFactory
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.673 sec
[junit] Running org.apache.nutch.webui.client.TestRemoteCommandExecutor
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.249 sec
[junit] Running org.apache.nutch.webui.client.TestRemoteCommandsBatchFactory
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.085 sec
[junit] Running org.apache.nutch.webui.view.TestColorEnumLabel
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.451 sec

BUILD FAILED
/Users/hsaputra/open/asf/nutch/branches/2_x/build.xml:450: Tests failed!
}}

Where can I which tests are failing?


> Upgrade to Gora 0.6
> ---
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-1946.patch
>
>
> Apache Gora was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-26 Thread Henry Saputra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339252#comment-14339252
 ] 

Henry Saputra edited comment on NUTCH-1946 at 2/26/15 9:56 PM:
---

Try to replicate the error stack but when I ran ant test I just saw this at the 
end:
{{
  [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 
sec
[junit] Running org.apache.nutch.util.TestSuffixStringMatcher
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.046 sec
[junit] Running org.apache.nutch.util.TestTableUtil
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.021 sec
[junit] Running org.apache.nutch.util.TestURLUtil
[junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.376 sec
[junit] Running org.apache.nutch.webui.client.TestCrawlCycle
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.247 sec
[junit] Running org.apache.nutch.webui.client.TestNutchClientFactory
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.673 sec
[junit] Running org.apache.nutch.webui.client.TestRemoteCommandExecutor
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.249 sec
[junit] Running org.apache.nutch.webui.client.TestRemoteCommandsBatchFactory
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.085 sec
[junit] Running org.apache.nutch.webui.view.TestColorEnumLabel
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.451 sec

BUILD FAILED
/Users/hsaputra/open/asf/nutch/branches/2_x/build.xml:450: Tests failed!
}}

Where can I which tests are failing?



was (Author: hsaputra):
Try to replicate the error stack but when I ran ant test I just saw this at the 
end:
{{{
  [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 
sec
[junit] Running org.apache.nutch.util.TestSuffixStringMatcher
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.046 sec
[junit] Running org.apache.nutch.util.TestTableUtil
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.021 sec
[junit] Running org.apache.nutch.util.TestURLUtil
[junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.376 sec
[junit] Running org.apache.nutch.webui.client.TestCrawlCycle
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.247 sec
[junit] Running org.apache.nutch.webui.client.TestNutchClientFactory
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.673 sec
[junit] Running org.apache.nutch.webui.client.TestRemoteCommandExecutor
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.249 sec
[junit] Running org.apache.nutch.webui.client.TestRemoteCommandsBatchFactory
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.085 sec
[junit] Running org.apache.nutch.webui.view.TestColorEnumLabel
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.451 sec

BUILD FAILED
/Users/hsaputra/open/asf/nutch/branches/2_x/build.xml:450: Tests failed!
}}}

Where can I which tests are failing?


> Upgrade to Gora 0.6
> ---
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-1946.patch
>
>
> Apache Gora was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-26 Thread Henry Saputra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339252#comment-14339252
 ] 

Henry Saputra commented on NUTCH-1946:
--

Try to replicate the error stack but when I ran ant test I just saw this at the 
end:
{{{
  [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 
sec
[junit] Running org.apache.nutch.util.TestSuffixStringMatcher
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.046 sec
[junit] Running org.apache.nutch.util.TestTableUtil
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.021 sec
[junit] Running org.apache.nutch.util.TestURLUtil
[junit] Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.376 sec
[junit] Running org.apache.nutch.webui.client.TestCrawlCycle
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.247 sec
[junit] Running org.apache.nutch.webui.client.TestNutchClientFactory
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.673 sec
[junit] Running org.apache.nutch.webui.client.TestRemoteCommandExecutor
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.249 sec
[junit] Running org.apache.nutch.webui.client.TestRemoteCommandsBatchFactory
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.085 sec
[junit] Running org.apache.nutch.webui.view.TestColorEnumLabel
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.451 sec

BUILD FAILED
/Users/hsaputra/open/asf/nutch/branches/2_x/build.xml:450: Tests failed!
}}}

Where can I which tests are failing?


> Upgrade to Gora 0.6
> ---
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-1946.patch
>
>
> Apache Gora was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-26 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339224#comment-14339224
 ] 

Lewis John McGibbney commented on NUTCH-1946:
-

Grand

> Upgrade to Gora 0.6
> ---
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-1946.patch
>
>
> Apache Gora was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-26 Thread Henry Saputra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339203#comment-14339203
 ] 

Henry Saputra commented on NUTCH-1946:
--

Ah thanks, it compiled now =)

> Upgrade to Gora 0.6
> ---
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-1946.patch
>
>
> Apache Gora was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Unsubscribe

2015-02-26 Thread Julien Nioche
Massimo,

http://nutch.apache.org/mailing_lists.html

=> dev-unsubscr...@nutch.apache.org

Thanks

On 26 February 2015 at 19:11, Massimo Miccoli 
wrote:

>
>
> Massimo
>
> > Il giorno 26/feb/2015, alle ore 19:31, lewi...@apache.org ha scritto:
> >
> > Author: lewismc
> > Date: Thu Feb 26 18:31:39 2015
> > New Revision: 1662530
> >
> > URL: http://svn.apache.org/r1662530
> > Log:
> > NUTCH-1933 nutch-selenium plugin
> >
> > Added:
> >nutch/trunk/src/plugin/lib-selenium/
> >nutch/trunk/src/plugin/lib-selenium/build.xml
> >nutch/trunk/src/plugin/lib-selenium/ivy.xml
> >nutch/trunk/src/plugin/lib-selenium/plugin.xml
> >nutch/trunk/src/plugin/lib-selenium/src/
> >nutch/trunk/src/plugin/lib-selenium/src/java/
> >nutch/trunk/src/plugin/lib-selenium/src/java/org/
> >nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/
> >nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/
> >
> nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/
> >
> nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/
> >
> nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
> >nutch/trunk/src/plugin/protocol-selenium/
> >nutch/trunk/src/plugin/protocol-selenium/build-ivy.xml
> >nutch/trunk/src/plugin/protocol-selenium/build.xml
> >nutch/trunk/src/plugin/protocol-selenium/ivy.xml
> >nutch/trunk/src/plugin/protocol-selenium/plugin.xml
> >nutch/trunk/src/plugin/protocol-selenium/src/
> >nutch/trunk/src/plugin/protocol-selenium/src/java/
> >nutch/trunk/src/plugin/protocol-selenium/src/java/org/
> >nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/
> >nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/
> >
> nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/
> >
> nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/
> >
> nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java
> >
> nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java
> >
> nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html
> >nutch/trunk/src/plugin/protocol-selenium/src/target/
> >nutch/trunk/src/plugin/protocol-selenium/src/target/classes/
> >nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/
> >
> nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/
> >
> nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/
> >
> nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/
> >
> nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/
> >
> nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html
> > Modified:
> >nutch/trunk/CHANGES.txt
> >nutch/trunk/build.xml
> >nutch/trunk/ivy/ivy.xml
> >nutch/trunk/src/plugin/build.xml
> >
> > Modified: nutch/trunk/CHANGES.txt
> > URL:
> http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?rev=1662530&r1=1662529&r2=1662530&view=diff
> >
> ==
> > --- nutch/trunk/CHANGES.txt (original)
> > +++ nutch/trunk/CHANGES.txt Thu Feb 26 18:31:39 2015
> > @@ -2,6 +2,8 @@ Nutch Change Log
> >
> > Nutch Current Development 1.10-SNAPSHOT
> >
> > +* NUTCH-1933 nutch-selenium plugin (Mo Omer, Mohammad Al-Moshin,
> lewismc)
> > +
> > * NUTCH-827 HTTP POST Authentication (Jasper van Veghel, yuanyun.cn,
> snagel, lewismc)
> >
> > * NUTCH-1724 LinkDBReader to support regex output filtering (markus)
> >
> > Modified: nutch/trunk/build.xml
> > URL:
> http://svn.apache.org/viewvc/nutch/trunk/build.xml?rev=1662530&r1=1662529&r2=1662530&view=diff
> >
> ==
> > --- nutch/trunk/build.xml (original)
> > +++ nutch/trunk/build.xml Thu Feb 26 18:31:39 2015
> > @@ -184,6 +184,7 @@
> >   
> >   
> >   
> > +  
> >   
> >   
> >   
> > @@ -197,6 +198,7 @@
> >   
> >   
> >   
> > +  
> >   
> >   
> >   
> > @@ -591,6 +593,7 @@
> >   
> >   
> >   
> > +  
> >   
> >   
> >   
> > @@ -604,6 +607,7 @@
> >   
> >   
> >   
> > +  
> >   
> >   
> >   
> > @@ -985,6 +989,8 @@
> > 
> > 
> > 
> > +
> > +
> > 
> > 
> > 
> > @@ -1008,6 +1014,8 @@
> > 
> > 
> > 
> > +
> > +
> > 
> > 
> > 
> >
> > Modified: nutch/trunk/ivy/ivy.xml
> > URL:
> http://svn.apache.org/viewvc/nutch/trunk/ivy/ivy.xml?rev=1662530&r1=1662529&r2=1662530&view=diff
> >
> ==

Re: MetaData fornear duplicates

2015-02-26 Thread Ami Akshay Parikh
Ya. I know about that. But I just thought that because Parse_Data already
does that for us, I did not want to do tthe same processing again. I will
try to figure something out. Thanks a lot.

Regards,
Ami Parikh
(213)590-0005

On Thu, Feb 26, 2015 at 12:39 PM, Renxia Wang  wrote:

> Not sure how you implement it so it is hard to tell. You may want to take
> a look at the SegmentReader's get and getMapRecords methods. Those may give
> you ideas. You can use SegmentReader.get directly to get the segment data
> too. While it is slow as it slepp(5000) at every time you call it, so slow
> that you definitely cannot get the result tomorrow by running it on your
> 50K urls data set. Muti-threading to call the SegmentReader.get on all the
> segments at the same time can speed this up, while if you have a lot of
> segments(like me,  > 20), OutOfMemory issue will come to you, even if you
> set the java heap size to be 4GBs(or even more) I am stuck at here. T_T
>
> Zhique
>
>
>
> On Thu, Feb 26, 2015 at 11:54 AM, Ami Akshay Parikh 
> wrote:
>
>> I am using the MapFileReader to iterate through the file. And I read the
>> key into a Text object and the MetaData into a ParseData object. I get the
>> following exception:
>>
>> Exception in thread "main" java.io.EOFException
>> at java.io.DataInputStream.readFully(DataInputStream.java:197)
>> at org.apache.hadoop.io.Text.readString(Text.java:402)
>> at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243)
>> at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
>> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941)
>> at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
>> at NearDuplicates.main(NearDuplicates.java:58)
>>
>> Thanks,
>>
>> Regards,
>> Ami Parikh
>> (213)590-0005
>>
>> On Thu, Feb 26, 2015 at 11:00 AM, Renxia Wang  wrote:
>>
>>> Hi Ami,
>>>
>>> What method of what class do you use to get the meta data? Please
>>> provide more info about this, log etc.
>>>
>>> Zhique
>>>
>>> On Thu, Feb 26, 2015 at 10:53 AM, Ami Akshay Parikh 
>>> wrote:
>>>
 Hello,

 When I try to use the parse_data from the segment directory for getting
 the MetaData for finding near duplicates, My code runs into a EOFException.
 I found something about a bug in nutch in the archives, but I wanted to
 know if anyone else is facing this problem and how can I possibly resolve
 it.

 Thanks,

 Regards,
 Ami Parikh
 (213)590-0005

>>>
>>>
>>
>


Re: MetaData fornear duplicates

2015-02-26 Thread Renxia Wang
Not sure how you implement it so it is hard to tell. You may want to take a
look at the SegmentReader's get and getMapRecords methods. Those may give
you ideas. You can use SegmentReader.get directly to get the segment data
too. While it is slow as it slepp(5000) at every time you call it, so slow
that you definitely cannot get the result tomorrow by running it on your
50K urls data set. Muti-threading to call the SegmentReader.get on all the
segments at the same time can speed this up, while if you have a lot of
segments(like me,  > 20), OutOfMemory issue will come to you, even if you
set the java heap size to be 4GBs(or even more) I am stuck at here. T_T

Zhique



On Thu, Feb 26, 2015 at 11:54 AM, Ami Akshay Parikh 
wrote:

> I am using the MapFileReader to iterate through the file. And I read the
> key into a Text object and the MetaData into a ParseData object. I get the
> following exception:
>
> Exception in thread "main" java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at org.apache.hadoop.io.Text.readString(Text.java:402)
> at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243)
> at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144)
> at
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941)
> at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
> at NearDuplicates.main(NearDuplicates.java:58)
>
> Thanks,
>
> Regards,
> Ami Parikh
> (213)590-0005
>
> On Thu, Feb 26, 2015 at 11:00 AM, Renxia Wang  wrote:
>
>> Hi Ami,
>>
>> What method of what class do you use to get the meta data? Please provide
>> more info about this, log etc.
>>
>> Zhique
>>
>> On Thu, Feb 26, 2015 at 10:53 AM, Ami Akshay Parikh 
>> wrote:
>>
>>> Hello,
>>>
>>> When I try to use the parse_data from the segment directory for getting
>>> the MetaData for finding near duplicates, My code runs into a EOFException.
>>> I found something about a bug in nutch in the archives, but I wanted to
>>> know if anyone else is facing this problem and how can I possibly resolve
>>> it.
>>>
>>> Thanks,
>>>
>>> Regards,
>>> Ami Parikh
>>> (213)590-0005
>>>
>>
>>
>


[jira] [Commented] (NUTCH-1933) nutch-selenium plugin

2015-02-26 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339119#comment-14339119
 ] 

Lewis John McGibbney commented on NUTCH-1933:
-

Hi [~jorgelbg] thanks for noticing this. I did not evidently.
bq. I see that is posible to use a phantomjs driver with selenium to provide 
headless browsing. Is there any way to configure the selenium driver used?
Please see NUTCH-1948

> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Reporter: Mo Omer
>Assignee: Mohammad Al-Mohsin
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch, 
> NUTCH-selenium-trunk.v2.1.patch, NUTCH-selenium-trunk.v2.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-26 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339103#comment-14339103
 ] 

Lewis John McGibbney commented on NUTCH-1946:
-

Hi [~hsaputra], can you try clearing your ~/.ivy2 cache and trying again? I 
know this sounds pretty extreme but I just removed mine and it applies fine 
with no dependency issues.

{code}
$ rm -r ~/.ivy2
$ ant clean runtime
...
$  [copy] Copied 2 empty directories to 2 empty directories under 
/usr/local/2webgui/runtime/local/test

BUILD SUCCESSFUL
Total time: 6 minutes 36 seconds
{code}

> Upgrade to Gora 0.6
> ---
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-1946.patch
>
>
> Apache Gora was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1933) nutch-selenium plugin

2015-02-26 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339080#comment-14339080
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1933:
---

I see a {{target}} folder in 
/nutch/trunk/src/plugin/protocol-selenium/src/target/ is this suppose to be 
there? I see that is posible to use a phantomjs driver with selenium to provide 
headless browsing. Is there any way to configure the selenium driver used?

> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Reporter: Mo Omer
>Assignee: Mohammad Al-Mohsin
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch, 
> NUTCH-selenium-trunk.v2.1.patch, NUTCH-selenium-trunk.v2.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-26 Thread Henry Saputra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339053#comment-14339053
 ] 

Henry Saputra commented on NUTCH-1946:
--

Tried to run ant in the 2.0 branch with your patch and saw this:

BUILD FAILED
/Users/hsaputra/open/asf/nutch/branches/2_x/build.xml:468: impossible to ivy 
retrieve: java.lang.RuntimeException: problem during retrieve of 
org.apache.nutch#nutch: java.lang.RuntimeException: Multiple artifacts of the 
module org.apache.avro#avro-ipc;1.7.6 are retrieved to the same file! Update 
the retrieve pattern  to fix this error.
at 
org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:211)
at org.apache.ivy.Ivy.retrieve(Ivy.java:555)
at org.apache.ivy.ant.IvyRetrieve.doExecute(IvyRetrieve.java:97)
at org.apache.ivy.ant.IvyTask.execute(IvyTask.java:277)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:292)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.

Any idea how to pass this?

> Upgrade to Gora 0.6
> ---
>
> Key: NUTCH-1946
> URL: https://issues.apache.org/jira/browse/NUTCH-1946
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-1946.patch
>
>
> Apache Gora was released recently.
> We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
> for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: MetaData fornear duplicates

2015-02-26 Thread Ami Akshay Parikh
I am using the MapFileReader to iterate through the file. And I read the
key into a Text object and the MetaData into a ParseData object. I get the
following exception:

Exception in thread "main" java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at org.apache.hadoop.io.Text.readString(Text.java:402)
at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243)
at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144)
at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941)
at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
at NearDuplicates.main(NearDuplicates.java:58)

Thanks,

Regards,
Ami Parikh
(213)590-0005

On Thu, Feb 26, 2015 at 11:00 AM, Renxia Wang  wrote:

> Hi Ami,
>
> What method of what class do you use to get the meta data? Please provide
> more info about this, log etc.
>
> Zhique
>
> On Thu, Feb 26, 2015 at 10:53 AM, Ami Akshay Parikh 
> wrote:
>
>> Hello,
>>
>> When I try to use the parse_data from the segment directory for getting
>> the MetaData for finding near duplicates, My code runs into a EOFException.
>> I found something about a bug in nutch in the archives, but I wanted to
>> know if anyone else is facing this problem and how can I possibly resolve
>> it.
>>
>> Thanks,
>>
>> Regards,
>> Ami Parikh
>> (213)590-0005
>>
>
>


[jira] [Commented] (NUTCH-1933) nutch-selenium plugin

2015-02-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338951#comment-14338951
 ] 

Hudson commented on NUTCH-1933:
---

SUCCESS: Integrated in Nutch-trunk #2991 (See 
[https://builds.apache.org/job/Nutch-trunk/2991/])
NUTCH-1933 nutch-selenium plugin (lewismc: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1662530)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/build.xml
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/plugin/build.xml
* /nutch/trunk/src/plugin/lib-selenium
* /nutch/trunk/src/plugin/lib-selenium/build.xml
* /nutch/trunk/src/plugin/lib-selenium/ivy.xml
* /nutch/trunk/src/plugin/lib-selenium/plugin.xml
* /nutch/trunk/src/plugin/lib-selenium/src
* /nutch/trunk/src/plugin/lib-selenium/src/java
* /nutch/trunk/src/plugin/lib-selenium/src/java/org
* /nutch/trunk/src/plugin/lib-selenium/src/java/org/apache
* /nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch
* /nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol
* 
/nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium
* 
/nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
* /nutch/trunk/src/plugin/protocol-selenium
* /nutch/trunk/src/plugin/protocol-selenium/build-ivy.xml
* /nutch/trunk/src/plugin/protocol-selenium/build.xml
* /nutch/trunk/src/plugin/protocol-selenium/ivy.xml
* /nutch/trunk/src/plugin/protocol-selenium/plugin.xml
* /nutch/trunk/src/plugin/protocol-selenium/src
* /nutch/trunk/src/plugin/protocol-selenium/src/java
* /nutch/trunk/src/plugin/protocol-selenium/src/java/org
* /nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache
* /nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch
* /nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol
* 
/nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium
* 
/nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java
* 
/nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java
* 
/nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html
* /nutch/trunk/src/plugin/protocol-selenium/src/target
* /nutch/trunk/src/plugin/protocol-selenium/src/target/classes
* /nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org
* /nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache
* /nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch
* 
/nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol
* 
/nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit
* 
/nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html


> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Reporter: Mo Omer
>Assignee: Mohammad Al-Mohsin
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch, 
> NUTCH-selenium-trunk.v2.1.patch, NUTCH-selenium-trunk.v2.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Unsubscribe

2015-02-26 Thread Massimo Miccoli


Massimo

> Il giorno 26/feb/2015, alle ore 19:31, lewi...@apache.org ha scritto:
> 
> Author: lewismc
> Date: Thu Feb 26 18:31:39 2015
> New Revision: 1662530
> 
> URL: http://svn.apache.org/r1662530
> Log:
> NUTCH-1933 nutch-selenium plugin
> 
> Added:
>nutch/trunk/src/plugin/lib-selenium/
>nutch/trunk/src/plugin/lib-selenium/build.xml
>nutch/trunk/src/plugin/lib-selenium/ivy.xml
>nutch/trunk/src/plugin/lib-selenium/plugin.xml
>nutch/trunk/src/plugin/lib-selenium/src/
>nutch/trunk/src/plugin/lib-selenium/src/java/
>nutch/trunk/src/plugin/lib-selenium/src/java/org/
>nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/
>nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/
>nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/
>
> nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/
>
> nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
>nutch/trunk/src/plugin/protocol-selenium/
>nutch/trunk/src/plugin/protocol-selenium/build-ivy.xml
>nutch/trunk/src/plugin/protocol-selenium/build.xml
>nutch/trunk/src/plugin/protocol-selenium/ivy.xml
>nutch/trunk/src/plugin/protocol-selenium/plugin.xml
>nutch/trunk/src/plugin/protocol-selenium/src/
>nutch/trunk/src/plugin/protocol-selenium/src/java/
>nutch/trunk/src/plugin/protocol-selenium/src/java/org/
>nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/
>nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/
>
> nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/
>
> nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/
>
> nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java
>
> nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java
>
> nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html
>nutch/trunk/src/plugin/protocol-selenium/src/target/
>nutch/trunk/src/plugin/protocol-selenium/src/target/classes/
>nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/
>nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/
>
> nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/
>
> nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/
>
> nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/
>
> nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html
> Modified:
>nutch/trunk/CHANGES.txt
>nutch/trunk/build.xml
>nutch/trunk/ivy/ivy.xml
>nutch/trunk/src/plugin/build.xml
> 
> Modified: nutch/trunk/CHANGES.txt
> URL: 
> http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?rev=1662530&r1=1662529&r2=1662530&view=diff
> ==
> --- nutch/trunk/CHANGES.txt (original)
> +++ nutch/trunk/CHANGES.txt Thu Feb 26 18:31:39 2015
> @@ -2,6 +2,8 @@ Nutch Change Log
> 
> Nutch Current Development 1.10-SNAPSHOT
> 
> +* NUTCH-1933 nutch-selenium plugin (Mo Omer, Mohammad Al-Moshin, lewismc)
> +
> * NUTCH-827 HTTP POST Authentication (Jasper van Veghel, yuanyun.cn, snagel, 
> lewismc)
> 
> * NUTCH-1724 LinkDBReader to support regex output filtering (markus)
> 
> Modified: nutch/trunk/build.xml
> URL: 
> http://svn.apache.org/viewvc/nutch/trunk/build.xml?rev=1662530&r1=1662529&r2=1662530&view=diff
> ==
> --- nutch/trunk/build.xml (original)
> +++ nutch/trunk/build.xml Thu Feb 26 18:31:39 2015
> @@ -184,6 +184,7 @@
>   
>   
>   
> +  
>   
>   
>   
> @@ -197,6 +198,7 @@
>   
>   
>   
> +  
>   
>   
>   
> @@ -591,6 +593,7 @@
>   
>   
>   
> +  
>   
>   
>   
> @@ -604,6 +607,7 @@
>   
>   
>   
> +  
>   
>   
>   
> @@ -985,6 +989,8 @@
> 
> 
> 
> +
> +
> 
> 
> 
> @@ -1008,6 +1014,8 @@
> 
> 
> 
> +
> +
> 
> 
> 
> 
> Modified: nutch/trunk/ivy/ivy.xml
> URL: 
> http://svn.apache.org/viewvc/nutch/trunk/ivy/ivy.xml?rev=1662530&r1=1662529&r2=1662530&view=diff
> ==
> --- nutch/trunk/ivy/ivy.xml (original)
> +++ nutch/trunk/ivy/ivy.xml Thu Feb 26 18:31:39 2015
> @@ -23,24 +23,24 @@
>database etc.
>
>
> -
> +
>
>
>
> -
> +
>
>
>
>
> -
> +
>
>conf="*->master" />
>conf="*->master" />
> -

Re: MetaData fornear duplicates

2015-02-26 Thread Renxia Wang
Hi Ami,

What method of what class do you use to get the meta data? Please provide
more info about this, log etc.

Zhique

On Thu, Feb 26, 2015 at 10:53 AM, Ami Akshay Parikh 
wrote:

> Hello,
>
> When I try to use the parse_data from the segment directory for getting
> the MetaData for finding near duplicates, My code runs into a EOFException.
> I found something about a bug in nutch in the archives, but I wanted to
> know if anyone else is facing this problem and how can I possibly resolve
> it.
>
> Thanks,
>
> Regards,
> Ami Parikh
> (213)590-0005
>


MetaData fornear duplicates

2015-02-26 Thread Ami Akshay Parikh
Hello,

When I try to use the parse_data from the segment directory for getting the
MetaData for finding near duplicates, My code runs into a EOFException. I
found something about a bug in nutch in the archives, but I wanted to know
if anyone else is facing this problem and how can I possibly resolve it.

Thanks,

Regards,
Ami Parikh
(213)590-0005


[jira] [Resolved] (NUTCH-1933) nutch-selenium plugin

2015-02-26 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1933.
-
Resolution: Fixed

Committed @revision 1662530 in trunk

> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Reporter: Mo Omer
>Assignee: Mohammad Al-Mohsin
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch, 
> NUTCH-selenium-trunk.v2.1.patch, NUTCH-selenium-trunk.v2.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1933) nutch-selenium plugin

2015-02-26 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1933:

Assignee: Mohammad Al-Mohsin  (was: Lewis John McGibbney)

> nutch-selenium plugin
> -
>
> Key: NUTCH-1933
> URL: https://issues.apache.org/jira/browse/NUTCH-1933
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Reporter: Mo Omer
>Assignee: Mohammad Al-Mohsin
> Fix For: 1.10
>
> Attachments: NUTCH-selenium-trunk.patch, 
> NUTCH-selenium-trunk.v2.1.patch, NUTCH-selenium-trunk.v2.patch
>
>
> I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
> plugin to run against trunk.
> I feel that there is a good bit of work to be done here however early testing 
> on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump

2015-02-26 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338684#comment-14338684
 ] 

Sebastian Nagel commented on NUTCH-1950:


Great! For a MD5 calculation, see o.a.hadoop.io.MD5Hash (example usage in 
src/java/org/apache/nutch/crawl/TextMD5Signature.java). Since a MD5 sum should 
guarantee a unique name: why not remove/replace ugly characters from the prefix 
at all? They may also cause errors if not allowed by the file system. E.g.,
{noformat}
 http://en.wikipedia.org/wiki/$100   ->  
http_en_wikipedia_org_wiki_100_d7a09ded039d2833ff602ac9d4cd5a8d
 http://en.wikipedia.org/wiki/100->  
http_en_wikipedia_org_wiki_100_483a8ae86d3af6b656cdb3ec67753c24
{noformat}


> File name too long when bin/nutch dump
> --
>
> Key: NUTCH-1950
> URL: https://issues.apache.org/jira/browse/NUTCH-1950
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.10
>Reporter: Chong Li
>Priority: Minor
> Fix For: 1.10
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When bin/dump in version 1.10-trunk, there will be an exception saying "File 
> name too long". When crawling, the length of the url may be longer than 255 
> bytes and nutch save the file using the url as file name. It can be saved in 
> segments but when dumping the files to local file system, the length of the 
> filename can not be longer than 255 bytes. 
> The FileDumper.java need to be changed to handle such exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump

2015-02-26 Thread Chong Li (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338211#comment-14338211
 ] 

Chong Li commented on NUTCH-1950:
-

I have thought about that and at first we just wanted every new filename to be 
unique. 

I tried to save the exact 255 characters and 128 characters as the filename 
before and the new url was still not human readable because there were a lot of 
random characters in it.. and that is the reason why those filenames are so 
long

I think it is a good idea to save the first 32 characters or just save the 
domain name, and then plus a unique key. 
Thanks for the advice! I will change my solution!

> File name too long when bin/nutch dump
> --
>
> Key: NUTCH-1950
> URL: https://issues.apache.org/jira/browse/NUTCH-1950
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.10
>Reporter: Chong Li
>Priority: Minor
> Fix For: 1.10
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When bin/dump in version 1.10-trunk, there will be an exception saying "File 
> name too long". When crawling, the length of the url may be longer than 255 
> bytes and nutch save the file using the url as file name. It can be saved in 
> segments but when dumping the files to local file system, the length of the 
> filename can not be longer than 255 bytes. 
> The FileDumper.java need to be changed to handle such exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump

2015-02-26 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338198#comment-14338198
 ] 

Sebastian Nagel commented on NUTCH-1950:


Is it really a good idea to take the system time as fall-back file name? Could 
take e.g. the first 32 characters (for human readability) plus the MD5 of the 
filename/URL: this would make the filename predictable and constant over time.

> File name too long when bin/nutch dump
> --
>
> Key: NUTCH-1950
> URL: https://issues.apache.org/jira/browse/NUTCH-1950
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.10
>Reporter: Chong Li
>Priority: Minor
> Fix For: 1.10
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When bin/dump in version 1.10-trunk, there will be an exception saying "File 
> name too long". When crawling, the length of the url may be longer than 255 
> bytes and nutch save the file using the url as file name. It can be saved in 
> segments but when dumping the files to local file system, the length of the 
> filename can not be longer than 255 bytes. 
> The FileDumper.java need to be changed to handle such exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)