Re: [MASSMAIL]Re: Fetch failed : java.lang.NullPointerException
Hi Taichi: Which plugins you have enabled in nutch-site.xml? - Mensaje original - De: "Taichi Ho"Para: dev@nutch.apache.org Enviados: Miércoles, 30 de Septiembre 2015 16:57:39 Asunto: [MASSMAIL]Re: Fetch failed : java.lang.NullPointerException Hi, I have the same problem. The following is part of my log: http://pastebin.com/JjkJ1qe6 It seems there is a read time out. But I paste the url in the browser and it works fine. Any ideas what could be causing this problem? Thanks. On Mon, Sep 28, 2015 at 7:46 AM Michael Joyce < jo...@apache.org > wrote: I don't see any null pointer exceptions coming up in your log. Do you have any more info or perhaps I'm missing something? -- Jimmy On Sun, Sep 27, 2015 at 3:04 PM, mithun < mithun626...@gmail.com > wrote: Hi All While crawling my seed list, I bumped into this Null Pointer Exception for few urls. What could be the problem. Please find paste.bin link of my hadoop.log file http://pastebin.com/SyyybtEx Thanks Mithun
[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939939#comment-14939939 ] Michael Joyce commented on NUTCH-2129: -- Thanks Julien. I figured there would probably be a few thoughts on this, so I appreciate the feedback. I'll checkout the stuff you mentioned. Thanks for the ideas. > Track Protocol Status in Crawl Datum > > > Key: NUTCH-2129 > URL: https://issues.apache.org/jira/browse/NUTCH-2129 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce > Fix For: 2.4, 1.11 > > > It's become necessary on a few crawls that I run to get protocol status code > stats. After speaking with [~lewismc] it seemed that there might not be a > super convenient way of doing this as is, but it would be great to be able to > add the functionality necessary to pull this information out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Atomic update and optimistic concurrency in Solr
Hi all: I'm trying to make an atomic update or optimistic concurrency update in Solr. Anyone can help me?
[jira] [Commented] (NUTCH-2128) Refactor configuration end point
[ https://issues.apache.org/jira/browse/NUTCH-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940064#comment-14940064 ] ASF GitHub Bot commented on NUTCH-2128: --- GitHub user sujen1412 opened a pull request: https://github.com/apache/nutch/pull/69 fix for NUTCH-2128 Refactor config endpoint by Sujen shah You can merge this pull request into a Git repository by running: $ git pull https://github.com/sujen1412/nutch NUTCH-2128 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/69.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #69 commit f9c80a4bba43c0a117804d4997303a5a974f4cc2 Author: Sujen ShahDate: 2015-09-29T19:07:13Z Refactor config endpoint > Refactor configuration end point > > > Key: NUTCH-2128 > URL: https://issues.apache.org/jira/browse/NUTCH-2128 > Project: Nutch > Issue Type: Sub-task > Components: REST_api >Reporter: Sujen Shah >Assignee: Sujen Shah >Priority: Minor > Fix For: 1.11 > > > To better define the endpoint to create a new configuration and add a new > endpoint to update a particular property value of a configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2108) Add a function to the selenium interactive plugin interface to do multiple manipulation of driver and then return the data
[ https://issues.apache.org/jira/browse/NUTCH-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940025#comment-14940025 ] Asitang Mishra commented on NUTCH-2108: --- [~chrismattmann] > Add a function to the selenium interactive plugin interface to do multiple > manipulation of driver and then return the data > -- > > Key: NUTCH-2108 > URL: https://issues.apache.org/jira/browse/NUTCH-2108 > Project: Nutch > Issue Type: Sub-task > Components: fetcher >Affects Versions: 1.10 >Reporter: Asitang Mishra > Labels: memex > > In the interactive selenium plugin we have to create handler classes for each > manipulation of a page. Sometimes we need to manipulate a page in many ways > and keep track of those manipulations. Like clicking on say each link in a > table and then refreshing to get the original page back as even one click can > make all other links go away. This can be done in a single loop. Which will > be a little too much work and way complicated using multiple handlers. So, I > am proposing a new function "String multiProcessDriver(WebDriver driver)" > that takes the driver and returns a concatenated String along with the > already present "void processDriver(WebDriver driver)". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: fix for NUTCH-2128 Refactor config endpoint by...
GitHub user sujen1412 opened a pull request: https://github.com/apache/nutch/pull/69 fix for NUTCH-2128 Refactor config endpoint by Sujen shah You can merge this pull request into a Git repository by running: $ git pull https://github.com/sujen1412/nutch NUTCH-2128 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/69.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #69 commit f9c80a4bba43c0a117804d4997303a5a974f4cc2 Author: Sujen ShahDate: 2015-09-29T19:07:13Z Refactor config endpoint --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Assigned] (NUTCH-2128) Refactor configuration end point
[ https://issues.apache.org/jira/browse/NUTCH-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujen Shah reassigned NUTCH-2128: - Assignee: Sujen Shah > Refactor configuration end point > > > Key: NUTCH-2128 > URL: https://issues.apache.org/jira/browse/NUTCH-2128 > Project: Nutch > Issue Type: Sub-task > Components: REST_api >Reporter: Sujen Shah >Assignee: Sujen Shah >Priority: Minor > Fix For: 1.11 > > > To better define the endpoint to create a new configuration and add a new > endpoint to update a particular property value of a configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2108) Add a function to the selenium interactive plugin interface to do multiple manipulation of driver and then return the data
[ https://issues.apache.org/jira/browse/NUTCH-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940036#comment-14940036 ] Michael Joyce commented on NUTCH-2108: -- Good stuff [~asitang], glad to see the workaround proved fruitful and great example handlers! > Add a function to the selenium interactive plugin interface to do multiple > manipulation of driver and then return the data > -- > > Key: NUTCH-2108 > URL: https://issues.apache.org/jira/browse/NUTCH-2108 > Project: Nutch > Issue Type: Sub-task > Components: fetcher >Affects Versions: 1.10 >Reporter: Asitang Mishra > Labels: memex > > In the interactive selenium plugin we have to create handler classes for each > manipulation of a page. Sometimes we need to manipulate a page in many ways > and keep track of those manipulations. Like clicking on say each link in a > table and then refreshing to get the original page back as even one click can > make all other links go away. This can be done in a single loop. Which will > be a little too much work and way complicated using multiple handlers. So, I > am proposing a new function "String multiProcessDriver(WebDriver driver)" > that takes the driver and returns a concatenated String along with the > already present "void processDriver(WebDriver driver)". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2123) Seed List REST API returns Text but headers indicate/require JSON
[ https://issues.apache.org/jira/browse/NUTCH-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujen Shah updated NUTCH-2123: -- Attachment: NUTCH-2123.patch Patch for correcting the response headers. > Seed List REST API returns Text but headers indicate/require JSON > - > > Key: NUTCH-2123 > URL: https://issues.apache.org/jira/browse/NUTCH-2123 > Project: Nutch > Issue Type: Bug > Components: REST_api >Affects Versions: 1.11 >Reporter: Aron Ahmadia >Priority: Minor > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2123.patch > > > nutch.py: POST Endpoint: /seed/create > nutch.py: POST Request data: {'seedUrls': [{'id': 0, 'url': > 'http://aron.ahmadia.net', 'seedList': None}], 'id': '12345', 'name': 'aron'} > nutch.py: POST Request headers: {'Accept': 'application/json'} > nutch.py: Response headers: {'content-type': 'application/json', 'server': > 'Jetty(8.1.15.v20140411)', 'content-length': '64', 'date': 'Fri, 25 Sep 2015 > 05:49:09 GMT'} > nutch.py: Response status: 200 > resp.headers > {'content-type': 'application/json', 'server': 'Jetty(8.1.15.v20140411)', > 'content-length': '64', 'date': 'Fri, 25 Sep 2015 05:49:09 GMT'} > resp.text > '/var/folders/3s/pw2prx7n7vd22qqrlssmtn90gp/T/1443160149187-0' -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[Nutch Wiki] Update of "Nutch_1.X_RESTAPI" by SujenShah
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "Nutch_1.X_RESTAPI" page has been changed by SujenShah: https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI?action=diff=7=8 = Nutch 1.x REST API v1.0 = - <> + < > == Introduction == This page documents the Nutch 1.X REST API v1.0. @@ -222, +222 @@ __Response__ is created job's id. job-id-43243 + + + === Seed List creation === + + The /seed/create endpoint enables the user to create a seedlist and return the temporary path of the file created. This path should be passed to the url_dir parameter of the INJECT job. + + {{{ + POST /seed/create + { + "name":"name-of-seedlist", + "seedUrls":["http://www.example.com;,] + } + }}} + + __Response__ is the file directory path + + /var/folders/m9/hsls1krx12x968plt2brlhr0gn/T/1443721976324-0 === Database ===
[GitHub] nutch pull request: Fix for NUTCH-2086 Contributed by Sujen Shah
Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/61 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2086) Nutch 1.X Webui
[ https://issues.apache.org/jira/browse/NUTCH-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939406#comment-14939406 ] ASF GitHub Bot commented on NUTCH-2086: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/61 > Nutch 1.X Webui > > > Key: NUTCH-2086 > URL: https://issues.apache.org/jira/browse/NUTCH-2086 > Project: Nutch > Issue Type: New Feature > Components: REST_api, web gui >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2086.patch > > > To port the Apache Wicket based webui in Nutch 2.X to 1.X -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [VOTE] Release Apache Nutch 2.3.1
Hi Lewis, -1 until I verify nutch actually crawls. Right now it finds 0 URLs with no errors. 2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at all. Cheers, Sherban On 9/30/15, 5:35 PM, "Lewis John Mcgibbney"wrote: >Hi Folks, >Is anyone else able to test and run the release candidate for 2.3.1? >It would be great to get a release if we can get the VOTE's and the RC is >suitable. >Thanks in advance. >Best >Lewis > >On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney < >lewis.mcgibb...@gmail.com> wrote: > >> Hi Folks, >> It turns out the formatting for the original email below was terrible. >> Sorry about that. >> I've hopefully corrected formatting now. Please VOTE away! >> >> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney < >> lewis.mcgibb...@gmail.com> wrote: >> >>> Hi user@ & dev@, >>> >>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1. >>> >>> We addressed 32 issues in all which can been see at the release report >>> http://s.apache.org/nutch_2.3.1 >>> >>> The release candidate comprises the following components. >>> >>> * A staging repository [0] containing various Maven artifacts >>> * A branch-2.3.1 of the 2.x code [1] >>> * The tagged source upon which we are VOTE'ing [2] >>> * Finally, the release artifacts [3] which i would encourage you to >>> verify for signatures and test. >>> >>> You should use the following KEYS [4] file to verify the signatures of >>> all release artifacts. >>> >>> Please VOTE as follows >>> >>> [ ] +1 Push the release, I am happy :) >>> [ ] +/-0 I am not bothered either way >>> [ ] -1 I am not happy with this release candidate (please state why) >>> >>> Firstly thank you to everyone that contributed to Nutch. Secondly, >>>thank >>> you to everyone that VOTE's. It is appreciated. >>> >>> Thanks >>> Lewis >>> (on behalf of Nutch PMC) >>> >>> p.s. Here's my +1 >>> >>> [0] >>> https://repository.apache.org/content/repositories/orgapachenutch-1005 >>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1 >>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1 >>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1 >>> [4] http://www.apache.org/dist/nutch/KEYS >>> >>> -- >>> *Lewis* >>> >> >> >> >> -- >> *Lewis* >> > > > >-- >*Lewis* __ This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939503#comment-14939503 ] Julien Nioche commented on NUTCH-2129: -- I'd rather keep it simple and not modify the CrawlDatum so much. Why don't you simply add a config element and optionally store the code in the metadata? BTW we already have the option to store the response headers - see [https://github.com/apache/nutch/commit/23c7761aff830db82a1e44b84bf81265639c9a26]. You could use that and simply reparse the first line to get the code. > Track Protocol Status in Crawl Datum > > > Key: NUTCH-2129 > URL: https://issues.apache.org/jira/browse/NUTCH-2129 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce > Fix For: 2.4, 1.11 > > > It's become necessary on a few crawls that I run to get protocol status code > stats. After speaking with [~lewismc] it seemed that there might not be a > super convenient way of doing this as is, but it would be great to be able to > add the functionality necessary to pull this information out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)