Re: Request for inclusion in the Nutch email list
Hi Pramod, To subscribe to the list you need to send a mail to dev-subscr...@nutch.apache.org. For more instructions have a look at - http://nutch.apache.org/mailing_lists.html Cheers, Sujen Shah On Tue, Sep 29, 2015 at 10:22 PM, Pramod Nagarajarao wrote: > Hello Team, > > I'm Pramod and am a graduate student studying Computer Science at USC. I > want to be a part of Nutch mailing lists and I request you to add me on it. > Thanks. > > Regards, > Pramod Nagarajarao >
Request for inclusion in the Nutch email list
Hello Team, I'm Pramod and am a graduate student studying Computer Science at USC. I want to be a part of Nutch mailing lists and I request you to add me on it. Thanks. Regards, Pramod Nagarajarao
Re: Team 18: Selenium handler question
Hi folks, The handler interface requires you to implement two functions: Void processDriver(..) Boolean shouldProcessURL The processDriver function can do any manipulation of the web driver that you’d like. The content will be pulled out of the body tag of the document when this function returns. It is given to your handler preloaded with the URL for the current page being fetched. You should be able to take that and do the manipulations necessary. shouldProcessURL is used to check whether the handler should be loaded for a particular URL. If you want the handler to run over every URL then just have it return true. If you want to have it run on only certain URLs then you can implement that logic in there. As for documentation, the Selenium docs [1] are pretty good. If you need to handle authentication that can be a pain. I don’t have too many recommendations there. You’ll have to just search around and figure out best recommendations. Stackoverflow is always good =D [2] [1] http://www.seleniumhq.org/docs/03_webdriver.jsp [2] https://stackoverflow.com/questions/24304752/how-to-handle-authentication-p opup-with-selenium-webdriver-using-java Hope that helps -- Michael J. Joyce Scientific Applications Software Engineer Instrument Software and Science Data Systems NASA Jet Propulsion Laboratory California Institute of Technology 4800 Oak Grove Drive Pasadena, California 91109 Mail Stop: 158-242 Cel: (626) 788-7511 Tel: (818) 354-7550 Fax: (818) 393-1370 On 10/1/15, 6:50 AM, "Christian Alan Mattmann" wrote: >Hi Team 18, > >This is great and you are headed in the right direction. > >MikeJ - can you suggest a sample reference to take a look >at for the team? > >Cheers, >Chris > >+ >Chris Mattmann, Ph.D. >Adjunct Associate Professor, Computer Science Department >University of Southern California >Los Angeles, CA 90089 USA >Email: mattm...@usc.edu >WWW: http://sunset.usc.edu/ >+ > > > > >-Original Message- >From: Mithun Maragiri >Date: Thursday, October 1, 2015 at 12:21 AM >To: jpluser >Cc: "ramac...@usc.edu" , Charan Shampur >, Sharan Kadagad >Subject: Team 18: Selenium handler question > >>Hello Professor, >> >> >>We want some help in writing selenium handler code. >>We crawled the URLs for 30 rounds and we ended up with a few URLs which >>were not fetched. >>We wrote a python script to filter these URLs whose status code is not >>OK/SUCCESS. >>Once we had these URLs we manually checked any one of the URLs as to why >>it is not fetched. >>We discovered that the website was behind the form and needed >>authentication to access its web pages. >>All the fetch requests made by crawler are http GET requests but for >>these unfetched URLs we need to make POST request. We are thinking of >>this approach >> >> >>Approach: >>> Write a script which filters all the URLs whose status code is not >>>success. >>> create a webDriver for each of these URLs in the DefaultHandler() >>> manually sign up to each of these unfetched URLs with the same login >>>credentials: Example: login= Team18; Password= team18Password >>> once driver is created, create a POST request with the URL and append >>>our login credentials and then make an AJAX call >>> after studying materials online we realized that the purpose of >>>selenium is exactly the same. But we cannot find any examples online >>>where someone has written a handler. We are finding it hard to >>>understand how to write the handler. >> >> >>Can you please provide some example code writing the handler? we will use >>that as the reference and try to write as per our need >> >> >>Thanks, >> >>Team 18 >> >> >> >> >
Re: [VOTE] Release Apache Nutch 2.3.1
Hi Lewis, -1 until I verify nutch actually crawls. Right now it finds 0 URLs with no errors. 2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at all. Cheers, Sherban On 9/30/15, 5:35 PM, "Lewis John Mcgibbney" wrote: >Hi Folks, >Is anyone else able to test and run the release candidate for 2.3.1? >It would be great to get a release if we can get the VOTE's and the RC is >suitable. >Thanks in advance. >Best >Lewis > >On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney < >lewis.mcgibb...@gmail.com> wrote: > >> Hi Folks, >> It turns out the formatting for the original email below was terrible. >> Sorry about that. >> I've hopefully corrected formatting now. Please VOTE away! >> >> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney < >> lewis.mcgibb...@gmail.com> wrote: >> >>> Hi user@ & dev@, >>> >>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1. >>> >>> We addressed 32 issues in all which can been see at the release report >>> http://s.apache.org/nutch_2.3.1 >>> >>> The release candidate comprises the following components. >>> >>> * A staging repository [0] containing various Maven artifacts >>> * A branch-2.3.1 of the 2.x code [1] >>> * The tagged source upon which we are VOTE'ing [2] >>> * Finally, the release artifacts [3] which i would encourage you to >>> verify for signatures and test. >>> >>> You should use the following KEYS [4] file to verify the signatures of >>> all release artifacts. >>> >>> Please VOTE as follows >>> >>> [ ] +1 Push the release, I am happy :) >>> [ ] +/-0 I am not bothered either way >>> [ ] -1 I am not happy with this release candidate (please state why) >>> >>> Firstly thank you to everyone that contributed to Nutch. Secondly, >>>thank >>> you to everyone that VOTE's. It is appreciated. >>> >>> Thanks >>> Lewis >>> (on behalf of Nutch PMC) >>> >>> p.s. Here's my +1 >>> >>> [0] >>> https://repository.apache.org/content/repositories/orgapachenutch-1005 >>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1 >>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1 >>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1 >>> [4] http://www.apache.org/dist/nutch/KEYS >>> >>> -- >>> *Lewis* >>> >> >> >> >> -- >> *Lewis* >> > > > >-- >*Lewis* __ This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
[Nutch Wiki] Update of "Nutch_1.X_RESTAPI" by SujenShah
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "Nutch_1.X_RESTAPI" page has been changed by SujenShah: https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI?action=diff&rev1=7&rev2=8 = Nutch 1.x REST API v1.0 = - <> + <> == Introduction == This page documents the Nutch 1.X REST API v1.0. @@ -222, +222 @@ __Response__ is created job's id. job-id-43243 + + + === Seed List creation === + + The /seed/create endpoint enables the user to create a seedlist and return the temporary path of the file created. This path should be passed to the url_dir parameter of the INJECT job. + + {{{ + POST /seed/create + { + "name":"name-of-seedlist", + "seedUrls":["http://www.example.com";,] + } + }}} + + __Response__ is the file directory path + + /var/folders/m9/hsls1krx12x968plt2brlhr0gn/T/1443721976324-0 === Database ===
[jira] [Updated] (NUTCH-2123) Seed List REST API returns Text but headers indicate/require JSON
[ https://issues.apache.org/jira/browse/NUTCH-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujen Shah updated NUTCH-2123: -- Attachment: NUTCH-2123.patch Patch for correcting the response headers. > Seed List REST API returns Text but headers indicate/require JSON > - > > Key: NUTCH-2123 > URL: https://issues.apache.org/jira/browse/NUTCH-2123 > Project: Nutch > Issue Type: Bug > Components: REST_api >Affects Versions: 1.11 >Reporter: Aron Ahmadia >Priority: Minor > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2123.patch > > > nutch.py: POST Endpoint: /seed/create > nutch.py: POST Request data: {'seedUrls': [{'id': 0, 'url': > 'http://aron.ahmadia.net', 'seedList': None}], 'id': '12345', 'name': 'aron'} > nutch.py: POST Request headers: {'Accept': 'application/json'} > nutch.py: Response headers: {'content-type': 'application/json', 'server': > 'Jetty(8.1.15.v20140411)', 'content-length': '64', 'date': 'Fri, 25 Sep 2015 > 05:49:09 GMT'} > nutch.py: Response status: 200 > resp.headers > {'content-type': 'application/json', 'server': 'Jetty(8.1.15.v20140411)', > 'content-length': '64', 'date': 'Fri, 25 Sep 2015 05:49:09 GMT'} > resp.text > '/var/folders/3s/pw2prx7n7vd22qqrlssmtn90gp/T/1443160149187-0' -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2128) Refactor configuration end point
[ https://issues.apache.org/jira/browse/NUTCH-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940064#comment-14940064 ] ASF GitHub Bot commented on NUTCH-2128: --- GitHub user sujen1412 opened a pull request: https://github.com/apache/nutch/pull/69 fix for NUTCH-2128 Refactor config endpoint by Sujen shah You can merge this pull request into a Git repository by running: $ git pull https://github.com/sujen1412/nutch NUTCH-2128 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/69.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #69 commit f9c80a4bba43c0a117804d4997303a5a974f4cc2 Author: Sujen Shah Date: 2015-09-29T19:07:13Z Refactor config endpoint > Refactor configuration end point > > > Key: NUTCH-2128 > URL: https://issues.apache.org/jira/browse/NUTCH-2128 > Project: Nutch > Issue Type: Sub-task > Components: REST_api >Reporter: Sujen Shah >Assignee: Sujen Shah >Priority: Minor > Fix For: 1.11 > > > To better define the endpoint to create a new configuration and add a new > endpoint to update a particular property value of a configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: fix for NUTCH-2128 Refactor config endpoint by...
GitHub user sujen1412 opened a pull request: https://github.com/apache/nutch/pull/69 fix for NUTCH-2128 Refactor config endpoint by Sujen shah You can merge this pull request into a Git repository by running: $ git pull https://github.com/sujen1412/nutch NUTCH-2128 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/69.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #69 commit f9c80a4bba43c0a117804d4997303a5a974f4cc2 Author: Sujen Shah Date: 2015-09-29T19:07:13Z Refactor config endpoint --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Assigned] (NUTCH-2128) Refactor configuration end point
[ https://issues.apache.org/jira/browse/NUTCH-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sujen Shah reassigned NUTCH-2128: - Assignee: Sujen Shah > Refactor configuration end point > > > Key: NUTCH-2128 > URL: https://issues.apache.org/jira/browse/NUTCH-2128 > Project: Nutch > Issue Type: Sub-task > Components: REST_api >Reporter: Sujen Shah >Assignee: Sujen Shah >Priority: Minor > Fix For: 1.11 > > > To better define the endpoint to create a new configuration and add a new > endpoint to update a particular property value of a configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2108) Add a function to the selenium interactive plugin interface to do multiple manipulation of driver and then return the data
[ https://issues.apache.org/jira/browse/NUTCH-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940036#comment-14940036 ] Michael Joyce commented on NUTCH-2108: -- Good stuff [~asitang], glad to see the workaround proved fruitful and great example handlers! > Add a function to the selenium interactive plugin interface to do multiple > manipulation of driver and then return the data > -- > > Key: NUTCH-2108 > URL: https://issues.apache.org/jira/browse/NUTCH-2108 > Project: Nutch > Issue Type: Sub-task > Components: fetcher >Affects Versions: 1.10 >Reporter: Asitang Mishra > Labels: memex > > In the interactive selenium plugin we have to create handler classes for each > manipulation of a page. Sometimes we need to manipulate a page in many ways > and keep track of those manipulations. Like clicking on say each link in a > table and then refreshing to get the original page back as even one click can > make all other links go away. This can be done in a single loop. Which will > be a little too much work and way complicated using multiple handlers. So, I > am proposing a new function "String multiProcessDriver(WebDriver driver)" > that takes the driver and returns a concatenated String along with the > already present "void processDriver(WebDriver driver)". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2108) Add a function to the selenium interactive plugin interface to do multiple manipulation of driver and then return the data
[ https://issues.apache.org/jira/browse/NUTCH-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940025#comment-14940025 ] Asitang Mishra commented on NUTCH-2108: --- [~chrismattmann] > Add a function to the selenium interactive plugin interface to do multiple > manipulation of driver and then return the data > -- > > Key: NUTCH-2108 > URL: https://issues.apache.org/jira/browse/NUTCH-2108 > Project: Nutch > Issue Type: Sub-task > Components: fetcher >Affects Versions: 1.10 >Reporter: Asitang Mishra > Labels: memex > > In the interactive selenium plugin we have to create handler classes for each > manipulation of a page. Sometimes we need to manipulate a page in many ways > and keep track of those manipulations. Like clicking on say each link in a > table and then refreshing to get the original page back as even one click can > make all other links go away. This can be done in a single loop. Which will > be a little too much work and way complicated using multiple handlers. So, I > am proposing a new function "String multiProcessDriver(WebDriver driver)" > that takes the driver and returns a concatenated String along with the > already present "void processDriver(WebDriver driver)". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939939#comment-14939939 ] Michael Joyce commented on NUTCH-2129: -- Thanks Julien. I figured there would probably be a few thoughts on this, so I appreciate the feedback. I'll checkout the stuff you mentioned. Thanks for the ideas. > Track Protocol Status in Crawl Datum > > > Key: NUTCH-2129 > URL: https://issues.apache.org/jira/browse/NUTCH-2129 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce > Fix For: 2.4, 1.11 > > > It's become necessary on a few crawls that I run to get protocol status code > stats. After speaking with [~lewismc] it seemed that there might not be a > super convenient way of doing this as is, but it would be great to be able to > add the functionality necessary to pull this information out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Atomic update and optimistic concurrency in Solr
Hi all: I'm trying to make an atomic update or optimistic concurrency update in Solr. Anyone can help me?
Re: [MASSMAIL]Re: Fetch failed : java.lang.NullPointerException
Hi Taichi: Which plugins you have enabled in nutch-site.xml? - Mensaje original - De: "Taichi Ho" Para: dev@nutch.apache.org Enviados: Miércoles, 30 de Septiembre 2015 16:57:39 Asunto: [MASSMAIL]Re: Fetch failed : java.lang.NullPointerException Hi, I have the same problem. The following is part of my log: http://pastebin.com/JjkJ1qe6 It seems there is a read time out. But I paste the url in the browser and it works fine. Any ideas what could be causing this problem? Thanks. On Mon, Sep 28, 2015 at 7:46 AM Michael Joyce < jo...@apache.org > wrote: I don't see any null pointer exceptions coming up in your log. Do you have any more info or perhaps I'm missing something? -- Jimmy On Sun, Sep 27, 2015 at 3:04 PM, mithun < mithun626...@gmail.com > wrote: Hi All While crawling my seed list, I bumped into this Null Pointer Exception for few urls. What could be the problem. Please find paste.bin link of my hadoop.log file http://pastebin.com/SyyybtEx Thanks Mithun
[jira] [Created] (NUTCH-2131) Problem running nutch(crawl) with selenium
Ashwini created NUTCH-2131: -- Summary: Problem running nutch(crawl) with selenium Key: NUTCH-2131 URL: https://issues.apache.org/jira/browse/NUTCH-2131 Project: Nutch Issue Type: Bug Components: nutch server Affects Versions: 1.10 Environment: Ubuntu 12.04 32-bit Reporter: Ashwini Hello, I had a few issues with running selenium on Ubuntu. I am trying to follow the tutorial that has a description to install the nutch selenium plugin, https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-selenium I was successfully able to include the plugin and build nutch again. But during the crawling process, I get the error "Unable to connect to host 127.0.0.1 on port 7055 after 45000 ms" . I tried to do research on this and I think that the Firefox version I am using and Selenium jars are incompatible.(I'm not sure if this is the issue) So I downgraded my Firefox to version(41 downgraded to 33), but I am still getting the same error. Is there a compatible version of firefox that I need to install or is there any other problem? I am using selenium that is integrated in nutch-1.10 and nutch version is 1.10. I have used 2.44.0 selenium standalone software with firefox version 33 and everything works fine. Please help me with this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939503#comment-14939503 ] Julien Nioche commented on NUTCH-2129: -- I'd rather keep it simple and not modify the CrawlDatum so much. Why don't you simply add a config element and optionally store the code in the metadata? BTW we already have the option to store the response headers - see [https://github.com/apache/nutch/commit/23c7761aff830db82a1e44b84bf81265639c9a26]. You could use that and simply reparse the first line to get the code. > Track Protocol Status in Crawl Datum > > > Key: NUTCH-2129 > URL: https://issues.apache.org/jira/browse/NUTCH-2129 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce > Fix For: 2.4, 1.11 > > > It's become necessary on a few crawls that I run to get protocol status code > stats. After speaking with [~lewismc] it seemed that there might not be a > super convenient way of doing this as is, but it would be great to be able to > add the functionality necessary to pull this information out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2086) Nutch 1.X Webui
[ https://issues.apache.org/jira/browse/NUTCH-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939406#comment-14939406 ] ASF GitHub Bot commented on NUTCH-2086: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/61 > Nutch 1.X Webui > > > Key: NUTCH-2086 > URL: https://issues.apache.org/jira/browse/NUTCH-2086 > Project: Nutch > Issue Type: New Feature > Components: REST_api, web gui >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.11 > > Attachments: NUTCH-2086.patch > > > To port the Apache Wicket based webui in Nutch 2.X to 1.X -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: Fix for NUTCH-2086 Contributed by Sujen Shah
Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/61 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---