Re: [MASSMAIL]Re: Fetch failed : java.lang.NullPointerException

2015-10-01 Thread Roannel Fern�ndez Hern�ndez
Hi Taichi: 

Which plugins you have enabled in nutch-site.xml? 

- Mensaje original -

De: "Taichi Ho"  
Para: dev@nutch.apache.org 
Enviados: Miércoles, 30 de Septiembre 2015 16:57:39 
Asunto: [MASSMAIL]Re: Fetch failed : java.lang.NullPointerException 

Hi, I have the same problem. The following is part of my log: 
http://pastebin.com/JjkJ1qe6 

It seems there is a read time out. But I paste the url in the browser and it 
works fine. 

Any ideas what could be causing this problem? 

Thanks. 

On Mon, Sep 28, 2015 at 7:46 AM Michael Joyce < jo...@apache.org > wrote: 



I don't see any null pointer exceptions coming up in your log. Do you have any 
more info or perhaps I'm missing something? 


-- Jimmy 

On Sun, Sep 27, 2015 at 3:04 PM, mithun < mithun626...@gmail.com > wrote: 



Hi All 

While crawling my seed list, I bumped into this Null Pointer Exception for few 
urls. What could be the problem. 

Please find paste.bin link of my hadoop.log file 

http://pastebin.com/SyyybtEx 


Thanks 
Mithun 











[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-10-01 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939939#comment-14939939
 ] 

Michael Joyce commented on NUTCH-2129:
--

Thanks Julien. I figured there would probably be a few thoughts on this, so I 
appreciate the feedback. I'll checkout the stuff you mentioned. Thanks for the 
ideas.

> Track Protocol Status in Crawl Datum
> 
>
> Key: NUTCH-2129
> URL: https://issues.apache.org/jira/browse/NUTCH-2129
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> It's become necessary on a few crawls that I run to get protocol status code 
> stats. After speaking with [~lewismc] it seemed that there might not be a 
> super convenient way of doing this as is, but it would be great to be able to 
> add the functionality necessary to pull this information out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Atomic update and optimistic concurrency in Solr

2015-10-01 Thread Roannel Fernández Hernández
Hi all:

I'm trying to make an atomic update or optimistic concurrency update in Solr. 
Anyone can help me?


[jira] [Commented] (NUTCH-2128) Refactor configuration end point

2015-10-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940064#comment-14940064
 ] 

ASF GitHub Bot commented on NUTCH-2128:
---

GitHub user sujen1412 opened a pull request:

https://github.com/apache/nutch/pull/69

fix for NUTCH-2128 Refactor config endpoint by Sujen shah



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sujen1412/nutch NUTCH-2128

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/69.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #69


commit f9c80a4bba43c0a117804d4997303a5a974f4cc2
Author: Sujen Shah 
Date:   2015-09-29T19:07:13Z

Refactor config endpoint




> Refactor configuration end point
> 
>
> Key: NUTCH-2128
> URL: https://issues.apache.org/jira/browse/NUTCH-2128
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Sujen Shah
>Priority: Minor
> Fix For: 1.11
>
>
> To better define the endpoint to create a new configuration and add a new 
> endpoint to update a particular property value of a configuration. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2108) Add a function to the selenium interactive plugin interface to do multiple manipulation of driver and then return the data

2015-10-01 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940025#comment-14940025
 ] 

Asitang Mishra commented on NUTCH-2108:
---

[~chrismattmann]

> Add a function to the selenium interactive plugin interface to do multiple 
> manipulation of driver and then return the data
> --
>
> Key: NUTCH-2108
> URL: https://issues.apache.org/jira/browse/NUTCH-2108
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> In the interactive selenium plugin we have to create handler classes for each 
> manipulation of a page. Sometimes we need to manipulate a page in many ways 
> and keep track of those manipulations. Like clicking on say each link in a 
> table and then refreshing to get the original page back as even one click can 
> make all other links go away. This can be done in a single loop. Which will 
> be a little too much work and way complicated using multiple handlers. So, I 
> am proposing a new function "String multiProcessDriver(WebDriver driver)"  
> that takes the driver and returns a concatenated String along with the 
> already present "void processDriver(WebDriver driver)".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: fix for NUTCH-2128 Refactor config endpoint by...

2015-10-01 Thread sujen1412
GitHub user sujen1412 opened a pull request:

https://github.com/apache/nutch/pull/69

fix for NUTCH-2128 Refactor config endpoint by Sujen shah



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sujen1412/nutch NUTCH-2128

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/69.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #69


commit f9c80a4bba43c0a117804d4997303a5a974f4cc2
Author: Sujen Shah 
Date:   2015-09-29T19:07:13Z

Refactor config endpoint




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Assigned] (NUTCH-2128) Refactor configuration end point

2015-10-01 Thread Sujen Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujen Shah reassigned NUTCH-2128:
-

Assignee: Sujen Shah

> Refactor configuration end point
> 
>
> Key: NUTCH-2128
> URL: https://issues.apache.org/jira/browse/NUTCH-2128
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Sujen Shah
>Priority: Minor
> Fix For: 1.11
>
>
> To better define the endpoint to create a new configuration and add a new 
> endpoint to update a particular property value of a configuration. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2108) Add a function to the selenium interactive plugin interface to do multiple manipulation of driver and then return the data

2015-10-01 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940036#comment-14940036
 ] 

Michael Joyce commented on NUTCH-2108:
--

Good stuff [~asitang], glad to see the workaround proved fruitful and great 
example handlers!

> Add a function to the selenium interactive plugin interface to do multiple 
> manipulation of driver and then return the data
> --
>
> Key: NUTCH-2108
> URL: https://issues.apache.org/jira/browse/NUTCH-2108
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> In the interactive selenium plugin we have to create handler classes for each 
> manipulation of a page. Sometimes we need to manipulate a page in many ways 
> and keep track of those manipulations. Like clicking on say each link in a 
> table and then refreshing to get the original page back as even one click can 
> make all other links go away. This can be done in a single loop. Which will 
> be a little too much work and way complicated using multiple handlers. So, I 
> am proposing a new function "String multiProcessDriver(WebDriver driver)"  
> that takes the driver and returns a concatenated String along with the 
> already present "void processDriver(WebDriver driver)".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2123) Seed List REST API returns Text but headers indicate/require JSON

2015-10-01 Thread Sujen Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujen Shah updated NUTCH-2123:
--
Attachment: NUTCH-2123.patch

Patch for correcting the response headers.

> Seed List REST API returns Text but headers indicate/require JSON
> -
>
> Key: NUTCH-2123
> URL: https://issues.apache.org/jira/browse/NUTCH-2123
> Project: Nutch
>  Issue Type: Bug
>  Components: REST_api
>Affects Versions: 1.11
>Reporter: Aron Ahmadia
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2123.patch
>
>
> nutch.py: POST Endpoint: /seed/create
> nutch.py: POST Request data: {'seedUrls': [{'id': 0, 'url': 
> 'http://aron.ahmadia.net', 'seedList': None}], 'id': '12345', 'name': 'aron'}
> nutch.py: POST Request headers: {'Accept': 'application/json'}
> nutch.py: Response headers: {'content-type': 'application/json', 'server': 
> 'Jetty(8.1.15.v20140411)', 'content-length': '64', 'date': 'Fri, 25 Sep 2015 
> 05:49:09 GMT'}
> nutch.py: Response status: 200
> resp.headers
> {'content-type': 'application/json', 'server': 'Jetty(8.1.15.v20140411)', 
> 'content-length': '64', 'date': 'Fri, 25 Sep 2015 05:49:09 GMT'}
> resp.text
> '/var/folders/3s/pw2prx7n7vd22qqrlssmtn90gp/T/1443160149187-0'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[Nutch Wiki] Update of "Nutch_1.X_RESTAPI" by SujenShah

2015-10-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "Nutch_1.X_RESTAPI" page has been changed by SujenShah:
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI?action=diff=7=8

  = Nutch 1.x REST API v1.0 =
  
- <>
+ <>
  
  == Introduction ==
  This page documents the Nutch 1.X REST API v1.0. 
@@ -222, +222 @@

  __Response__ is created job's id.
  
  job-id-43243
+ 
+ 
+ === Seed List creation ===
+ 
+ The /seed/create endpoint enables the user to create a seedlist and return 
the temporary path of the file created. This path should be passed to the 
url_dir parameter of the INJECT job.
+ 
+ {{{
+ POST /seed/create
+ {
+ "name":"name-of-seedlist", 
+ "seedUrls":["http://www.example.com;,]
+ }
+ }}}
+ 
+ __Response__ is the file directory path
+ 
+ /var/folders/m9/hsls1krx12x968plt2brlhr0gn/T/1443721976324-0
  
  
  === Database ===


[GitHub] nutch pull request: Fix for NUTCH-2086 Contributed by Sujen Shah

2015-10-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/61


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2086) Nutch 1.X Webui

2015-10-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939406#comment-14939406
 ] 

ASF GitHub Bot commented on NUTCH-2086:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/61


> Nutch 1.X Webui 
> 
>
> Key: NUTCH-2086
> URL: https://issues.apache.org/jira/browse/NUTCH-2086
> Project: Nutch
>  Issue Type: New Feature
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2086.patch
>
>
> To port the Apache Wicket based webui in Nutch 2.X to 1.X



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Release Apache Nutch 2.3.1

2015-10-01 Thread Drulea, Sherban
Hi Lewis,

-1 until I verify nutch actually crawls. Right now it finds 0 URLs with no
errors.

2.3.1 is an improvement over 2.3.0 which didn¹t work with Mongo at all.

Cheers,
Sherban



On 9/30/15, 5:35 PM, "Lewis John Mcgibbney" 
wrote:

>Hi Folks,
>Is anyone else able to test and run the release candidate for 2.3.1?
>It would be great to get a release if we can get the VOTE's and the RC is
>suitable.
>Thanks in advance.
>Best
>Lewis
>
>On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney <
>lewis.mcgibb...@gmail.com> wrote:
>
>> Hi Folks,
>> It turns out the formatting for the original email below was terrible.
>> Sorry about that.
>> I've hopefully corrected formatting now. Please VOTE away!
>>
>> On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney <
>> lewis.mcgibb...@gmail.com> wrote:
>>
>>> Hi user@ & dev@,
>>>
>>> This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1.
>>>
>>> We addressed 32 issues in all which can been see at the release report
>>> http://s.apache.org/nutch_2.3.1
>>>
>>> The release candidate comprises the following components.
>>>
>>> * A staging repository [0] containing various Maven artifacts
>>> * A branch-2.3.1 of the 2.x code [1]
>>> * The tagged source upon which we are VOTE'ing [2]
>>> * Finally, the release artifacts [3] which i would encourage you to
>>> verify for signatures and test.
>>>
>>> You should use the following KEYS [4] file to verify the signatures of
>>> all release artifacts.
>>>
>>> Please VOTE as follows
>>>
>>> [ ] +1 Push the release, I am happy :)
>>> [ ] +/-0 I am not bothered either way
>>> [ ] -1 I am not happy with this release candidate (please state why)
>>>
>>> Firstly thank you to everyone that contributed to Nutch. Secondly,
>>>thank
>>> you to everyone that VOTE's. It is appreciated.
>>>
>>> Thanks
>>> Lewis
>>> (on behalf of Nutch PMC)
>>>
>>> p.s. Here's my +1
>>>
>>> [0]
>>> https://repository.apache.org/content/repositories/orgapachenutch-1005
>>> [1] https://svn.apache.org/repos/asf/nutch/branches/branch-2.3.1
>>> [2] https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>>> [3] https://dist.apache.org/repos/dist/dev/nutch/2.3.1
>>> [4] http://www.apache.org/dist/nutch/KEYS
>>>
>>> --
>>> *Lewis*
>>>
>>
>>
>>
>> --
>> *Lewis*
>>
>
>
>
>-- 
>*Lewis*


__

This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.



[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-10-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939503#comment-14939503
 ] 

Julien Nioche commented on NUTCH-2129:
--

I'd rather keep it simple and not modify the CrawlDatum so much. Why don't you 
simply add a config element and optionally store the code in the metadata?
BTW we already have the option to store the response headers - see 
[https://github.com/apache/nutch/commit/23c7761aff830db82a1e44b84bf81265639c9a26].
 You could use that and simply reparse the first line to get the code.


> Track Protocol Status in Crawl Datum
> 
>
> Key: NUTCH-2129
> URL: https://issues.apache.org/jira/browse/NUTCH-2129
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> It's become necessary on a few crawls that I run to get protocol status code 
> stats. After speaking with [~lewismc] it seemed that there might not be a 
> super convenient way of doing this as is, but it would be great to be able to 
> add the functionality necessary to pull this information out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)