[Nutch Wiki] Update of "Nutch_1.X_RESTAPI" by SujenShah

2015-02-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "Nutch_1.X_RESTAPI" page has been changed by SujenShah:
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI

New page:
= Nutch 1.x REST API =

<>

== Introduction ==
This page documents the Nutch 1.X REST API. 

It provides details on the type of REST calls which can be made to the Nutch 
1.x REST API. Many of the API points are adapted from the ones provided by the  
[[https://wiki.apache.org/nutch/NutchRESTAPI|Nutch 2.x REST API]]. One of the 
reasons to come up with a REST API is to integrate D3 to show visualizations 
about the working of a Nutch crawl. 


== REST API Calls ==
=== Administration ===
This API point is created in order to get server status and manage server's 
state.
 Get server status 

GET /admin


__Response__ contains server startup date, availible configuration names, job 
history and currently running jobs.

{
   "startDate":142457250,
   "configuration":[
  "default"
   ],
   "jobs":[

   ],
   "runningJobs":[

   ]
}


 Stop server 
It is possible to stop running server using ''/admin/stop''.

GET /admin/stop


__Response__

Stopping in 5 seconds.


=== Jobs ===
This point allows job management, including creation, job information and 
killing of a job.
 Listing all jobs 

GET /job


__Response__ contains list of all jobs (running and history)

[
   {
  "id":"job-id-5977",
  "type":"FETCH",
  "confId":"default",
  "args":null,
  "result":null,
  "state":"FINISHED",
  "msg":"",
  "crawlId":"crawl-01"
   }
   {
  "id":"job-id-5978",
  "type":"PARSE",
  "confId":"default",
  "args":null,
  "result":null,
  "state":"RUNNING",
  "msg":"",
  "crawlId":"crawl-01"
   }
]


 Get job info 

GET /job/job-id-5977


__Response__

   {
  "id":"job-id-5977",
  "type":"FETCH",
  "confId":"default",
  "args":null,
  "result":null,
  "state":"FINISHED",
  "msg":"",
  "crawlId":"crawl-01"
   }


 Stop job 

GET /job/job-id-5977/stop


__Response__

  true



 Kill job 

GET /job/job-id-5977/abort


__Response__

  true


 Create job 
Create job with given parameters. You should either specify Job Type(like 
INJECT, GENERATE, FETCH, PARSE, etc ) or jobClassName.

POST /job/create
   {
  "crawlId":"crawl-01",
  "type":"FETCH",
  "confId":"default",
  "args":{"someParam":"someValue"}
   }

POST /job/create
   {
  "crawlId":"crawl-01",
  "jobClassName":"org.apache.nutch.fetcher.FetcherJob"
  "confId":"default",
  "args":{"someParam":"someValue"}
   }


__Response__ is created job's id.

job-id-43243


=== URL ===

This point is created in order to get the required information about a URL or 
list of URLs to generate a D3 visualization. The information obtained from this 
API point will help 

GET /url/{filtered-url}

__Response__ contains information about the url from the CrawlDbReader.java 
class. The parameters are

   {
  "url" : "",
  "statusCode" : "",
  "fetchTime" : "",
  "score" : "",
  "numOfInlinks" : "",
  "numOfOutlinks" : "",
   }


== More ==
Description of more API points coming soon.


Build failed in Jenkins: Nutch-nutchgora #1346

2015-02-21 Thread Apache Jenkins Server
See 

--
[...truncated 3225 lines...]

compile:

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 


compile:

job:
  [jar] Building jar: 


resolve-test:
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. 
It could not be found.

copy-libs:

compile-core-test:
[javac] Compiling 43 source files to 


test-core:
[mkdir] Created dir: 

 [copy] Copying 91 files to 

 [copy] Copying 1 file to 

 [copy] Copying 1 file to 

 [copy] Copying 1 file to 

 [copy] Copying 1 file to 

 [copy] Copying 1 file to 

 [copy] Copying 1 file to 

[junit] Running org.apache.nutch.api.TestAPI
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.021 sec
[junit] Running org.apache.nutch.crawl.TestAdaptiveFetchSchedule
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.557 sec
[junit] Running org.apache.nutch.crawl.TestGenerator
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 4, Time elapsed: 
0.016 sec
[junit] Running org.apache.nutch.crawl.TestInjector
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
0.017 sec
[junit] Running org.apache.nutch.crawl.TestSignatureFactory
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.26 sec
[junit] Running org.apache.nutch.crawl.TestURLPartitioner
[junit] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.413 sec
[junit] Running org.apache.nutch.crawl.TestUrlWithScore
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.121 sec
[junit] Running org.apache.nutch.fetcher.TestFetcher
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
1.058 sec
[junit] Running org.apache.nutch.indexer.TestIndexingFilters
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.919 sec
[junit] Running org.apache.nutch.metadata.TestMetadata
[junit] Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.126 sec
[junit] Running org.apache.nutch.metadata.TestSpellCheckedMetadata
[junit] Tests run: 11, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
3.111 sec
[junit] Running org.apache.nutch.net.TestURLFilters
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.563 sec
[junit] Running org.apache.nutch.net.TestURLNormalizers
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.379 sec
[junit] Running org.apache.nutch.parse.TestOutlinkExtractor
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.044 sec
[junit] Running org.apache.nutch.parse.TestParserFactory
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.072 sec
[junit] Running org.apache.nutch.plugin.TestPluginSystem
[junit] Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.095 sec
[junit] Running org.apache.nutch.protocol.TestContent
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.559 sec
[junit] Running org.apache.nutch.protocol.TestProtocolFactory
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.948 sec
[junit] Running org.apache.nutch.storage.TestGoraStorage
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 3, Time elapsed: 
0.018 sec
[junit] Running org.apache.nutch.util.TestEncodingDetector
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.422 sec
[junit] Running org.apache.nutch.util.TestGZIPUtils
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.531 sec
[junit] Running org.apache.nutch.util.TestMimeUtil
[junit] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.366 sec
[junit] Running org.apache.nutch.util.TestNodeWa

Re: Nutch-Selenium Plugin Truncates Binary Data

2015-02-21 Thread Mohammad Al-Mohsin
Hi Jiaxin,

In *HttpResponse.java*, you can check the 'Content-Type' header and then
decide whether to:

- Set the response content to be the binary http response. (Check out
protocol-httpclient's source code for hints)
or
- Continue executing *readPlainContent(url)*, which in turn will set the
'content' from the html body by Selenium Firefox driver.

By the way, since nutch-selenium will be looking for the html body, I think
we should check for 'text/html' and 'application/xhtml+xml' content types,
not just anything that starts with 'text/.'


Best regards,
Mohammad Al-Mohsin

On Sat, Feb 21, 2015 at 12:05 PM, Jiaxin Ye  wrote:

> Hi Mohammad,
>
> Hey, I think that's a very good idea! Any hints about how to change the
> selenium plugin? I am thinking about the same thing but struggling on how
> to do it.
>
> Best,
> Jiaxin
>
> On Sat, Feb 21, 2015 at 6:03 AM, Mohammad Al-Mohsin  wrote:
>
>> I am using nutch-selenium 
>> plugin and I also have Tesseract 
>> installed for parsing text off images.
>>
>> While crawling with Nutch & selenium, I noticed that binary data (e.g.
>> images, pdf) are always truncated and thus skip/fail parsing. Here is a
>> sample of the log:
>>
>> *Content of size 800750 was truncated to 368. Content is truncated, parse
>> may fail!*
>> When I turn selenium off, parsing works fine and the content is not
>> truncated.
>>
>> I found that nutch-selenium gets the html body of whatever Firefox
>> displays. So even though you're fetching an image, selenium will just give
>> you the image html tag instead of the image itself.
>> e.g. 
>>
>> To get around this, I modified selenium plugin to handle the fetch only
>> if the Content-Type header starts with 'text', i.e. to catch 'text/html'.
>> Otherwise, if the content is not textual, it just returns the content as
>> protocol-httpclient does.
>>
>> Now, I am getting binary data properly parsed and also getting selenium
>> handle page rendering with javascript.
>>
>> Is this is the proper way to tackle this? what do you think?
>>
>>
>> Best regards,
>> Mohammad Al-Mohsin
>>
>
>


[Nutch Wiki] Update of "NutchTutorial" by SujenShah

2015-02-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "NutchTutorial" page has been changed by SujenShah:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=76&rev2=77

  
  If all has gone to plan, you are now ready to search with 
http://localhost:8983/solr/admin/.
  
+ == Whats Next ==
+ 
+ You may want to check out the documentation for the 
[[https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI|Nutch 1.X REST API]] to get 
an overview of the work going on towards providing Apache CXF based REST 
services for Nutch 1.X branch.
+ 


Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-21 Thread Nikunj Gala
Being completely new to patch files, I don't know how patch files work but
after looking at patch file, ivy.xml (new one) ivy.xml.rej, ivy.xml.orig I
could understand that
selenium dependencies were not added by the patch in ivy.xml file which I
added manually and I could build Nutch 1.10 Trunk with Tika dependency 1.7
and Selenium.
This build runs perfectly fine.


Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-21 Thread Shuo Li
Yop,

Here's a correct ivy.xml. I think there may be some mistakes when we
install the patch. It will generate some duplicate  tag. You may
need to delete them manually. If anybody could provide a complete tutorial
or a correct patch that'd be great.

PS0: I didn't read the whole conversation. I hope this helped.
PS1: Please remove all the lines in that patch about ivy.xml and replace
with the attachment.

Regards,
Shuo Li

On Sat, Feb 21, 2015 at 11:43 AM, Nikunj Gala  wrote:

> Hey you are correct  I see fails while patching ivy.xml on the latest
> GitHub Nutch Trunk
> The patch longs are as follows:
>
> ---
> patching file build.xml
> patching file ivy/ivy.xml
> Hunk #3 FAILED at 59.
> 1 out of 3 hunks FAILED -- saving rejects to file ivy/ivy.xml.rej
> patching file src/plugin/build.xml
> Hunk #2 succeeded at 148 (offset 2 lines).
> patching file src/plugin/lib-selenium/build.xml
> patching file src/plugin/lib-selenium/ivy.xml
> patching file src/plugin/lib-selenium/plugin.xml
> patching file
> src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
> patching file src/plugin/lib-selenium/src/pom.xml
> patching file src/plugin/protocol-selenium/.idea/.name
> patching file src/plugin/protocol-selenium/.idea/compiler.xml
> patching file
> src/plugin/protocol-selenium/.idea/copyright/profiles_settings.xml
> patching file src/plugin/protocol-selenium/.idea/encodings.xml
> patching file src/plugin/protocol-selenium/.idea/misc.xml
> patching file src/plugin/protocol-selenium/.idea/modules.xml
> patching file src/plugin/protocol-selenium/.idea/scopes/scope_settings.xml
> patching file src/plugin/protocol-selenium/.idea/vcs.xml
> patching file src/plugin/protocol-selenium/.idea/workspace.xml
> patching file src/plugin/protocol-selenium/build.xml
> patching file src/plugin/protocol-selenium/ivy.xml
> patching file src/plugin/protocol-selenium/plugin.xml
> patching file
> src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java
> patching file
> src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java
> patching file
> src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html
> patching file src/plugin/protocol-selenium/src/pom.xml
> patching file
> src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html
>
> ---
>
> Trying to understand and fix the patch now.
> Has anybody else done any changes in the patch?
>





	
		http://www.apache.org/licenses/LICENSE-2.0.txt/"; />
		http://nutch.apache.org"; />
		http://nutch.apache.org";>Nutch is an open source web-search
			software. It builds on
			Hadoop, Tika and Solr, adding web-specifics,
			such as a crawler, a link-graph
			database etc.
		
	
	
	
		
	
	
	
		
		
	
	
	
		
		
		
		
		
		
		
		
		
		
		
			
			
			
			
			
			
		
		
		
		
		
		
		
		
		
		
		
		
		
		
		
		
			
		
		
		
		
		
		
		
		
		
		
		
		
		
		
		
		
		
		
		
		
	
	



Re: Nutch-Selenium Plugin Truncates Binary Data

2015-02-21 Thread Jiaxin Ye
Hi Mohammad,

Hey, I think that's a very good idea! Any hints about how to change the
selenium plugin? I am thinking about the same thing but struggling on how
to do it.

Best,
Jiaxin

On Sat, Feb 21, 2015 at 6:03 AM, Mohammad Al-Mohsin  wrote:

> I am using nutch-selenium 
> plugin and I also have Tesseract 
> installed for parsing text off images.
>
> While crawling with Nutch & selenium, I noticed that binary data (e.g.
> images, pdf) are always truncated and thus skip/fail parsing. Here is a
> sample of the log:
>
> *Content of size 800750 was truncated to 368. Content is truncated, parse
> may fail!*
> When I turn selenium off, parsing works fine and the content is not
> truncated.
>
> I found that nutch-selenium gets the html body of whatever Firefox
> displays. So even though you're fetching an image, selenium will just give
> you the image html tag instead of the image itself.
> e.g. 
>
> To get around this, I modified selenium plugin to handle the fetch only if
> the Content-Type header starts with 'text', i.e. to catch 'text/html'.
> Otherwise, if the content is not textual, it just returns the content as
> protocol-httpclient does.
>
> Now, I am getting binary data properly parsed and also getting selenium
> handle page rendering with javascript.
>
> Is this is the proper way to tackle this? what do you think?
>
>
> Best regards,
> Mohammad Al-Mohsin
>


Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-21 Thread Nikunj Gala
Hey you are correct  I see fails while patching ivy.xml on the latest
GitHub Nutch Trunk
The patch longs are as follows:
---
patching file build.xml
patching file ivy/ivy.xml
Hunk #3 FAILED at 59.
1 out of 3 hunks FAILED -- saving rejects to file ivy/ivy.xml.rej
patching file src/plugin/build.xml
Hunk #2 succeeded at 148 (offset 2 lines).
patching file src/plugin/lib-selenium/build.xml
patching file src/plugin/lib-selenium/ivy.xml
patching file src/plugin/lib-selenium/plugin.xml
patching file
src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
patching file src/plugin/lib-selenium/src/pom.xml
patching file src/plugin/protocol-selenium/.idea/.name
patching file src/plugin/protocol-selenium/.idea/compiler.xml
patching file
src/plugin/protocol-selenium/.idea/copyright/profiles_settings.xml
patching file src/plugin/protocol-selenium/.idea/encodings.xml
patching file src/plugin/protocol-selenium/.idea/misc.xml
patching file src/plugin/protocol-selenium/.idea/modules.xml
patching file src/plugin/protocol-selenium/.idea/scopes/scope_settings.xml
patching file src/plugin/protocol-selenium/.idea/vcs.xml
patching file src/plugin/protocol-selenium/.idea/workspace.xml
patching file src/plugin/protocol-selenium/build.xml
patching file src/plugin/protocol-selenium/ivy.xml
patching file src/plugin/protocol-selenium/plugin.xml
patching file
src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java
patching file
src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java
patching file
src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html
patching file src/plugin/protocol-selenium/src/pom.xml
patching file
src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html
---

Trying to understand and fix the patch now.
Has anybody else done any changes in the patch?


Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-21 Thread Jiaxin Ye
If you use the newest verson of Nutch 1.10, when you intsall the patch, you
should see some fails. Following the fails and change the corresponding
file according to the patch.

On Sat, Feb 21, 2015 at 11:18 AM, Nikunj Gala  wrote:

> Does it mean that If i take Nutch 1.10 without the update that is
> available on GitHub and apply patch, change Tika dependency to 1.7 manually
> then it might get built successfully?


Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-21 Thread Nikunj Gala
Does it mean that If i take Nutch 1.10 without the update that is available
on GitHub and apply patch, change Tika dependency to 1.7 manually then it
might get built successfully?


Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-21 Thread Jiaxin Ye
Hi, i guess the reason would be Nutch 1.10 has a update recently which
changes the tika verson from 1.6 to 1.7 in the ivy.xml. I am also guessing
your patch installation has some fails in ivy.xml. If that is the case,
that means the patch is no longer compatible with the newest version of
Nutch 1.10 . You need to manually fix the ivy.xml file yourself by looking
at the patch.

Jiaxin

On Friday, February 20, 2015, Nikunj Gala > wrote:

> I followed steps mentioned by Jiaxin Ye to Configure Nutch-Selenium in
> Nutch 1.10
> But the build did not succeed and there are errors in compilation since
> files on storage do not exist.
> Is there any work around for this?
>
> On Friday, February 20, 2015, Yash Sangani  wrote:
>
>> I thought we have to build nutch again after this and thus I tried to
>> build it but then I get the error mentioned in the first email.
>> So I didnt try to crawl it as yet.
>>
>> On Fri, Feb 20, 2015 at 1:26 AM, zhangxin0804 
>> wrote:
>>
>>> I got you. I met the same problem and stuck at here. Did you try to open
>>> another terminal to crawl data again? You can have a try to do it and to
>>> check whether the Firefox is still pop-up repeats again and again.
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/Problem-installing-Selenium-on-Ubuntu-with-Nutch-trunk-1-10-tp4187576p4187584.html
>>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>>>
>>
>>
>>
>> --
>> Regards,
>> Yash Sangani
>> MS student, Computer Science
>> University of Southern California.
>>
>


Nutch-Selenium Plugin Truncates Binary Data

2015-02-21 Thread Mohammad Al-Mohsin
I am using nutch-selenium  plugin
and I also have Tesseract  installed
for parsing text off images.

While crawling with Nutch & selenium, I noticed that binary data (e.g.
images, pdf) are always truncated and thus skip/fail parsing. Here is a
sample of the log:

*Content of size 800750 was truncated to 368. Content is truncated, parse
may fail!*
When I turn selenium off, parsing works fine and the content is not
truncated.

I found that nutch-selenium gets the html body of whatever Firefox
displays. So even though you're fetching an image, selenium will just give
you the image html tag instead of the image itself.
e.g. 

To get around this, I modified selenium plugin to handle the fetch only if
the Content-Type header starts with 'text', i.e. to catch 'text/html'.
Otherwise, if the content is not textual, it just returns the content as
protocol-httpclient does.

Now, I am getting binary data properly parsed and also getting selenium
handle page rendering with javascript.

Is this is the proper way to tackle this? what do you think?


Best regards,
Mohammad Al-Mohsin