Hi Jiaxin,
In *HttpResponse.java*, you can check the 'Content-Type' header and then
decide whether to:
- Set the response content to be the binary http response. (Check out
protocol-httpclient's source code for hints)
or
- Continue executing *readPlainContent(url)*, which in turn will set the
Dear Wiki user,
You have subscribed to a wiki page or wiki category on Nutch Wiki for change
notification.
The NutchTutorial page has been changed by SujenShah:
https://wiki.apache.org/nutch/NutchTutorial?action=diffrev1=76rev2=77
If all has gone to plan, you are now ready to search with
Dear Wiki user,
You have subscribed to a wiki page or wiki category on Nutch Wiki for change
notification.
The Nutch_1.X_RESTAPI page has been changed by SujenShah:
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI
New page:
= Nutch 1.x REST API =
TableOfContents(4)
== Introduction ==
This
See https://builds.apache.org/job/Nutch-nutchgora/1346/
--
[...truncated 3225 lines...]
compile:
jar:
deps-test:
deploy:
copy-generated-lib:
deploy:
[copy] Copying 1 file to
I am using nutch-selenium https://github.com/momer/nutch-selenium plugin
and I also have Tesseract https://wiki.apache.org/tika/TikaOCR installed
for parsing text off images.
While crawling with Nutch selenium, I noticed that binary data (e.g.
images, pdf) are always truncated and thus skip/fail
Hi, i guess the reason would be Nutch 1.10 has a update recently which
changes the tika verson from 1.6 to 1.7 in the ivy.xml. I am also guessing
your patch installation has some fails in ivy.xml. If that is the case,
that means the patch is no longer compatible with the newest version of
Nutch
Hi Mohammad,
Hey, I think that's a very good idea! Any hints about how to change the
selenium plugin? I am thinking about the same thing but struggling on how
to do it.
Best,
Jiaxin
On Sat, Feb 21, 2015 at 6:03 AM, Mohammad Al-Mohsin m...@mem9.net wrote:
I am using nutch-selenium
Yop,
Here's a correct ivy.xml. I think there may be some mistakes when we
install the patch. It will generate some duplicate ?xml? tag. You may
need to delete them manually. If anybody could provide a complete tutorial
or a correct patch that'd be great.
PS0: I didn't read the whole
Does it mean that If i take Nutch 1.10 without the update that is available
on GitHub and apply patch, change Tika dependency to 1.7 manually then it
might get built successfully?
If you use the newest verson of Nutch 1.10, when you intsall the patch, you
should see some fails. Following the fails and change the corresponding
file according to the patch.
On Sat, Feb 21, 2015 at 11:18 AM, Nikunj Gala nikun...@usc.edu wrote:
Does it mean that If i take Nutch 1.10 without
Hey you are correct I see fails while patching ivy.xml on the latest
GitHub Nutch Trunk
The patch longs are as follows:
---
patching file
11 matches
Mail list logo