Re: Nutch-Selenium Plugin Truncates Binary Data

2015-02-21 Thread Mohammad Al-Mohsin
Hi Jiaxin, In *HttpResponse.java*, you can check the 'Content-Type' header and then decide whether to: - Set the response content to be the binary http response. (Check out protocol-httpclient's source code for hints) or - Continue executing *readPlainContent(url)*, which in turn will set the

[Nutch Wiki] Update of NutchTutorial by SujenShah

2015-02-21 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The NutchTutorial page has been changed by SujenShah: https://wiki.apache.org/nutch/NutchTutorial?action=diffrev1=76rev2=77 If all has gone to plan, you are now ready to search with

[Nutch Wiki] Update of Nutch_1.X_RESTAPI by SujenShah

2015-02-21 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The Nutch_1.X_RESTAPI page has been changed by SujenShah: https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI New page: = Nutch 1.x REST API = TableOfContents(4) == Introduction == This

Build failed in Jenkins: Nutch-nutchgora #1346

2015-02-21 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-nutchgora/1346/ -- [...truncated 3225 lines...] compile: jar: deps-test: deploy: copy-generated-lib: deploy: [copy] Copying 1 file to

Nutch-Selenium Plugin Truncates Binary Data

2015-02-21 Thread Mohammad Al-Mohsin
I am using nutch-selenium https://github.com/momer/nutch-selenium plugin and I also have Tesseract https://wiki.apache.org/tika/TikaOCR installed for parsing text off images. While crawling with Nutch selenium, I noticed that binary data (e.g. images, pdf) are always truncated and thus skip/fail

Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-21 Thread Jiaxin Ye
Hi, i guess the reason would be Nutch 1.10 has a update recently which changes the tika verson from 1.6 to 1.7 in the ivy.xml. I am also guessing your patch installation has some fails in ivy.xml. If that is the case, that means the patch is no longer compatible with the newest version of Nutch

Re: Nutch-Selenium Plugin Truncates Binary Data

2015-02-21 Thread Jiaxin Ye
Hi Mohammad, Hey, I think that's a very good idea! Any hints about how to change the selenium plugin? I am thinking about the same thing but struggling on how to do it. Best, Jiaxin On Sat, Feb 21, 2015 at 6:03 AM, Mohammad Al-Mohsin m...@mem9.net wrote: I am using nutch-selenium

Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-21 Thread Shuo Li
Yop, Here's a correct ivy.xml. I think there may be some mistakes when we install the patch. It will generate some duplicate ?xml? tag. You may need to delete them manually. If anybody could provide a complete tutorial or a correct patch that'd be great. PS0: I didn't read the whole

Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-21 Thread Nikunj Gala
Does it mean that If i take Nutch 1.10 without the update that is available on GitHub and apply patch, change Tika dependency to 1.7 manually then it might get built successfully?

Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-21 Thread Jiaxin Ye
If you use the newest verson of Nutch 1.10, when you intsall the patch, you should see some fails. Following the fails and change the corresponding file according to the patch. On Sat, Feb 21, 2015 at 11:18 AM, Nikunj Gala nikun...@usc.edu wrote: Does it mean that If i take Nutch 1.10 without

Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-21 Thread Nikunj Gala
Hey you are correct I see fails while patching ivy.xml on the latest GitHub Nutch Trunk The patch longs are as follows: --- patching file