[
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335641#comment-14335641
]
Mohammad Al-Mohsin commented on NUTCH-1933:
---
Thanks for your comments [~lew
[
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mohammad Al-Mohsin updated NUTCH-1933:
--
Attachment: NUTCH-selenium-trunk.v2.1.patch
Hi Lewis,
Patch updated with comment files
nt
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++
>
>
>
>
>
>
> -Original Message-
> From: Mohammad Al-Mohsin
> Reply-To: "dev@nutch.apache.org"
> Date: Saturday, February 21, 2015 at 6:03 AM
> To: "dev@nutch
[
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333206#comment-14333206
]
Mohammad Al-Mohsin edited comment on NUTCH-1933 at 2/23/15 11:1
[
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mohammad Al-Mohsin updated NUTCH-1933:
--
Attachment: NUTCH-selenium-trunk.v2.patch
Takes care of Tika 1.7 update and handles
rn will set the
'content' from the html body by Selenium Firefox driver.
By the way, since nutch-selenium will be looking for the html body, I think
we should check for 'text/html' and 'application/xhtml+xml' content types,
not just anything that starts with 'text
Otherwise, if the content is not textual, it just returns the content as
protocol-httpclient does.
Now, I am getting binary data properly parsed and also getting selenium
handle page rendering with javascript.
Is this is the proper way to tackle this? what do you think?
Best regards,
Mohammad Al-Mohsin
m to extract the MIME types you
encountered. You can do this natively with Java or if you prefer Python ~
like me, you can use nutchpy <https://github.com/ContinuumIO/nutchpy>.
Best regards,
Mohammad Al-Mohsin
On Fri, Feb 20, 2015 at 8:45 PM, Pranshu Kumar wrote:
>
> I just wante
e deleting
runtime directory.
Best regards,
Mohammad Al-Mohsin
On Thu, Feb 19, 2015 at 10:16 PM, Nagarjun Pola wrote:
> Yes. I should do that.
>
> Thank You Jiaxin.
>
> Best,
> Nagarjun Pola
>
>
> On Thu, Feb 19, 2015 at 10:15 PM, Jiaxin Ye wrote:
>
>> Hmm..
(HttpFormAuthentication.java:95)
at
org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:468)
... 5 more
Fetch failed with protocol status: exception(16), lastModified=0:
java.lang.RuntimeException: *java.lang.IllegalArgumentException: No form
exists: username*
Best regards,
Mohammad Al-Mohsin
On Wed, F
ta.nasa.gov/login>). *The documentation says (loginUrl
- the URL containing the actual ) but is it really the case?
I am using latest Nutch 1.10 trunk version that includes NUTCH-827v3 patch
<https://issues.apache.org/jira/browse/NUTCH-827> on latest OS X Yosemite
(10.10.2).
Please let me know if I'm missing something!
Best regards,
Mohammad Al-Mohsin
FYI, the issue was resolved by deleting 'runtime' directory and then
recompiling Nutch.
cd nutch/trunk
rm -r runtime
ant runtime
Best regards,
Mohammad Al-Mohsin
On Mon, Feb 16, 2015 at 2:56 AM, Mohammad Al-Mohsin wrote:
> Here is the error stack:
>
> 2015-02-16
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:206)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:758)
Best regards,
Mohammad Al-Mohsin
On Mon, Feb 16, 2015 at 1:57 AM, Mohammad Al-Mohsin wrote:
> Hi,
>
> I'm trying to use
rror?
Best regards,
Mohammad Al-Mohsin
[
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321843#comment-14321843
]
Mohammad Al-Mohsin commented on NUTCH-1933:
---
Since Nutch trunk has been upd
15 matches
Mail list logo