[jira] [Commented] (NUTCH-1933) nutch-selenium plugin

2015-02-24 Thread Mohammad Al-Mohsin (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335641#comment-14335641 ] Mohammad Al-Mohsin commented on NUTCH-1933: --- Thanks for your comments [~lew

[jira] [Updated] (NUTCH-1933) nutch-selenium plugin

2015-02-24 Thread Mohammad Al-Mohsin (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohammad Al-Mohsin updated NUTCH-1933: -- Attachment: NUTCH-selenium-trunk.v2.1.patch Hi Lewis, Patch updated with comment files

Re: Nutch-Selenium Plugin Truncates Binary Data

2015-02-23 Thread Mohammad Al-Mohsin
nt > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++ > > > > > > > -Original Message- > From: Mohammad Al-Mohsin > Reply-To: "dev@nutch.apache.org" > Date: Saturday, February 21, 2015 at 6:03 AM > To: "dev@nutch

[jira] [Comment Edited] (NUTCH-1933) nutch-selenium plugin

2015-02-23 Thread Mohammad Al-Mohsin (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333206#comment-14333206 ] Mohammad Al-Mohsin edited comment on NUTCH-1933 at 2/23/15 11:1

[jira] [Updated] (NUTCH-1933) nutch-selenium plugin

2015-02-23 Thread Mohammad Al-Mohsin (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohammad Al-Mohsin updated NUTCH-1933: -- Attachment: NUTCH-selenium-trunk.v2.patch Takes care of Tika 1.7 update and handles

Re: Nutch-Selenium Plugin Truncates Binary Data

2015-02-21 Thread Mohammad Al-Mohsin
rn will set the 'content' from the html body by Selenium Firefox driver. By the way, since nutch-selenium will be looking for the html body, I think we should check for 'text/html' and 'application/xhtml+xml' content types, not just anything that starts with 'text

Nutch-Selenium Plugin Truncates Binary Data

2015-02-21 Thread Mohammad Al-Mohsin
Otherwise, if the content is not textual, it just returns the content as protocol-httpclient does. Now, I am getting binary data properly parsed and also getting selenium handle page rendering with javascript. Is this is the proper way to tackle this? what do you think? Best regards, Mohammad Al-Mohsin

Re: Nutchpy crawled statistics

2015-02-20 Thread Mohammad Al-Mohsin
m to extract the MIME types you encountered. You can do this natively with Java or if you prefer Python ~ like me, you can use nutchpy <https://github.com/ContinuumIO/nutchpy>. Best regards, Mohammad Al-Mohsin On Fri, Feb 20, 2015 at 8:45 PM, Pranshu Kumar wrote: > > I just wante

Re: Problem Fetching with Selenium Installed

2015-02-19 Thread Mohammad Al-Mohsin
e deleting runtime directory. Best regards, Mohammad Al-Mohsin On Thu, Feb 19, 2015 at 10:16 PM, Nagarjun Pola wrote: > Yes. I should do that. > > Thank You Jiaxin. > > Best, > Nagarjun Pola > > > On Thu, Feb 19, 2015 at 10:15 PM, Jiaxin Ye wrote: > >> Hmm..

Re: HttpPostAuthentication Cannot Find Authentication Form

2015-02-18 Thread Mohammad Al-Mohsin
(HttpFormAuthentication.java:95) at org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:468) ... 5 more Fetch failed with protocol status: exception(16), lastModified=0: java.lang.RuntimeException: *java.lang.IllegalArgumentException: No form exists: username* Best regards, Mohammad Al-Mohsin On Wed, F

HttpPostAuthentication Cannot Find Authentication Form

2015-02-17 Thread Mohammad Al-Mohsin
ta.nasa.gov/login>). *The documentation says (loginUrl - the URL containing the actual ) but is it really the case? I am using latest Nutch 1.10 trunk version that includes NUTCH-827v3 patch <https://issues.apache.org/jira/browse/NUTCH-827> on latest OS X Yosemite (10.10.2). Please let me know if I'm missing something! Best regards, Mohammad Al-Mohsin

Re: Nutch-Selenium Error

2015-02-16 Thread Mohammad Al-Mohsin
FYI, the issue was resolved by deleting 'runtime' directory and then recompiling Nutch. cd nutch/trunk rm -r runtime ant runtime Best regards, Mohammad Al-Mohsin On Mon, Feb 16, 2015 at 2:56 AM, Mohammad Al-Mohsin wrote: > Here is the error stack: > > 2015-02-16

Re: Nutch-Selenium Error

2015-02-16 Thread Mohammad Al-Mohsin
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:206) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:758) Best regards, Mohammad Al-Mohsin On Mon, Feb 16, 2015 at 1:57 AM, Mohammad Al-Mohsin wrote: > Hi, > > I'm trying to use

Nutch-Selenium Error

2015-02-16 Thread Mohammad Al-Mohsin
rror? Best regards, Mohammad Al-Mohsin

[jira] [Commented] (NUTCH-1933) nutch-selenium plugin

2015-02-15 Thread Mohammad Al-Mohsin (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321843#comment-14321843 ] Mohammad Al-Mohsin commented on NUTCH-1933: --- Since Nutch trunk has been upd