Selenium error

2015-02-22 Thread Puranjay Rajpal
Hi i have installed selenium on my mac but when i try to crawl any website i get the following lines and then the crawl just stops . org.openqa.selenium.firefox.NotConnectedException: Unable to connect to host 127.0.0.1 on port 7055 after 45000 ms. I am not sure on how to solve this ?

Re: linkdb/current/part-00000/data does not exist

2015-02-22 Thread veeresh beeram
Hi, I was unable to reproduce the linkdb error. The NSIDC ADE 403 forbidden error occurs because NSIDC seems to be blocking User-Agent's containing nutch in them. -- Thanks, Veeresh On 20 February 2015 at 15:26, Shuo Li sli...@usc.edu wrote: Hi, I'm trying to crawl NSF ACADIS with

[jira] [Updated] (NUTCH-1944) Add raw content to indexes

2015-02-22 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1944: Fix Version/s: (was: 2.3.1) 2.4 Add raw content to indexes

[jira] [Assigned] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-22 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1946: --- Assignee: Lewis John McGibbney Upgrade to Gora 0.6 ---

[jira] [Commented] (NUTCH-1944) Add raw content to indexes

2015-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332417#comment-14332417 ] Sebastian Nagel commented on NUTCH-1944: This issue duplicates NUTCH-1785 but this

How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang
Hi I want to develop an UrlFIlter which takes an url, takes its metadata or even the fetched content, then use some duplicate detection algorithms to determine if it is a duplicate of any url in bitch. However, the only parameter passed into the Urlfilter is the url, is it possible to get the

[jira] [Resolved] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-22 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1925. - Resolution: Fixed Committed @revision 1661539 in 2.3.1 HEAD Upgrade Tika to

[jira] [Created] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-22 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1946: --- Summary: Upgrade to Gora 0.6 Key: NUTCH-1946 URL: https://issues.apache.org/jira/browse/NUTCH-1946 Project: Nutch Issue Type: Improvement

[Nutch Wiki] New attachment added to page RunNutchInEclipse

2015-02-22 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page RunNutchInEclipse for change notification. An attachment has been added to that page by SebastianNagel. Following detailed information is available: Attachment name: nutch_eclipse_javadoc_loc.png Attachment size: 70201 Attachment link:

[jira] [Updated] (NUTCH-1923) Nutch + Cassandra Docker

2015-02-22 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1923: Fix Version/s: (was: 2.4) 2.3.1 Nutch + Cassandra Docker

[Nutch Wiki] Update of RunNutchInEclipse by SebastianNagel

2015-02-22 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The RunNutchInEclipse page has been changed by SebastianNagel: https://wiki.apache.org/nutch/RunNutchInEclipse?action=diffrev1=50rev2=51 Comment: add section how to make Eclipse display

[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-22 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332350#comment-14332350 ] Hudson commented on NUTCH-1925: --- SUCCESS: Integrated in Nutch-nutchgora #1347 (See

[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2015-02-22 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-840: --- Attachment: NUTCH-840-2.x.patch Patch for 2.X. There currently appears to be a

[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-22 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332384#comment-14332384 ] Sebastian Nagel commented on NUTCH-1925: Great to see again successful Jenkins

[jira] [Updated] (NUTCH-1709) Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain methods not defined in source .avsc

2015-02-22 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1709: Fix Version/s: (was: 2.4) 2.3.1 Generated classes

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang
I just added a counter in my URLFilter, and prove that the URLFilter instances in each fetching circle are different. Sample logs: 2015-02-22 21:07:10,636 INFO exactdup.ExactDupURLFilter - Processed 69 links 2015-02-22 21:07:10,638 INFO exactdup.ExactDupURLFilter - Processed 70 links 2015-02-22

Subscribe to the mailing list

2015-02-22 Thread Chetan Vazirabadkar
Thanks, Chetan Vazirabadkar.

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang
I log the instance id and get the result: 2015-02-22 21:42:15,972 INFO exactdup.ExactDupURLFilter - URlFilter ID: 423250256 2015-02-22 21:42:24,782 INFO exactdup.ExactDupURLFilter - URlFilter ID: 828433560 2015-02-22 21:42:24,795 INFO exactdup.ExactDupURLFilter - URlFilter ID: 828433560

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Mattmann, Chris A (3980)
Cool, good test. I thought the Nutch plugin system cached instances of plugins - I am not sure if it creates a new one each time. are you sure you don’t have the same URLFilter instance, it’s just called on different datasets and thus produces different counts? Either way, so you should simply

Re: Tesseract OCR and GDAL in Tika plugin for Nutch?

2015-02-22 Thread Mattmann, Chris A (3980)
You need to install 1.8-SNAPSHOT version of Tika in your assignment. Please read the assignment instructions again. http://sunset.usc.edu/classes/cs572_2015/ Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Mattmann, Chris A (3980)
In the constructor of your URLFilter, why not consider passing in a NutchConfiguration object, and then reading the path to e.g, the LinkDb from the config. Then have a private member variable for the LinkDbReader (maybe static initialized for efficiency) and use that in your interface method.

Re: Vagrant Crushed When using Nutch-Selenium

2015-02-22 Thread Mattmann, Chris A (3980)
Going to implement more configuration in the plugin, but based on the student emails I think your advice helped :) ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang
Hi Prof Mattmann, You are saying train and model, are we expected to use machine learning algorithms to train model for duplication detection? Thanks, Renxia On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: There is nothing stating in your

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang
Hi Majisha, From the source code of the URLFilter interface comments, the urlfilter is called in the injector and db updater, which means that you do have the data of the url you are processing in the the filter crawled. You may want to take a look at this article, which illustrate the workflow

Re:

2015-02-22 Thread Mattmann, Chris A (3980)
Exactly, Jiaxin, great answer. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop:

Re: Nutch-Selenium Error

2015-02-22 Thread Mattmann, Chris A (3980)
Hi Mohammad, did you get this fixed? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop:

[jira] [Commented] (NUTCH-1933) nutch-selenium plugin

2015-02-22 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332441#comment-14332441 ] Chris A. Mattmann commented on NUTCH-1933: -- Thank you [~almohsin], I will update

[no subject]

2015-02-22 Thread Ankit Singhaniya

Re: Nutchpy crawled statistics

2015-02-22 Thread Mattmann, Chris A (3980)
Exactly, Mohammad, thank you. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop:

[no subject]

2015-02-22 Thread Akhil Ramachandran

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang
Is there only one instance of a plugin for all fetch circles? I am assuming that when the job is started, a plugin instance is initialized and used in every fetching circle. Is it correct? On Sunday, February 22, 2015, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: In the

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Mattmann, Chris A (3980)
I believe the Plugin system caches plugins, but you will need to confirm (haven’t looked in a long time). ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion

[jira] [Updated] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-22 Thread Jorge Luis Betancourt Gonzalez (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Luis Betancourt Gonzalez updated NUTCH-1928: -- Attachment: NUTCH-1928v6.patch Indexing filter of documents by

[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-22 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332506#comment-14332506 ] Lewis John McGibbney commented on NUTCH-1946: - Right now I am bumping in to

Re: Nutch-Selenium Error

2015-02-22 Thread Mattmann, Chris A (3980)
Good to hear! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: Vagrant Crushed When using Nutch-Selenium

2015-02-22 Thread Mattmann, Chris A (3980)
Thanks Mo, great advice. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Jiaxin Ye
Hey, I haven't started working on the deduplicatiin yet, but if I were you I will use tika library to retrieve the MIMEtype and metadata. The code is presented in the book tika. Why not try that out? :) Best, Jiaxin On Sunday, February 22, 2015, Renxia Wang renxi...@usc.edu wrote: Hi I want

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang
Thank you for you suggestion. I will take a look at that. There is a URLUtil class in nutch's source code, but I am just wonder if that one will send a request to the URL again to get the data. Cause the url's metadata has already been downloaded, it is better if we can get the data locally. On

[Nutch Wiki] Update of AdvancedAjaxInteraction by ChrisMattmann

2015-02-22 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The AdvancedAjaxInteraction page has been changed by ChrisMattmann: https://wiki.apache.org/nutch/AdvancedAjaxInteraction?action=diffrev1=2rev2=3 Comment: - add links for install Selenium

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Jiaxin Ye
You are absolutely right! I am just throwing ideas :) If you are looking at local data, org.apache.nutch.segment.SegmentReader may be helpful I guess. As all data contents parsed are located there. On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang renxi...@usc.edu wrote: Thank you for you

Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-22 Thread Mattmann, Chris A (3980)
You are using the Github version of the patch which only works with Nutch2 - you need to use NUTCH-1933. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet

Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-22 Thread Mattmann, Chris A (3980)
Hi Nikunj, Please see this: https://en.wikipedia.org/wiki/Patch_(Unix) Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA

Re: Nutch-Selenium Plugin Truncates Binary Data

2015-02-22 Thread Mattmann, Chris A (3980)
I think this is fantastic Mohammad! Can you update the patch on NUTCH-1933 with this improvement, so we can get it into the sources? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Nagarjun Pola
I have just started looking up in those lines and found that interface URLFilter has a method named filter. And I think this is our point of interest. Maybe you should look at how to use this method in your plugin. On Sun, Feb 22, 2015 at 4:41 PM, Jiaxin Ye jiaxi...@usc.edu wrote: You are

Re: linkdb/current/part-00000/data does not exist

2015-02-22 Thread Mattmann, Chris A (3980)
What command are you using to crawl? Are you using bin/crawl, and/or doing incremental crawling? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion

Re: Vagrant Crushed When using Nutch-Selenium

2015-02-22 Thread Mo Omer
No problem! How'd it work out? Mo This message was drafted on a tiny touch screen; please forgive brevity tpyos On Feb 22, 2015, at 6:19 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Thanks Mo, great advice.

[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-22 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332502#comment-14332502 ] Lewis John McGibbney commented on NUTCH-1928: - Fantastic [~jorgelbg] If you

[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-22 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332504#comment-14332504 ] Hudson commented on NUTCH-1928: --- SUCCESS: Integrated in Nutch-trunk #2986 (See

[jira] [Commented] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-22 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332505#comment-14332505 ] Lewis John McGibbney commented on NUTCH-1946: - [Ongoing discussion on Gora

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Mattmann, Chris A (3980)
There is nothing stating in your assignment that you can’t use *previously* crawled data to train your model - you should have at least 2 full sets of this. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software

Re: [MASSMAIL] Re: [ANNOUNCE] New Nutch committer and PMC - Jorge Luis Betancourt Gonzalez

2015-02-22 Thread Yusniel Hidalgo Delgado
Congratulations Jorge! Yusniel Hidalgo Delgado Semantic Web Research Group University of Informatics Sciences http://gws-uci.blogspot.com/ Havana, Cuba - Mensaje original - De: Julien Nioche lists.digitalpeb...@gmail.com Para: dev@nutch.apache.org, u...@nutch.apache.org Enviados:

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang
Thanks. I will take a look at that. On Sunday, February 22, 2015, Jiaxin Ye jiaxi...@usc.edu wrote: You are absolutely right! I am just throwing ideas :) If you are looking at local data, org.apache.nutch.segment.SegmentReader may be helpful I guess. As all data contents parsed are located

Re: linkdb/current/part-00000/data does not exist

2015-02-22 Thread Shuo Li
I was using ./bin/crawl and not incremental crawling at that time. This file appears after I start crawling *.gif, *.jpg, *.mov, etc. I will provide more information if I can reproduce this error. Thanks =) On Sun, Feb 22, 2015 at 4:47 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang
YES. I tried that out, while that one has only url as input. The problem is how to get the data of that url locally. On Sunday, February 22, 2015, Nagarjun Pola np...@usc.edu wrote: I have just started looking up in those lines and found that interface URLFilter has a method named filter. And

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Renxia Wang
Thanks. That's what I was trying to figure out, but don't know which class to get the path to the data files. Thanks to point it out. On Sunday, February 22, 2015, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: In the constructor of your URLFilter, why not consider passing in a

[jira] [Work started] (NUTCH-1946) Upgrade to Gora 0.6

2015-02-22 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-1946 started by Lewis John McGibbney. --- Upgrade to Gora 0.6 --- Key:

[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-22 Thread Jorge Luis Betancourt Gonzalez (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332500#comment-14332500 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-1928: --- [~lewismc]

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Majisha Parambath
My understanding is that the LinkDB or CrawlDB will contain the results of previously fetched and parsed pages. However if we want to get the contents of a URL/page in the URL Filtering stage( *which is not yet fetched*) , is there any util in Nutch that we can use to fetch the contents of the

Re: How to read metadata/content of an URL in URLFilter?

2015-02-22 Thread Mattmann, Chris A (3980)
That’s one way - for sure - but what I was implying is that you can train (read: feed data into) your model (read: algorithm) using previously crawled information. So, no I wasn’t implying machine learning. ++ Chris Mattmann, Ph.D.