Hi i have installed selenium on my mac but when i try to crawl any website
i get the following lines and then the crawl just stops .
org.openqa.selenium.firefox.NotConnectedException: Unable to connect to
host 127.0.0.1 on port 7055 after 45000 ms.
I am not sure on how to solve this ?
Hi,
I was unable to reproduce the linkdb error.
The NSIDC ADE 403 forbidden error occurs because NSIDC seems to be blocking
User-Agent's containing nutch in them.
--
Thanks,
Veeresh
On 20 February 2015 at 15:26, Shuo Li sli...@usc.edu wrote:
Hi,
I'm trying to crawl NSF ACADIS with
[
https://issues.apache.org/jira/browse/NUTCH-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-1944:
Fix Version/s: (was: 2.3.1)
2.4
Add raw content to indexes
[
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney reassigned NUTCH-1946:
---
Assignee: Lewis John McGibbney
Upgrade to Gora 0.6
---
[
https://issues.apache.org/jira/browse/NUTCH-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332417#comment-14332417
]
Sebastian Nagel commented on NUTCH-1944:
This issue duplicates NUTCH-1785 but this
Hi
I want to develop an UrlFIlter which takes an url, takes its metadata or
even the fetched content, then use some duplicate detection algorithms to
determine if it is a duplicate of any url in bitch. However, the only
parameter passed into the Urlfilter is the url, is it possible to get the
[
https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney resolved NUTCH-1925.
-
Resolution: Fixed
Committed @revision 1661539 in 2.3.1 HEAD
Upgrade Tika to
Lewis John McGibbney created NUTCH-1946:
---
Summary: Upgrade to Gora 0.6
Key: NUTCH-1946
URL: https://issues.apache.org/jira/browse/NUTCH-1946
Project: Nutch
Issue Type: Improvement
Dear Wiki user,
You have subscribed to a wiki page RunNutchInEclipse for change notification.
An attachment has been added to that page by SebastianNagel. Following detailed
information is available:
Attachment name: nutch_eclipse_javadoc_loc.png
Attachment size: 70201
Attachment link:
[
https://issues.apache.org/jira/browse/NUTCH-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-1923:
Fix Version/s: (was: 2.4)
2.3.1
Nutch + Cassandra Docker
Dear Wiki user,
You have subscribed to a wiki page or wiki category on Nutch Wiki for change
notification.
The RunNutchInEclipse page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/RunNutchInEclipse?action=diffrev1=50rev2=51
Comment:
add section how to make Eclipse display
[
https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332350#comment-14332350
]
Hudson commented on NUTCH-1925:
---
SUCCESS: Integrated in Nutch-nutchgora #1347 (See
[
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-840:
---
Attachment: NUTCH-840-2.x.patch
Patch for 2.X.
There currently appears to be a
[
https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332384#comment-14332384
]
Sebastian Nagel commented on NUTCH-1925:
Great to see again successful Jenkins
[
https://issues.apache.org/jira/browse/NUTCH-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-1709:
Fix Version/s: (was: 2.4)
2.3.1
Generated classes
I just added a counter in my URLFilter, and prove that the URLFilter
instances in each fetching circle are different.
Sample logs:
2015-02-22 21:07:10,636 INFO exactdup.ExactDupURLFilter - Processed 69
links
2015-02-22 21:07:10,638 INFO exactdup.ExactDupURLFilter - Processed 70
links
2015-02-22
Thanks,
Chetan Vazirabadkar.
I log the instance id and get the result:
2015-02-22 21:42:15,972 INFO exactdup.ExactDupURLFilter - URlFilter ID:
423250256
2015-02-22 21:42:24,782 INFO exactdup.ExactDupURLFilter - URlFilter ID:
828433560
2015-02-22 21:42:24,795 INFO exactdup.ExactDupURLFilter - URlFilter ID:
828433560
Cool, good test. I thought the Nutch plugin system cached instances
of plugins - I am not sure if it creates a new one each time. are you
sure you don’t have the same URLFilter instance, it’s just called on
different datasets and thus produces different counts?
Either way, so you should simply
You need to install 1.8-SNAPSHOT version of Tika in your assignment.
Please read the assignment instructions again.
http://sunset.usc.edu/classes/cs572_2015/
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument
In the constructor of your URLFilter, why not consider passing
in a NutchConfiguration object, and then reading the path to e.g,
the LinkDb from the config. Then have a private member variable
for the LinkDbReader (maybe static initialized for efficiency)
and use that in your interface method.
Going to implement more configuration in the plugin, but
based on the student emails I think your advice helped :)
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet
Hi Prof Mattmann,
You are saying train and model, are we expected to use machine learning
algorithms to train model for duplication detection?
Thanks,
Renxia
On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
There is nothing stating in your
Hi Majisha,
From the source code of the URLFilter interface comments, the urlfilter is
called in the injector and db updater, which means that you do have the
data of the url you are processing in the the filter crawled.
You may want to take a look at this article, which illustrate the workflow
Exactly, Jiaxin, great answer.
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop:
Hi Mohammad, did you get this fixed?
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop:
[
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332441#comment-14332441
]
Chris A. Mattmann commented on NUTCH-1933:
--
Thank you [~almohsin], I will update
Exactly, Mohammad, thank you.
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop:
Is there only one instance of a plugin for all fetch circles? I am assuming
that when the job is started, a plugin instance is initialized and used in
every fetching circle. Is it correct?
On Sunday, February 22, 2015, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
In the
I believe the Plugin system caches plugins, but you will need
to confirm (haven’t looked in a long time).
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion
[
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jorge Luis Betancourt Gonzalez updated NUTCH-1928:
--
Attachment: NUTCH-1928v6.patch
Indexing filter of documents by
[
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332506#comment-14332506
]
Lewis John McGibbney commented on NUTCH-1946:
-
Right now I am bumping in to
Good to hear!
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email:
Thanks Mo, great advice.
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email:
Hey,
I haven't started working on the deduplicatiin yet, but if I were you I
will use tika library to retrieve the MIMEtype and metadata. The code is
presented in the book tika. Why not try that out? :)
Best,
Jiaxin
On Sunday, February 22, 2015, Renxia Wang renxi...@usc.edu wrote:
Hi
I want
Thank you for you suggestion. I will take a look at that. There is a
URLUtil class in nutch's source code, but I am just wonder if that one will
send a request to the URL again to get the data. Cause the url's metadata
has already been downloaded, it is better if we can get the data locally.
On
Dear Wiki user,
You have subscribed to a wiki page or wiki category on Nutch Wiki for change
notification.
The AdvancedAjaxInteraction page has been changed by ChrisMattmann:
https://wiki.apache.org/nutch/AdvancedAjaxInteraction?action=diffrev1=2rev2=3
Comment:
- add links for install Selenium
You are absolutely right! I am just throwing ideas :) If you are looking at
local data, org.apache.nutch.segment.SegmentReader may be helpful I guess.
As all data contents parsed are located there.
On Sun, Feb 22, 2015 at 4:33 PM, Renxia Wang renxi...@usc.edu wrote:
Thank you for you
You are using the Github version of the patch which only works
with Nutch2 - you need to use NUTCH-1933.
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet
Hi Nikunj,
Please see this:
https://en.wikipedia.org/wiki/Patch_(Unix)
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA
I think this is fantastic Mohammad!
Can you update the patch on NUTCH-1933 with this improvement,
so we can get it into the sources?
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data
I have just started looking up in those lines and found that interface
URLFilter has a method named filter. And I think this is our point of
interest.
Maybe you should look at how to use this method in your plugin.
On Sun, Feb 22, 2015 at 4:41 PM, Jiaxin Ye jiaxi...@usc.edu wrote:
You are
What command are you using to crawl? Are you using bin/crawl, and/or
doing incremental crawling?
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion
No problem! How'd it work out?
Mo
This message was drafted on a tiny touch screen; please forgive brevity tpyos
On Feb 22, 2015, at 6:19 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
Thanks Mo, great advice.
[
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332502#comment-14332502
]
Lewis John McGibbney commented on NUTCH-1928:
-
Fantastic [~jorgelbg]
If you
[
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332504#comment-14332504
]
Hudson commented on NUTCH-1928:
---
SUCCESS: Integrated in Nutch-trunk #2986 (See
[
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332505#comment-14332505
]
Lewis John McGibbney commented on NUTCH-1946:
-
[Ongoing discussion on Gora
There is nothing stating in your assignment that you can’t
use *previously* crawled data to train your model - you
should have at least 2 full sets of this.
Cheers,
Chris
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software
Congratulations Jorge!
Yusniel Hidalgo Delgado
Semantic Web Research Group
University of Informatics Sciences
http://gws-uci.blogspot.com/
Havana, Cuba
- Mensaje original -
De: Julien Nioche lists.digitalpeb...@gmail.com
Para: dev@nutch.apache.org, u...@nutch.apache.org
Enviados:
Thanks. I will take a look at that.
On Sunday, February 22, 2015, Jiaxin Ye jiaxi...@usc.edu wrote:
You are absolutely right! I am just throwing ideas :) If you are looking
at local data, org.apache.nutch.segment.SegmentReader may be helpful I
guess. As all data contents parsed are located
I was using ./bin/crawl and not incremental crawling at that time. This
file appears after I start crawling *.gif, *.jpg, *.mov, etc. I will
provide more information if I can reproduce this error.
Thanks =)
On Sun, Feb 22, 2015 at 4:47 PM, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov
YES. I tried that out, while that one has only url as input. The problem is
how to get the data of that url locally.
On Sunday, February 22, 2015, Nagarjun Pola np...@usc.edu wrote:
I have just started looking up in those lines and found that interface
URLFilter has a method named filter. And
Thanks. That's what I was trying to figure out, but don't know which class
to get the path to the data files. Thanks to point it out.
On Sunday, February 22, 2015, Mattmann, Chris A (3980)
chris.a.mattm...@jpl.nasa.gov wrote:
In the constructor of your URLFilter, why not consider passing
in a
[
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on NUTCH-1946 started by Lewis John McGibbney.
---
Upgrade to Gora 0.6
---
Key:
[
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332500#comment-14332500
]
Jorge Luis Betancourt Gonzalez commented on NUTCH-1928:
---
[~lewismc]
My understanding is that the LinkDB or CrawlDB will contain the results of
previously fetched and parsed pages.
However if we want to get the contents of a URL/page in the URL Filtering
stage( *which is not yet fetched*) , is there any util in Nutch that we
can use to fetch the contents of the
That’s one way - for sure - but what I was implying is that
you can train (read: feed data into) your model (read: algorithm)
using previously crawled information. So, no I wasn’t implying
machine learning.
++
Chris Mattmann, Ph.D.
59 matches
Mail list logo