Hi Dinesh,
This article
<https://opensourceconnections.com/blog/2014/05/24/crawling-with-nutch/> is
quite old (Nutch 1.x, Solr 4.x), but the high-level steps are still pretty
much the same: get your java set up, kick off a Solr
<http://lucene.apache.org/solr/guide/7_5/installing-
To highlight Shawn's point, nutch leverages SOLR. That means that nutch
defines the integration and is responsibile for providing their
documentation.
On Mon, Oct 22, 2018, 4:14 PM Atita Arora wrote:
> and
> https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/
>
and
https://lobster1234.github.io/2017/08/14/search-with-nutch-mongodb-solr/
On Tue, Oct 23, 2018 at 1:12 AM Atita Arora wrote:
> I think this should be kind of useful :
>
>
> https://blog.building-blocks.com/building-a-search-engine-with-nutch-and-solr-in-10-minutes/
>
> I i
I think this should be kind of useful :
https://blog.building-blocks.com/building-a-search-engine-with-nutch-and-solr-in-10-minutes/
I integrated Aperture with Solr way back in 2008.
On Mon, Oct 22, 2018 at 11:27 PM Dinesh Sundaram
wrote:
> Thanks Shawn for the reply, yes I do have s
On 10/22/2018 3:26 PM, Dinesh Sundaram wrote:
Thanks Shawn for the reply, yes I do have some questions on the solr too.
can you please share the steps for solr side to integate the nutch or no
steps are needed in solr?
Since I have no idea what has to happen on the nutch side, I really
can't
Thanks Shawn for the reply, yes I do have some questions on the solr too.
can you please share the steps for solr side to integate the nutch or no
steps are needed in solr?
On Thu, Oct 18, 2018 at 8:35 PM Shawn Heisey wrote:
> On 10/18/2018 12:35 PM, Dinesh Sundaram wrote:
> > Can you please
On 10/18/2018 12:35 PM, Dinesh Sundaram wrote:
Can you please share the steps to integrate nutch 2.3.1 with solrcloud
7.1.0.
You will need to speak to the nutch project about how to configure their
software to interact with Solr. If you have questions about Solr
itself, we can answer those.
Hi Team,
Can you please share the steps to integrate nutch 2.3.1 with solrcloud
7.1.0.
Thanks,
Dinesh Sundaram
domain in nutch in solr
Hi,
Can anyone tel me how to crawl all other pages of same domain.
For example i'm feeding a website http://www.techcrunch.com/ in seed.txt.
Following property is added in nutch-site.xml
property
namedb.ignore.internal.links/name
valuefalse/value
Hi,
Can anyone tel me how to crawl all other pages of same domain.
For example i'm feeding a website http://www.techcrunch.com/ in seed.txt.
Following property is added in nutch-site.xml
property
namedb.ignore.internal.links/name
valuefalse/value
descriptionIf true, when adding new links
I had a similar error. I couldn't find any documentation which nutch and
solr versions are compatible. For instance, we' re using nutch 1.6 on
hadoop 1.0.4 with solrj 3.4.0 and index crawled segments to solr 4.2.0. But
I remember that I could find a compatible version of solrj for nutch 1.4
I am trying to configure nutch 1.4 with solr 3.4.
I configured everything and when I run the command:
./nutch crawl urls -dir myCrawl2 -solr http://localhost:8080 -depth 2 -topN
2
I get the following error:
java.io.IOException: Job failed!
SolrDeleteDuplicates: starting at 2013-06-06 15:49:30
can you check if you have correct solrj client library version in both nutch
and Solr server.
--
View this message in context:
http://lucene.472066.n3.nabble.com/nutch-1-4-solr-3-4-configuration-error-tp4068724p4068733.html
Sent from the Solr - User mailing list archive at Nabble.com.
information:
: 1) The solr admin screen comes up fine in the browser.
At which URL does the Solr admin screen come up fine in your browser?
Best guess...
1) you have solr installed such that it uses the webcontext /solr but
you gave the wrong url to nutch (ie: try -solr
http://localhost:8080/solr
to confirm an expected behavior of solr:
Assuming we have uniqueKeyid/uniqueKey in schema.xml for solr,
when
we send the same URL from nutch to solr multiple times. would there be
ONLY
ONE entry for that URL, but the content (if changed) and timestamp
would be
updated
we have uniqueKeyid/uniqueKey in schema.xml for solr, when
we send the same URL from nutch to solr multiple times. would there be ONLY
ONE entry for that URL, but the content (if changed) and timestamp would be
updated?
Thanks!
Joe
--
Regards,
David Shen
http://about.me/davidshen
wrote:
Dear list,
I just want to confirm an expected behavior of solr:
Assuming we have uniqueKeyid/uniqueKey in schema.xml for solr,
when
we send the same URL from nutch to solr multiple times. would there be
ONLY
ONE entry for that URL, but the content (if changed) and timestamp
the same URL from nutch to solr multiple times. would there be
ONLY
ONE entry for that URL, but the content (if changed) and timestamp
would be
updated?
Thanks!
Joe
--
Regards,
David Shen
http://about.me/davidshen
https://twitter.com/#!/davidshen84
now, all works!
I have another problem If I use a conector with my solr-nutch.
this is the error:
Grave: java.lang.RuntimeException:
org.apache.lucene.index.CorruptIndexException: Unknown format version: -11
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068
is becayse nutch is unable to find a url in the url
location that you provide.
Kindly ensure there is a url there.
--
View this message in context:
http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3773089.html
Sent from the Solr - User mailing list archive at Nabble.com.
The empty path message is becayse nutch is unable to find a url in the url
location that you provide.
Kindly ensure there is a url there.
--
View this message in context:
http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3773089.html
Sent from the Solr - User mailing list archive
will be for different domains. So for each domain folder in
urls folder there has to be a corresponding folder (with the same name) in
the crawl folder.
--
View this message in context:
http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3765607.html
Sent from the Solr - User mailing list
I try to configured nutch (1.4) on my solr 3.2
But when I try with a crawl command
bin/nutch inject crawl/crawldb urls
don't works, and it reply with can't convert a empty path
why, in your opinion?
tx
a.
-- folder name
The folder name will be for different domains. So for each domain folder in
urls folder there has to be a corresponding folder (with the same name) in
the crawl folder.
--
View this message in context:
http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3765607.html
Sent from
Hi All,
I have some problems with integration of Nutch in Solr and Tomcat.
I follo Nutch tutorial for integration and now, I can crawl a website: all
works right.
But It I try the solr integration, I can't indexing on Solr.
follow the nutch output after the command:
bin/nutch crawl urls -solr
Doesn't tomcat run on port 8080, and not port 8983? Or did you change the
tomcat's default port to 8983?
On Feb 5, 2012 5:17 AM, alessio crisantemi alessio.crisant...@gmail.com
wrote:
Hi All,
I have some problems with integration of Nutch in Solr and Tomcat.
I follo Nutch tutorial
alessio crisantemi-2,
I think you got it.. Check the jars in nutch lib and see if the solr n solrj
jars are same... That could be the issue
--
View this message in context:
http://lucene.472066.n3.nabble.com/nutch-in-solr-tp3716969p3717542.html
Sent from the Solr - User mailing list archive
some problems with integration of Nutch in Solr and Tomcat.
I follo Nutch tutorial for integration and now, I can crawl a website:
all
works right.
But It I try the solr integration, I can't indexing on Solr.
follow the nutch output after the command:
bin/nutch crawl urls -solr http
in Solr and Tomcat.
I follo Nutch tutorial for integration and now, I can crawl a website:
all
works right.
But It I try the solr integration, I can't indexing on Solr.
follow the nutch output after the command:
bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3
looks like solrj version in nutch classpath is different that the solr
version on server,
can you post the versions for both nutch and solr?
On Sun, Feb 5, 2012 at 10:24 PM, alessio crisantemi
alessio.crisant...@gmail.com wrote:
no, all run on port 8983.
..
2012/2/5 Matthew Parker mpar
on server,
can you post the versions for both nutch and solr?
On Sun, Feb 5, 2012 at 10:24 PM, alessio crisantemi
alessio.crisant...@gmail.com wrote:
no, all run on port 8983.
..
2012/2/5 Matthew Parker mpar...@apogeeintegration.com
Doesn't tomcat run on port 8080, and not port
solj is the solr java client library,
so there seem to be two versions 1.4.1 and 3.4.0, which are
incompatible, so you can do the following,
refer :
https://github.com/geek4377/nutch/commit/c66bf35ff4f86393413621b3b889b1c78281df4d
to see how to upgrade the solr version in nutch, teh above
/c66bf35ff4f86393413621b3b889b1c78281df4d
to see how to upgrade the solr version in nutch, teh above example
replaces solr 1.4.0 with 3.1.0.
On Sun, Feb 5, 2012 at 11:02 PM, alessio crisantemi
alessio.crisant...@gmail.com wrote:
if I look the solr and nuth libs I found:
apache-solr-solrj-1.4.1.jar on Solr
...@zudiewiener.com wrote:
I'm running 64bit Ubuntu 11.04, nutch 1.2, solr 3.3 (downloaded, not
built) and tomcat6 following this (and some other) links
http://wiki.apache.org/nutch/RunningNutchAndSolr
I have added the nutch schema and can access/view this schema via the
admin page. nutch also works as I can
hope that helps,
On Wed, Jul 13, 2011 at 8:58 AM, Leo Subscriptions
llsub...@zudiewiener.com wrote:
I'm running 64bit Ubuntu 11.04, nutch 1.2, solr 3.3 (downloaded, not
built) and tomcat6 following this (and some other) links
http://wiki.apache.org/nutch/RunningNutchAndSolr
I have
If you're using Solr anyway, you'd better upgrade to Nutch 1.3 with Solr 3.x
support.
Works like a charm.
Thanks,
Leo
On Wed, 2011-07-13 at 11:31 +0530, Geek Gamer wrote:
you need to update the solrj libs to 3.x version. the java bin format
has changed .
I made the change a few
I'm running 64bit Ubuntu 11.04, nutch 1.2, solr 3.3 (downloaded, not
built) and tomcat6 following this (and some other) links
http://wiki.apache.org/nutch/RunningNutchAndSolr
I have added the nutch schema and can access/view this schema via the
admin page. nutch also works as I can perfrom
Hello Friends,
I am a newbie to Solr and trying to integrate Apache Nutch 1.3 and Solr 3.2
. I did the steps explained in the following two URL's :
http://wiki.apache.org/nutch/RunningNutchAndSolr
http://thetechietutorials.blogspot.com/2011/06/how-to-build-and-start-apache-solr.html
I
Can you let me know when and where you were getting the error? A screen-shot
will be helpful.
On Tue, Jul 5, 2011 at 8:15 AM, serenity keningston
serenity.kenings...@gmail.com wrote:
Hello Friends,
I am a newbie to Solr and trying to integrate Apache Nutch 1.3 and Solr 3.2
. I did
You are using the crawl job so you must specify the URL to your Solr instance.
The newly updated wiki has you answer:
http://wiki.apache.org/nutch/bin/nutch_crawl
Hello Friends,
I am a newbie to Solr and trying to integrate Apache Nutch 1.3 and Solr 3.2
. I did the steps explained
:
Hello Friends,
I am a newbie to Solr and trying to integrate Apache Nutch 1.3 and Solr
3.2
. I did the steps explained in the following two URL's :
http://wiki.apache.org/nutch/RunningNutchAndSolr
http://thetechietutorials.blogspot.com/2011/06/how-to-build-and-start-apache
to integrate Apache Nutch 1.3 and
Solr
3.2
. I did the steps explained in the following two URL's :
http://wiki.apache.org/nutch/RunningNutchAndSolr
http://thetechietutorials.blogspot.com/2011/06/how-to-build-and-start-apache-solr.html
I downloaded both the softwares
after it's being parsed, you need to do it later
on.
On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote:
Hi all,
I am a newbie to nutch and solr. Well relatively much newer to Solr than
Nutch :)
I have been using nutch for past two weeks, and I wanted to know if I can
query
in the seed.txt and does
not proceed further from there. So I am just bit confused. Why is it not
crawling the linked pages(a.html, b.html, c.html and d.html). I get a
feeling that I am missing something that the author of the blog(
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
not proceed further from there. So I am just bit confused. Why is it not
crawling the linked pages(a.html, b.html, c.html and d.html). I get a
feeling that I am missing something that the author of the blog(
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
everyone would
am just bit confused. Why is it not
crawling the linked pages(a.html, b.html, c.html and d.html). I get a
feeling that I am missing something that the author of the blog(
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed
everyone would know.
Thanks,
Abi
On Wed, Feb 9
Hi Erick,
Thanks a bunch for the response
Could be a chance..but all I am wondering is where to specify the depth in
the whole entire process in the URL
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried
specifying it during the fetcher phase but it was just ignored
:. ab1s...@gmail.com wrote:
Hi Erick,
Thanks a bunch for the response
Could be a chance..but all I am wondering is where to specify the depth in
the whole entire process in the URL
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried
specifying it during the fetcher phase
Hi Charan,
Thanks for the clarifications.
The link I have been referring to(
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) does not say
anything about using the crawl? Do I have to do it after the last step
mentioned?
Thanks,
Abi
On Thu, Feb 10, 2011 at 12:58 AM, charan kumar
, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote:
why are using solrindex in the argument.? It is used when we need to index
the crawled data in Solr
For more read http://wiki.apache.org/nutch/NutchTutorial .
Also for nutch-solr integration this is very useful blog
http
BLEH! facepalm This is entirely possible to do in a single step AS LONG AS
YOU GET THE SYNTAX CORRECT ;-)
http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/
http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/bin/nutch
crawl urls -dir crawl
All,
I have a couple websites that I need to crawl and the following command line
used to work I think. Solr is up and running and everything is fine there
and I can go through and index the site but I really need the results added
to Solr after the crawl. Does anyone have any idea on how to make
why are using solrindex in the argument.? It is used when we need to index
the crawled data in Solr
For more read http://wiki.apache.org/nutch/NutchTutorial .
Also for nutch-solr integration this is very useful blog
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
I integrated nutch
.
Also for nutch-solr integration this is very useful blog
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
I integrated nutch and solr and it works well.
Thanks
On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene]
ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347
hi, i'm new in using apache nutch and solr... has anyone from the list
experiences in indexing nutch crawls into solr? the main problem is, that e.g.
nutch crawled pdf documents (with the other stuff from the crawled site) after
solr-indexing isn't queryable... e.g.
query in nutch:
bin
Tony Wang wrote:
Thanks Otis.
I've just downloaded
NUTCH-442_v8.patchhttps://issues.apache.org/jira/secure/attachment/12391810/NUTCH-442_v8.patchfrom
https://issues.apache.org/jira/browse/NUTCH-442, but the patching process
gave me lots errors, see below:
This patch will be integrated within
is CentOS 5.2 by
the way.
Thanks!
Tony
On Sun, Dec 28, 2008 at 10:18 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
Tony,
I think you should ignore the advice/code from foofactory blog and just go
with NUTCH-442, as that's most likely going to result in the official
Nutch-Solr
understanding
either
Nutch or Solr. My suggestion is to first play only with Nutch and learn
how
to run various Nutch steps, all the way to generating an index. Then
play
with Solr (and forget about Nutch) by following the Solr tutorial. Once
you
get Solr by itself working, you will understand
=onhl.fl=content
I followed this guide to integrate Nutch with Solr
http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html.
I wonder what could be wrong with my integration.
I use CentOS 5.2, Tomcat6 and Nutch Solr latest nightly builds.
Thanks!
Tony
--
Signature
/ -- Lucene - Solr - Nutch
- Original Message
From: Tony Wang ivyt...@gmail.com
To: solr-user@lucene.apache.org
Sent: Friday, December 26, 2008 11:20:06 AM
Subject: Please help me integrate Nutch with Solr
I got the web interface to work at here
http://208.64.71.46:8080/search.jsp
wrong from the information you provided
below. Are there any errors in the log? Are you sure your solr home is set
correctly?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Tony Wang ivyt...@gmail.com
To: solr-user
/example/solr/data/index/ directory. You will need to adjust the schema to
match the Lucene/Nutch index fields, too.
But honestly, it looks like you are starting from the middle without really
following things step-by-step and without really understanding either Nutch or
Solr. My suggestion
blog, or post it as a patch for inclusion in
nutch/contrib (if sami is ok with that). If you have issues with
how to use the solr client api, solr-user is here to help.
I've done this. Apparently someone else has taken on the solr-nutch
job and made it a bit more complicated (which is good
On Sep 26, 2007, at 4:04 AM, Doğacan Güney wrote:
NUTCH-442 is one of the issues that I want to really see resolved.
Unfortunately, I haven't received many (as in, none) comments, so I
haven't made further progress on it.
I am probably your target customer but to be honest all we care about
with SOLR
Daniel,
We just started to test/research posibility of integration of Nutch and Solr
so it will be nice to hear any advices as well.
Thanks,
DT
www.ejizn.com
- Original Message -
From: Daniel Clark [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, September 25, 2007 1:23 PM
/contrib (if sami is ok with that). If you have issues with
how to use the solr client api, solr-user is here to help.
I've done this. Apparently someone else has taken on the solr-nutch
job and made it a bit more complicated (which is good for the long
term) than sami's original patch
But we still use a version of Sami's patch that works on both trunk
nutch and trunk solr (solrj.) I sent my changes to sami when we did
it, if you need it let me know...
I put my files up here: http://variogr.am/latest/?p=26
-b
Thanks Brian.
I'm sure this will help lots of people.
Brian Whitman wrote:
But we still use a version of Sami's patch that works on both trunk
nutch and trunk solr (solrj.) I sent my changes to sami when we did
it, if you need it let me know...
I put my files up here: http://variogr.am
68 matches
Mail list logo