I'd also be very much interested in knowing these!
On 11/18/2012 07:32 PM, kiran chitturi wrote:
Hi!
I have been running crawls using Nutch for 13000 documents (protocol http)
on a single machine and it goes on to take 2-3 days to get finished. I am
using 2.x version of Nutch.
I use a depth of
Solr sent me to Nutch list, but okay. Thanks,
On 10/16/2012 02:27 PM, Lewis John Mcgibbney wrote:
Hi Tolga,
Please take this to the Solr user@ list.
Thank you
Lewis
On Tue, Oct 16, 2012 at 12:13 PM, Tolga wrote:
Hi,
I've tried url:fass\.sabanciuniv\.edu AND content:this, and
more details of what you have tried and
what issues you are having.
On Fri, Oct 12, 2012 at 5:03 PM, Tolga wrote:
Not really. Let me elaborate. If I pass it multiple URLs such as
http://example.com, example.net and example.org, how can I search only in
net?
Regards,
On 12 October 2012 23:55, Teja
Not really. Let me elaborate. If I pass it multiple URLs such as
http://example.com, example.net and example.org, how can I search only in
net?
Regards,
On 12 October 2012 23:55, Tejas Patil wrote:
> Hi Tolga,
>
> For searching a specific content from a specific website, crawl it firs
Hi,
I use nutch to crawl my website and index to solr. However, how can I
search for piece of content in a specific website? I use multiple URL's
Regards,
On 10/05/2012 04:22 PM, Asha Chhikara wrote:
which type of field is it.
On Fri, Oct 5, 2012 at 6:40 PM, Tolga wrote:
In solr schema.xml
On 10/05/2012 03:41 PM, Asha Chhikara wrote:
where is it present.
On Fri, Oct 5, 2012 at 6:04 PM, Tolga wrote:
Hi,
What is meant by "
In solr schema.xml
On 10/05/2012 03:41 PM, Asha Chhikara wrote:
where is it present.
On Fri, Oct 5, 2012 at 6:04 PM, Tolga wrote:
Hi,
What is meant by "ERROR:
[doc=http://bilgisayarciniz.**org/<http://bilgisayarciniz.org/>]
Error adding field 'title'='bilgisaya
Hi,
What is meant by "ERROR: [doc=http://bilgisayarciniz.org/] Error adding
field 'title'='bilgisayarciniz web hizmetleri'"? I have title defined as
a field, is also has the multiValued property.
Regards,
Sorry for reposting, but it must have been missed.
Original Message
Subject:Nutch and CAS
Date: Thu, 27 Sep 2012 15:37:04 +0300
From: Tolga
To: user@nutch.apache.org
Hi,
Can Nutch crawl a website behind CAS (Central Authentication System)?
Regards,
Hi,
Can Nutch crawl a website behind CAS (Central Authentication System)?
Regards,
Oh, and by way I have the 'title' field.
On 09/05/2012 04:48 PM, Tolga wrote:
I changed the encoding to ISO-8859-9 and restarted Solr, it didn't
work :S Below is the full error:
SEVERE: org.apache.solr.common.SolrException: ERROR:
[doc=http://www.sabanciuniv.edu/] Error addi
bancı Üniversitesi
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:297)
... 26 more
Regards,
On 09/05/2012 02:14 PM, Lewis John Mcgibbney wrote:
Most likely
On Wed, Sep 5, 2012 at 12:12 PM, Tolga wrote:
Sorry for replying to this, I can start a new thread if
e various tutorials here to get you going
http://wiki.apache.org/nutch/#Nutch_2.X_tutorial.28s.29
hth
Lewis
On Tue, Sep 4, 2012 at 2:27 PM, Tolga wrote:
Hi,
I set up my Solr, and when I tried to crawl my website, it gave me
[mtozses@atlas bin]$ time ./nutch crawl urls -dir crawl-$(date +%FT%H
Hi,
I set up my Solr, and when I tried to crawl my website, it gave me
[mtozses@atlas bin]$ time ./nutch crawl urls -dir crawl-$(date
+%FT%H-%M-%S) -solr http://localhost:8983/solr/ -depth 5 -topN 5
Exception in thread "main" org.apache.gora.util.GoraException:
java.io.IOException: java.sql.SQ
What brackets? I don't see brackets.
On 08/28/2012 03:39 PM, Lewis John Mcgibbney wrote:
I assume you are not using the brackets?
This command should most certainly work.
Lewis
On Tue, Aug 28, 2012 at 7:48 AM, Tolga wrote:
"ant runtime" gives me:
Buildfile: build.xml
You should check out the following tutorials below
http://wiki.apache.org/nutch/Nutch2Tutorial
http://nlp.solutions.asia/?p=180
Lewis
On Mon, Aug 27, 2012 at 12:34 PM, Tolga wrote:
Hi, and thanks for your fast reply.
I found a tutorial on the interwebz, and it said to use ant in $NUTCH.
However
elow
http://wiki.apache.org/nutch/Nutch2Tutorial
http://nlp.solutions.asia/?p=180
Lewis
On Mon, Aug 27, 2012 at 12:34 PM, Tolga wrote:
Hi, and thanks for your fast reply.
I found a tutorial on the interwebz, and it said to use ant in $NUTCH.
However, when I used it, I got:
[mtozses@atlas NUTCH]
Hi, and thanks for your fast reply.
I found a tutorial on the interwebz, and it said to use ant in $NUTCH.
However, when I used it, I got:
[mtozses@atlas NUTCH]$ time ant
Buildfile: build.xml
[taskdef] Could not load definitions from resource
org/sonar/ant/antlib.xml. It could not be found
Hi,
I can find only src/bin/nutch in 2.0-src.zip. There's no bin/nutch. Does
that mean I have to compile it?
Regards,
NUTCH_HOME/conf and not in
NUTCH_HOME/runtime/local/conf (see tutorial on WIKI)
On 29 May 2012 07:31, Tolga wrote:
Hi,
I know this issue should have been closed, but I thought I'd continue
this
rather than starting a new thread.
Anyway, I'm getting this: parse.ParserFactory - Par
and not nutch-default and my bet is
that your are doing this in NUTCH_HOME/conf and not in
NUTCH_HOME/runtime/local/conf (see tutorial on WIKI)
On 29 May 2012 07:31, Tolga wrote:
Hi,
I know this issue should have been closed, but I thought I'd continue this
rather than starting a new t
On Tue, May 22, 2012 at 9:20 PM, Tolga wrote:
Hi,
I crawl / index PDF files just fine, but I get the following warning.
parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to contentType
application/pdf via parse-plugins.xml, but not enabled via plugin.includes
in nutch-default.xml.
Hi,
Do you have to use Nutch for this purpose? I belive you can use wget -m
http://www.example.com and get everything in a much structured way.
On 25 May 2012 11:07, vlad.paunescu wrote:
> Hello,
>
> I am currently trying to use Nutch as a web site mirroring tool. To be more
> explicit, I only
Hi,
Isn't tika responsible for XML parsing? Because I got this:
parse.ParserFactory - ParserFactory: Plugin:
org.apache.nutch.parse.feed.FeedParser mapped to contentType
application/rss+xml via parse-plugins.xml, but not enabled via
plugin.includes in nutch-default.xml. Should I just include
ctfully suggest that you go through the basic information that
is available online to get familiar with Nutch. Copying the online
information into this mailing list is not helping anybody.
On Thu, May 24, 2012 at 10:19 AM, Tolga wrote:
On 5/24/12 11:00 AM, Piet van Remortel wrote:
On Thu, Ma
On 5/24/12 11:00 AM, Piet van Remortel wrote:
On Thu, May 24, 2012 at 9:35 AM, Tolga wrote:
- I don't fully understand the use of topN parameter. Should I increase it?
yes
What would a sensible topN value be a for a large university website?
- You mean parse-pdf thing? I'v
hadoop logs have overrun the local disk on which the crawler was
running (i.e. disk full)
Piet
On Thu, May 24, 2012 at 9:17 AM, Tolga wrote:
Hi,
I am crawling a large website, which is our university's. From the logs
and some grep'ing, I see that some pdf files were not crawled. W
Hi,
I am crawling a large website, which is our university's. From the logs
and some grep'ing, I see that some pdf files were not crawled. Why could
this happen? I'm crawling with -depth 100 -topN 5.
Regards,
le.
If you you look at the XML file in question, one of the first XML
configuration blocks says
Just remove your unnecessary config and Tika will do the work for you :0)
Lewis
On Wed, May 23, 2012 at 11:44 AM, Tolga wrote:
Hi,
I put the lines
in parse-plugins.
first XML
configuration blocks says
Just remove your unnecessary config and Tika will do the work for you :0)
Lewis
On Wed, May 23, 2012 at 11:44 AM, Tolga wrote:
Hi,
I put the lines
in parse-plugins.xml, but I still can't crawl xls files. Why is that?
Regards,
Hi,
I put the lines
in parse-plugins.xml, but I still can't crawl xls files. Why is that?
Regards,
Yes, a redirect.
On 5/23/12 11:37 AM, Lewis John Mcgibbney wrote:
Can you please elaborate on a re-write rule? Do you mean a redirect?
On Wed, May 23, 2012 at 7:39 AM, Tolga wrote:
Thank you all, especially Lewis, Markus, and whomever I might have
forgotten! It is working; I can crawl, index
Thank you all, especially Lewis, Markus, and whomever I might have
forgotten! It is working; I can crawl, index and search.
One last question though. On my drupal website, I am redirecting
www.example.com to example.com. However, I noticed that nutch doesn't
crawl the web site if there is a re
Hi,
I crawl / index PDF files just fine, but I get the following warning.
parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to
contentType application/pdf via parse-plugins.xml, but not enabled via
plugin.includes in nutch-default.xml.
I've got the value
protocol-http|urlfilter-
to conf/parse-plugins.xml?
Regards,
On 5/22/12 2:06 PM, Piet van Remortel wrote:
another option is
protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)
which uses Tika, which parses PDF.
On Tue, May 22, 2012 at 1:00 PM,
nvisage to be obtained during your
crawl. The first option has the downside that on occasion the parser
can choke on rather large files...
On Tue, May 22, 2012 at 10:36 AM, Tolga wrote:
What is that value's unit? kilobytes? My PDF file is 4.7mb.
On 5/22/12 12:34 PM, Lewis John Mcgibbney wro
to use the http.verbose and fetcher.verbose properties as
well.
On Tue, May 22, 2012 at 10:31 AM, Tolga wrote:
The value is 65536
On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote:
try your http.content.limit and also make sure that you haven't
changed anything within the tika mimeType mappings.
On Tu
The value is 65536
On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote:
try your http.content.limit and also make sure that you haven't
changed anything within the tika mimeType mappings.
On Tue, May 22, 2012 at 9:06 AM, Tolga wrote:
Sorry, I forgot to also add my original problem. PDF file
Hmm, okay. I never touched that file.
On 5/22/12 12:26 PM, Lewis John Mcgibbney wrote:
Sorry I should have been more explicit about the exact file locationb
http://svn.apache.org/repos/asf/nutch/trunk/conf/parse-plugins.xml
hth
On Tue, May 22, 2012 at 10:19 AM, Tolga wrote:
By, tika
By, tika mimeType settings, do you mean protocol-http?
On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote:
try your http.content.limit and also make sure that you haven't
changed anything within the tika mimeType mappings.
On Tue, May 22, 2012 at 9:06 AM, Tolga wrote:
Sorry, I forgot to
Sorry, I forgot to also add my original problem. PDF files are not
crawled. I even modified -topN to be 10.
Original Message
Subject:PDF not crawled/indexed
Date: Tue, 22 May 2012 10:48:15 +0300
From: Tolga
To: user@nutch.apache.org
Hi,
I am crawling my
Hi,
I am crawling my website with this command:
bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr
http://localhost:8983/solr/ -depth 20 -topN 5
Is it a good idea to modify the directory name? Should I always delete
indexes prior to crawling and stick to the same directory name?
Re
Okay I'm coming to the end of my questions.
Do I need to read
http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F
to index files as well on a web site?
Thanks a lot :)
Hi,
I am getting this error while crawling my website with nutch:
[doc=null] missing required field: id
request: http://localhost:8983/solr/update?wt=javabin&version=2 at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
at
org.apache.solr.client.so
Hi Cameron,
I've been dealing with the same issue, and taking care of it by adding
the field, in your case 'site', to solr schema.xml, and restarting solr.
On 5/18/12 7:58 AM, cameron tran wrote:
Hello
I am trying to get Nutch 1.4 (downloaded binary) to do solrindex to
http://127.0.0.1:8983/
On Tuesday 15 May 2012 13:40:26 Tolga wrote:
I'm a little confused. How can I not use the crawl command and execute
the separate crawl cycle commands at the same time?
Regards,
On 5/11/12 9:40 AM, Markus Jelsma wrote:
Ah, that means don't use the crawl command and do a little shell
Hi,
I have been trying for a week. I really want to get a start, so what
should I use? curl or nutch? I want to be able to index pdf, xml etc.
and search within them as well.
Regards,
I'm going nuts.
I issued the command bin/nutch crawl urls -solr
http://localhost:8983/solr/ -depth 3 -topN 5, went on to
http://localhost:8983/solr/admin/stats.jsp and verified the index, but
can't search within a web page. What am I doing wrong?
Regards,
PM, Markus Jelsma wrote:
Please follow the step-by-step tutorial, it's explained there:
http://wiki.apache.org/nutch/NutchTutorial
On Tuesday 15 May 2012 13:40:26 Tolga wrote:
I'm a little confused. How can I not use the crawl command and execute
the separate crawl cycle commands at the
on't crawl?
On 5/15/12 2:05 PM, Markus Jelsma wrote:
Please follow the step-by-step tutorial, it's explained there:
http://wiki.apache.org/nutch/NutchTutorial
On Tuesday 15 May 2012 13:40:26 Tolga wrote:
I'm a little confused. How can I not use the crawl command and execute
e commands, see the nutch
wiki for examples. And don't do solrdedup. Search the Solr wiki for
deduplication.
cheers
On Fri, 11 May 2012 07:39:36 +0300, Tolga wrote:
Hi,
How do I exactly "omit solrdedup and use Solr's internal
deduplication" instead.? I don't even know
omit solrdedup and use Solr's internal
deduplication instead, it works similar and uses the same signature
algorithm as Nutch has. Please consult the Solr wiki page on
deduplication.
Good luck
On Thu, 10 May 2012 22:54:37 +0300, Tolga wrote:
Hi Markus,
On 05/10/2012 09:42 AM, Markus Jelsma w
Hi Markus,
On 05/10/2012 09:42 AM, Markus Jelsma wrote:
Hi,
On Thu, 10 May 2012 09:10:04 +0300, Tolga wrote:
Hi,
This will sound like a duplicate, but actually it differs from the
other one. Please bear with me. Following
http://wiki.apache.org/nutch/NutchTutorial, I first issued the
Thanks. *heads to the Solr list*
On 5/10/12 9:42 AM, Markus Jelsma wrote:
Hi,
On Thu, 10 May 2012 09:10:04 +0300, Tolga wrote:
Hi,
This will sound like a duplicate, but actually it differs from the
other one. Please bear with me. Following
http://wiki.apache.org/nutch/NutchTutorial, I first
Hi,
This will sound like a duplicate, but actually it differs from the other
one. Please bear with me. Following
http://wiki.apache.org/nutch/NutchTutorial, I first issued the command
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
Then when I got the message
Excepti
Title: [Fwd: RE: Weekly Report]
Hi,
It seems you have the same same error as me. Did you solve it? If
yes, how?
Regards,
On 05/09/2012 05:04 PM, Stephan Kristyn wrote:
Hi, it seems like I forgot to fetch the crawled URLs, as mentioned
t 12:35 PM, Tolga wrote:
I've read that and done accordingly, I still get that error.
On 5/9/12 2:31 PM, Lewis John Mcgibbney wrote:
good to hear.
please see the tutorial for all required configuration
http://wiki.apache.org/nutch/NutchTutorial
On Wed, May 9, 2012 at 11:51 AM, Tolgawro
Dear Lewis,
Thanks a lot for your help. Now my crawler is indexing to Solr properly,
as requested. What I did was forget about the other tutorial, and follow
Nutch FAQ :)
Regards,
I've read that and done accordingly, I still get that error.
On 5/9/12 2:31 PM, Lewis John Mcgibbney wrote:
good to hear.
please see the tutorial for all required configuration
http://wiki.apache.org/nutch/NutchTutorial
On Wed, May 9, 2012 at 11:51 AM, Tolga wrote:
Dear Lewis,
I
Dear Lewis,
I've done as you said, and it's beginning to work. Except that it's
complaining about http.agent.name not having been fed. The tut I have
read states I don't need to fill it, but apparently I do. What should
this be?
On 5/9/12 1:25 PM, Lewis John Mcgibbney w
Sorry, there are actually .jar files under the directory, but I still
can't figure out what path to export to CLASSPATH
Original Message
Subject:CLASSPATH
Date: Wed, 09 May 2012 10:00:53 +0300
From: Tolga
To: user@nutch.apache.org
Hi,
This is my very
Hi,
This is my very first post to the list. In fact, I heard of nutch only
yesterday.
Anyway, I'm trying to figure out what path to export CLASSPATH to.
Tutorials tell me it needs to be where my .jar files are. However, there
are no .jar files under apache-nutch directory. So, please help me
Please remove me from the mailing list
63 matches
Mail list logo