Re: Best practices for running Nutch

2012-11-18 Thread Muzaffer Tolga Özses
I'd also be very much interested in knowing these! On 11/18/2012 07:32 PM, kiran chitturi wrote: Hi! I have been running crawls using Nutch for 13000 documents (protocol http) on a single machine and it goes on to take 2-3 days to get finished. I am using 2.x version of Nutch. I use a depth of

Re: Search in specific website

2012-10-16 Thread Tolga
Solr sent me to Nutch list, but okay. Thanks, On 10/16/2012 02:27 PM, Lewis John Mcgibbney wrote: Hi Tolga, Please take this to the Solr user@ list. Thank you Lewis On Tue, Oct 16, 2012 at 12:13 PM, Tolga wrote: Hi, I've tried url:fass\.sabanciuniv\.edu AND content:this, and

Re: Search in specific website

2012-10-16 Thread Tolga
more details of what you have tried and what issues you are having. On Fri, Oct 12, 2012 at 5:03 PM, Tolga wrote: Not really. Let me elaborate. If I pass it multiple URLs such as http://example.com, example.net and example.org, how can I search only in net? Regards, On 12 October 2012 23:55, Teja

Re: Search in specific website

2012-10-12 Thread Tolga
Not really. Let me elaborate. If I pass it multiple URLs such as http://example.com, example.net and example.org, how can I search only in net? Regards, On 12 October 2012 23:55, Tejas Patil wrote: > Hi Tolga, > > For searching a specific content from a specific website, crawl it firs

Search in specific website

2012-10-12 Thread Tolga
Hi, I use nutch to crawl my website and index to solr. However, how can I search for piece of content in a specific website? I use multiple URL's Regards,

Re: Error adding title

2012-10-05 Thread Tolga
On 10/05/2012 04:22 PM, Asha Chhikara wrote: which type of field is it. On Fri, Oct 5, 2012 at 6:40 PM, Tolga wrote: In solr schema.xml On 10/05/2012 03:41 PM, Asha Chhikara wrote: where is it present. On Fri, Oct 5, 2012 at 6:04 PM, Tolga wrote: Hi, What is meant by "

Re: Error adding title

2012-10-05 Thread Tolga
In solr schema.xml On 10/05/2012 03:41 PM, Asha Chhikara wrote: where is it present. On Fri, Oct 5, 2012 at 6:04 PM, Tolga wrote: Hi, What is meant by "ERROR: [doc=http://bilgisayarciniz.**org/<http://bilgisayarciniz.org/>] Error adding field 'title'='bilgisaya

Error adding title

2012-10-05 Thread Tolga
Hi, What is meant by "ERROR: [doc=http://bilgisayarciniz.org/] Error adding field 'title'='bilgisayarciniz web hizmetleri'"? I have title defined as a field, is also has the multiValued property. Regards,

Fwd: Nutch and CAS

2012-10-02 Thread Tolga
Sorry for reposting, but it must have been missed. Original Message Subject:Nutch and CAS Date: Thu, 27 Sep 2012 15:37:04 +0300 From: Tolga To: user@nutch.apache.org Hi, Can Nutch crawl a website behind CAS (Central Authentication System)? Regards,

Nutch and CAS

2012-09-27 Thread Tolga
Hi, Can Nutch crawl a website behind CAS (Central Authentication System)? Regards,

Re: Crawl errors

2012-09-05 Thread Tolga
Oh, and by way I have the 'title' field. On 09/05/2012 04:48 PM, Tolga wrote: I changed the encoding to ISO-8859-9 and restarted Solr, it didn't work :S Below is the full error: SEVERE: org.apache.solr.common.SolrException: ERROR: [doc=http://www.sabanciuniv.edu/] Error addi

Re: Crawl errors

2012-09-05 Thread Tolga
bancı Üniversitesi at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:297) ... 26 more Regards, On 09/05/2012 02:14 PM, Lewis John Mcgibbney wrote: Most likely On Wed, Sep 5, 2012 at 12:12 PM, Tolga wrote: Sorry for replying to this, I can start a new thread if

Re: Crawl errors

2012-09-05 Thread Tolga
e various tutorials here to get you going http://wiki.apache.org/nutch/#Nutch_2.X_tutorial.28s.29 hth Lewis On Tue, Sep 4, 2012 at 2:27 PM, Tolga wrote: Hi, I set up my Solr, and when I tried to crawl my website, it gave me [mtozses@atlas bin]$ time ./nutch crawl urls -dir crawl-$(date +%FT%H

Crawl errors

2012-09-04 Thread Tolga
Hi, I set up my Solr, and when I tried to crawl my website, it gave me [mtozses@atlas bin]$ time ./nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr http://localhost:8983/solr/ -depth 5 -topN 5 Exception in thread "main" org.apache.gora.util.GoraException: java.io.IOException: java.sql.SQ

Re: bin/nutch

2012-08-28 Thread Tolga
What brackets? I don't see brackets. On 08/28/2012 03:39 PM, Lewis John Mcgibbney wrote: I assume you are not using the brackets? This command should most certainly work. Lewis On Tue, Aug 28, 2012 at 7:48 AM, Tolga wrote: "ant runtime" gives me: Buildfile: build.xml

Re: bin/nutch

2012-08-27 Thread Tolga
You should check out the following tutorials below http://wiki.apache.org/nutch/Nutch2Tutorial http://nlp.solutions.asia/?p=180 Lewis On Mon, Aug 27, 2012 at 12:34 PM, Tolga wrote: Hi, and thanks for your fast reply. I found a tutorial on the interwebz, and it said to use ant in $NUTCH. However

Re: bin/nutch

2012-08-27 Thread Tolga
elow http://wiki.apache.org/nutch/Nutch2Tutorial http://nlp.solutions.asia/?p=180 Lewis On Mon, Aug 27, 2012 at 12:34 PM, Tolga wrote: Hi, and thanks for your fast reply. I found a tutorial on the interwebz, and it said to use ant in $NUTCH. However, when I used it, I got: [mtozses@atlas NUTCH]

Re: bin/nutch

2012-08-27 Thread Tolga
Hi, and thanks for your fast reply. I found a tutorial on the interwebz, and it said to use ant in $NUTCH. However, when I used it, I got: [mtozses@atlas NUTCH]$ time ant Buildfile: build.xml [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found

bin/nutch

2012-08-26 Thread Tolga
Hi, I can find only src/bin/nutch in 2.0-src.zip. There's no bin/nutch. Does that mean I have to compile it? Regards,

Re: parse.ParserFactory

2012-05-29 Thread Tolga
NUTCH_HOME/conf and not in NUTCH_HOME/runtime/local/conf (see tutorial on WIKI) On 29 May 2012 07:31, Tolga wrote: Hi, I know this issue should have been closed, but I thought I'd continue this rather than starting a new thread. Anyway, I'm getting this: parse.ParserFactory - Par

Re: parse.ParserFactory

2012-05-29 Thread Tolga
and not nutch-default and my bet is that your are doing this in NUTCH_HOME/conf and not in NUTCH_HOME/runtime/local/conf (see tutorial on WIKI) On 29 May 2012 07:31, Tolga wrote: Hi, I know this issue should have been closed, but I thought I'd continue this rather than starting a new t

Re: parse.ParserFactory

2012-05-28 Thread Tolga
On Tue, May 22, 2012 at 9:20 PM, Tolga wrote: Hi, I crawl / index PDF files just fine, but I get the following warning. parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to contentType application/pdf via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml.

Re: Using Nutch for Web Site Mirroring

2012-05-25 Thread Tolga
Hi, Do you have to use Nutch for this purpose? I belive you can use wget -m http://www.example.com and get everything in a much structured way. On 25 May 2012 11:07, vlad.paunescu wrote: > Hello, > > I am currently trying to use Nutch as a web site mirroring tool. To be more > explicit, I only

XML parsing

2012-05-24 Thread Tolga
Hi, Isn't tika responsible for XML parsing? Because I got this: parse.ParserFactory - ParserFactory: Plugin: org.apache.nutch.parse.feed.FeedParser mapped to contentType application/rss+xml via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml. Should I just include

Re: Large website not fully crawled

2012-05-24 Thread Tolga
ctfully suggest that you go through the basic information that is available online to get familiar with Nutch. Copying the online information into this mailing list is not helping anybody. On Thu, May 24, 2012 at 10:19 AM, Tolga wrote: On 5/24/12 11:00 AM, Piet van Remortel wrote: On Thu, Ma

Re: Large website not fully crawled

2012-05-24 Thread Tolga
On 5/24/12 11:00 AM, Piet van Remortel wrote: On Thu, May 24, 2012 at 9:35 AM, Tolga wrote: - I don't fully understand the use of topN parameter. Should I increase it? yes What would a sensible topN value be a for a large university website? - You mean parse-pdf thing? I'v

Re: Large website not fully crawled

2012-05-24 Thread Tolga
hadoop logs have overrun the local disk on which the crawler was running (i.e. disk full) Piet On Thu, May 24, 2012 at 9:17 AM, Tolga wrote: Hi, I am crawling a large website, which is our university's. From the logs and some grep'ing, I see that some pdf files were not crawled. W

Large website not fully crawled

2012-05-24 Thread Tolga
Hi, I am crawling a large website, which is our university's. From the logs and some grep'ing, I see that some pdf files were not crawled. Why could this happen? I'm crawling with -depth 100 -topN 5. Regards,

Re: Apparently far from last question :)

2012-05-23 Thread Tolga
le. If you you look at the XML file in question, one of the first XML configuration blocks says Just remove your unnecessary config and Tika will do the work for you :0) Lewis On Wed, May 23, 2012 at 11:44 AM, Tolga wrote: Hi, I put the lines in parse-plugins.

Re: Apparently far from last question :)

2012-05-23 Thread Tolga
first XML configuration blocks says Just remove your unnecessary config and Tika will do the work for you :0) Lewis On Wed, May 23, 2012 at 11:44 AM, Tolga wrote: Hi, I put the lines in parse-plugins.xml, but I still can't crawl xls files. Why is that? Regards,

Apparently far from last question :)

2012-05-23 Thread Tolga
Hi, I put the lines in parse-plugins.xml, but I still can't crawl xls files. Why is that? Regards,

Re: One last question

2012-05-23 Thread Tolga
Yes, a redirect. On 5/23/12 11:37 AM, Lewis John Mcgibbney wrote: Can you please elaborate on a re-write rule? Do you mean a redirect? On Wed, May 23, 2012 at 7:39 AM, Tolga wrote: Thank you all, especially Lewis, Markus, and whomever I might have forgotten! It is working; I can crawl, index

One last question

2012-05-22 Thread Tolga
Thank you all, especially Lewis, Markus, and whomever I might have forgotten! It is working; I can crawl, index and search. One last question though. On my drupal website, I am redirecting www.example.com to example.com. However, I noticed that nutch doesn't crawl the web site if there is a re

parse.ParserFactory

2012-05-22 Thread Tolga
Hi, I crawl / index PDF files just fine, but I get the following warning. parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to contentType application/pdf via parse-plugins.xml, but not enabled via plugin.includes in nutch-default.xml. I've got the value protocol-http|urlfilter-

Re: PDF not crawled/indexed

2012-05-22 Thread Tolga
to conf/parse-plugins.xml? Regards, On 5/22/12 2:06 PM, Piet van Remortel wrote: another option is protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic) which uses Tika, which parses PDF. On Tue, May 22, 2012 at 1:00 PM,

Re: PDF not crawled/indexed

2012-05-22 Thread Tolga
nvisage to be obtained during your crawl. The first option has the downside that on occasion the parser can choke on rather large files... On Tue, May 22, 2012 at 10:36 AM, Tolga wrote: What is that value's unit? kilobytes? My PDF file is 4.7mb. On 5/22/12 12:34 PM, Lewis John Mcgibbney wro

Re: PDF not crawled/indexed

2012-05-22 Thread Tolga
to use the http.verbose and fetcher.verbose properties as well. On Tue, May 22, 2012 at 10:31 AM, Tolga wrote: The value is 65536 On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: try your http.content.limit and also make sure that you haven't changed anything within the tika mimeType mappings. On Tu

Re: PDF not crawled/indexed

2012-05-22 Thread Tolga
The value is 65536 On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: try your http.content.limit and also make sure that you haven't changed anything within the tika mimeType mappings. On Tue, May 22, 2012 at 9:06 AM, Tolga wrote: Sorry, I forgot to also add my original problem. PDF file

Re: PDF not crawled/indexed

2012-05-22 Thread Tolga
Hmm, okay. I never touched that file. On 5/22/12 12:26 PM, Lewis John Mcgibbney wrote: Sorry I should have been more explicit about the exact file locationb http://svn.apache.org/repos/asf/nutch/trunk/conf/parse-plugins.xml hth On Tue, May 22, 2012 at 10:19 AM, Tolga wrote: By, tika

Re: PDF not crawled/indexed

2012-05-22 Thread Tolga
By, tika mimeType settings, do you mean protocol-http? On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote: try your http.content.limit and also make sure that you haven't changed anything within the tika mimeType mappings. On Tue, May 22, 2012 at 9:06 AM, Tolga wrote: Sorry, I forgot to

Fwd: PDF not crawled/indexed

2012-05-22 Thread Tolga
Sorry, I forgot to also add my original problem. PDF files are not crawled. I even modified -topN to be 10. Original Message Subject:PDF not crawled/indexed Date: Tue, 22 May 2012 10:48:15 +0300 From: Tolga To: user@nutch.apache.org Hi, I am crawling my

PDF not crawled/indexed

2012-05-22 Thread Tolga
Hi, I am crawling my website with this command: bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr http://localhost:8983/solr/ -depth 20 -topN 5 Is it a good idea to modify the directory name? Should I always delete indexes prior to crawling and stick to the same directory name? Re

Crawl / index files as well

2012-05-21 Thread Tolga
Okay I'm coming to the end of my questions. Do I need to read http://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F to index files as well on a web site? Thanks a lot :)

org.apache.solr.common.SolrException: ERROR: [doc=null] missing required field: id

2012-05-21 Thread Tolga
Hi, I am getting this error while crawling my website with nutch: [doc=null] missing required field: id request: http://localhost:8983/solr/update?wt=javabin&version=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430) at org.apache.solr.client.so

Re: ERROR solr.SolrIndexer - java.io.IOException: Job failed!

2012-05-17 Thread Tolga
Hi Cameron, I've been dealing with the same issue, and taking care of it by adding the field, in your case 'site', to solr schema.xml, and restarting solr. On 5/18/12 7:58 AM, cameron tran wrote: Hello I am trying to get Nutch 1.4 (downloaded binary) to do solrindex to http://127.0.0.1:8983/

Re: HTTP error 400

2012-05-17 Thread Tolga
On Tuesday 15 May 2012 13:40:26 Tolga wrote: I'm a little confused. How can I not use the crawl command and execute the separate crawl cycle commands at the same time? Regards, On 5/11/12 9:40 AM, Markus Jelsma wrote: Ah, that means don't use the crawl command and do a little shell

curl or nutch

2012-05-16 Thread Tolga
Hi, I have been trying for a week. I really want to get a start, so what should I use? curl or nutch? I want to be able to index pdf, xml etc. and search within them as well. Regards,

solrindex

2012-05-15 Thread Tolga
I'm going nuts. I issued the command bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5, went on to http://localhost:8983/solr/admin/stats.jsp and verified the index, but can't search within a web page. What am I doing wrong? Regards,

Re: HTTP error 400

2012-05-15 Thread Tolga
PM, Markus Jelsma wrote: Please follow the step-by-step tutorial, it's explained there: http://wiki.apache.org/nutch/NutchTutorial On Tuesday 15 May 2012 13:40:26 Tolga wrote: I'm a little confused. How can I not use the crawl command and execute the separate crawl cycle commands at the

Re: HTTP error 400

2012-05-15 Thread Tolga
on't crawl? On 5/15/12 2:05 PM, Markus Jelsma wrote: Please follow the step-by-step tutorial, it's explained there: http://wiki.apache.org/nutch/NutchTutorial On Tuesday 15 May 2012 13:40:26 Tolga wrote: I'm a little confused. How can I not use the crawl command and execute

Re: HTTP error 400

2012-05-15 Thread Tolga
e commands, see the nutch wiki for examples. And don't do solrdedup. Search the Solr wiki for deduplication. cheers On Fri, 11 May 2012 07:39:36 +0300, Tolga wrote: Hi, How do I exactly "omit solrdedup and use Solr's internal deduplication" instead.? I don't even know

Re: HTTP error 400

2012-05-10 Thread Tolga
omit solrdedup and use Solr's internal deduplication instead, it works similar and uses the same signature algorithm as Nutch has. Please consult the Solr wiki page on deduplication. Good luck On Thu, 10 May 2012 22:54:37 +0300, Tolga wrote: Hi Markus, On 05/10/2012 09:42 AM, Markus Jelsma w

Re: HTTP error 400

2012-05-10 Thread Tolga
Hi Markus, On 05/10/2012 09:42 AM, Markus Jelsma wrote: Hi, On Thu, 10 May 2012 09:10:04 +0300, Tolga wrote: Hi, This will sound like a duplicate, but actually it differs from the other one. Please bear with me. Following http://wiki.apache.org/nutch/NutchTutorial, I first issued the

Re: HTTP error 400

2012-05-10 Thread Tolga
Thanks. *heads to the Solr list* On 5/10/12 9:42 AM, Markus Jelsma wrote: Hi, On Thu, 10 May 2012 09:10:04 +0300, Tolga wrote: Hi, This will sound like a duplicate, but actually it differs from the other one. Please bear with me. Following http://wiki.apache.org/nutch/NutchTutorial, I first

HTTP error 400

2012-05-09 Thread Tolga
Hi, This will sound like a duplicate, but actually it differs from the other one. Please bear with me. Following http://wiki.apache.org/nutch/NutchTutorial, I first issued the command bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 Then when I got the message Excepti

Re: HTTP ERROR 400

2012-05-09 Thread Tolga
Title: [Fwd: RE: Weekly Report] Hi, It seems you have the same same error as me. Did you solve it? If yes, how? Regards, On 05/09/2012 05:04 PM, Stephan Kristyn wrote: Hi, it seems like I forgot to fetch the crawled URLs, as mentioned

Re: CLASSPATH

2012-05-09 Thread Tolga
t 12:35 PM, Tolga wrote: I've read that and done accordingly, I still get that error. On 5/9/12 2:31 PM, Lewis John Mcgibbney wrote: good to hear. please see the tutorial for all required configuration http://wiki.apache.org/nutch/NutchTutorial On Wed, May 9, 2012 at 11:51 AM, Tolgawro

Working!

2012-05-09 Thread Tolga
Dear Lewis, Thanks a lot for your help. Now my crawler is indexing to Solr properly, as requested. What I did was forget about the other tutorial, and follow Nutch FAQ :) Regards,

Re: CLASSPATH

2012-05-09 Thread Tolga
I've read that and done accordingly, I still get that error. On 5/9/12 2:31 PM, Lewis John Mcgibbney wrote: good to hear. please see the tutorial for all required configuration http://wiki.apache.org/nutch/NutchTutorial On Wed, May 9, 2012 at 11:51 AM, Tolga wrote: Dear Lewis, I&#x

Re: CLASSPATH

2012-05-09 Thread Tolga
Dear Lewis, I've done as you said, and it's beginning to work. Except that it's complaining about http.agent.name not having been fed. The tut I have read states I don't need to fill it, but apparently I do. What should this be? On 5/9/12 1:25 PM, Lewis John Mcgibbney w

Fwd: CLASSPATH

2012-05-09 Thread Tolga
Sorry, there are actually .jar files under the directory, but I still can't figure out what path to export to CLASSPATH Original Message Subject:CLASSPATH Date: Wed, 09 May 2012 10:00:53 +0300 From: Tolga To: user@nutch.apache.org Hi, This is my very

CLASSPATH

2012-05-09 Thread Tolga
Hi, This is my very first post to the list. In fact, I heard of nutch only yesterday. Anyway, I'm trying to figure out what path to export CLASSPATH to. Tutorials tell me it needs to be where my .jar files are. However, there are no .jar files under apache-nutch directory. So, please help me

Please remove me from the mailing list

2011-06-12 Thread Tolga Soyata
Please remove me from the mailing list