Re: [Nutch-general] Nutch and distributed searching (w/ apologies)

2007-08-02 Thread Doğacan Güney
-servers.txt) and that you can bring down a single search server, update the index and pieces, and then bring the single search server back up. This way the entire index is never down. Hope this helps and let me know if you have any questions. Dennis Kubes -- Doğacan Güney

Re: [Nutch-general] Outlinks normalizer

2007-08-02 Thread Doğacan Güney
that, perhaps we can stop creating a ParseUtil instance for every ParseSegment.map [even though it has a smaller overhead]). -- Doğacan Güney - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find

Re: [Nutch-general] slow generate process

2007-07-31 Thread Doğacan Güney
Luca Rondanini Doğacan Güney wrote: On 7/25/07, Luca Rondanini [EMAIL PROTECTED] wrote: this is my hadoop log(just in case): 2007-07-25 13:19:57,040 INFO crawl.Generator - Generator: starting 2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: segment: /home/semantix/nutch

Re: [Nutch-general] Really big indexing and timeouts?

2007-07-31 Thread Doğacan Güney
to a slowish indexing filter like language-identifier?) Dennis -- Doğacan Güney - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events

Re: [Nutch-general] eliminating almost duplicate URLs

2007-07-30 Thread Doğacan Güney
! Answers - Check it out. http://answers.yahoo.com/dir/?link=listsid=396545433 -- Doğacan Güney - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events

Re: [Nutch-general] DownloadingNutch - svn co nutch nightly

2007-07-27 Thread Doğacan Güney
://autos.yahoo.com/carfinder/ -- Doğacan Güney - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your

Re: [Nutch-general] SOLVED? Re: NullPointerException fetching some sites with temp redirects

2007-07-26 Thread Doğacan Güney
create patches: http://wiki.apache.org/nutch/HowToContribute ) Cheers, Carl. -- Doğacan Güney - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events

Re: [Nutch-general] SearchApp from Introduction to Nutch, Part 2: Searching

2007-07-25 Thread Doğacan Güney
Sick sense of humor? Visit Yahoo! TV's Comedy with an Edge to see what's on, when. http://tv.yahoo.com/collections/222 -- Doğacan Güney - This SF.net email is sponsored by: Splunk Inc. Still grepping

Re: [Nutch-general] IllegalArgumentException: plugin.folders is not defined

2007-07-25 Thread Doğacan Güney
=listsid=396545433 -- Doğacan Güney - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your

Re: [Nutch-general] How to use automaton-urlfilter.txt

2007-07-25 Thread Doğacan Güney
reading plugin's source code. Thanks -- Doğacan Güney - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX

Re: [Nutch-general] NullPointerException fetching some sites with temp redirects

2007-07-25 Thread Doğacan Güney
think it will solve your problem but Content.java has changed recently so I am not sure what was in line 146. So, if problem reoccurs with latest trunk I can check exactly which line is failing. Alternatively, you can send that part of Content.java's code. Cheers, Carl. -- Doğacan Güney

Re: [Nutch-general] slow generate process

2007-07-25 Thread Doğacan Güney
(db_redir_perm): 4 CrawlDb statistics: done Luca Rondanini -- Doğacan Güney - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration

Re: [Nutch-general] slow generate process

2007-07-25 Thread Doğacan Güney
Luca Rondanini -- DoÄŸacan Güney -- Doğacan Güney - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using

Re: [Nutch-general] RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)

2007-07-25 Thread Doğacan Güney
there is such a difference and is there some way to eliminate part of this overhead ? Regards, -- Marc -- Doğacan Güney - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now

Re: [Nutch-general] RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?)

2007-07-25 Thread Doğacan Güney
. -Original Message- From: Doğacan Güney [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 25, 2007 2:44 PM To: [EMAIL PROTECTED] Subject: Re: RE : Nutch overhead to Lucene (or: why is Nutch 4 times slower than Lucene ?) On 7/25/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I

Re: [Nutch-general] RSS link extractor

2007-07-19 Thread Doğacan Güney
commit it. -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net

Re: [Nutch-general] four nutch merge commands: mergedb, mergesegs, mergelinkdb, merge

2007-07-16 Thread Doğacan Güney
-- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar

Re: [Nutch-general] Search on Date range

2007-07-13 Thread Doğacan Güney
/in/yahoo/mail/yahoomail/tools/tools-08.html/ -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click

Re: [Nutch-general] Indexing exits with Job Failed

2007-07-09 Thread Doğacan Güney
/hadoop.log or your tasktracker's log files and you should see a more detailed log about your problem. -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express

Re: [Nutch-general] IOException using feed plugin - NUTCH-444

2007-06-29 Thread Doğacan Güney
- Original Message From: Doğacan Güney [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, June 27, 2007 10:59:52 PM Subject: Re: Possibly use a different library to parse RSS feed for improved performance and compatibility On 6/28/07, Kai_testing Middleton [EMAIL PROTECTED

Re: [Nutch-general] Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-06-28 Thread Doğacan Güney
and it will work too. --Kai Middleton - Original Message From: Doğacan Güney [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, June 22, 2007 1:39:12 AM Subject: Re: Possibly use a different library to parse RSS feed for improved performance and compatibility On 6/21/07, Kai_testing

Re: [Nutch-general] Stemming with Nutch

2007-06-28 Thread Doğacan Güney
appreciated :). Thanks Rob -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http

Re: [Nutch-general] The ranking is wrong

2007-06-27 Thread Doğacan Güney
___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney

Re: [Nutch-general] NUTCH-505 - cannot find symbol: variable URL_VALIDATOR

2007-06-26 Thread Doğacan Güney
$ ant clean ant Bored stiff? Loosen up... Download and play hundreds of games for free on Yahoo! Games. http://games.yahoo.com/games/front -- Doğacan Güney

Re: [Nutch-general] Integrate nutch crawler with Solr index server

2007-06-26 Thread Doğacan Güney
. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general -- Doğacan Güney

Re: [Nutch-general] Integrate nutch crawler with Solr index server

2007-06-26 Thread Doğacan Güney
On 6/26/07, Sami Siren [EMAIL PROTECTED] wrote: Doğacan Güney wrote: Hi, On 6/26/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Is this actually planned (addition of SolrIndexer to Nutch)? A search for SolrIndexer in JIRA got no hits. There is NUTCH-442 (one of the most popular

Re: [Nutch-general] Indexer NPE

2007-06-25 Thread Doğacan Güney
command parses the segment. Thanks -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get

Re: [Nutch-general] search error

2007-06-24 Thread Doğacan Güney
is happening :( when i remove the recommended plugin from there the search.jsp page is displayed normally If you are using tomcat, please start it in 'run' mode(./catalina.sh run) and check if tomcat prints an exception. please help its really urgent -- Doğacan Güney

Re: [Nutch-general] search error

2007-06-24 Thread Doğacan Güney
help On 6/24/07, Doğacan Güney [EMAIL PROTECTED] wrote: On 6/24/07, karan [EMAIL PROTECTED] wrote: hi i just tried to build the recommended plugin that is given in the plugin writing example when i included the plugin in the plugin.includesproperty the searc.jsp nothing

Re: [Nutch-general] fetching http://www.variety.com//div/a

2007-06-24 Thread Doğacan Güney
on/reviewing NUTCH-505 would be a nice place to start :). 8:00? 8:25? 8:40? Find a flick in no time with the Yahoo! Search movie showtime shortcut. http://tools.search.yahoo.com/shortcuts/#news -- Doğacan Güney

Re: [Nutch-general] Indexing problems in nutch-nightly

2007-06-24 Thread Doğacan Güney
, with your fetcher patch applied. I will report back with the result when the process is done. - Original Message From: Doğacan Güney [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, June 19, 2007 7:12:32 AM Subject: Re: Indexing problems in nutch-nightly Here is the patch

Re: [Nutch-general] Indexer NPE

2007-06-24 Thread Doğacan Güney
(NUTCH-504, rev 550196). See discussion here: http://www.nabble.com/Indexing-problems-in-nutch-nightly-tf3923427.html for why the problem occurs. Conf: 1 single machine Linux 2.6, Java 1.6 nutch nigthly + hadoop 0.12.3 Thanks in advance for ur help -- Doğacan Güney

Re: [Nutch-general] Indexer NPE

2007-06-24 Thread Doğacan Güney
for ur help -- DoÄŸacan Güney -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click

Re: [Nutch-general] search error

2007-06-24 Thread Doğacan Güney
the parameters like the one above. On 6/24/07, Doğacan Güney [EMAIL PROTECTED] wrote: On 6/24/07, karan [EMAIL PROTECTED] wrote: hey... thnx for reply tomcat in run mode does generate exceptions at the terminal :)..and the output shoes the plugin is in the registered list of plugins

Re: [Nutch-general] fetching http://www.variety.com//div/a

2007-06-23 Thread Doğacan Güney
On 6/22/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: These 'urls' most likely come from parse-js plugin. Can you disable it and see if they disappear? To extract links from js code, parse-js uses a heuristic that unfortunately also may extract garbage urls

Re: [Nutch-general] fetching http://www.variety.com//div/a

2007-06-23 Thread Doğacan Güney
On 6/23/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: On 6/22/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: These 'urls' most likely come from parse-js plugin. Can you disable it and see if they disappear? To extract links from js code, parse

Re: [Nutch-general] fetching http://www.variety.com//div/a

2007-06-23 Thread Doğacan Güney
On 6/23/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: On 6/23/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: On 6/22/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: These 'urls' most likely come from parse-js plugin. Can

Re: [Nutch-general] Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-06-22 Thread Doğacan Güney
knows. Yahoo! Answers - Check it out. http://answers.yahoo.com/dir/?link=listsid=396545433 -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take

Re: [Nutch-general] fetching http://www.variety.com//div/a

2007-06-22 Thread Doğacan Güney
://farechase.yahoo.com/ Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online. http://smallbusiness.yahoo.com/webhosting -- Doğacan Güney

Re: [Nutch-general] OR searches possible?

2007-06-22 Thread Doğacan Güney
a little closer to the building of the Lucene query (and allow this behaviour) via a Nutch plugin? Andrzej Bialecki is working on this - see NUTCH-479. Thanks Rob -- Doğacan Güney - This SF.net email is sponsored by DB2

Re: [Nutch-general] Distributed index

2007-06-22 Thread Doğacan Güney
server I would love to hear about it. The former suggestions of space and architecture are what we have experienced. Dennis Kubes -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C

Re: [Nutch-general] Lucene client and nutch index

2007-06-20 Thread Doğacan Güney
and your index size will grow very large. -Brian !DSPAM:467817bf321421501980509! -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control

Re: [Nutch-general] Lucene client and nutch index

2007-06-20 Thread Doğacan Güney
to access segmented data. Best regards, Ronny -Opprinnelig melding- Fra: Doğacan Güney [mailto:[EMAIL PROTECTED] Sendt: 20. juni 2007 08:14 Til: [EMAIL PROTECTED] Emne: Re: Lucene client and nutch index On 6/20/07, Naess, Ronny [EMAIL PROTECTED] wrote: I tried your tip Brian

Re: [Nutch-general] Performance: Fetcher2 or Fetcher

2007-06-20 Thread Doğacan Güney
is the best to do to gain in term of performance and to stay enough polite ? That's kind of between you and the server you are fetching but I wouldn't recommend a delay lower than 5 seconds. More tricks to gain performance are welcome E -- Doğacan Güney

Re: [Nutch-general] Indexing problems in nutch-nightly

2007-06-19 Thread Doğacan Güney
-tf3788992.html I have put up a patchified version here: http://www.ceng.metu.edu.tr/~e1345172/segment_reader_hang.patch Can you retry with this patch? Thanks! -- Doğacan Güney -- Doğacan Güney - This SF.net email

Re: [Nutch-general] Indexing problems in nutch-nightly

2007-06-19 Thread Doğacan Güney
Here is the patch for the fetchers: http://www.ceng.metu.edu.tr/~e1345172/parse_in_fetchers.patch -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express

Re: [Nutch-general] Indexing problems in nutch-nightly

2007-06-18 Thread Doğacan Güney
fun:) Anyway, it seems you are running into the problem described here: http://www.nabble.com/bug-in-SegmentReader-tf3788992.html I have put up a patchified version here: http://www.ceng.metu.edu.tr/~e1345172/segment_reader_hang.patch Can you retry with this patch? Thanks! -- Doğacan Güney

Re: [Nutch-general] Indexing problems in nutch-nightly

2007-06-18 Thread Doğacan Güney
EDT 2007 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 30.0 seconds (3.473E-4 days) Score: 1.0 Signature: c079280b4afb4347372982d5a034d51b Metadata: _ngt_:1181243348572 _pst_:success(1), lastModified=0 - Original Message From: Doğacan Güney

Re: [Nutch-general] Why Nutch is indexing HTTP 302 pages

2007-06-12 Thread Doğacan Güney
pages that return 200. You can fix this by putting status code in Content's Metadata then only parsing pages that have status code 200. (or, nutch stores page's headers in content's metadata. You can check if content's metadata has a location header). -- Doğacan Güney

Re: [Nutch-general] Nutch/Hadoop Fetcher confusion

2007-06-12 Thread Doğacan Güney
is necessary for politeness). So, in your case, you either have very few hosts (of which one has almost 100K urls) or there is a problem with partitioning. Patrik [...snip...] -- Doğacan Güney - This SF.net email is sponsored

Re: [Nutch-general] What is parse-oo and why doesn't parsed PDF content show up in cached.jsp ?

2007-06-12 Thread Doğacan Güney
=./servlet/cached?%=id%link/a to download it directly. % } % You can get a url's ParseText with bean.getParseText(details). -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE

Re: [Nutch-general] Nutch/Hadoop Fetcher confusion

2007-06-12 Thread Doğacan Güney
to a value greater than 1 and you have a very unpolite fetcher. Please don't run this to fetch a site you don't control :) -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version

Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-08 Thread Doğacan Güney
like this: int h = 0; IteratorEntryK,V i = entrySet().iterator(); while (i.hasNext()) h += i.next().hashCode(); return h; So if configuration's hashCode changes, CACHE's hashCode also changes. Thanks for the detailed analysis! Enzo -- Doğacan

Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-08 Thread Doğacan Güney
On 6/8/07, Enzo Michelangeli [EMAIL PROTECTED] wrote: - Original Message - From: Doğacan Güney [EMAIL PROTECTED] Sent: Friday, June 08, 2007 3:49 PM [...] Any idea? This will certainly help a lot. If it is not too much trouble, can you add debug outputs for hashCodes of conf

Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-08 Thread Doğacan Güney
On 6/8/07, Enzo Michelangeli [EMAIL PROTECTED] wrote: - Original Message - From: Doğacan Güney [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, June 08, 2007 8:27 PM Subject: Re: Loading mechnism of plugin classes and singleton objects [...] This is strange, because, as you

Re: [Nutch-general] Cookie

2007-06-07 Thread Doğacan Güney
to implement the management of cookie in Nutch. Thanks -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just

Re: [Nutch-general] Cookie

2007-06-07 Thread Doğacan Güney
cookies across fetcher, well I am not sure how to do it:) Perhaps, you can write an extra job that puts the cookie to every datum from that host, then pick it up in fetcher. Or perhaps someone has a better idea :) Thanks -- Doğacan Güney

Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-05 Thread Doğacan Güney
machine. (By the way, what does Until Nutch runtime mean here? Before Nutch runtime, no class whatsoever is supposed to be alive in the JVM, is it?) Enzo -- Doğacan Güney - This SF.net email is sponsored by DB2 Express

Re: [Nutch-general] Compression

2007-06-03 Thread Doğacan Güney
___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take

Re: [Nutch-general] Compression

2007-06-03 Thread Doğacan Güney
code(at least for gzip) and do compression. It will just be very slow:). Can you create an OutputFormat in CrawlDbMerger and set compression type to BLOCK manually? You can take a look at ParseOutputFormat's code as an example. Any clues ? -- Doğacan Güney

Re: [Nutch-general] How to enable followRedirects?

2007-06-03 Thread Doğacan Güney
. Plugin protocol-httpclient uses commons-httpclient library, nutch disables redirects in this library because nutch handles redirects itself. -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2

Re: [Nutch-general] Content Type Not Resolved Correctly?

2007-06-01 Thread Doğacan Güney
minds are what make reality real -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get

Re: [Nutch-general] Content Type Not Resolved Correctly?

2007-06-01 Thread Doğacan Güney
Content-Type: text/html So, I'm lost. On 6/1/07, Doğacan Güney [EMAIL PROTECTED] wrote: Hi, On 6/1/07, Briggs [EMAIL PROTECTED] wrote: So, I have been having huge problems with parsing. It seems that many urls are being ignored because the parser plugins throw

Re: [Nutch-general] How to parse PDF files? Deferred parsing possible?

2007-05-31 Thread Doğacan Güney
(plugin.includes property). How can I make it parse these type of content while crawling? And if I run the fetch in non-parsing mode how can I make it parse them later and update it in crawl folder. Please help. -- Doğacan Güney

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-31 Thread Doğacan Güney
or point me to proper articles or wiki where I can learn this. On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote: On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration

Re: [Nutch-general] Fetcher2 slowness?

2007-05-31 Thread Doğacan Güney
On 5/18/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: On 5/18/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: Hi everyone, Has anyone tried Fetcher2 from latest trunk? On our tests, Fetcher2 is always slower (by a large margin) that Fetcher

Re: [Nutch-general] Fetcher2 slowness?

2007-05-31 Thread Doğacan Güney
On 5/31/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: I am still not sure about the source of this bug, but I think I found some unnecessary waits in Fetcher2. Even if a url is blocked by robots.txt (or has a crawl delay larger that max.crawl.delay), Fetcher2 still

Re: [Nutch-general] What is parse-oo and why doesn't parsed PDF content show up in cached.jsp ?

2007-05-31 Thread Doğacan Güney
to display the parsed content of the PDF instead of this message? As its name implies, cached content shows url's content:) . What you want to see is its parse text. Nutch doesn't do this but it is simple to change it so that it reads from segment/parse_text instead of segment/content . -- Doğacan Güney

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Doğacan Güney
? Are there other URL filters? If so, in what order are the filters called? !DSPAM:465d634894881383415936! -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version

Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Doğacan Güney
) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477) -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version

Re: [Nutch-general] Nutch crawls blocked sites - Why?

2007-05-28 Thread Doğacan Güney
, what do I need to check. Please help. In your case, crawl-urlfilter.txt is not read because you are not running 'crawl' command (as in bin/nutch crawl). You have to update regex-urlfilter.txt or prefix-urlfilter.txt and make sure that you enable them in your conf. -- Doğacan Güney

Re: [Nutch-general] Clustered crawl

2007-05-26 Thread Doğacan Güney
, but could take up to a week to complete. The new cluster was supposed to fix that and make this easier... It looks like your problem is related to https://issues.apache.org/jira/browse/NUTCH-246 . Jeff -Original Message- From: Doğacan Güney [mailto:[EMAIL PROTECTED] Sent: Friday, May 25

Re: [Nutch-general] Clustered crawl

2007-05-25 Thread Doğacan Güney
fetch those urls?) Thanks for the help. Jeff -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data

Re: [Nutch-general] Fetcher2 slowness?

2007-05-24 Thread Doğacan Güney
] -- Dogacan Güney -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http

Re: [Nutch-general] Fetcher2 slowness?

2007-05-24 Thread Doğacan Güney
are fetching? I have ~3 urls with ~1000 hosts. Hosts have at most 500 urls and there are 23 hosts that have 500 urls. I generally run Fetcher with 100-200 threads and Fetcher2 with 50 threads. -vishal. [snip] -- Doğacan Güney

Re: [Nutch-general] some pdf's are not parsed

2007-05-23 Thread Doğacan Güney
http.content.limit. And parse-pdf can't parse partial pdf files. -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits

Re: [Nutch-general] Fetcher2 slowness?

2007-05-23 Thread Doğacan Güney
(Fetcher finished in 1 hour, Fetcher2 in about 2.5). Though I have performed other tests where their performance is similar(and I have no idea why). I am trying to find the cause of problem, but so far, had no luck. Otis [snip] -- Doğacan Güney

Re: [Nutch-general] Fetcher2 slowness?

2007-05-18 Thread Doğacan Güney
On 5/18/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: Hi everyone, Has anyone tried Fetcher2 from latest trunk? On our tests, Fetcher2 is always slower (by a large margin) that Fetcher. For a segment with ~3 urls, we ran Fetcher with 150 threads and Fetcher2

Re: [Nutch-general] readseg bug?

2007-05-17 Thread Doğacan Güney
things) calculate score. There are 6 CrawlDatum fields and all of them are exactly identical. Is this a bug or am I missing something here? Any light on this matter would be greatly appreciated. Thank you. Florent -- Doğacan Güney

Re: [Nutch-general] Type:PDF

2007-05-16 Thread Doğacan Güney
to get any result when you use a query type:pdf with the webapp on your index ? yes. type:pdf something cu *pike -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C

Re: [Nutch-general] Type:PDF

2007-05-14 Thread Doğacan Güney
of question, I would suggest using Luke (http://www.getopt.org/luke/). You can view each document to check whether type field is indexed correctly, then you can do a search in Luke to see if that works. -- Doğacan Güney

Re: [Nutch-general] Nutch Crawling error

2007-05-14 Thread Doğacan Güney
] wrote: It should look like this but change out domain for your domain. Try this and let me know if it works. 127.0.0.1 dhcppc0.domain.com dhcppc0 localhost.localdomain localhost Dennis Kubes -- Doğacan Güney

Re: [Nutch-general] java.net.MalformedURLException: unknown protocol: s

2007-05-02 Thread Doğacan Güney
to feed these urls to java.net.URL you get this exception. It is not a big deal (computation continues ignoring that url candidate) though it may be a bit annoying. [snip] -- Doğacan Güney - This SF.net email is sponsored

Re: [Nutch-general] Plugin to index categories by url rules

2007-04-25 Thread Doğacan Güney
be appreciated. Thanks! -- View this message in context: http://www.nabble.com/Plugin-to-index-categories-by-url-rules-tf3621139.html#a10112854 Sent from the Nutch - User mailing list archive at Nabble.com. -- Doğacan Güney

Re: [Nutch-general] Nutch 0.8.1 problems

2007-02-21 Thread Doğacan Güney
) will assume that you are accessing /user/username/relative_path. You either have to put your crawldb there or configure nutch to use local fs or change generate's arguments. [snip] -- Doğacan Güney - Take Surveys. Earn Cash

Re: [Nutch-general] Nutch 0.8.1 problems

2007-02-21 Thread Doğacan Güney
what the problem is then. Can you include the output of commands: hadoop dfs -ls /nutch/filesystem/crawl/ hadoop dfs -ls /nutch/filesystem/crawl/crawldb Any other ideas? -- Oleg. -- Doğacan Güney - Take Surveys. Earn

Re: [Nutch-general] focused crawls -- where to add parse filter

2007-02-18 Thread Doğacan Güney
on a different sort value. The second part can be written with a different scoring plugin. Simply put whatever it is you need in CrawlDatum's metadata then change ScoringFilter.generatorSortValue to look up that value and give a good/bad score. [snip] -- Doğacan Güney

Re: [Nutch-general] focused crawls -- where to add parse filter

2007-02-17 Thread Doğacan Güney
the indexerScore method to give it an even higher boost. -Brian -- Doğacan Güney - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions

Re: [Nutch-general] Need help with deleteduplicates

2006-12-27 Thread Doğacan Güney
sdeck wrote: That sort of gets me there in understanding what is going on. Still not all the way though. So, let's look at the trunk of deleteduplicates: http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java No where in there do I see

Re: [Nutch-general] errors with parsing and indexing

2006-12-14 Thread Doğacan Güney
Doğacan Güney wrote: Hi, After hadoop-0.9.1, parsing and indexing doesn't seem to work. If you parse while fetching then it is fine, but if you run parse as a different job, it creates an essentially empty parse_data directory(which has index files, but doesn't have data files). I am looking

Re: [Nutch-general] Getting size and mime type info from Hits

2006-12-07 Thread Doğacan Güney
Daniel López wrote: Hi again, I finally ignored the RTF and MP3 plugins and was able to compile Nutch from scratch and then proceeded to create my own web search application. I get it up and running and I'm now displaying the same information as the demo search pages that come with

Re: [Nutch-general] httpclient fetcher error in hadoop log

2006-08-31 Thread Doğacan Güney
Hi, Feng Ji wrote: hi there, I got the huge percentage of fetching error for httpclient in hadoop log as followings: httpclient.HttpMethodDirector : httpclient.HttpMethodDirector - Redirect requested but followRedirects is disabled : I am not sure if this is an error. Plugin