Re: [Nutch-general] Pull out a page from already processed pages, re-parse and replace

2007-07-26 Thread Andrzej Bialecki
to be altered to achieve this? Just remove the following directories from each segment: crawl_parse, parse_text, parse_data, and then run bin/nutch parse on these segments. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] CrawlDbReader TopN

2007-07-25 Thread Andrzej Bialecki
- it's equivalent to IdentityReducer, which is used implicitly by this job. This class is a leftover from the time, when it contained also some filtering code. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: [Nutch-general] four nutch merge commands: mergedb, mergesegs, mergelinkdb, merge

2007-07-16 Thread Andrzej Bialecki
, and mergesegs to merge segments ;) And a simple merge merges indexes of multiple segments, which is a performance-related step in the regular Nutch work-cycle. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: [Nutch-general] Restricting crawl to a certain topic

2007-07-12 Thread Andrzej Bialecki
Carl Cerecke wrote: Carl Cerecke wrote: Andrzej Bialecki wrote: Carl Cerecke wrote: I've given this a crack and it mostly seems to work, except I'm not sure how to get the score back into the crawldb. After reading the Javadoc, I figured that passScoreAfterParsing() was the method I need

Re: [Nutch-general] incremental growing index

2007-07-12 Thread Andrzej Bialecki
? It would be probably too slow, unless you made a copy of linkdb/crawldb on the local FS-es of each node. But at this point the benefit of this change would be doubtful, because of all the I/O you would need to do to prepare each task's environment ... -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Separating nutch and hadoop configurations.

2007-07-11 Thread Andrzej Bialecki
), it should be enough to put the nutch*.job file in ${hadoop.dir}, and copy bin/nutch (possibly with some minor modifications - my memory is a little vague on this ...). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: [Nutch-general] Locale for Nutch?

2007-07-09 Thread Andrzej Bialecki
get the index.ja.html page instead of the English page. Please see org.apache.nutch.protocol.httpclient.Http.java:116 - currently this is hardcoded, but it would be easy to turn it into a configuration parameter. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] NUTCH-479 Support for OR queries - what is this about

2007-07-07 Thread Andrzej Bialecki
is the historical genesis of this issue (or is that even relevant)? Nutch webapp doesn't have anything to do with it. The limitations in the query syntax have different roots (see above). -- Best regards, Andrzej Bialecki

Re: [Nutch-general] The ranking is wrong

2007-06-27 Thread Andrzej Bialecki
text blocks by size * drop a certain number (or percentage) of the smallest of the text blocks. * put the blocks back in order, and extract only their text content. This is the main body text. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Integrate nutch crawler with Solr index server

2007-06-26 Thread Andrzej Bialecki
if the dates (with this resolution) were stored in a single field. The other method (combining) is already in use in Nutch, and implemented in CommonGrams. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: [Nutch-general] fetching http://www.variety.com//div/a

2007-06-23 Thread Andrzej Bialecki
Doğacan Güney wrote: On 6/23/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: On 6/22/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: These 'urls' most likely come from parse-js plugin. Can you disable it and see if they disappear? To extract

Re: [Nutch-general] fetching http://www.variety.com//div/a

2007-06-23 Thread Andrzej Bialecki
Doğacan Güney wrote: On 6/22/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: These 'urls' most likely come from parse-js plugin. Can you disable it and see if they disappear? To extract links from js code, parse-js uses a heuristic that unfortunately also may extract

Re: [Nutch-general] Distributed index

2007-06-21 Thread Andrzej Bialecki
response times on most queries. Are you running with a sorted index, and using non-zero searcher.max.hits? If you use a well-defined PR-like scoring, then using this feature could make wonders to the performance, and increase the max number of docs per server. -- Best regards, Andrzej

Re: [Nutch-general] Distributed index

2007-06-21 Thread Andrzej Bialecki
queries: http://www.nabble.com/Performance-optimization-for-Nutch-index---query-tf3276316.html#a9111523 -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded

Re: [Nutch-general] doubt about indexing

2007-06-20 Thread Andrzej Bialecki
. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-general] doubt about indexing

2007-06-19 Thread Andrzej Bialecki
this (for performance reasons). Whenever the full text is needed, it's retrieved from Nutch segment data. Please see the logic in o.a.n.s.FetchedSegment for details - this process doesn't use Lucene at all, it simply retrieves records from Hadoop MapFile using URL as document ID. -- Best regards, Andrzej

Re: [Nutch-general] Indexing problems in nutch-nightly

2007-06-15 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-general] Indexing problems in nutch-nightly

2007-06-15 Thread Andrzej Bialecki
- this should be parseData instead of parse. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info

Re: [Nutch-general] Any URL filter available for search.jsp?

2007-06-14 Thread Andrzej Bialecki
want him not to see MR certain sites in the results that have been crawled. MR How can this be achieved? Anyone solve this problem? I need this filter too. How to do it in the best way in nutch 0.9? Any thoughts? http://issues.apache.org/jira/browse/NUTCH-477 -- Best regards, Andrzej

Re: [Nutch-general] Crawling the web and going into depth

2007-06-12 Thread Andrzej Bialecki
it in a regex, or you can implement your own URLFilter plugin that does exactly this. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: [Nutch-general] Crawling the web and going into depth

2007-06-12 Thread Andrzej Bialecki
Enzo Michelangeli wrote: - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Sunday, June 10, 2007 5:48 PM Enzo Michelangeli wrote: - Original Message - From: Berlin Brown [EMAIL PROTECTED] Sent: Sunday, June 10, 2007 11:24 AM Yea, but how do crawl

Re: [Nutch-general] Nutch/Hadoop Fetcher confusion

2007-06-12 Thread Andrzej Bialecki
and you have a very unpolite fetcher. Please don't run this to fetch a site you don't control :) .. because it destroys the built-in controls that Nutch uses to avoid making multiple concurrent requests to the same site, or to make them too quickly. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Cookie

2007-06-07 Thread Andrzej Bialecki
, it handles cookies properly without any additional configuration. However, they are not stored anywhere, so they will be valid only for the duration of a single fetch. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] stackoverflow error

2007-06-06 Thread Andrzej Bialecki
the source of DOMContentUtils to artificially limit the level of recursion in getOutlinks to something like 200-300. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [Nutch-general] Is fetcher.throttle.bandwidth known to work?

2007-06-05 Thread Andrzej Bialecki
Enzo Michelangeli wrote: - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Monday, June 04, 2007 2:05 PM Er... I saw it mentioned at http://wiki.apache.org/nutch/FetchOptions , so I thought it was for real... Sorry, this page is wrong and should be corrected

Re: [Nutch-general] Is fetcher.throttle.bandwidth known to work?

2007-06-04 Thread Andrzej Bialecki
Enzo Michelangeli wrote: - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] Sent: Monday, June 04, 2007 1:31 AM Enzo Michelangeli wrote: In my case (with Nutch 0.8), it seems not: I set it to 500, and the fetcher still saturates the 1.5 Mbit/s link... Is it supposed

Re: [Nutch-general] Is fetcher.throttle.bandwidth known to work?

2007-06-03 Thread Andrzej Bialecki
property with such name ... Is this perhaps a part of your local code base? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: [Nutch-general] Nutch and faceted search

2007-06-02 Thread Andrzej Bialecki
fast, although they differ in accuracy vs. speed balance. Unfortunately the code is not public - but the task is certainly doable, and doesn't require major changes. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Compression

2007-06-02 Thread Andrzej Bialecki
? You can use the *Merger tools to re-write the data. E.g. CrawlDbMerger for crawldb, giving just a single db as the input argument. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [Nutch-general] Fetcher2 slowness?

2007-05-31 Thread Andrzej Bialecki
/fetcher2_robots.patch Good catch! The patch looks good, too - please go ahead. One question: why did you remove the call to finishFetchItem() around line 505? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: [Nutch-general] Parallelizing URLFiltering

2007-05-31 Thread Andrzej Bialecki
to increase the number of concurrent requests and the cache size. This was on Linux, though - I have no idea how to do this on Windows. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [Nutch-general] Fetcher2 slowness?

2007-05-31 Thread Andrzej Bialecki
for this, if I am mistaken, just give me a nudge and I will send an updated patch. Indeed, you're right - I should've checked with the base version, not just the patch. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] mergesegs is not functioning properly

2007-05-29 Thread Andrzej Bialecki
should be fine. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-general] nutch-site.xml vs. nutch-default.xml

2007-05-27 Thread Andrzej Bialecki
which config files are loaded in what order and from what locations. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http

Re: [Nutch-general] Fetcher2 slowness?

2007-05-18 Thread Andrzej Bialecki
, and queue info logging. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-general] Fetcher2 slowness?

2007-05-18 Thread Andrzej Bialecki
Doğacan Güney wrote: On 5/18/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: Hi everyone, Has anyone tried Fetcher2 from latest trunk? On our tests, Fetcher2 is always slower (by a large margin) that Fetcher. For a segment with ~3 urls, we ran Fetcher with 150

Re: [Nutch-general] Generic Question about initial seed

2007-05-16 Thread Andrzej Bialecki
junk and spam - unless you tightly control the quality of URLs, using URLFilters, ScoringFilters and other means. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [Nutch-general] http content limit not working?

2007-05-11 Thread Andrzej Bialecki
, the request is terminated and Nutch is able to do the right thing. The default protocol-http plugin does not use the apache commons httpclient stuff, and works correctly. Could you please create a JIRA issue, so that your analysis and the possible fix is recorded? Thanks! -- Best regards, Andrzej

Re: [Nutch-general] urlfilter-suffix bug ?

2007-05-05 Thread Andrzej Bialecki
implement the former. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-general] Hardware Crashes and Garbage Collection on Nutch/Hadoop

2007-04-21 Thread Andrzej Bialecki
hitting OS-wide limits of open file handles. In another installation the OS-wide limits were ok, but the limits on this particular account were insufficient. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: [Nutch-general] Fetching outside the domain ?

2007-04-20 Thread Andrzej Bialecki
about such things should be fatored out and encapsulated in a utility class. This is more work than just adding a single line check, which may suggest why it hasn't been done yet. Patches are welcome ;) -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Combining standard Lucene and Nutch

2007-04-11 Thread Andrzej Bialecki
into Nutch queries, and then translated into Lucene queries, using this tool: bin/nutch org.apache.nutch.searcher.Query -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [Nutch-general] How to recude the tmp disk space usage during linkdb process?

2007-04-11 Thread Andrzej Bialecki
lowering it. * Please try the following modification: somewhere around LinkDb.java:283 add the following line: job.setCombinerClass(LinkDb.class); Recompile and re-run. * Also, as others suggested, you may want to turn on compression. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Removing pages from index immediately

2007-04-05 Thread Andrzej Bialecki
unfetched pages. You can also modify the Generator to completely skip such flagged pages. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: [Nutch-general] Unable to load native-hadoop library

2007-04-04 Thread Andrzej Bialecki
send Nutch-related questions first to Nutch groups). What is your operating system (uname -a) ? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: [Nutch-general] Unable to load native-hadoop library

2007-04-04 Thread Andrzej Bialecki
wangxu wrote: Linux wangxu.com 2.6.8-2-386 #1 Tue Aug 16 12:46:35 UTC 2005 i686 GNU/Linux Andrzej Bialecki wrote: wangxu wrote: when I use nutch-nightly0.9 ,I got this: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable And I echo

Re: [Nutch-general] Crawling + Indexing staging vs. production and URL conflict

2007-03-30 Thread Andrzej Bialecki
thing to do then would be to rewrite absolute outlinks contained in the content, from staging to www - but this can be done in URLNormalizers. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [Nutch-general] 0.8.x Crawler compared to 0.7.2 Crawler

2007-03-28 Thread Andrzej Bialecki
, it was completely rewritten - I don't think there's any detailed documentation on this, though... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: [Nutch-general] 0.8.x Crawler compared to 0.7.2 Crawler

2007-03-27 Thread Andrzej Bialecki
0.7.2 was released but failed to locate any such discussion). Please see above. The answer is yes. ;) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: [Nutch-general] Splitting segments

2007-03-26 Thread Andrzej Bialecki
Mathijs Homminga wrote: Hi all, Is there a way to split large segments into smaller pieces? Mathijs As the name suggests (not ;) ) use SegmentMerger with the -slice option. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] DummySSLProtocolSocketFactory problem, please help me!!!!

2007-03-14 Thread Andrzej Bialecki
any answers. What helps is when you create a bug issue in JIRA, describe the problem and attach a patch that helped in your case. Thank you for your co-operation. ;) -- Best regards, Andrzej Bialecki

Re: [Nutch-general] fetch2 very slow - anyone try this??

2007-03-12 Thread Andrzej Bialecki
public URLs, could you please send me your fetchlist ? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Re: [Nutch-general] Fetch: java.lang.NullPointerException

2007-03-09 Thread Andrzej Bialecki
provide a descriptive message instead of throwing NPE. Care to provide a patch? ;) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration

Re: [Nutch-general] Fetch: java.lang.NullPointerException

2007-03-08 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-general] Behavior of nutch-site.xml vs. hadoop-site.xml

2007-03-02 Thread Andrzej Bialecki
in the Parse MetaData. The reason is simple - space. Storing additional data consumes space, and if someone just occasionally needs this info from one or two pages it's less costly to re-parse the page again. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Behavior of nutch-site.xml vs. hadoop-site.xml

2007-03-02 Thread Andrzej Bialecki
rubdabadub wrote: On 3/2/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Dennis Kubes wrote: Believe it or not I don't think that meta tags are currently stored. I looked through the html parsing code and didn't see anywhere that it could be storing it except in html filters. I see

Re: [Nutch-general] Recovering aborted fetch

2007-02-27 Thread Andrzej Bialecki
by the symbolic name inside the SequenceFile. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: [Nutch-general] Recovering aborted fetch

2007-02-26 Thread Andrzej Bialecki
the javadoc says, so that there's no misunderstanding: if you use DFS and your fetch job is aborted, there is no way in the world to recover the data - it's permanently lost. If you run with a local FS, you can try this tool and hope for the best. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Quick questions - merging/deduping

2007-02-22 Thread Andrzej Bialecki
. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-general] Quick questions - merging/deduping

2007-02-21 Thread Andrzej Bialecki
Lucifersam wrote: Andrzej Bialecki wrote: Lucifersam wrote: Finally - I seem to have a problem with identical pages with different urls - i.e. http://website/ http://website/default.htm I was under the impression that these would be removed by the dedup process, but this does

Re: [Nutch-general] How to limit nutch to fetch, refetch and index just the injected URLs?

2007-02-02 Thread Andrzej Bialecki
not support it, but it's easy to add. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram

Re: [Nutch-general] Dedup index error

2007-02-01 Thread Andrzej Bialecki
partition ... I need to check where the problem originates - however, this should not happen if you index more documents than 2 * the number of reduce tasks. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: [Nutch-general] Fetcher threads automation

2007-02-01 Thread Andrzej Bialecki
and quickly, rather they make a bunch of requests for resources tied to a single page, then wait relatively long time, and then make another bunch of requests ... So, the request pattern is still more fair than in the case of a mad crawler. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Dedup index error

2007-01-31 Thread Andrzej Bialecki
or more indexes under crawled/indexes is invalid - nonexistent, incomplete or corrupt. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: [Nutch-general] Fetcher threads automation

2007-01-28 Thread Andrzej Bialecki
) - and quite often all requests from such sources get blocked at the firewall level - sometimes, even whole IP classes get blocked. So, t(h)read carefully ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: [Nutch-general] Fetcher threads automation

2007-01-28 Thread Andrzej Bialecki
for automating these types of job streams in python but that is not complete yet. Andrzej, do you think this is something we should post to the wiki? Sure, if it's ok for you to release it I'm sure many people would find it useful. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Linking url metadata to nutch search results

2007-01-26 Thread Andrzej Bialecki
function which maps String to Integer, but even in this case you would have a small probability that existing URLs will be re-numbered. The space of int is too small to use random hashing and hope there are no collisions. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] How to limit nutch to fetch, refetch and index just the injected URLs?

2007-01-26 Thread Andrzej Bialecki
Nicolás Lichtmaier wrote: I'd like to limit nutch to fetch, refetch and index just the injected URLs. Will setting db.max.outlinks.per.page to 0 enable me to do that? If not... how could achive what I'm looking to? You need to run updatedb with -noAdditions switch. -- Best regards, Andrzej

Re: [Nutch-general] Need help with form based authentication

2007-01-26 Thread Andrzej Bialecki
, redirecting, running javascripts, etc. In the end only perhaps 1 out of 50 sites was using a plain form authentication, and even that with different field names on the form ... so I gave up. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Merging large sets of segments, help.

2007-01-24 Thread Andrzej Bialecki
. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-general] Merging large sets of segments, help.

2007-01-24 Thread Andrzej Bialecki
of this information is already available on the Nutch Wiki. All I can say is that there is certainly a limit to what you can do using the local mode - if you need to handle large numbers of pages you will need to migrate to the distributed setup. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Merging large sets of segments, help.

2007-01-24 Thread Andrzej Bialecki
from 0.8 and later, and offers only limited scalability. Still, this workaround should work ok ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix

Re: [Nutch-general] Problem crawling/fetching using https

2007-01-24 Thread Andrzej Bialecki
. There were also other intermittent problems with this library, so after much deliberation we decided to leave the simpler plugin as the default ... These issues may have been solved in a newer version of httpclient library. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Input directory urls/url-fr.txt in localhost:9000 is invalid with Hadoop 0.4.0patched and Nutch 0.8.1

2007-01-19 Thread Andrzej Bialecki
) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) .. and that's because urlDir: urls/url-fr.txt is not a directory, but a file. You should give only the urls as the input directory - Nutch will read all text files inside the directory. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Reduce segment size

2007-01-19 Thread Andrzej Bialecki
regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-general] DB_unfetched status

2007-01-18 Thread Andrzej Bialecki
of threads accessing a single host, and delay between requests. Look for Exceeded http.max.delays errors in your log. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [Nutch-general] Nutch 0.8 cannot find all the links on a page

2007-01-18 Thread Andrzej Bialecki
- most likely you have the default rule that discards URLs with special characters. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: [Nutch-general] How to recover data from filesystem

2007-01-17 Thread Andrzej Bialecki
to physically remove all blocks that are not accounted for in the current fsimage). If it's any consolation - this problem is recognized, and people are actively working on fixing it. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Issue While Creating Inverted Links

2007-01-16 Thread Andrzej Bialecki
! This exception doesn't tell anything except that the job failed... You need to increase the logging level to DEBUG - please check log4j.properties . My guess is that most likely one of these segments is unfetched or corrupted. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] BUG with error: failure closing block of file with Hadoop 0.9.2 and Nutch 0.8.1

2007-01-16 Thread Andrzej Bialecki
) Nutch 0.8.1 doesn't work with any other version of Hadoop than the one it's supplied with - i.e. version 0.4.0-patched. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [Nutch-general] checksum error in segment merger

2007-01-16 Thread Andrzej Bialecki
time I see that Hadoop detects non-obvious errors in hardware or connectivity on a cluster - on one hand, it would be nice if it were less susceptible to this kind of errors, on the other hand - it makes for a good diagnostic tool ;) -- Best regards, Andrzej Bialecki

Re: [Nutch-general] checksum error in segment merger

2007-01-15 Thread Andrzej Bialecki
this on an NFS volume, using LocalFileSystem? You aren't running out of disk space by any chance? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: [Nutch-general] checksum error in segment merger

2007-01-15 Thread Andrzej Bialecki
Brian Whitman wrote: On Jan 15, 2007, at 1:36 PM, Andrzej Bialecki wrote: Brian Whitman wrote: (nutch-nightly, hadoop 0.9.1) The file indicated (bad_files/data.-931801681) is a 255MB binary file -- running strings on it shows a lot of URIs. There's also a 2MB .data.crc-931801681 file, all

Re: [Nutch-general] nutch-0.9 trunk is failing in Indexer

2007-01-11 Thread Andrzej Bialecki
indicate that mapred.speculative.execution is true in your config - make sure it's explicitly set to false. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded

Re: [Nutch-general] Filtering URLs in CrawlDB

2007-01-09 Thread Andrzej Bialecki
segment. Indexes contain segment names and document id-s inside, so if you have merged/sliced your segments you have to rebuild the index too. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic

Re: [Nutch-general] LocalFileSystem , LinkDbReader and workingDir

2007-01-09 Thread Andrzej Bialecki
paths for any arguments ... ;) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-general] Nutch Programmer Wanted

2007-01-07 Thread Andrzej Bialecki
mln urls, if even that many. The main bottleneck were the DB operations, which for any type of hardware would take even days to complete. These limitations have been largely removed in 0.8 and later, due to the Hadoop framework. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Reading Inlinks

2007-01-05 Thread Andrzej Bialecki
to include information from linkdb when it generates new segments, whichever way is more suitable to your requirements. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [Nutch-general] Google Search on Nutch?

2007-01-03 Thread Andrzej Bialecki
is more than capable of doing this, all it takes is one person familiar with the infrastructure the nightly build process, and with a day or two to spare ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information

Re: [Nutch-general] Google Search on Nutch?

2007-01-03 Thread Andrzej Bialecki
knows anymore). No, I meant the apache.org as a person (a committer), who is familiar enough with both Nutch and the local infrastructure at apache.org so that he could set it up. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] re-parse hang?

2007-01-03 Thread Andrzej Bialecki
. Any ideas? In such case you should always do a full thread dump of this JVM process. Under Unix systems this is achieved by doing kill -SIGQUIT pid, under Windows Ctrl-Break. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] Error on convert to 0.9 during mergesegs step

2007-01-02 Thread Andrzej Bialecki
/ are incompatible with 0.8.x, and with earlier versions of trunk// - see the note 17. in CHANGES.txt. You should also temporarily increase your logging level to DEBUG to see if there are any problems reported at low level. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] parse-js as a HtmlParseFilter

2006-12-30 Thread Andrzej Bialecki
and consuming 100% CPU. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-general] pagerank implementation

2006-12-15 Thread Andrzej Bialecki
). This should be trivial to implement as a scoring plugin. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com

Re: [Nutch-general] Error on convert to 0.9 during mergesegs step

2006-12-15 Thread Andrzej Bialecki
) ^^ Please set mapred.speculative.execution to false, and repeat. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System

Re: [Nutch-general] Error on convert to 0.9 during mergesegs step

2006-12-15 Thread Andrzej Bialecki
, and Nutch config contains only overrides ... so you need to put this explicitly into your hadoop-site.xml, like this: property namemapred.speculative.execution/name valuefalse/value /property If this fixes your problem, I'll put this property in the public sources. -- Best regards, Andrzej

Re: [Nutch-general] error with trunk: linkdb copied to wrong dir

2006-12-15 Thread Andrzej Bialecki
trouble, and you come up with some patches that improve support for *BSD, I may be able to integrate them back to Hadoop sources. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: [Nutch-general] error with trunk: linkdb copied to wrong dir

2006-12-14 Thread Andrzej Bialecki
again :) Indeed, this is related to some changes of delete()'s behavior in HDFS - it seems that previously it would just return false on non-existent directories, now it throws an Exception. I fixed this in trunk/ and branch-0.8. -- Best regards, Andrzej Bialecki

Re: [Nutch-general] error with trunk: linkdb copied to wrong dir

2006-12-14 Thread Andrzej Bialecki
in logs. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-general] error with trunk: linkdb copied to wrong dir

2006-12-14 Thread Andrzej Bialecki
ava:74) The issue is that this constructor, MapFile.Writer(Configuration, FileSystem, String, Class, Class) is present only in Hadoop 0.9, but it wasn't present before ... -- Best regards, Andrzej Bialecki

  1   2   3   4   >