Re: full text search for java sources and subversion repository

2010-05-09 Thread Andrzej Bialecki
in your segment (you can dump this with readseg command). It should contain a plain text content of your file. * use Luke (www.getopt.org/luke) to examine your Lucene index. You should be able to retrieve terms coming from your Java documents - use Rec

Re: Wildcard search with nutch distributed search

2010-05-09 Thread Andrzej Bialecki
ire major refactoring) > that could provide this functionality? Use Nutch for crawling and indexing to Solr, and then use Solr directly for searching. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| In

Re: JobTracker gets stuck with DFS problems

2010-05-03 Thread Andrzej Bialecki
ch > crawl" command, that means I will have to code my own .sh for crawling, one > that uses the -noparsing option of the fetcher right ? You can simply set the fetcher.parsing config option to false. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ ___

Re: JobTracker gets stuck with DFS problems

2010-05-03 Thread Andrzej Bialecki
you can re-parse again after you fixed the config or the code... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: JobTracker gets stuck with DFS problems

2010-04-30 Thread Andrzej Bialecki
n the politeness crawl delay. > > 3. When it all goes down, is there a way to restart crawling from where the > process stopped ? Unfortunately, no. You should at least crawl without parsing, so tha

Re: Hadoop Disk Error

2010-04-27 Thread Andrzej Bialecki
o the documentation. The problem should be reported to the Hadoop project. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Andrzej Bialecki
, because it needs the Hadoop >> infrastructure to run). > > I thought ant tar did this? That's what it sez on the release guide [1] and > what I'm familiar with when I did the Nutch 0.9 release. ant tar packs everything, i.e. both sou

Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Andrzej Bialecki
e. We may have been too hasty with that, though... What do others think? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

ANNOUNCE: Nutch becomes an Apache Top-Level Project (TLP)

2010-04-26 Thread Andrzej Bialecki
s_tlp -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: How to do faceting on data indexed by Nutch

2010-04-25 Thread Andrzej Bialecki
g backends - the one that is configured by default uses plain Lucene, and it does not support faceting. The other backend uses Solr, and then of course it supports faceting and all other Solr features. So in your case you need to switch to use Solr

Re: About Apache Nutch 1.1 Final Release

2010-04-16 Thread Andrzej Bialecki
On 2010-04-17 05:45, Phil Barnett wrote: > On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote: > >> More details on this (your environment, OS, JDK version) and >> logs/stacktraces would be highly appreciated! You mentioned that you >> have some scripts - if yo

Re: About Apache Nutch 1.1 Final Release

2010-04-10 Thread Andrzej Bialecki
get more specific. More details on this (your environment, OS, JDK version) and logs/stacktraces would be highly appreciated! You mentioned that you have some scripts - if you could extract relevant portions from them (or copy the scripts) it would h

Re: [VOTE] Apache Nutch 1.1 Release Candidate #1

2010-04-09 Thread Andrzej Bialecki
Release the packages as Apache Nutch 1.1. > > [ ] -1 Do not release the packages because... > +1 - tested both local and distributed workflows, all looks good. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || _

[VOTE RESULTS] Nutch to become a top-level project (TLP)

2010-04-08 Thread Andrzej Bialecki
ormal steps to become a TLP. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Nutch segment merge is very slow

2010-04-05 Thread Andrzej Bialecki
ep takes too much time, but still the number of segments is well below a hundred, just don't merge them. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \|

Re: Can't open a nutch 1.0 index with luke

2010-04-01 Thread Andrzej Bialecki
ventDispatchThread.pumpEvents(Unknown Source) > at java.awt.EventDispatchThread.pumpEvents(Unknown Source) > at java.awt.EventDispatchThread.run(Unknown Source) > > Any ideas why this happens and how

Re: [VOTE] Nutch to become a top-level project (TLP)

2010-04-01 Thread Andrzej Bialecki
icial, but I'm not familiar with maven, so I won't be able to make this change myself... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embed

[VOTE] Nutch to become a top-level project (TLP)

2010-04-01 Thread Andrzej Bialecki
gards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: hamid sefrani

2010-03-29 Thread Andrzej Bialecki
On 2010-03-29 17:14, Pedro Bezunartea López wrote: Thanks Andrzej, I was more curious than bothered by these easy to spot spam messages. Can I help? Thanks, not really - I sent an admin unsubscribe and it worked, we'll see if the problem returns ... -- Best regards, Andrzej Bia

Re: hamid sefrani

2010-03-29 Thread Andrzej Bialecki
moderator adds them. It appears that this user slipped through ... I'll try to forcibly unsubscribe him. Sorry! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: Nutch Fetch Stuck

2010-03-13 Thread Andrzej Bialecki
, otherwise it's likely to happen again. Are you running this on a cluster? Check the logs of the crashed tasks (in logs/userlogs/ on respective tasktracker nodes). -- Best regards, Andrzej Bia

Re: Nutch Fetch Stuck

2010-03-12 Thread Andrzej Bialecki
strongly recommend that you first fetch, and then run the parsing as a separate step. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, Syst

Re: Avoid indexing common html to all pages, promoting page titles.

2010-03-12 Thread Andrzej Bialecki
aviour? You can define these weights in the configuration, look for query boost properties. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, S

Re: Where are new linked entries added

2010-03-11 Thread Andrzej Bialecki
fying the code directly in ParseOutputFormat, it's complex and fragile. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: form-based authentication? Any progress

2010-03-10 Thread Andrzej Bialecki
ripts generating the response ... it was a total mess. So, if you target 10 sites, you can make it work. If you target 10,000 sites all using slightly different methods, then forget it. -- Best regards, Andrzej Bia

Re: Content of redirected urls empty

2010-03-08 Thread Andrzej Bialecki
really no content for the redirected url? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Cont

Re: New version of nutch?

2010-03-03 Thread Andrzej Bialecki
still a few months away. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Update on ignoring menu divs

2010-02-28 Thread Andrzej Bialecki
ogle.com/p/boilerpipe/ . -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Nutch v0.4

2010-02-25 Thread Andrzej Bialecki
arently that site no longer exists. Sorry :( However, you can still check out that code from CVS repository at nutch.sf.net . -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web __

Re: SegmentFilter

2010-02-21 Thread Andrzej Bialecki
On 2010-02-21 12:36, reinhard schwab wrote: Andrzej Bialecki schrieb: On 2010-02-20 23:32, reinhard schwab wrote: Andrzej Bialecki schrieb: On 2010-02-20 22:45, reinhard schwab wrote: the content of one page is stored even 7 times. http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8 i

Re: SegmentFilter

2010-02-21 Thread Andrzej Bialecki
On 2010-02-20 23:32, reinhard schwab wrote: Andrzej Bialecki schrieb: On 2010-02-20 22:45, reinhard schwab wrote: the content of one page is stored even 7 times. http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8 i believe this comes from Recno:: 383 URL:: http://www.cinema-paradiso.at

Re: SegmentFilter

2010-02-20 Thread Andrzej Bialecki
set of URL params, such as sessionId, print=yes, etc) or completely unrelated (human errors, peculiarities of the content management system, or mirrors). In your case it seems that the same page is available under different values of g2_highlightId. -- Best regards, Andrze

Re: About HBase Integration

2010-02-09 Thread Andrzej Bialecki
On 2010-02-09 03:08, Hua Su wrote: Thanks. But heritrix is another project, right? Please see this Git repository, it contains the latest work in progress on Nutch+HBase: git://github.com/dogacan/nutchbase.git -- Best regards, Andrzej Bialecki

Re: merge not working anymore

2010-01-18 Thread Andrzej Bialecki
WARN hdfs.DFSClient - DFS Read: java.io.IOException: Could not obtain block: blk_-6931814167688802826_9735 file=/user/root/crawl/indexed-segments/20100117235244/part-0/_1lr.prx This error is commonly caused by running out of disk space on a datanode. -- Best regards, Andrzej Bia

Re: Post Injecting ?

2010-01-15 Thread Andrzej Bialecki
On 2010-01-15 20:09, MilleBii wrote: Inject is meant to seed the database at the start. But I would like to inject new urls on a production crawldb, I think it works but I was wondering if somebody could confirm that. Yes. New urls are merged with the old ones. -- Best regards, Andrzej

Re: Help Needed with Error: java.lang.StackOverflowError

2010-01-11 Thread Andrzej Bialecki
e urlfilter-automaton, which is slightly less expressive but much much faster. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integrat

Re: Adding additional metadata

2010-01-11 Thread Andrzej Bialecki
e in a separate plugin then it might. Another reason is configurability - if you put this code in a separate plugin, you can easily turn it on/off, but if it sits in HtmlParser this would be more difficul

Re: Purging from Nutch after indexing with Solr

2010-01-09 Thread Andrzej Bialecki
which does happen in development& test phases, less in production though. Right. Also, a common practice is to keep the raw data for a while just to make sure that the parsing and indexing went smoothly (in case you need to re-parse the raw content). -- Best r

Re: Purging from Nutch after indexing with Solr

2010-01-08 Thread Andrzej Bialecki
ks will incrementally merge the existing linkdb with new links from a new segment. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System In

Re: alternatives to PDFBox (was: IOException when parsing PDF files)

2010-01-07 Thread Andrzej Bialecki
efae13d6cf878691 Umm .. if anything that comment suggests that properly handling diverse PDFs is simply a hard thing to do, and PDFBox is not that much to blame. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __

Re: Dedup remove all duplicates

2010-01-06 Thread Andrzej Bialecki
(2 documents), and if the problem persist please report this in JIRA. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integra

Re: Accessing crawled data

2009-12-22 Thread Andrzej Bialecki
On 2009-12-22 16:07, Claudio Martella wrote: Andrzej Bialecki wrote: On 2009-12-22 13:16, Claudio Martella wrote: Yes, I'am aware of that. The problem is that i have some fields of the SolrDocument that i want to compute by text analysis (basically i want to do some smart keywords extra

Re: Accessing crawled data

2009-12-22 Thread Andrzej Bialecki
ution that you are looking for is an IndexingFilter - this receives a copy of the document with all fields collected just before it's sent to the indexing backend - and you can freely modify the content of NutchDocument, e.g. do additional analysis, add/remove/modify fields, etc. -- Best r

Re: Large files - nutch failing to fetch

2009-12-22 Thread Andrzej Bialecki
ed in that patch required too much maintenance. On the positive side, it worked well with super-large keys and values (in the order of gigabytes). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retr

Re: Large files - nutch failing to fetch

2009-12-21 Thread Andrzej Bialecki
regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Nutch Hadoop 0.20 - AlreadyBeingCreatedException

2009-12-17 Thread Andrzej Bialecki
es) - maybe we should commit the change? Thanks for reporting this - could you perhaps try to apply that patch and see if it helps? I hesitated to commit it because it's really a workaround and not a solution ... but if it works for you then it's better than nothing. -- Best r

Re: OR support

2009-12-14 Thread Andrzej Bialecki
On 2009-12-14 16:05, BrunoWL wrote: Nobody? Please, any answer would good. Please check this issue: https://issues.apache.org/jira/browse/NUTCH-479 That's the current status, i.e. this functionality is available only as a patch. -- Best regards, Andrzej Bia

Re: Luke reading index in hdfs

2009-12-11 Thread Andrzej Bialecki
t contains part-N partial indexes). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact:

Re: domain vs www.domain?

2009-12-10 Thread Andrzej Bialecki
e to regex-urlnormalizer that changes the matching urls to e.g. always lose the 'www.' part. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__||

Re: NOINDEX, NOFOLLOW

2009-12-10 Thread Andrzej Bialecki
that page. Very good explanation, that's exactly the reasons why Nutch never discards such pages. If you really want to ignore certain pages, then use URLFilters and/or ScoringFilters. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Andrzej Bialecki
g the nutch*.job to a separate Hadoop cluster? Could you please try it with a standalone Hadoop cluster (even if it's a pseudo-distributed, i.e. single node)? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| In

Re: Nutch 1.0 wml plugin

2009-12-07 Thread Andrzej Bialecki
, please creata a JIRA issue in Nutch, and attach the patch. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integra

Re: How does generate work ?

2009-12-03 Thread Andrzej Bialecki
the priority of URL during generation. See ScoringFilter.generatorSortValue(..), you can modify this method in scoring-opic (or in your own scoring filter) to prioritize certain urls over others. -- Best regar

Re: org.apache.hadoop.util.DiskChecker$DiskErrorExceptio

2009-12-02 Thread Andrzej Bialecki
. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: crawl dates with fetch interval 0

2009-12-02 Thread Andrzej Bialecki
reinhard schwab wrote: this crawl date will be fetched and fetched again with 0 days retry interval. i will open an issue in jira and attach a patch. Thanks for catching this bug - please do so. -- Best regards, Andrzej Bialecki

Re: odd warnings

2009-12-01 Thread Andrzej Bialecki
d. However, the deduplication process doesn't accept partial indexes, so you need to specify each /part-NNNN dir as an input to dedup. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval

Re: odd warnings

2009-11-30 Thread Andrzej Bialecki
merged index, and "indexes" for partial indexes), otherwise they won't be found by the NutchBean (the search component in Nutch). So e.g. your Lucene index in index1/ won't be found. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __

Re: Nutch frozen but not exiting

2009-11-28 Thread Andrzej Bialecki
Paul Tomblin wrote: On Sat, Nov 28, 2009 at 5:48 PM, Andrzej Bialecki wrote: Paul Tomblin wrote: -bash-3.2$ jstack -F 32507 Attaching to process ID 32507, please wait... Hm, I can't see anything obviously wrong with that thread dump. What's the CPU and swap usage, and load

Re: Nutch frozen but not exiting

2009-11-28 Thread Andrzej Bialecki
Paul Tomblin wrote: On Sat, Nov 28, 2009 at 4:45 PM, Andrzej Bialecki wrote: Paul Tomblin wrote: How can I tell what's going on and why it's stopped? Try to generate a thread dump to see what code is being executed. I didn't do any sort of distributed mode because I&

Re: Nutch frozen but not exiting

2009-11-28 Thread Andrzej Bialecki
thread dump to see what code is being executed. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: 100 fetches per second?

2009-11-27 Thread Andrzej Bialecki
Next week I will be working on integrating the patches from Julien, and if time permits I could perhaps start working on a speed monitoring to lock out slow servers. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/

Re: 100 fetches per second?

2009-11-27 Thread Andrzej Bialecki
, slow map tasks tend to hang around, but still some of them finish and make space for new tasks. As time goes on, majority of your tasks becomes slow tasks, so the overall speed continues to drop down. -- Best regards, Andrze

Re: Encoding the content got from Fetcher

2009-11-27 Thread Andrzej Bialecki
uses ICU4J CharsetDetector plus its own heuristic (in util.EncodingDetector and in HtmlParser) that tries to detect character encoding if it's missing or even if it's wrong - but this is a tricky issue and sometimes results are unpredictable. -- Best regards, Andrze

Re: Broken segments ?

2009-11-26 Thread Andrzej Bialecki
track which thread you replied to and your question is "hidden" in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult." -- Best regar

Re: 100 fetches per second?

2009-11-25 Thread Andrzej Bialecki
put, to see how many unique hosts are in the current working set. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://

Re: Nutch config IOException

2009-11-25 Thread Andrzej Bialecki
innocuous - it helps to debug at which points in the code the Configuration instances are being created. And you wouldn't have seen this if you didn't turn on the DEBUG logging. ;) -- Best regards, Andrze

Re: dedup dont delete duplicates !

2009-11-25 Thread Andrzej Bialecki
the db in order to update the signatures. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact

Re: dedup dont delete duplicates !

2009-11-24 Thread Andrzej Bialecki
d to use a more relaxed Signature implementation, e.g. TextProfileSignature. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix,

Re: dedup dont delete duplicates !

2009-11-24 Thread Andrzej Bialecki
ls in your crawldb. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: can you incrementally build an index?

2009-11-24 Thread Andrzej Bialecki
and rebuild it from all per-segment indexes plus that most recent one. And then deduplicate. If this sounds wasteful, please keep in mind that when Lucene merges indexes it needs to re-write the main index anyway, so in terms of disk IO it should be nearly the same. -- Best regards, A

Re: AbstractFetchSchedule

2009-11-22 Thread Andrzej Bialecki
? Hm, indeed this looks like a bug - we should instead do like this: if (datum.getFetchInterval() > maxInterval) { datum.setFetchInterval(maxInterval * 0.9); } -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__

Re: Nutch upgrade to Hadoop

2009-11-21 Thread Andrzej Bialecki
! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Nutch upgrade to Hadoop

2009-11-20 Thread Andrzej Bialecki
Dennis Kubes wrote: I would like to get a couple things in this release as well. Let me know if you want help with the upgrade. You mean you want to do the Hadoop upgrade? I won't stand in your way :) -- Best regards, Andrzej Bia

Re: Nutch near future - strategic directions

2009-11-20 Thread Andrzej Bialecki
e current code, but it's design is obscured by the ScoringFilter api and the need to maintain its own extended DB-s. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|

Re: Nutch upgrade to Hadoop

2009-11-20 Thread Andrzej Bialecki
week) - and I agree that we should have a 1.1 release in the near future. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System In

Re: Scalability for one site

2009-11-16 Thread Andrzej Bialecki
(and webmasters who are their victims). The source code is there, if you choose you can modify it to bypass these restrictions, just be aware of the consequences (and don't use "Nutch" as your user agent ;) ).

Re: decoding nutch readseg -dump 's output

2009-11-16 Thread Andrzej Bialecki
tform encoding - any characters outside this encoding will be replaced by question marks. If you want to get an exact copy of the raw binary content then please use the SegmentReader API. -- Best regar

Re: Nutch near future - strategic directions

2009-11-16 Thread Andrzej Bialecki
s of course depends on the "last modified" timestamp being present on the webpage that is being crawled, which I believe is not mandatory. Still those who do set it would benefit. This is already implemented - see the Signature / MD5Signature / TextProfileSignature. -- Best regards, An

Re: Synonym Filter with Nutch

2009-11-13 Thread Andrzej Bialecki
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Nutch Hadoop question

2009-11-13 Thread Andrzej Bialecki
them to use different ports AND different local paths. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.s

Re: Problems with Hadoop source

2009-11-11 Thread Andrzej Bialecki
efines the implementation of the "file://" schema FileSystem. Now you probably forgot to put hadoop-default.xml on your classpath. Go to Build Path and add this file to your classpath, and all should be ok. -- Best regar

Re: changing/addding field in existing index

2009-11-09 Thread Andrzej Bialecki
the index. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Nutch near future - strategic directions

2009-11-09 Thread Andrzej Bialecki
mirrors. Etc, etc ... I'm pretty sure there are many others. Let's make Nutch an attractive platform to develop and experiment with such components. - Briefly ;) that's what comes to my mind when I think about the

Re: Direct Access to Cached Data

2009-11-05 Thread Andrzej Bialecki
adseg), and you can use its API to retrieve either all or individual records from a segment (using URL as key). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| |

Unsubscribe step-by-step (Re: could you unsubscribe me from this mailing list pls. tks)

2009-11-02 Thread Andrzej Bialecki
Andrzej Bialecki wrote: doesn't work, as reported by me and others last week. Thanks, Did you get the message with the subject of "confirm unsubscribe from nutch-user@lucene.apache.org" and did you respond to it from the same email account that you were subscribed from? ..

Re: could you unsubscribe me from this mailing list pls. tks

2009-11-02 Thread Andrzej Bialecki
ot; and did you respond to it from the same email account that you were subscribed from? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix,

Re: including code between plugins

2009-11-02 Thread Andrzej Bialecki
ntifier code in my plugin code without actually using the language-identifier plugin? You need to add the language-identifier plugin to the section in your plugin.xml, like this: --

Re: updatedb is talking long long time

2009-11-02 Thread Andrzej Bialecki
d re-running the operation. * minor issue - when specifying the path names of segments and crawldb, do NOT append the trailing slash - it's not harmful in this particular case, but you could have a nasty surprise when doing e.g. copy / mv op

Re: unbalanced fetching

2009-10-29 Thread Andrzej Bialecki
the longest is assigned a lot of URLs from a single host. A workaround for this is to limit the max number of URLs per host (in nutch-site.xml) to a more reasonable number, e.g. 100 or 1000, whatever works best for you. -- Best regards, Andrzej Bialecki

Re: Nutch indexes less pages, then it fetches

2009-10-28 Thread Andrzej Bialecki
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: How to index files only with specific type

2009-10-27 Thread Andrzej Bialecki
ment on or reject it by returning null. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Deleting stale URLs from Nutch/Solr

2009-10-26 Thread Andrzej Bialecki
Gora Mohanty wrote: On Mon, 26 Oct 2009 17:26:23 +0100 Andrzej Bialecki wrote: [...] Stale (no longer existing) URLs are marked with STATUS_DB_GONE. They are kept in Nutch crawldb to prevent their re-discovery (through stale links pointing to these URL-s from other pages). If you really want

Re: Deleting stale URLs from Nutch/Solr

2009-10-26 Thread Andrzej Bialecki
URLs directly from CrawlDb (using e.g. CrawlDbReader API) and then uses SolrJ API to send the same delete requests + commit. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Sem

Re: Targeting Specific Links

2009-10-23 Thread Andrzej Bialecki
at.MIN_VALUE) { return; } -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: Accessing an Index from a shared location

2009-10-21 Thread Andrzej Bialecki
Java - you need to mount this location as a local volume. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.s

Re: ERROR: current leaseholder is trying to recreate file.

2009-10-20 Thread Andrzej Bialecki
. This problem is rare - I think I crawled cumulatively ~500mln pages in various configs and it didn't occur to me personally. It requires a few things to go wrong (see the issue comments). -- Best regards, Andrzej Bia

Re: Extending HTML Parser to create subpage index documents

2009-10-19 Thread Andrzej Bialecki
le.tar!myfile.txt) and add the original URL in the metadata, to keep track of the parent URL. The rest should be handled automatically, although there are some other complications that need to be handled as well (e.g. don't recraw

Re: How to run a complete crawl?

2009-10-17 Thread Andrzej Bialecki
ult is 100 - when crawling filesystems each file in a directory is treated as an outlink, and this limit is then applied. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retr

Re: ERROR datanode.DataNode - DatanodeRegistration ... BlockAlreadyExistsException

2009-10-17 Thread Andrzej Bialecki
valid, and cannot be written to. Are you sure you are running a single datanode process per machine? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || |

Re: Nutch Enterprise

2009-10-17 Thread Andrzej Bialecki
I agree with Dennis - use Nutch if you need to do a larger-scale discovery such as when you crawl the web, but if you already know all target pages in advance then Solr will be a much better (and much easier to handle) platform. -- Best regards, Andrzej Bialecki

  1   2   3   4   5   6   7   8   >