how can i fetch a site manual
hi everyone, i'm writing searcher engine for blog, i've got a problem when getting info of bloger. 1. i use filter to make sure nutch only take a profile site of this one. 2. i generate dynamic blog,feed... links by myself and use some function in nutch to take it content 3. collect all info of this one and update to profile's datum. everything is ok, but i want nutch fetch all the site i got when taking info of this person too (able to search late. eg: blog page...). anyone can help me to solve it. i've try to use this codes in "private ParseStatus output" function (fetcher.java) on my new data: key2,datum2... but still fetch nothing. output.collect (key2, new FetcherOutput(datum2, storingContent ? content2 : null, parse2 != null ? new ParseImpl(parse2) : null)); thanks for any advices.
[jira] Commented: (NUTCH-505) Outlink urls should be validated
[ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511985 ] Hudson commented on NUTCH-505: -- Integrated in Nutch-Nightly #147 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/147/]) > Outlink urls should be validated > > > Key: NUTCH-505 > URL: https://issues.apache.org/jira/browse/NUTCH-505 > Project: Nutch > Issue Type: Improvement >Reporter: Doğacan Güney >Assignee: Doğacan Güney >Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, > NUTCH-505_draft_v2.patch > > > See discussion here: > http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html > Parse plugins may extract garbage urls from pages. We need a url validation > system that tests these urls and filters out garbage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-510) IndexMerger delete working dir
[ https://issues.apache.org/jira/browse/NUTCH-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511986 ] Hudson commented on NUTCH-510: -- Integrated in Nutch-Nightly #147 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/147/]) > IndexMerger delete working dir > -- > > Key: NUTCH-510 > URL: https://issues.apache.org/jira/browse/NUTCH-510 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.0.0 >Reporter: Enis Soztutar >Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: index.merger.delete.temp.dirs.patch > > > IndexMerger does not delete the working dir when an IOException is thrown > such as No space left on device. Local temporary directories should be > deleted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: OPIC scoring differences
Doğacan Güney wrote: Andrzej, nice to see you working on this. There is one thing that I don't understand about your presentation. Assume that page A is the only url in our crawldb and it contains n outlinks. t = 0 - Generate runs, A is generated. t = 1 - Page A is fetched and its cash is distributed to its outlinks. t = 2 - Generate runs, pages P0-Pn are generated. t = 3 - P0 - Pn are fetched and their cash are distributed to their outlinks. - At this time, it is possible that page Pk links to page A. So, now Page A's cash > 0. t = 4 - Generate runs, page A is considered but is not generated (since its next fetch time is later than current time). - Won't page A become a temporary sink? Time between subsequent fetches may be as large as 30 days in default configuration. So, page A will accumulate cash for a long time without distributing it. Yes. That's why Abiteboul used history (several cycles long) to smooth out temporary imbalances in cache redistribution. The history component described in the paper could be either several cycles long, or specific period of time long. In our case I think the history for rarely updated pages should span the db.max.interval period plus some, and for frequently updated pages it should span several cycles. - I don't see how we can achieve that, but, IMO, if a page is considered but not generated, nutch should distribute its cash to outlinks the outlinks that are stored in its parse data. (I know that this is incredibly hard (if not impossible) to do this.) Actually we store outlinks in two places - one place is obviously the segments. The other less obvious place is the linkdb - the data is there, it just needs to be inverted (again). So, theoretically, we could modify the updatedb process to consider the complete webgraph, i.e. all link information collected so far - but the main attractiveness of OPIC is that it's incremental, so that you don't have to consider the whole webgraph with small incremental updates. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Closed: (NUTCH-510) IndexMerger delete working dir
[ https://issues.apache.org/jira/browse/NUTCH-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney closed NUTCH-510. --- Issue resolved and committed. > IndexMerger delete working dir > -- > > Key: NUTCH-510 > URL: https://issues.apache.org/jira/browse/NUTCH-510 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.0.0 >Reporter: Enis Soztutar >Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: index.merger.delete.temp.dirs.patch > > > IndexMerger does not delete the working dir when an IOException is thrown > such as No space left on device. Local temporary directories should be > deleted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-510) IndexMerger delete working dir
[ https://issues.apache.org/jira/browse/NUTCH-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney resolved NUTCH-510. - Resolution: Fixed Assignee: Doğacan Güney Committed in rev. 555307 with style modifications. I also removed two useless log guards in IndexMerger. > IndexMerger delete working dir > -- > > Key: NUTCH-510 > URL: https://issues.apache.org/jira/browse/NUTCH-510 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.0.0 >Reporter: Enis Soztutar >Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: index.merger.delete.temp.dirs.patch > > > IndexMerger does not delete the working dir when an IOException is thrown > such as No space left on device. Local temporary directories should be > deleted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: OPIC scoring differences
On 7/9/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: Carl Cerecke wrote: > Hi, > > The docs for the OPICScoringFilter mention that the plugin implements a > variant of OPIC from Artiboul et al's paper. What exactly is different? > How does the difference affect the scores? As it is now, the implementation doesn't preserve the total "cash value" in the system, and also there is almost no smoothing between the iterations (Abiteboul's "history"). As a consequence, scores may (and do) vary dramatically between iterations, and they don't converge to stable values, i.e. they always increase. For pages that get a lot of score contributions from other pages this leads to an explosive increase into the range of thousands or eventually millions. This means that the scores produced by the OPIC plugin exaggerate score differences between pages more and more, even if the web graph that you crawl is stable. In a sense, to follow the "cash" analogy, our implementation of OPIC illustrates a runaway economy - galloping inflation, rich get richer and poor get poorer ;) > Also, there's a comment in the code: > > // XXX (ab) no adjustment? I think this is contrary to the algorithm descr. > // XXX in the paper, where page "loses" its score if it's distributed to > // XXX linked pages... > > Is this something that will be looked at eventually or is the scoring > "good enough" at the moment without some "adjustment". Yes, I'll start working on it when I get back from vacations. I did some simulations that show how to fix it (see http://wiki.apache.org/nutch/FixingOpicScoring bottom of the page). Andrzej, nice to see you working on this. There is one thing that I don't understand about your presentation. Assume that page A is the only url in our crawldb and it contains n outlinks. t = 0 - Generate runs, A is generated. t = 1 - Page A is fetched and its cash is distributed to its outlinks. t = 2 - Generate runs, pages P0-Pn are generated. t = 3 - P0 - Pn are fetched and their cash are distributed to their outlinks. - At this time, it is possible that page Pk links to page A. So, now Page A's cash > 0. t = 4 - Generate runs, page A is considered but is not generated (since its next fetch time is later than current time). - Won't page A become a temporary sink? Time between subsequent fetches may be as large as 30 days in default configuration. So, page A will accumulate cash for a long time without distributing it. - I don't see how we can achieve that, but, IMO, if a page is considered but not generated, nutch should distribute its cash to outlinks the outlinks that are stored in its parse data. (I know that this is incredibly hard (if not impossible) to do this.) Or am I missing something here? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
[jira] Updated: (NUTCH-506) Nutch should delegate compression to Hadoop
[ https://issues.apache.org/jira/browse/NUTCH-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-506: Attachment: NUTCH-506.patch New version. I missed ProtocolStatus and ParseStatus. This patch updates them in a backward-compatible way. > Nutch should delegate compression to Hadoop > --- > > Key: NUTCH-506 > URL: https://issues.apache.org/jira/browse/NUTCH-506 > Project: Nutch > Issue Type: Improvement >Reporter: Doğacan Güney > Fix For: 1.0.0 > > Attachments: compress.patch, NUTCH-506.patch > > > Some data structures within nutch (such as Content, ParseText) handle their > own compression. We should delegate all compressions to Hadoop. > Also, nutch should respect io.seqfile.compression.type setting. Currently > even if io.seqfile.compression.type is BLOCK or RECORD, nutch overrides it > for some structures and sets it to NONE (However, IMO, ParseText should > always be compressed as RECORD because of performance reasons). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-505) Outlink urls should be validated
[ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney resolved NUTCH-505. - Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Doğacan Güney Committed in rev. 555237. > Outlink urls should be validated > > > Key: NUTCH-505 > URL: https://issues.apache.org/jira/browse/NUTCH-505 > Project: Nutch > Issue Type: Improvement >Reporter: Doğacan Güney >Assignee: Doğacan Güney >Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-505.patch, NUTCH-505.patch, NUTCH-505_draft.patch, > NUTCH-505_draft_v2.patch > > > See discussion here: > http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html > Parse plugins may extract garbage urls from pages. We need a url validation > system that tests these urls and filters out garbage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.