Re: Scoring API issues (LONG)

2007-10-18 Thread Andrzej Bialecki

Sami Siren wrote:

Andrzej Bialecki wrote:

Hi all,

I've been working recently on a custom scoring plugin, and I found out
some issues with the scoring API that severely limit the way we can
calculate static page scores. I'd like to restart the discussion about
this API, and propose some changes. Any comments or suggestions are
welcome!


Hi,

In practice I have found out that sometimes it's just easier (and even
more efficient) to write a custom mr job (yes, an additional phase into
the process) to calculate the scores for urls.


Same here. E.g. PageRank calc. requires running a separate job. Other 
scoring techniques that use a post-processed linkgraph also require 
running a separate MR job.



By using this strategy it would give users more freedom in selecting the
data (and algorithm) required and same time keep the other parts of the
process more slim.


Right .. except the main (supposed) benefit of OPIC was that it would be 
possible to avoid running an additional analysis step - the scores were 
supposed to be re-calculated online as a part of other steps. It's not 
worked out this way, as we know, but this was the main motivation for 
introducing the scoring API ... although it seems more and more that 
this API is just a glorified OPIC, and it's not sufficiently re-usable 
to benefit other scoring algorithms ...



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Scoring API issues (LONG)

2007-10-18 Thread Sami Siren
Andrzej Bialecki wrote:
> Hi all,
> 
> I've been working recently on a custom scoring plugin, and I found out
> some issues with the scoring API that severely limit the way we can
> calculate static page scores. I'd like to restart the discussion about
> this API, and propose some changes. Any comments or suggestions are
> welcome!

Hi,

In practice I have found out that sometimes it's just easier (and even
more efficient) to write a custom mr job (yes, an additional phase into
the process) to calculate the scores for urls.

By using this strategy it would give users more freedom in selecting the
data (and algorithm) required and same time keep the other parts of the
process more slim.

-- 
 Sami Siren


Re: Scoring API issues (LONG)

2007-09-19 Thread Andrzej Bialecki

Doğacan Güney wrote:

On 9/18/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Doğacan Güney wrote:




public void prepareGeneratorConfig(MapWritable crawlDbMeta,
Configuration config);

What about the segment's metadata in prepareUpdateConfig? Following your
idea, we would have to pass a Map ...


Yeah, I think it looks good but I guess you disagree?


:) Each variant looks a bit like a hack ... The variant with Path[] 
doesn't require that you retrieve all segments' metadata first, even if 
your plugins don't use it.


On the other hand the variant with Map avoids the 
pesky issue of doing file I/O in the plugins. So perhaps it's a little 
better ...



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Scoring API issues (LONG)

2007-09-18 Thread Doğacan Güney
On 9/18/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Doğacan Güney wrote:
>
> >> public void prepareInjectorConfig(Path crawlDb, Path urls, Configuration
> >> config);
> >> public void prepareGeneratorConfig(Path crawlDb, Configuration config);
> >> public void prepareIndexerConfig(Path crawlDb, Path linkDb, Path[]
> >> segments, Configuration config);
> >> public void prepareUpdateConfig(Path crawlDb, Path[] segments,
> >> Configuration config);
> >
> > Should we really pass Path-s to methods? IMHO, opening a file and
> > reading from it looks a bit cumbersome. I would suggest that the
> > relevant job would read the file then pass the data (MapWritable) to
> > the method. For example, prepareGeneratorConfig would look like this:
> >
> > public void prepareGeneratorConfig(MapWritable crawlDbMeta,
> > Configuration config);
>
> What about the segment's metadata in prepareUpdateConfig? Following your
> idea, we would have to pass a Map metaData> ...

Yeah, I think it looks good but I guess you disagree?

>
> --
> Best regards,
> Andrzej Bialecki <><
>   ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Doğacan Güney


Re: Scoring API issues (LONG)

2007-09-18 Thread Andrzej Bialecki

Doğacan Güney wrote:


public void prepareInjectorConfig(Path crawlDb, Path urls, Configuration
config);
public void prepareGeneratorConfig(Path crawlDb, Configuration config);
public void prepareIndexerConfig(Path crawlDb, Path linkDb, Path[]
segments, Configuration config);
public void prepareUpdateConfig(Path crawlDb, Path[] segments,
Configuration config);


Should we really pass Path-s to methods? IMHO, opening a file and
reading from it looks a bit cumbersome. I would suggest that the
relevant job would read the file then pass the data (MapWritable) to
the method. For example, prepareGeneratorConfig would look like this:

public void prepareGeneratorConfig(MapWritable crawlDbMeta,
Configuration config);


What about the segment's metadata in prepareUpdateConfig? Following your 
idea, we would have to pass a MapmetaData> ...


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Scoring API issues (LONG)

2007-09-18 Thread Doğacan Güney
Hi,

I think the ideas here are brilliant. A big +1 from me. I have one
minor suggestion that I detail below.

On 9/13/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> I've been working recently on a custom scoring plugin, and I found out
> some issues with the scoring API that severely limit the way we can
> calculate static page scores. I'd like to restart the discussion about
> this API, and propose some changes. Any comments or suggestions are welcome!
>
>
> Currently we use the ScoringFilters API in several places in the code,
> in order to adjust page scores during different operations (such as
> generating, parsing, updating the db, and indexing). The API can be
> divided into two distinct parts - the part that prioritizes pages to be
> re-fetched, and the part that calculates static page scores that convey
> the idea of query-independent page "importance" or "quality". Of these
> two, the second one is more complex and requires better support in the code.
>
> One general comment: all of the issues below could be handled by
> special-purpose tools that execute several map-reduce jobs to read the
> data, build intermediate databases, and update crawldb / linkdb as
> necessary. However, it would be good to investigate if we can come up
> with an abstraction that is flexible enough to do this for different
> scoring algorithms, and still doesn't require running many additional jobs.
>
> 1. Partial link graph issue
> ---
> Historically speaking, this API was abstracted from the existing Nutch
> code, which implemented the OPIC-like scoring. As such, it works on the
> premise that it's possible to correctly calculate a static page score
> given only the following information:
>
> * page history recorded in CrawlDb,
> * new page information from the current segment,
> * partial inlink information available in the segments currently being
> processed (in updatedb).
>
> This works well for the OPIC (if even there - the jury's still out ;) ),
> but not for many other scoring algorithms, including the most popular
> ones like PageRank or HITS, or even the inlink-degree algorithm from
> Nutch 0.7. Those algorithms require processing of the complete web graph
> information (i.e. the inlink / outlink info) that's been collected so far.
>
> I propose the following change: let's update the LinkDb in the same step
> as CrawlDb, so that both are up-to-date with the most current
> information after finishing the updatedb step. In terms of the input /
> output data involved, this change does increase the amount of data to be
> processed, by the size of LinkDb. However, the benefit of this change is
> that for each page to be updated we have access to its complete inlink
> information, which enables us to use other scoring algorithms that
> require this data. Also, we don't have to run invertlinks anymore.
>
> So, the updatedb process would look like this:
>
> INPUT:
> - CrawlDb: 
> - LinkDb:  
>  - segment/crawl_fetch:
>
>  - segment/crawl_parse:
>
>
> MAP: simply collects the input data in records that are able to
> accommodate all this information:
>
>  CrawlDatum dbDatum, CrawlDatum fetchDatum> >
>
> REDUCE: uses a modified version of CrawlDbReducer, which first collapses
> all incoming records to a single record in the above format, i.e. it
> collects all incoming records and fills in the slots for dbDatum,
> fetchDatum, oldInlinks and newInlinks. The we pretty much reuse the rest
> of the existing logic in CrawlDbReducer - but at the end of the reduce()
> we can pass the full inlink information to ScoringFilters.updateDbScore.
> Finally, we aggregate all inlink information and output the following
> record:
>
>  >
>
> OUTPUT: we use a special OutputFormat that splits output records into
>  and  and creates new versions of both
> CrawlDb and LinkDb.
>
> 2. Lack of global properties
> 
> Neither CrawlDb, nor LinkDb, nor segments keep around the most basic
> global statistics about them. Currently, in order to tell how many pages
> we have in the db it's necessary to run a mapred job (readdb -stats) -
> even though this information is static and could've been calculated in
> advance for a given generation of db-s or for each segment. This
> complicates even simple tasks such as readseg -list, and makes it
> difficult to keep around global score-related statistics for db-s and
> segments.
>
> So, I propose to add a metadata file located in a well-known location
> inside the db or segment directory. The file would be based on a single
> MapWritable that contains arbitrary keys/value, including predefined
> ones such as the number of records, last update time etc. We would need
> to maintain it for each db and each segment. Each operation that changes
> a db or a segment would update this information.
>
> In practial terms, I propose to add static metho