Invenio Developers Forum - no forum today

2010-12-20 Thread Tibor Simko
Hello:

There will be no Invenio Developers Forum today.

Best regards
-- 
Tibor Simko


Re: BibClassify with RDF and MySQL store

2010-12-20 Thread Ferran Jorba
Hello,

> On Sat, Dec 18, 2010 at 12:33 PM, Samuele Kaplun  
> wrote:
>> Hi Roman,
>>
>> Il giorno sab, 18/12/2010 alle 12.17 +0100, Roman Chyla ha scritto:
>>> I agree this is cool, but something doesn't fit, at least I don't
>>> understand how this could be used for the task of bibclassify, the
>>> dict is good if you know (more or less) what you are looking for, but
>>> the task of bibclassify is to find entities inside the fulltext - and
>>> to find that out, bibclassify has to search for it - and it is not
>>> exactly the same thing as the spell checking. I must be missing
>>> something, could you explain to me what advantage at all there would
>>> be in using the dict? As a fast cache of single level entries? I could
>>> see how it would be more useful for the cache, citation links etc.,
>>> but not for bibclassify.

I suggested to look at dict for those reasons:

1. I doesn't neet 24 GB of RAM to start working, regardless of the size
   of the corpora. ;-)
2. It easily permits shared and reused corpora.
3. The protocol itself is easy to understand, not unlike HTTP.
4. The *meaning* of the returned value is up to the client, not unlike
   the Unix way of doing things.  In dict, you just return data, it is
   up to you to interpret it.  You can tag relations, codes, etc.
5. Integrating a dict client in the python Invenio code has a small cost 
   (dicoclient.py from the dico client, see
   http://packages.debian.org/dico or
   http://puszcza.gnu.org.ua/software/dico/) is only 13 KB and doesn't
   have dependencies other than standard Python libs.

>> I am not that aware of how BibClassify works right now, but if its final
>> goal is to look for the most frequent keywords (from a controlled set)
>> inside a fulltext, then, post-poning the issue of the grammar (plural,
>> genders, conjugations :-S), I think that it would be indeed possible to
>> use dictd in a orthogonal way than we currently do with ontologies.
>>
>> Currently for each word in the ontology (correct me if I am wrong) we
>> look how many times it appears in the text.
>>
>> On the other hand with dict, we might simply take all the words in the
>> text, and filter them against the dictionary (which is built after the
>> ontology), and then sum up the occurencies of repeated words.
>
> OK, I see what you mean - could work, but would work mean 'improved'?
>
> If you take an average of 3000 words times the real time reported for
> lookup above:
>
> 3000 * 0.004 = 12s
> or
> 3000 * 0.006 = 18s
>
> that is two to three times slower than the current bibclassify
> implementation (in case of HEP).
>
> It could be faster for bigger dictionaries, like Eurovoc, because
> bibclassify will slow down -- or if we manage to cut down the lookup
> time (by making it local process?)

With dict you can use stateless connection (like HTTP) but also you can
reuse an already opened session, so the latency should be better.

>> The two methods should accomplish the same goal (if I am not wrong on
>> BibClassify algorithm) but the latter should be in principle extremely
>> fast, unless the grammar issue is the bottleneck.
>
> in principle, direct lookups must be replaced by some approximate
> lookups (btw, I think dictd could handle grammar variations better
> than the current regex pattersn, so that would be a gain) - but it
> will return more entries in many cases, then it is necessary to choose
> the right one. Might be easy for limited domains - for Eurovoc, you
> will need some sort of disambiguation
>
> Another interesting problem is the single keyword made of several
> tokens, like 'search engine' in the sentence:
>
> Invenio comes with its own search engine implementation?
>
> will you ask for:
> 1. invenio
> 2. comes
> 
> 6 search
> 7 engine
> 8 implementation
>
>  -- somehow combine 6+7 based on the responses?
>
> or create collocations and ask for them (will double the number of
> lookups, and does not skip inserted words)
> Invenio comes
> comes with
> ...
> search engine
>
>
> Don't get me wrong, dictd is cool. I am just saying it is tiny bit
> more complicated.

Maybe.  I don't know the details of BibClassify, sorry, and I wasn't
advocating to rewrite all BibClassify using dict, of course.  What I'm
suggesting is that those dictionaries, ontologies, etc that you need for
BibClassify to work could be served and easily reused with dict,
cheaper, faster and maybe better that SQL or Solr or whatever other
alternative.

Thanks,

Ferran

PS Eric Lease Morgan did some experiments a while ago using dict for
   serving LC Authorities Catalog, see
   http://serials.infomotions.com/code4lib/archive/2008/200803/0557.html