On 5 March 2012 15:59, Pablo Mendes <[email protected]> wrote:
>
>
>> I've been using
>> awk -F'\t' '($1>=3){print $0}' < lexic.tsv
>>
>> where lexic.tsv is the input to
>> org.dbpedia.spotlight.util.CreateLexicalizations - I guess now is a
>> good time to find out if I'm doing it wrong :)
>
>
> Right. If lexic.tsv contains <count,uri,surfaceForm>, and these counts came
> from the Wikipedia paragraphs (occs.uriSorted.tsv) than I'd say you're doing
> it right. Do make sure you merge the (uri->sf) entries coming from
> occurrences with the ones coming from titles, redirects and disambiguations
> (TRDs), though.

Aha. I had been missing that step.

Also, while we're on this topic, I notice that things like
'[[las]]ach' are being extracted with the surface form 'las', and not
'lasach', as I'd expected. I guess it's not necessary for the DBpedia
extraction framework, and ISTR that the relevant piece of Mediawiki
was particularly horrible, but it's something that may be worth adding
to a FAQ.

> You can choose if you want to do it before or after
> counting. Merging before counting means that you do not give any special
> weight to TRDs. Merging after counting means that you consider TRDs to be a
> special class of mappings that deserve to be included even if they are not
> frequently occurring (e.g. helps with sparsity but may include spurious
> mappings).
>
> See (latest revision):
> https://spotlight.svn.sourceforge.net/svnroot/dbp-spotlight/trunk/bin/index.sh
>
> I do a basic concatenation there. This means that occurrences in Wikipedia
> pointing at redirects and disambiguations would be missed. Best would be to
> extend ExtractCandidateMap to already read in the occs, and do the same job
> we currently do with cut/sort/grep/sed, plus the transitive closure of URIs.
> We would love if anybody volunteered to send us that patch.
> ( https://sourceforge.net/tracker/?func=detail&aid=3497056&group_id=399595&atid=1657035
> ) Otherwise, whenever I have some time I'll work on it and include it in the
> next release.

Might be worth making a list of project ideas, big and small. "I
wanted to contribute, but I didn't know where to start" is a common
enough reason given for not contributing to open source.

-- 
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Reply via email to