I've been using
> awk -F'\t' '($1>=3){print $0}' < lexic.tsv
>
> where lexic.tsv is the input to
> org.dbpedia.spotlight.util.CreateLexicalizations - I guess now is a
> good time to find out if I'm doing it wrong :)
>
Right. If lexic.tsv contains <count,uri,surfaceForm>, and these counts came
from the Wikipedia paragraphs (occs.uriSorted.tsv) than I'd say you're
doing it right. Do make sure you merge the (uri->sf) entries coming from
occurrences with the ones coming from titles, redirects and disambiguations
(TRDs), though. You can choose if you want to do it before or after
counting. Merging before counting means that you do not give any special
weight to TRDs. Merging after counting means that you consider TRDs to be a
special class of mappings that deserve to be included even if they are not
frequently occurring (e.g. helps with sparsity but may include spurious
mappings).
See (latest revision):
https://spotlight.svn.sourceforge.net/svnroot/dbp-spotlight/trunk/bin/index.sh
I do a basic concatenation there. This means that occurrences in Wikipedia
pointing at redirects and disambiguations would be missed. Best would be to
extend ExtractCandidateMap to already read in the occs, and do the same job
we currently do with cut/sort/grep/sed, plus the transitive closure of
URIs. We would love if anybody volunteered to send us that patch. (
https://sourceforge.net/tracker/?func=detail&aid=3497056&group_id=399595&atid=1657035)
Otherwise, whenever I have some time I'll work on it and include it in
the next release.
Cheers,
Pablo
On Mon, Mar 5, 2012 at 4:17 PM, Jimmy O'Regan <[email protected]> wrote:
> On 5 March 2012 11:57, Pablo Mendes <[email protected]> wrote:
> > Hi Reinhard,
> > We've assumed that you would have filtered the URIs before you've created
> > the index, as this seems to be the most space/time efficient solution.
> >
> > On which of the two alternatives below do you intend to filter?
> > 1. c(uri) --number of occurrences of a given URI
> > 2. c(sf,uri) -- number of occurrences of a given sf->uri pair
> >
> > You could easily do c(uri) because that's usually stored in the index.
> > However, c(sf,uri) does not go to the context index anymore. In my dev
> > branch, it goes to the candidate index, though. But that one is built
> from a
> > TSV file, and it would be much easier to filter directly from that.
> >
>
> I've been using
> awk -F'\t' '($1>=3){print $0}' < lexic.tsv
>
> where lexic.tsv is the input to
> org.dbpedia.spotlight.util.CreateLexicalizations - I guess now is a
> good time to find out if I'm doing it wrong :)
>
> --
> <Sefam> Are any of the mentors around?
> <jimregan> yes, they're the ones trolling you
>
------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users