Hi Reinhard,
Thanks for catching that.
The real solution, in my opinion, would be using Apache CLI to read
parameters from the command line that will tell you which column should be
used as surface form, when the input type is TSV. If you want to contribute
that patch, I will incorporate.
However, as a quick fix you can also use awk, perl or whatever to invert
the fields in the file:
cat output/occs.uriSorted.tsv | cut -d$'\t' -f 2,3 | perl -F/\\t/ -lane
'print "$F[1]\t$F[0]";' > output/surfaceForms-fromOccs.tsv
Cheers,
Pablo
On Sat, Mar 24, 2012 at 12:14 PM, reinhard schwab <[email protected]>wrote:
> Am 05.03.2012 16:59, schrieb Pablo Mendes:
> > Right. If lexic.tsv contains <count,uri,surfaceForm>, and these counts
> > came from the Wikipedia paragraphs (occs.uriSorted.tsv) than I'd say
> > you're doing it right. Do make sure you merge the (uri->sf) entries
> > coming from occurrences with the ones coming from titles, redirects
> > and disambiguations (TRDs), though. You can choose if you want to do
> > it before or after counting. Merging before counting means that you do
> > not give any special weight to TRDs. Merging after counting means that
> > you consider TRDs to be a special class of mappings that deserve to be
> > included even if they are not frequently occurring (e.g. helps with
> > sparsity but may include spurious mappings).
> >
> > See (latest revision):
> >
> https://spotlight.svn.sourceforge.net/svnroot/dbp-spotlight/trunk/bin/index.sh
> >
>
> hi pablo,
>
> i have just discovered a minor problem with this script
>
>
> https://spotlight.svn.sourceforge.net/svnroot/dbp-spotlight/trunk/bin/getSurfaceFormMapFromOccs.sh
>
> cat output/occs.uriSorted.tsv | cut -d$'\t' -f 2,3 >
> output/surfaceForms-fromOccs.tsv
>
>
> IndexLingPipeSpotter expects the surface forms at index 0.
> But this tool here writes the surface form to index 1 and the title to
> index 0.
> Finally i end up with dictionary entries containing underlines _ when
> combining surface forms from TitRedDis and Occs.
> A very simple fix would be to change the line to
>
> cat output/occs.uriSorted.tsv | cut -d$'\t' -f 3,2 >
> output/surfaceForms-fromOccs.tsv
>
> or?
>
> best regards
> reinhard
>
>
>
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users