hi pablo,
i have consulted awk documentation ,seems that
awk -F "\t" '{print $2, $1}' OFS='\t'
is reordering input.
best regards
reinhard
Am 24.03.2012 12:37, schrieb Pablo Mendes:
>
> Hi Reinhard,
> Thanks for catching that.
>
> The real solution, in my opinion, would be using Apache CLI to read
> parameters from the command line that will tell you which column
> should be used as surface form, when the input type is TSV. If you
> want to contribute that patch, I will incorporate.
>
> However, as a quick fix you can also use awk, perl or whatever to
> invert the fields in the file:
>
> cat output/occs.uriSorted.tsv | cut -d$'\t' -f 2,3 | perl -F/\\t/
> -lane 'print "$F[1]\t$F[0]";' > output/surfaceForms-fromOccs.tsv
>
> Cheers,
> Pablo
>
> On Sat, Mar 24, 2012 at 12:14 PM, reinhard schwab
> <[email protected] <mailto:[email protected]>> wrote:
>
> Am 05.03.2012 16:59, schrieb Pablo Mendes:
> > Right. If lexic.tsv contains <count,uri,surfaceForm>, and these
> counts
> > came from the Wikipedia paragraphs (occs.uriSorted.tsv) than I'd say
> > you're doing it right. Do make sure you merge the (uri->sf) entries
> > coming from occurrences with the ones coming from titles, redirects
> > and disambiguations (TRDs), though. You can choose if you want to do
> > it before or after counting. Merging before counting means that
> you do
> > not give any special weight to TRDs. Merging after counting
> means that
> > you consider TRDs to be a special class of mappings that deserve
> to be
> > included even if they are not frequently occurring (e.g. helps with
> > sparsity but may include spurious mappings).
> >
> > See (latest revision):
> >
>
> https://spotlight.svn.sourceforge.net/svnroot/dbp-spotlight/trunk/bin/index.sh
> >
>
> hi pablo,
>
> i have just discovered a minor problem with this script
>
>
> https://spotlight.svn.sourceforge.net/svnroot/dbp-spotlight/trunk/bin/getSurfaceFormMapFromOccs.sh
>
> cat output/occs.uriSorted.tsv | cut -d$'\t' -f 2,3 >
> output/surfaceForms-fromOccs.tsv
>
>
> IndexLingPipeSpotter expects the surface forms at index 0.
> But this tool here writes the surface form to index 1 and the title to
> index 0.
> Finally i end up with dictionary entries containing underlines _ when
> combining surface forms from TitRedDis and Occs.
> A very simple fix would be to change the line to
>
> cat output/occs.uriSorted.tsv | cut -d$'\t' -f 3,2 >
> output/surfaceForms-fromOccs.tsv
>
> or?
>
> best regards
> reinhard
>
>
>
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users