hi pablo,

i have consulted awk documentation ,seems that

awk -F "\t" '{print $2, $1}' OFS='\t'

is reordering input.

best regards
reinhard

Am 24.03.2012 12:37, schrieb Pablo Mendes:
>
> Hi Reinhard,
> Thanks for catching that.
>
> The real solution, in my opinion, would be using Apache CLI to read
> parameters from the command line that will tell you which column
> should be used as surface form, when the input type is TSV. If you
> want to contribute that patch, I will incorporate.
>
> However, as a quick fix you can also use awk, perl or whatever to
> invert the fields in the file:
>
> cat output/occs.uriSorted.tsv | cut -d$'\t' -f 2,3 | perl -F/\\t/
> -lane 'print "$F[1]\t$F[0]";' > output/surfaceForms-fromOccs.tsv
>
> Cheers,
> Pablo
>
> On Sat, Mar 24, 2012 at 12:14 PM, reinhard schwab
> <[email protected] <mailto:[email protected]>> wrote:
>
>     Am 05.03.2012 16:59, schrieb Pablo Mendes:
>     > Right. If lexic.tsv contains <count,uri,surfaceForm>, and these
>     counts
>     > came from the Wikipedia paragraphs (occs.uriSorted.tsv) than I'd say
>     > you're doing it right. Do make sure you merge the (uri->sf) entries
>     > coming from occurrences with the ones coming from titles, redirects
>     > and disambiguations (TRDs), though. You can choose if you want to do
>     > it before or after counting. Merging before counting means that
>     you do
>     > not give any special weight to TRDs. Merging after counting
>     means that
>     > you consider TRDs to be a special class of mappings that deserve
>     to be
>     > included even if they are not frequently occurring (e.g. helps with
>     > sparsity but may include spurious mappings).
>     >
>     > See (latest revision):
>     >
>     
> https://spotlight.svn.sourceforge.net/svnroot/dbp-spotlight/trunk/bin/index.sh
>     >
>
>     hi pablo,
>
>     i have just discovered a minor problem with this script
>
>     
> https://spotlight.svn.sourceforge.net/svnroot/dbp-spotlight/trunk/bin/getSurfaceFormMapFromOccs.sh
>
>     cat output/occs.uriSorted.tsv | cut -d$'\t' -f 2,3 >
>     output/surfaceForms-fromOccs.tsv
>
>
>     IndexLingPipeSpotter expects the surface forms at index 0.
>     But this tool here writes the surface form to index 1 and the title to
>     index 0.
>     Finally i end up with dictionary entries containing underlines _ when
>     combining surface forms from TitRedDis and Occs.
>     A very simple fix would be to change the line to
>
>     cat output/occs.uriSorted.tsv | cut -d$'\t' -f 3,2 >
>     output/surfaceForms-fromOccs.tsv
>
>     or?
>
>     best regards
>     reinhard
>
>
>

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Reply via email to