Hi,

Thanks for the feedback!

The "__NOTOC__" style tokens are some special marks in the Wikipedia.
Similar ones include, "__TOC__", "__FORCETOC__", "__STATICREDIRECT__",
"__NOEDITSECTION__", etc.

I've updated the filtering script on the wiki page, and added a section for
which types of the Wiki could be dealt with the tool.

Good luck with your tagger!


Best wishes,
Gang


2013/9/14 Per Tunedal <per.tune...@operamail.com>

>   Hi,
>  the cleaned  Danish Wikipedia file containted this unwanted characters:
>
>  __NOTOC__
>
>  on a separate line somewhere in the middle of the text. Aught to be
> discarded in the cleaning script.
>
>  Yours,
>  Per Tunedal
>
>  On Sat, Sep 14, 2013, at 15:18, Per Tunedal wrote:
>
>      Hi,
>  thank you. Works as charm for Wikipedia, Wikivoyage and Wikibooks, as far
> as I can see.
>
>  But, NO, it doesn't work for the Wiktionary. I get output that looks OK,
> but it doesn't include the full atricles. Further, it includes explanations
> for foreign words as well.
>
>  I tried:
>
>  bzcat svwiktionary-20130909-pages-articles-multistream.xml.bz2 | python
> WikiExtractor.py -o output
>
>  etc.
>
>  And got:
>
>
> jag älskar dig.
> Fras.
> jag älskar dig
>
>
> I love you.
> Fras.
> I love you
>
>
> ich liebe dich.
> Fras.
> ich liebe dich
>
>
> jeg elsker dig.
> Fras.
> jeg elsker dig
>
>  etc.
>
>  cf the original wiktionary:
> http://sv.wiktionary.org/wiki/Wiktionary:Huvudsida
>
>  One more example:
>
>  Look at the word "användbarhet", its in the Wiktionary but I cannot find
> it in the extracted file.
>
>  And the last one, look at the word "slå":
>
>  I get:
>  slå.
> Grammatik.
> I talspråk förekommer i supinum istället för "slagit"/"slagits" även
> böjningarna "slått"/"slåtts" och i vissa trakter "slatt"/"slatts" (jämför
> "tatt" istället för "tagit") men dessa brukar undvikas i skrift.
>
>  That's only a tiny fraction of the original article:
>  http://sv.wiktionary.org/wiki/sl%C3%A5
>
>
>  Yours,
>  Per Tunedal
>
>  On Sat, Sep 14, 2013, at 3:22, Gang Chen wrote:
>
>  Hi,
>
>  The script needs an input redirect ("<") from a file instead of the stdin
> and an output redirect (">") to a file instead of the stdout. The following
> will do the work:
>
>  python cleanHTML.py < svwiktionary.text >  svwiktionary.filter.text
>
>  Btw, I only tested it on Wikipedia, but I'm not sure whether it works
> for Wiktionary too.
>  It'll be great if you sent a message about whether it works for
> Wiktionary too, and I'll update it on the wiki page.
>  Thanks!
>
>
>  2013/9/14 Per Tunedal <per.tune...@operamail.com>
>
>    Hi again,
>  the extractor is already finished.
>
>  I overlooked a line in your instructions (maybe I'm too tired):
>
>  cat output/*/* > svwiktionary.text
>
>  Now I'm running the cleaning script:
>
>  python cleanHTML.py svwiktionary.text
>
>  I will give you a report when it as finished.
>
>  Yours,
>  Per Tunedal
>
>
>
>  On Fri, Sep 13, 2013, at 18:11, Per Tunedal wrote:
>
>   Hi,
>  Thank you! Your Wikipedia Extractor is running right now. I will look for
> the result in an hour.
>
>  How do I use the script for filtering out "<>" tags? I've saved it as a
> Python file. Do I have to run it separately for every singe file in the
> output directory? Can't I just take every file in the directory  in a row?
> Just indicate the directory?
>
>  Yours,
>  Per Tunedal
>
>
>
>  On Fri, Sep 13, 2013, at 2:54, Gang Chen wrote:
>
>  Hi,
>
> 1) Is it possible to make some kind of Wikipedia dump?
>
> This tool works fine for extracting the main text from Wikipedia,
> http://wiki.apertium.org/wiki/User:Gang_Chen/Wikipedia_Extractor
>
>
>  Best wishes,
> Gang
>
>
>  2013/9/13 Per Tunedal <per.tune...@operamail.com>
>
> Hi,
> I'm planning to try to train the tagger for the pair sv-da, starting
> with Swedish.
>
> What's an appropriate corpus? Europarl is available but doesn't provide
> much of the everyday language.
>
> 1) Is it possible to make some kind of Wikipedia dump?
> 2) Lars, maybe you could suggest some free books from the Runeberg
> project that have a suitable language. I've noticed that some old books
> have old word forms or very odd spelling (i.e. August Strindberg has a
> very peculiar spelling).
>
> Yours,
> Per Tunedal
>
>
> ------------------------------------------------------------------------------
> How ServiceNow helps IT people transform IT departments:
> 1. Consolidate legacy IT systems to a single system of record for IT
> 2. Standardize and globalize service processes across IT
> 3. Implement zero-touch automation to replace manual, redundant tasks
> http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
> ------------------------------------------------------------------------------
>  How ServiceNow helps IT people transform IT departments:
>  1. Consolidate legacy IT systems to a single system of record for IT
>  2. Standardize and globalize service processes across IT
>  3. Implement zero-touch automation to replace manual, redundant tasks
>
> http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
>  *_______________________________________________*
>  Apertium-stuff mailing list
>  Apertium-stuff@lists.sourceforge.net
>  https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
>
> ------------------------------------------------------------------------------
>  How ServiceNow helps IT people transform IT departments:
>  1. Consolidate legacy IT systems to a single system of record for IT
>  2. Standardize and globalize service processes across IT
>  3. Implement zero-touch automation to replace manual, redundant tasks
>
> http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
>  *_______________________________________________*
>  Apertium-stuff mailing list
>  Apertium-stuff@lists.sourceforge.net
>  https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
>
>
> ------------------------------------------------------------------------------
> How ServiceNow helps IT people transform IT departments:
> 1. Consolidate legacy IT systems to a single system of record for IT
> 2. Standardize and globalize service processes across IT
> 3. Implement zero-touch automation to replace manual, redundant tasks
> http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
>   
> ------------------------------------------------------------------------------
>  How ServiceNow helps IT people transform IT departments:
>  1. Consolidate legacy IT systems to a single system of record for IT
>  2. Standardize and globalize service processes across IT
>  3. Implement zero-touch automation to replace manual, redundant tasks
>
> http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
>  *_______________________________________________*
>  Apertium-stuff mailing list
>  Apertium-stuff@lists.sourceforge.net
>  https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
>
>
> ------------------------------------------------------------------------------
>  LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
>  1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
> SharePoint
>  2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
> includes
>  Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk
>  *_______________________________________________*
>  Apertium-stuff mailing list
>  Apertium-stuff@lists.sourceforge.net
>  https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
> SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
> includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13.
> http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to