Hi,
Thanks for the feedback!
The "__NOTOC__" style tokens are some special marks in the Wikipedia.
Similar ones include, "__TOC__", "__FORCETOC__", "__STATICREDIRECT__",
"__NOEDITSECTION__", etc.
I've updated the filtering script on the wiki page, and added a section for
which types of the Wiki could be dealt with the tool.
Good luck with your tagger!
Best wishes,
Gang
2013/9/14 Per Tunedal <per.tune...@operamail.com>
> Hi,
> the cleaned Danish Wikipedia file containted this unwanted characters:
>
> __NOTOC__
>
> on a separate line somewhere in the middle of the text. Aught to be
> discarded in the cleaning script.
>
> Yours,
> Per Tunedal
>
> On Sat, Sep 14, 2013, at 15:18, Per Tunedal wrote:
>
> Hi,
> thank you. Works as charm for Wikipedia, Wikivoyage and Wikibooks, as far
> as I can see.
>
> But, NO, it doesn't work for the Wiktionary. I get output that looks OK,
> but it doesn't include the full atricles. Further, it includes explanations
> for foreign words as well.
>
> I tried:
>
> bzcat svwiktionary-20130909-pages-articles-multistream.xml.bz2 | python
> WikiExtractor.py -o output
>
> etc.
>
> And got:
>
>
> jag älskar dig.
> Fras.
> jag älskar dig
>
>
> I love you.
> Fras.
> I love you
>
>
> ich liebe dich.
> Fras.
> ich liebe dich
>
>
> jeg elsker dig.
> Fras.
> jeg elsker dig
>
> etc.
>
> cf the original wiktionary:
> http://sv.wiktionary.org/wiki/Wiktionary:Huvudsida
>
> One more example:
>
> Look at the word "användbarhet", its in the Wiktionary but I cannot find
> it in the extracted file.
>
> And the last one, look at the word "slå":
>
> I get:
> slå.
> Grammatik.
> I talspråk förekommer i supinum istället för "slagit"/"slagits" även
> böjningarna "slått"/"slåtts" och i vissa trakter "slatt"/"slatts" (jämför
> "tatt" istället för "tagit") men dessa brukar undvikas i skrift.
>
> That's only a tiny fraction of the original article:
> http://sv.wiktionary.org/wiki/sl%C3%A5
>
>
> Yours,
> Per Tunedal
>
> On Sat, Sep 14, 2013, at 3:22, Gang Chen wrote:
>
> Hi,
>
> The script needs an input redirect ("<") from a file instead of the stdin
> and an output redirect (">") to a file instead of the stdout. The following
> will do the work:
>
> python cleanHTML.py < svwiktionary.text > svwiktionary.filter.text
>
> Btw, I only tested it on Wikipedia, but I'm not sure whether it works
> for Wiktionary too.
> It'll be great if you sent a message about whether it works for
> Wiktionary too, and I'll update it on the wiki page.
> Thanks!
>
>
> 2013/9/14 Per Tunedal <per.tune...@operamail.com>
>
> Hi again,
> the extractor is already finished.
>
> I overlooked a line in your instructions (maybe I'm too tired):
>
> cat output/*/* > svwiktionary.text
>
> Now I'm running the cleaning script:
>
> python cleanHTML.py svwiktionary.text
>
> I will give you a report when it as finished.
>
> Yours,
> Per Tunedal
>
>
>
> On Fri, Sep 13, 2013, at 18:11, Per Tunedal wrote:
>
> Hi,
> Thank you! Your Wikipedia Extractor is running right now. I will look for
> the result in an hour.
>
> How do I use the script for filtering out "<>" tags? I've saved it as a
> Python file. Do I have to run it separately for every singe file in the
> output directory? Can't I just take every file in the directory in a row?
> Just indicate the directory?
>
> Yours,
> Per Tunedal
>
>
>
> On Fri, Sep 13, 2013, at 2:54, Gang Chen wrote:
>
> Hi,
>
> 1) Is it possible to make some kind of Wikipedia dump?
>
> This tool works fine for extracting the main text from Wikipedia,
> http://wiki.apertium.org/wiki/User:Gang_Chen/Wikipedia_Extractor
>
>
> Best wishes,
> Gang
>
>
> 2013/9/13 Per Tunedal <per.tune...@operamail.com>
>
> Hi,
> I'm planning to try to train the tagger for the pair sv-da, starting
> with Swedish.
>
> What's an appropriate corpus? Europarl is available but doesn't provide
> much of the everyday language.
>
> 1) Is it possible to make some kind of Wikipedia dump?
> 2) Lars, maybe you could suggest some free books from the Runeberg
> project that have a suitable language. I've noticed that some old books
> have old word forms or very odd spelling (i.e. August Strindberg has a
> very peculiar spelling).
>
> Yours,
> Per Tunedal
>
>
> ------------------------------------------------------------------------------
> How ServiceNow helps IT people transform IT departments:
> 1. Consolidate legacy IT systems to a single system of record for IT
> 2. Standardize and globalize service processes across IT
> 3. Implement zero-touch automation to replace manual, redundant tasks
> http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
> ------------------------------------------------------------------------------
> How ServiceNow helps IT people transform IT departments:
> 1. Consolidate legacy IT systems to a single system of record for IT
> 2. Standardize and globalize service processes across IT
> 3. Implement zero-touch automation to replace manual, redundant tasks
>
> http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
> *_______________________________________________*
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
>
> ------------------------------------------------------------------------------
> How ServiceNow helps IT people transform IT departments:
> 1. Consolidate legacy IT systems to a single system of record for IT
> 2. Standardize and globalize service processes across IT
> 3. Implement zero-touch automation to replace manual, redundant tasks
>
> http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
> *_______________________________________________*
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
>
>
> ------------------------------------------------------------------------------
> How ServiceNow helps IT people transform IT departments:
> 1. Consolidate legacy IT systems to a single system of record for IT
> 2. Standardize and globalize service processes across IT
> 3. Implement zero-touch automation to replace manual, redundant tasks
> http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
>
> ------------------------------------------------------------------------------
> How ServiceNow helps IT people transform IT departments:
> 1. Consolidate legacy IT systems to a single system of record for IT
> 2. Standardize and globalize service processes across IT
> 3. Implement zero-touch automation to replace manual, redundant tasks
>
> http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
> *_______________________________________________*
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
>
>
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
> SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
> includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk
> *_______________________________________________*
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
> SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
> includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13.
> http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>
>
------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13.
http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff