Hi,
the cleaned  Danish Wikipedia file containted this unwanted characters:

__NOTOC__

on a separate line somewhere in the middle of the text. Aught to be
discarded in the cleaning script.

Yours,
Per Tunedal

On Sat, Sep 14, 2013, at 15:18, Per Tunedal wrote:

Hi,
thank you. Works as charm for Wikipedia, Wikivoyage and Wikibooks, as
far as I can see.

But, NO, it doesn't work for the Wiktionary. I get output that looks
OK, but it doesn't include the full atricles. Further, it includes
explanations for foreign words as well.

I tried:

bzcat svwiktionary-20130909-pages-articles-multistream.xml.bz2 | python
WikiExtractor.py -o output

etc.

And got:

jag älskar dig.
Fras.
jag älskar dig
I love you.
Fras.
I love you
ich liebe dich.
Fras.
ich liebe dich
jeg elsker dig.
Fras.
jeg elsker dig

etc.

cf the original wiktionary:
http://sv.wiktionary.org/wiki/Wiktionary:Huvudsida

One more example:

Look at the word "användbarhet", its in the Wiktionary but I cannot
find it in the extracted file.

And the last one, look at the word "slå":

I get:
slå.
Grammatik.
I talspråk förekommer i supinum istället för "slagit"/"slagits" även
böjningarna "slått"/"slåtts" och i vissa trakter "slatt"/"slatts"
(jämför "tatt" istället för "tagit") men dessa brukar undvikas i
skrift.

That's only a tiny fraction of the original article:
http://sv.wiktionary.org/wiki/sl%C3%A5

Yours,
Per Tunedal

On Sat, Sep 14, 2013, at 3:22, Gang Chen wrote:

Hi,

The script needs an input redirect ("<") from a file instead of the
stdin and an output redirect (">") to a file instead of the stdout. The
following will do the work:

python cleanHTML.py < svwiktionary.text >  svwiktionary.filter.text

Btw, I only tested it on Wikipedia, but I'm not sure whether it works
for Wiktionary too.
It'll be great if you sent a message about whether it works for
Wiktionary too, and I'll update it on the wiki page.
Thanks!
2013/9/14 Per Tunedal <[1]per.tune...@operamail.com>

Hi again,
the extractor is already finished.

I overlooked a line in your instructions (maybe I'm too tired):

cat output/*/* > svwiktionary.text

Now I'm running the cleaning script:

python cleanHTML.py svwiktionary.text

I will give you a report when it as finished.

Yours,
Per Tunedal



On Fri, Sep 13, 2013, at 18:11, Per Tunedal wrote:

Hi,
Thank you! Your Wikipedia Extractor is running right now. I will look
for the result in an hour.

How do I use the script for filtering out "<>" tags? I've saved it as a
Python file. Do I have to run it separately for every singe file in the
output directory? Can't I just take every file in the directory  in a
row? Just indicate the directory?

Yours,
Per Tunedal



On Fri, Sep 13, 2013, at 2:54, Gang Chen wrote:

Hi,
1) Is it possible to make some kind of Wikipedia dump?

This tool works fine for extracting the main text from Wikipedia,
[2]http://wiki.apertium.org/wiki/User:Gang_Chen/Wikipedia_Extractor

Best wishes,
Gang
2013/9/13 Per Tunedal <[3]per.tune...@operamail.com>

  Hi,
  I'm planning to try to train the tagger for the pair sv-da, starting
  with Swedish.
  What's an appropriate corpus? Europarl is available but doesn't
  provide
  much of the everyday language.
  1) Is it possible to make some kind of Wikipedia dump?
  2) Lars, maybe you could suggest some free books from the Runeberg
  project that have a suitable language. I've noticed that some old
  books
  have old word forms or very odd spelling (i.e. August Strindberg has
  a
  very peculiar spelling).
  Yours,
  Per Tunedal
  --------------------------------------------------------------------
  ----------
  How ServiceNow helps IT people transform IT departments:
  1. Consolidate legacy IT systems to a single system of record for IT
  2. Standardize and globalize service processes across IT
  3. Implement zero-touch automation to replace manual, redundant
  tasks
  [4]http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/o
  stg.clktrk
  _______________________________________________
  Apertium-stuff mailing list
  [5]Apertium-stuff@lists.sourceforge.net
  [6]https://lists.sourceforge.net/lists/listinfo/apertium-stuff

-----------------------------------------------------------------------
-------

How ServiceNow helps IT people transform IT departments:

1. Consolidate legacy IT systems to a single system of record for IT

2. Standardize and globalize service processes across IT

3. Implement zero-touch automation to replace manual, redundant tasks

[7]http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg
.clktrk

_______________________________________________

Apertium-stuff mailing list

[8]Apertium-stuff@lists.sourceforge.net

[9]https://lists.sourceforge.net/lists/listinfo/apertium-stuff


-----------------------------------------------------------------------
-------

How ServiceNow helps IT people transform IT departments:

1. Consolidate legacy IT systems to a single system of record for IT

2. Standardize and globalize service processes across IT

3. Implement zero-touch automation to replace manual, redundant tasks

[10]http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ost
g.clktrk

_______________________________________________

Apertium-stuff mailing list

[11]Apertium-stuff@lists.sourceforge.net

[12]https://lists.sourceforge.net/lists/listinfo/apertium-stuff


  --------------------------------------------------------------------
  ----------
  How ServiceNow helps IT people transform IT departments:
  1. Consolidate legacy IT systems to a single system of record for IT
  2. Standardize and globalize service processes across IT
  3. Implement zero-touch automation to replace manual, redundant
  tasks
  [13]http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/
  ostg.clktrk
  _______________________________________________
  Apertium-stuff mailing list
  [14]Apertium-stuff@lists.sourceforge.net
  [15]https://lists.sourceforge.net/lists/listinfo/apertium-stuff


-----------------------------------------------------------------------
-------

How ServiceNow helps IT people transform IT departments:

1. Consolidate legacy IT systems to a single system of record for IT

2. Standardize and globalize service processes across IT

3. Implement zero-touch automation to replace manual, redundant tasks

[16]http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ost
g.clktrk

_______________________________________________

Apertium-stuff mailing list

[17]Apertium-stuff@lists.sourceforge.net

[18]https://lists.sourceforge.net/lists/listinfo/apertium-stuff



-----------------------------------------------------------------------
-------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13.
[19]http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ost
g.clktrk
_______________________________________________
Apertium-stuff mailing list
[20]Apertium-stuff@lists.sourceforge.net
[21]https://lists.sourceforge.net/lists/listinfo/apertium-stuff

References

1. mailto:per.tune...@operamail.com
2. http://wiki.apertium.org/wiki/User:Gang_Chen/Wikipedia_Extractor
3. mailto:per.tune...@operamail.com
4. http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
5. mailto:Apertium-stuff@lists.sourceforge.net
6. https://lists.sourceforge.net/lists/listinfo/apertium-stuff
7. http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
8. mailto:Apertium-stuff@lists.sourceforge.net
9. https://lists.sourceforge.net/lists/listinfo/apertium-stuff
  10. 
http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
  11. mailto:Apertium-stuff@lists.sourceforge.net
  12. https://lists.sourceforge.net/lists/listinfo/apertium-stuff
  13. 
http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
  14. mailto:Apertium-stuff@lists.sourceforge.net
  15. https://lists.sourceforge.net/lists/listinfo/apertium-stuff
  16. 
http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
  17. mailto:Apertium-stuff@lists.sourceforge.net
  18. https://lists.sourceforge.net/lists/listinfo/apertium-stuff
  19. 
http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk
  20. mailto:Apertium-stuff@lists.sourceforge.net
  21. https://lists.sourceforge.net/lists/listinfo/apertium-stuff
------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to