I'm trying to use the WikiParser to determine the category list of a
wikipedia page.
The category tags are represented as TextNode objects but when I print out
the toWikiText, it get an empty string. Should categories be "TextNodes" and
if so, what's the correct extract the category name from the wikipage?
input data:
{{Colonial Colleges}}
[[Category:New York| ]]
[[Category:Former British colonies]]
[[Category:States of the United States]]
{{Link FA|es}}
My code snippet:
val testFile = new
java.io.File("src/test/resources/datasource/wikipedia/xml/new_york.xml")
val parser = WikiParser()
val xmlSource = XMLSource.fromFile(testFile)
xmlSource.foreach{ wikiPage =>
val page = parser.apply(wikiPage)
page.children.foreach{ node =>
node match {
case template:TemplateNode => {
println("template:" + template.title + " with " +
template.children.size + " children")
}
case section:SectionNode => println("section:" +
section.toWikiText)
case text:TextNode => {
println("text:" + text.toWikiText + " line: " + text.line)
}
case link:LinkNode => {
val label = link.children.map(_.toWikiText).mkString("")
}
case x => println("class= " + x.getClass)
}
}
Output:
text:
line: 939
template:en:Template:Colonial Colleges with 0 children
text:
line: 940
text:
line: 942
text:
line: 943
text:
line: 944
template:en:Template:Link FA with 1 children
text:
--
@tommychheng
http://tommy.chheng.com
------------------------------------------------------------------------------
Magic Quadrant for Content-Aware Data Loss Prevention
Research study explores the data loss prevention market. Includes in-depth
analysis on the changes within the DLP market, and the criteria used to
evaluate the strengths and weaknesses of these DLP solutions.
http://www.accelacomm.com/jaw/sfnl/114/51385063/
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion