On 23 April 2013 12:01, Julien Plu <julien....@redaction-developpez.com> wrote:
> Sorry but I really don't understand how AST works (and Scala too) I try to
> retrieve all the PropertyNode contained in a PageNode so I do :
> override def extract(page: PageNode, subjectUri: String, pageContext:
> PageContext): Seq[Quad] = {
>     if (page.title.namespace != Namespace.Template || page.isRedirect ||
> !page.title.decoded.contains("évolution population")) return Seq.empty

I think it would be good if you could get a picture of the structure
of the tree. It's usually not complicated, but a bit hard to explain
in text. Can you use a debugger? If so, set a breakpoint at the
following line and let the debugger show the page variable. Then click
into it, look at its children, and so on.

We should add a toString() method to Node.scala (and some sub-classes)
that shows the structure.

>     for (node <- page.children) {
>         for (property <- allPropertiesNode(node)) {
>             println(property.toWikiText)
>         }
>     }
> }
> private def allPropertiesNode(node : Node) : List[PropertyNode] = {
>     node match {
>         case propertyNode : PropertyNode => List(propertyNode)
>         case _ = node.children
>    }

This is almost right. If I understand correctly, you want to walk
through the whole tree and collect all property nodes. Change this

    case _ = node.children

(does that even compile? I don't understand how... :-) ) to

    case _ => node.children.flatMap(allPropertiesNode)

(I think that should work, I'm not 100% sure.)

Oh by the way, the method name should be allPropertyNodes. :-) Or
maybe findPropertyNodes is even better.

Once the method works, you can drop the main loop in extract(). Instead of

for (node <- page.children) {
    for (property <- allPropertiesNode(node)) {

you can just write

for (property <- findPropertyNodes(page)) {

But that's just cosmetic surgery, it has the same effect.


> }
> And nothing is displayed on my screen :-(
> Any idea of what I do wrongly ?
> BesT.
> Julien.
> 2013/4/23 Julien Plu <julien....@redaction-developpez.com>
>> Hi,
>> param come from a bad copy paste, it's "pop" the good variable.
>> By the way thank you for the hint about AST I will take a look at these
>> class and see how I can use them. I won't hesitate to ask if I'm blocked :-)
>> Best.
>> Julien.
>> 2013/4/22 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
>>> Hi Julien,
>>> On 22 April 2013 21:43, Julien Plu <julien....@redaction-developpez.com>
>>> wrote:
>>> > I started the code for the extractor and I have a problem with the
>>> > regex in
>>> > Scala. the string is :
>>> >
>>> > http://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es/Antony/%C3%A9volution_population&action=edit
>>> >
>>> > And my regex is : val populationRegex = """|pop=(\d+)""".r
>>> >
>>> > And I use this piece of code :
>>> >
>>> > populationRegex findAllIn  page.children.toString foreach (_ match {
>>> >     case populationRegex (pop) => println(page.title.decoded + " : pop
>>> > : " +
>>> > param)
>>> What is param?
>>> But more generally - did you try using the AST (abstract syntax tree)
>>> built by the parser, i.e. the tree whose root node is the PageNode?
>>> I'm not sure how good our parser is at dealing with stuff like
>>> "<includeonly>" and "{{#switch ...}}", but I think it works and
>>> page.children should contain a ParserFunctionNode [1] object for the
>>> #switch, which in turn has a child for each branch, e.g. one child for
>>> an=2010 and one for pop=61793. These children are PropertyNode [2]
>>> objects, which have a key and (who would have thought) more children.
>>> Well, in this case, just one child, which is a TextNode. In a
>>> nutshell: Find the "#switch" node, find children with keys "an" and
>>> "pop", and generate triples for their values.
>>> >     case _ =>
>>> > })
>>> >
>>> > And instead of to get : "Données/Antony/évolution population : pop :
>>> > 61793"
>>> > just once
>>> >
>>> > I have many : "Données/Antony/évolution population : pop : null" as
>>> > much as
>>> > there is line in the string
>>> >
>>> > An idea of what I do wrongly ?
>>> >
>>> > I'm totally beginner in Scala :-( sorry.
>>> Your code excerpt looks pretty good to me. :-)
>>> The AST is usually much safer and cleaner than regexes. Regexes are
>>> more suitable for unstructured strings, but here you're dealing with
>>> pretty clean structures. So I would suggest you write some code that
>>> walks through the PageNode tree. If you have any questions, don't
>>> hesitate to ask. We're looking forward to your contributions. Thanks!
>>> Cheers,
>>> JC
>>> [1]
>>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/ParserFunctionNode.scala
>>> [2]
>>> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/PropertyNode.scala
>>> >
>>> > Best.
>>> >
>>> > Julien.
>>> >
>>> >
>>> > 2013/4/22 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
>>> >>
>>> >> The templates where data is stored are not used directly in the main
>>> >> pages. It's a complicated process: page Toulouse uses template X, X
>>> >> uses Y,
>>> >> Y uses Z, and Z contains the data. Something like that, I'm 100% sure,
>>> >> but
>>> >> the details don't matter. This means that wikiPageUsesTemplate and
>>> >> InfoboxExtractor won't help.
>>> >>
>>> >> Generating a separate file is probably the best idea. We could also
>>> >> send
>>> >> these new triples to the main mapping based file, but that might be
>>> >> confusing: first, they're not mapping based; second, new triples about
>>> >> a
>>> >> city would be added in a completely different place in the file.
>>> >> (That's not
>>> >> a big problem though.)
>>> >>
>>> >> Cheers,
>>> >> JC
>>> >
>>> >

