Re: [Dbpedia-discussion] Problem with extracted data

Julien Plu Tue, 23 Apr 2013 06:01:10 -0700

No I don't have a debugger because I'm coding on a remote machine via ssh.

And even with this code :


override def extract(page: PageNode, subjectUri: String, pageContext:
PageContext): Seq[Quad] = {
     if (page.title.namespace != Namespace.Template || page.isRedirect ||
!page.title.decoded.contains("évolution population")) return Seq.empty

    for (property <- findPropertyNodes(page)) {
        println(property.toWikiText)
    }
}
private def findPropertyNodes(node : Node) : List[PropertyNode] = {
    node match {
        case propertyNode : PropertyNode => List(propertyNode)
        case _ = node.children.flatMap(findPropertyNodes)
}

Absolutely nothing is displayed, because the list returned by
"findPropertyNodes" is empty and I don't understand why. I know she's empty
because if I do that :

if (findPropertyNodes(page).isEmpty) {
    println("empty")
}
else {
    println("no empty")
}

And "empty" is displayed whereas if I display "page.children" I have all
the template code but the "findPropertyNodes" function doesn't find
property inside this template code :-(

Best.

Julien.



2013/4/23 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>

> On 23 April 2013 12:01, Julien Plu <julien....@redaction-developpez.com>
> wrote:
> > Sorry but I really don't understand how AST works (and Scala too) I try
> to
> > retrieve all the PropertyNode contained in a PageNode so I do :
> >
> >
> > override def extract(page: PageNode, subjectUri: String, pageContext:
> > PageContext): Seq[Quad] = {
> >     if (page.title.namespace != Namespace.Template || page.isRedirect ||
> > !page.title.decoded.contains("évolution population")) return Seq.empty
> >
>
> I think it would be good if you could get a picture of the structure
> of the tree. It's usually not complicated, but a bit hard to explain
> in text. Can you use a debugger? If so, set a breakpoint at the
> following line and let the debugger show the page variable. Then click
> into it, look at its children, and so on.
>
> We should add a toString() method to Node.scala (and some sub-classes)
> that shows the structure.
>
> >     for (node <- page.children) {
> >         for (property <- allPropertiesNode(node)) {
> >             println(property.toWikiText)
> >         }
> >     }
> > }
> >
> > private def allPropertiesNode(node : Node) : List[PropertyNode] = {
> >     node match {
> >         case propertyNode : PropertyNode => List(propertyNode)
> >         case _ = node.children
> >    }
>
> This is almost right. If I understand correctly, you want to walk
> through the whole tree and collect all property nodes. Change this
> line:
>
>     case _ = node.children
>
> (does that even compile? I don't understand how... :-) ) to
>
>     case _ => node.children.flatMap(allPropertiesNode)
>
> (I think that should work, I'm not 100% sure.)
>
> Oh by the way, the method name should be allPropertyNodes. :-) Or
> maybe findPropertyNodes is even better.
>
> Once the method works, you can drop the main loop in extract(). Instead of
>
> for (node <- page.children) {
>     for (property <- allPropertiesNode(node)) {
>         println(property.toWikiText)
>     }
> }
>
> you can just write
>
> for (property <- findPropertyNodes(page)) {
>     println(property.toWikiText)
> }
>
> But that's just cosmetic surgery, it has the same effect.
>
> Cheers,
> JC
>
> > }
> >
> >
> > And nothing is displayed on my screen :-(
> >
> > Any idea of what I do wrongly ?
> >
> > BesT.
> >
> > Julien.
> >
> >
> > 2013/4/23 Julien Plu <julien....@redaction-developpez.com>
> >>
> >> Hi,
> >>
> >> param come from a bad copy paste, it's "pop" the good variable.
> >>
> >> By the way thank you for the hint about AST I will take a look at these
> >> class and see how I can use them. I won't hesitate to ask if I'm
> blocked :-)
> >>
> >> Best.
> >>
> >> Julien.
> >>
> >>
> >> 2013/4/22 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
> >>>
> >>> Hi Julien,
> >>>
> >>> On 22 April 2013 21:43, Julien Plu <
> julien....@redaction-developpez.com>
> >>> wrote:
> >>> > I started the code for the extractor and I have a problem with the
> >>> > regex in
> >>> > Scala. the string is :
> >>> >
> >>> >
> http://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es/Antony/%C3%A9volution_population&action=edit
> >>> >
> >>> > And my regex is : val populationRegex = """|pop=(\d+)""".r
> >>> >
> >>> > And I use this piece of code :
> >>> >
> >>> > populationRegex findAllIn  page.children.toString foreach (_ match {
> >>> >     case populationRegex (pop) => println(page.title.decoded + " :
> pop
> >>> > : " +
> >>> > param)
> >>>
> >>> What is param?
> >>>
> >>> But more generally - did you try using the AST (abstract syntax tree)
> >>> built by the parser, i.e. the tree whose root node is the PageNode?
> >>> I'm not sure how good our parser is at dealing with stuff like
> >>> "<includeonly>" and "{{#switch ...}}", but I think it works and
> >>> page.children should contain a ParserFunctionNode [1] object for the
> >>> #switch, which in turn has a child for each branch, e.g. one child for
> >>> an=2010 and one for pop=61793. These children are PropertyNode [2]
> >>> objects, which have a key and (who would have thought) more children.
> >>> Well, in this case, just one child, which is a TextNode. In a
> >>> nutshell: Find the "#switch" node, find children with keys "an" and
> >>> "pop", and generate triples for their values.
> >>>
> >>> >     case _ =>
> >>> > })
> >>> >
> >>> > And instead of to get : "Données/Antony/évolution population : pop :
> >>> > 61793"
> >>> > just once
> >>> >
> >>> > I have many : "Données/Antony/évolution population : pop : null" as
> >>> > much as
> >>> > there is line in the string
> >>> >
> >>> > An idea of what I do wrongly ?
> >>> >
> >>> > I'm totally beginner in Scala :-( sorry.
> >>>
> >>> Your code excerpt looks pretty good to me. :-)
> >>>
> >>> The AST is usually much safer and cleaner than regexes. Regexes are
> >>> more suitable for unstructured strings, but here you're dealing with
> >>> pretty clean structures. So I would suggest you write some code that
> >>> walks through the PageNode tree. If you have any questions, don't
> >>> hesitate to ask. We're looking forward to your contributions. Thanks!
> >>>
> >>> Cheers,
> >>> JC
> >>>
> >>> [1]
> >>>
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/ParserFunctionNode.scala
> >>> [2]
> >>>
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/PropertyNode.scala
> >>>
> >>> >
> >>> > Best.
> >>> >
> >>> > Julien.
> >>> >
> >>> >
> >>> > 2013/4/22 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
> >>> >>
> >>> >> The templates where data is stored are not used directly in the main
> >>> >> pages. It's a complicated process: page Toulouse uses template X, X
> >>> >> uses Y,
> >>> >> Y uses Z, and Z contains the data. Something like that, I'm 100%
> sure,
> >>> >> but
> >>> >> the details don't matter. This means that wikiPageUsesTemplate and
> >>> >> InfoboxExtractor won't help.
> >>> >>
> >>> >> Generating a separate file is probably the best idea. We could also
> >>> >> send
> >>> >> these new triples to the main mapping based file, but that might be
> >>> >> confusing: first, they're not mapping based; second, new triples
> about
> >>> >> a
> >>> >> city would be added in a completely different place in the file.
> >>> >> (That's not
> >>> >> a big problem though.)
> >>> >>
> >>> >> Cheers,
> >>> >> JC
> >>> >
> >>> >
> >>
> >>
> >
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Problem with extracted data

Reply via email to