Re: [Dbpedia-discussion] Problem with extracted data

Julien Plu Wed, 24 Apr 2013 12:49:58 -0700

Ok, I will add your comments tomorrow !

Best.


Julien.


2013/4/24 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>

> Cool, thanks! Your code looks good. I peppered your pull request with
> a few comments. None of them are major problems, but if you have time,
> please have a look at them. If you don't have time, please copy the
> comments into TODO comments in the code and we can fix them later.
>
> On 24 April 2013 16:34, Julien Plu <julien....@redaction-developpez.com>
> wrote:
> > You had right Jona the problem came from a bad path in the package.
> >
> > I just sent a new pull request with my extractor :-)
> >
> > Best.
> >
> > Julien.
> >
> >
> > 2013/4/23 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
> >>
> >> Hi Julien,
> >>
> >> On 23 April 2013 23:16, Julien Plu <julien....@redaction-developpez.com
> >
> >> wrote:
> >> > Ok, I finished, now I made an extractor which works like we expected
> :-)
> >> > I
> >> > don't think that what I did is well made but it works.
> >>
> >> Cool! You can always improve it later. :-)
> >>
> >> >
> >> > Anyway only one problem stay, if I put my "PopulationExtractor.scala"
> >> > file
> >> > from "mappings" folder into "fr" folder inside "mappings" folder the
> >> > extraction configuration file fail because he doesn't find the
> >> > "PopulationExtractor" class doesn't matter if I write
> >> > "fr.PopulationExtractor" or
> >> > "org.dbpedia.extraction.mappings.fr.PopulationExtractor". Any idea of
> >> > what's
> >> > going on ?
> >>
> >> Does the package declaration in the class file include the ".fr"?
> >> Scala is less strict than Java here.
> >>
> >> If you send a pull request, we can have a look at your code and merge
> >> it into the main repository, so others can run this extraction as
> >> well.
> >>
> >> https://github.com/dbpedia/extraction-framework/wiki/Contributing
> >>
> >> >
> >> > Last thing I added a dataset inside the file "DBpediaDatasets.scala"
> >> > like
> >> > that I have my own archive containing only the population
> informations.
> >>
> >> Right, that's one more thing you need to add.
> >>
> >> Thanks!
> >>
> >> JC
> >>
> >>
> >> >
> >> > Best.
> >> >
> >> > Julien.
> >> >
> >> >
> >> > 2013/4/23 Julien Plu <julien....@redaction-developpez.com>
> >> >>
> >> >> Yes I know IDE are really usefull but my working machine is on
> Windows
> >> >> and
> >> >> I'm really not familiar with. So I use a Linux distrib via a virtual
> >> >> machine
> >> >> but this virtual machine is too slow for coding with an IDE in
> graphics
> >> >> so I
> >> >> have to connect to this VM with a ssh connexion and use only the
> shell.
> >> >>
> >> >> I think that I will force me to use Windows that will be more easy
> than
> >> >> to
> >> >> continue to work like that :-D
> >> >>
> >> >> By the way I found my problem for the code. I was come from my regex,
> >> >> so
> >> >> instead to use """|pop=(\d+)""".r I use """pop=(\d+)""".r and now I
> >> >> have the
> >> >> good value that I want :-)
> >> >>
> >> >> Best.
> >> >>
> >> >> Julien.
> >> >>
> >> >>
> >> >> 2013/4/23 Dimitris Kontokostas <jimk...@gmail.com>
> >> >>>
> >> >>> You should use an IDE for this,it will make you life a lot easier ;)
> >> >>> I use the intelliJ IDEA default debugger and works pretty good. I
> >> >>> could
> >> >>> send you instructions to set it up
> >> >>>
> >> >>> Best,
> >> >>> Dimtiris
> >> >>>
> >> >>>
> >> >>> On Tue, Apr 23, 2013 at 3:59 PM, Julien Plu
> >> >>> <julien....@redaction-developpez.com> wrote:
> >> >>>>
> >> >>>> No I don't have a debugger because I'm coding on a remote machine
> via
> >> >>>> ssh.
> >> >>>>
> >> >>>> And even with this code :
> >> >>>>
> >> >>>>
> >> >>>> override def extract(page: PageNode, subjectUri: String,
> pageContext:
> >> >>>> PageContext): Seq[Quad] = {
> >> >>>>      if (page.title.namespace != Namespace.Template ||
> >> >>>> page.isRedirect
> >> >>>> || !page.title.decoded.contains("évolution population")) return
> >> >>>> Seq.empty
> >> >>>>
> >> >>>>     for (property <- findPropertyNodes(page)) {
> >> >>>>         println(property.toWikiText)
> >> >>>>     }
> >> >>>> }
> >> >>>> private def findPropertyNodes(node : Node) : List[PropertyNode] = {
> >> >>>>
> >> >>>>     node match {
> >> >>>>         case propertyNode : PropertyNode => List(propertyNode)
> >> >>>>         case _ = node.children.flatMap(findPropertyNodes)
> >> >>>> }
> >> >>>>
> >> >>>> Absolutely nothing is displayed, because the list returned by
> >> >>>> "findPropertyNodes" is empty and I don't understand why. I know
> she's
> >> >>>> empty
> >> >>>> because if I do that :
> >> >>>>
> >> >>>> if (findPropertyNodes(page).isEmpty) {
> >> >>>>     println("empty")
> >> >>>> }
> >> >>>> else {
> >> >>>>     println("no empty")
> >> >>>> }
> >> >>>>
> >> >>>> And "empty" is displayed whereas if I display "page.children" I
> have
> >> >>>> all
> >> >>>> the template code but the "findPropertyNodes" function doesn't find
> >> >>>> property
> >> >>>> inside this template code :-(
> >> >>>>
> >> >>>> Best.
> >> >>>>
> >> >>>> Julien.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> 2013/4/23 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
> >> >>>>>
> >> >>>>> On 23 April 2013 12:01, Julien Plu
> >> >>>>> <julien....@redaction-developpez.com> wrote:
> >> >>>>> > Sorry but I really don't understand how AST works (and Scala
> too)
> >> >>>>> > I
> >> >>>>> > try to
> >> >>>>> > retrieve all the PropertyNode contained in a PageNode so I do :
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > override def extract(page: PageNode, subjectUri: String,
> >> >>>>> > pageContext:
> >> >>>>> > PageContext): Seq[Quad] = {
> >> >>>>> >     if (page.title.namespace != Namespace.Template ||
> >> >>>>> > page.isRedirect
> >> >>>>> > ||
> >> >>>>> > !page.title.decoded.contains("évolution population")) return
> >> >>>>> > Seq.empty
> >> >>>>> >
> >> >>>>>
> >> >>>>> I think it would be good if you could get a picture of the
> structure
> >> >>>>> of the tree. It's usually not complicated, but a bit hard to
> explain
> >> >>>>> in text. Can you use a debugger? If so, set a breakpoint at the
> >> >>>>> following line and let the debugger show the page variable. Then
> >> >>>>> click
> >> >>>>> into it, look at its children, and so on.
> >> >>>>>
> >> >>>>> We should add a toString() method to Node.scala (and some
> >> >>>>> sub-classes)
> >> >>>>> that shows the structure.
> >> >>>>>
> >> >>>>> >     for (node <- page.children) {
> >> >>>>> >         for (property <- allPropertiesNode(node)) {
> >> >>>>> >             println(property.toWikiText)
> >> >>>>> >         }
> >> >>>>> >     }
> >> >>>>> > }
> >> >>>>> >
> >> >>>>> > private def allPropertiesNode(node : Node) : List[PropertyNode]
> =
> >> >>>>> > {
> >> >>>>> >     node match {
> >> >>>>> >         case propertyNode : PropertyNode => List(propertyNode)
> >> >>>>> >         case _ = node.children
> >> >>>>> >    }
> >> >>>>>
> >> >>>>> This is almost right. If I understand correctly, you want to walk
> >> >>>>> through the whole tree and collect all property nodes. Change this
> >> >>>>> line:
> >> >>>>>
> >> >>>>>     case _ = node.children
> >> >>>>>
> >> >>>>> (does that even compile? I don't understand how... :-) ) to
> >> >>>>>
> >> >>>>>     case _ => node.children.flatMap(allPropertiesNode)
> >> >>>>>
> >> >>>>> (I think that should work, I'm not 100% sure.)
> >> >>>>>
> >> >>>>> Oh by the way, the method name should be allPropertyNodes. :-) Or
> >> >>>>> maybe findPropertyNodes is even better.
> >> >>>>>
> >> >>>>> Once the method works, you can drop the main loop in extract().
> >> >>>>> Instead
> >> >>>>> of
> >> >>>>>
> >> >>>>> for (node <- page.children) {
> >> >>>>>     for (property <- allPropertiesNode(node)) {
> >> >>>>>         println(property.toWikiText)
> >> >>>>>     }
> >> >>>>> }
> >> >>>>>
> >> >>>>> you can just write
> >> >>>>>
> >> >>>>> for (property <- findPropertyNodes(page)) {
> >> >>>>>     println(property.toWikiText)
> >> >>>>> }
> >> >>>>>
> >> >>>>> But that's just cosmetic surgery, it has the same effect.
> >> >>>>>
> >> >>>>> Cheers,
> >> >>>>> JC
> >> >>>>>
> >> >>>>> > }
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > And nothing is displayed on my screen :-(
> >> >>>>> >
> >> >>>>> > Any idea of what I do wrongly ?
> >> >>>>> >
> >> >>>>> > BesT.
> >> >>>>> >
> >> >>>>> > Julien.
> >> >>>>> >
> >> >>>>> >
> >> >>>>> > 2013/4/23 Julien Plu <julien....@redaction-developpez.com>
> >> >>>>> >>
> >> >>>>> >> Hi,
> >> >>>>> >>
> >> >>>>> >> param come from a bad copy paste, it's "pop" the good variable.
> >> >>>>> >>
> >> >>>>> >> By the way thank you for the hint about AST I will take a look
> at
> >> >>>>> >> these
> >> >>>>> >> class and see how I can use them. I won't hesitate to ask if
> I'm
> >> >>>>> >> blocked :-)
> >> >>>>> >>
> >> >>>>> >> Best.
> >> >>>>> >>
> >> >>>>> >> Julien.
> >> >>>>> >>
> >> >>>>> >>
> >> >>>>> >> 2013/4/22 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
> >> >>>>> >>>
> >> >>>>> >>> Hi Julien,
> >> >>>>> >>>
> >> >>>>> >>> On 22 April 2013 21:43, Julien Plu
> >> >>>>> >>> <julien....@redaction-developpez.com>
> >> >>>>> >>> wrote:
> >> >>>>> >>> > I started the code for the extractor and I have a problem
> with
> >> >>>>> >>> > the
> >> >>>>> >>> > regex in
> >> >>>>> >>> > Scala. the string is :
> >> >>>>> >>> >
> >> >>>>> >>> >
> >> >>>>> >>> >
> >> >>>>> >>> >
> http://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es/Antony/%C3%A9volution_population&action=edit
> >> >>>>> >>> >
> >> >>>>> >>> > And my regex is : val populationRegex = """|pop=(\d+)""".r
> >> >>>>> >>> >
> >> >>>>> >>> > And I use this piece of code :
> >> >>>>> >>> >
> >> >>>>> >>> > populationRegex findAllIn  page.children.toString foreach (_
> >> >>>>> >>> > match {
> >> >>>>> >>> >     case populationRegex (pop) =>
> println(page.title.decoded +
> >> >>>>> >>> > "
> >> >>>>> >>> > : pop
> >> >>>>> >>> > : " +
> >> >>>>> >>> > param)
> >> >>>>> >>>
> >> >>>>> >>> What is param?
> >> >>>>> >>>
> >> >>>>> >>> But more generally - did you try using the AST (abstract
> syntax
> >> >>>>> >>> tree)
> >> >>>>> >>> built by the parser, i.e. the tree whose root node is the
> >> >>>>> >>> PageNode?
> >> >>>>> >>> I'm not sure how good our parser is at dealing with stuff like
> >> >>>>> >>> "<includeonly>" and "{{#switch ...}}", but I think it works
> and
> >> >>>>> >>> page.children should contain a ParserFunctionNode [1] object
> for
> >> >>>>> >>> the
> >> >>>>> >>> #switch, which in turn has a child for each branch, e.g. one
> >> >>>>> >>> child
> >> >>>>> >>> for
> >> >>>>> >>> an=2010 and one for pop=61793. These children are PropertyNode
> >> >>>>> >>> [2]
> >> >>>>> >>> objects, which have a key and (who would have thought) more
> >> >>>>> >>> children.
> >> >>>>> >>> Well, in this case, just one child, which is a TextNode. In a
> >> >>>>> >>> nutshell: Find the "#switch" node, find children with keys
> "an"
> >> >>>>> >>> and
> >> >>>>> >>> "pop", and generate triples for their values.
> >> >>>>> >>>
> >> >>>>> >>> >     case _ =>
> >> >>>>> >>> > })
> >> >>>>> >>> >
> >> >>>>> >>> > And instead of to get : "Données/Antony/évolution
> population :
> >> >>>>> >>> > pop :
> >> >>>>> >>> > 61793"
> >> >>>>> >>> > just once
> >> >>>>> >>> >
> >> >>>>> >>> > I have many : "Données/Antony/évolution population : pop :
> >> >>>>> >>> > null"
> >> >>>>> >>> > as
> >> >>>>> >>> > much as
> >> >>>>> >>> > there is line in the string
> >> >>>>> >>> >
> >> >>>>> >>> > An idea of what I do wrongly ?
> >> >>>>> >>> >
> >> >>>>> >>> > I'm totally beginner in Scala :-( sorry.
> >> >>>>> >>>
> >> >>>>> >>> Your code excerpt looks pretty good to me. :-)
> >> >>>>> >>>
> >> >>>>> >>> The AST is usually much safer and cleaner than regexes.
> Regexes
> >> >>>>> >>> are
> >> >>>>> >>> more suitable for unstructured strings, but here you're
> dealing
> >> >>>>> >>> with
> >> >>>>> >>> pretty clean structures. So I would suggest you write some
> code
> >> >>>>> >>> that
> >> >>>>> >>> walks through the PageNode tree. If you have any questions,
> >> >>>>> >>> don't
> >> >>>>> >>> hesitate to ask. We're looking forward to your contributions.
> >> >>>>> >>> Thanks!
> >> >>>>> >>>
> >> >>>>> >>> Cheers,
> >> >>>>> >>> JC
> >> >>>>> >>>
> >> >>>>> >>> [1]
> >> >>>>> >>>
> >> >>>>> >>>
> >> >>>>> >>>
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/ParserFunctionNode.scala
> >> >>>>> >>> [2]
> >> >>>>> >>>
> >> >>>>> >>>
> >> >>>>> >>>
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/PropertyNode.scala
> >> >>>>> >>>
> >> >>>>> >>> >
> >> >>>>> >>> > Best.
> >> >>>>> >>> >
> >> >>>>> >>> > Julien.
> >> >>>>> >>> >
> >> >>>>> >>> >
> >> >>>>> >>> > 2013/4/22 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
> >> >>>>> >>> >>
> >> >>>>> >>> >> The templates where data is stored are not used directly in
> >> >>>>> >>> >> the
> >> >>>>> >>> >> main
> >> >>>>> >>> >> pages. It's a complicated process: page Toulouse uses
> >> >>>>> >>> >> template
> >> >>>>> >>> >> X, X
> >> >>>>> >>> >> uses Y,
> >> >>>>> >>> >> Y uses Z, and Z contains the data. Something like that, I'm
> >> >>>>> >>> >> 100%
> >> >>>>> >>> >> sure,
> >> >>>>> >>> >> but
> >> >>>>> >>> >> the details don't matter. This means that
> >> >>>>> >>> >> wikiPageUsesTemplate
> >> >>>>> >>> >> and
> >> >>>>> >>> >> InfoboxExtractor won't help.
> >> >>>>> >>> >>
> >> >>>>> >>> >> Generating a separate file is probably the best idea. We
> >> >>>>> >>> >> could
> >> >>>>> >>> >> also
> >> >>>>> >>> >> send
> >> >>>>> >>> >> these new triples to the main mapping based file, but that
> >> >>>>> >>> >> might
> >> >>>>> >>> >> be
> >> >>>>> >>> >> confusing: first, they're not mapping based; second, new
> >> >>>>> >>> >> triples
> >> >>>>> >>> >> about
> >> >>>>> >>> >> a
> >> >>>>> >>> >> city would be added in a completely different place in the
> >> >>>>> >>> >> file.
> >> >>>>> >>> >> (That's not
> >> >>>>> >>> >> a big problem though.)
> >> >>>>> >>> >>
> >> >>>>> >>> >> Cheers,
> >> >>>>> >>> >> JC
> >> >>>>> >>> >
> >> >>>>> >>> >
> >> >>>>> >>
> >> >>>>> >>
> >> >>>>> >
> >> >>>>
> >> >>>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Kontokostas Dimitris
> >> >>
> >> >>
> >> >
> >
> >
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Problem with extracted data

Reply via email to