Re: [Dbpedia-discussion] Problem with extracted data

Julien Plu Wed, 24 Apr 2013 07:36:16 -0700

You had right Jona the problem came from a bad path in the package.

I just sent a new pull request with my extractor :-)


Best.

Julien.


2013/4/23 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>

> Hi Julien,
>
> On 23 April 2013 23:16, Julien Plu <julien....@redaction-developpez.com>
> wrote:
> > Ok, I finished, now I made an extractor which works like we expected :-)
> I
> > don't think that what I did is well made but it works.
>
> Cool! You can always improve it later. :-)
>
> >
> > Anyway only one problem stay, if I put my "PopulationExtractor.scala"
> file
> > from "mappings" folder into "fr" folder inside "mappings" folder the
> > extraction configuration file fail because he doesn't find the
> > "PopulationExtractor" class doesn't matter if I write
> > "fr.PopulationExtractor" or
> > "org.dbpedia.extraction.mappings.fr.PopulationExtractor". Any idea of
> what's
> > going on ?
>
> Does the package declaration in the class file include the ".fr"?
> Scala is less strict than Java here.
>
> If you send a pull request, we can have a look at your code and merge
> it into the main repository, so others can run this extraction as
> well.
>
> https://github.com/dbpedia/extraction-framework/wiki/Contributing
>
> >
> > Last thing I added a dataset inside the file "DBpediaDatasets.scala" like
> > that I have my own archive containing only the population informations.
>
> Right, that's one more thing you need to add.
>
> Thanks!
>
> JC
>
>
> >
> > Best.
> >
> > Julien.
> >
> >
> > 2013/4/23 Julien Plu <julien....@redaction-developpez.com>
> >>
> >> Yes I know IDE are really usefull but my working machine is on Windows
> and
> >> I'm really not familiar with. So I use a Linux distrib via a virtual
> machine
> >> but this virtual machine is too slow for coding with an IDE in graphics
> so I
> >> have to connect to this VM with a ssh connexion and use only the shell.
> >>
> >> I think that I will force me to use Windows that will be more easy than
> to
> >> continue to work like that :-D
> >>
> >> By the way I found my problem for the code. I was come from my regex, so
> >> instead to use """|pop=(\d+)""".r I use """pop=(\d+)""".r and now I
> have the
> >> good value that I want :-)
> >>
> >> Best.
> >>
> >> Julien.
> >>
> >>
> >> 2013/4/23 Dimitris Kontokostas <jimk...@gmail.com>
> >>>
> >>> You should use an IDE for this,it will make you life a lot easier ;)
> >>> I use the intelliJ IDEA default debugger and works pretty good. I could
> >>> send you instructions to set it up
> >>>
> >>> Best,
> >>> Dimtiris
> >>>
> >>>
> >>> On Tue, Apr 23, 2013 at 3:59 PM, Julien Plu
> >>> <julien....@redaction-developpez.com> wrote:
> >>>>
> >>>> No I don't have a debugger because I'm coding on a remote machine via
> >>>> ssh.
> >>>>
> >>>> And even with this code :
> >>>>
> >>>>
> >>>> override def extract(page: PageNode, subjectUri: String, pageContext:
> >>>> PageContext): Seq[Quad] = {
> >>>>      if (page.title.namespace != Namespace.Template || page.isRedirect
> >>>> || !page.title.decoded.contains("évolution population")) return
> Seq.empty
> >>>>
> >>>>     for (property <- findPropertyNodes(page)) {
> >>>>         println(property.toWikiText)
> >>>>     }
> >>>> }
> >>>> private def findPropertyNodes(node : Node) : List[PropertyNode] = {
> >>>>
> >>>>     node match {
> >>>>         case propertyNode : PropertyNode => List(propertyNode)
> >>>>         case _ = node.children.flatMap(findPropertyNodes)
> >>>> }
> >>>>
> >>>> Absolutely nothing is displayed, because the list returned by
> >>>> "findPropertyNodes" is empty and I don't understand why. I know she's
> empty
> >>>> because if I do that :
> >>>>
> >>>> if (findPropertyNodes(page).isEmpty) {
> >>>>     println("empty")
> >>>> }
> >>>> else {
> >>>>     println("no empty")
> >>>> }
> >>>>
> >>>> And "empty" is displayed whereas if I display "page.children" I have
> all
> >>>> the template code but the "findPropertyNodes" function doesn't find
> property
> >>>> inside this template code :-(
> >>>>
> >>>> Best.
> >>>>
> >>>> Julien.
> >>>>
> >>>>
> >>>>
> >>>> 2013/4/23 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
> >>>>>
> >>>>> On 23 April 2013 12:01, Julien Plu
> >>>>> <julien....@redaction-developpez.com> wrote:
> >>>>> > Sorry but I really don't understand how AST works (and Scala too) I
> >>>>> > try to
> >>>>> > retrieve all the PropertyNode contained in a PageNode so I do :
> >>>>> >
> >>>>> >
> >>>>> > override def extract(page: PageNode, subjectUri: String,
> pageContext:
> >>>>> > PageContext): Seq[Quad] = {
> >>>>> >     if (page.title.namespace != Namespace.Template ||
> page.isRedirect
> >>>>> > ||
> >>>>> > !page.title.decoded.contains("évolution population")) return
> >>>>> > Seq.empty
> >>>>> >
> >>>>>
> >>>>> I think it would be good if you could get a picture of the structure
> >>>>> of the tree. It's usually not complicated, but a bit hard to explain
> >>>>> in text. Can you use a debugger? If so, set a breakpoint at the
> >>>>> following line and let the debugger show the page variable. Then
> click
> >>>>> into it, look at its children, and so on.
> >>>>>
> >>>>> We should add a toString() method to Node.scala (and some
> sub-classes)
> >>>>> that shows the structure.
> >>>>>
> >>>>> >     for (node <- page.children) {
> >>>>> >         for (property <- allPropertiesNode(node)) {
> >>>>> >             println(property.toWikiText)
> >>>>> >         }
> >>>>> >     }
> >>>>> > }
> >>>>> >
> >>>>> > private def allPropertiesNode(node : Node) : List[PropertyNode] = {
> >>>>> >     node match {
> >>>>> >         case propertyNode : PropertyNode => List(propertyNode)
> >>>>> >         case _ = node.children
> >>>>> >    }
> >>>>>
> >>>>> This is almost right. If I understand correctly, you want to walk
> >>>>> through the whole tree and collect all property nodes. Change this
> >>>>> line:
> >>>>>
> >>>>>     case _ = node.children
> >>>>>
> >>>>> (does that even compile? I don't understand how... :-) ) to
> >>>>>
> >>>>>     case _ => node.children.flatMap(allPropertiesNode)
> >>>>>
> >>>>> (I think that should work, I'm not 100% sure.)
> >>>>>
> >>>>> Oh by the way, the method name should be allPropertyNodes. :-) Or
> >>>>> maybe findPropertyNodes is even better.
> >>>>>
> >>>>> Once the method works, you can drop the main loop in extract().
> Instead
> >>>>> of
> >>>>>
> >>>>> for (node <- page.children) {
> >>>>>     for (property <- allPropertiesNode(node)) {
> >>>>>         println(property.toWikiText)
> >>>>>     }
> >>>>> }
> >>>>>
> >>>>> you can just write
> >>>>>
> >>>>> for (property <- findPropertyNodes(page)) {
> >>>>>     println(property.toWikiText)
> >>>>> }
> >>>>>
> >>>>> But that's just cosmetic surgery, it has the same effect.
> >>>>>
> >>>>> Cheers,
> >>>>> JC
> >>>>>
> >>>>> > }
> >>>>> >
> >>>>> >
> >>>>> > And nothing is displayed on my screen :-(
> >>>>> >
> >>>>> > Any idea of what I do wrongly ?
> >>>>> >
> >>>>> > BesT.
> >>>>> >
> >>>>> > Julien.
> >>>>> >
> >>>>> >
> >>>>> > 2013/4/23 Julien Plu <julien....@redaction-developpez.com>
> >>>>> >>
> >>>>> >> Hi,
> >>>>> >>
> >>>>> >> param come from a bad copy paste, it's "pop" the good variable.
> >>>>> >>
> >>>>> >> By the way thank you for the hint about AST I will take a look at
> >>>>> >> these
> >>>>> >> class and see how I can use them. I won't hesitate to ask if I'm
> >>>>> >> blocked :-)
> >>>>> >>
> >>>>> >> Best.
> >>>>> >>
> >>>>> >> Julien.
> >>>>> >>
> >>>>> >>
> >>>>> >> 2013/4/22 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
> >>>>> >>>
> >>>>> >>> Hi Julien,
> >>>>> >>>
> >>>>> >>> On 22 April 2013 21:43, Julien Plu
> >>>>> >>> <julien....@redaction-developpez.com>
> >>>>> >>> wrote:
> >>>>> >>> > I started the code for the extractor and I have a problem with
> >>>>> >>> > the
> >>>>> >>> > regex in
> >>>>> >>> > Scala. the string is :
> >>>>> >>> >
> >>>>> >>> >
> >>>>> >>> >
> http://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es/Antony/%C3%A9volution_population&action=edit
> >>>>> >>> >
> >>>>> >>> > And my regex is : val populationRegex = """|pop=(\d+)""".r
> >>>>> >>> >
> >>>>> >>> > And I use this piece of code :
> >>>>> >>> >
> >>>>> >>> > populationRegex findAllIn  page.children.toString foreach (_
> >>>>> >>> > match {
> >>>>> >>> >     case populationRegex (pop) => println(page.title.decoded +
> "
> >>>>> >>> > : pop
> >>>>> >>> > : " +
> >>>>> >>> > param)
> >>>>> >>>
> >>>>> >>> What is param?
> >>>>> >>>
> >>>>> >>> But more generally - did you try using the AST (abstract syntax
> >>>>> >>> tree)
> >>>>> >>> built by the parser, i.e. the tree whose root node is the
> PageNode?
> >>>>> >>> I'm not sure how good our parser is at dealing with stuff like
> >>>>> >>> "<includeonly>" and "{{#switch ...}}", but I think it works and
> >>>>> >>> page.children should contain a ParserFunctionNode [1] object for
> >>>>> >>> the
> >>>>> >>> #switch, which in turn has a child for each branch, e.g. one
> child
> >>>>> >>> for
> >>>>> >>> an=2010 and one for pop=61793. These children are PropertyNode
> [2]
> >>>>> >>> objects, which have a key and (who would have thought) more
> >>>>> >>> children.
> >>>>> >>> Well, in this case, just one child, which is a TextNode. In a
> >>>>> >>> nutshell: Find the "#switch" node, find children with keys "an"
> and
> >>>>> >>> "pop", and generate triples for their values.
> >>>>> >>>
> >>>>> >>> >     case _ =>
> >>>>> >>> > })
> >>>>> >>> >
> >>>>> >>> > And instead of to get : "Données/Antony/évolution population :
> >>>>> >>> > pop :
> >>>>> >>> > 61793"
> >>>>> >>> > just once
> >>>>> >>> >
> >>>>> >>> > I have many : "Données/Antony/évolution population : pop :
> null"
> >>>>> >>> > as
> >>>>> >>> > much as
> >>>>> >>> > there is line in the string
> >>>>> >>> >
> >>>>> >>> > An idea of what I do wrongly ?
> >>>>> >>> >
> >>>>> >>> > I'm totally beginner in Scala :-( sorry.
> >>>>> >>>
> >>>>> >>> Your code excerpt looks pretty good to me. :-)
> >>>>> >>>
> >>>>> >>> The AST is usually much safer and cleaner than regexes. Regexes
> are
> >>>>> >>> more suitable for unstructured strings, but here you're dealing
> >>>>> >>> with
> >>>>> >>> pretty clean structures. So I would suggest you write some code
> >>>>> >>> that
> >>>>> >>> walks through the PageNode tree. If you have any questions, don't
> >>>>> >>> hesitate to ask. We're looking forward to your contributions.
> >>>>> >>> Thanks!
> >>>>> >>>
> >>>>> >>> Cheers,
> >>>>> >>> JC
> >>>>> >>>
> >>>>> >>> [1]
> >>>>> >>>
> >>>>> >>>
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/ParserFunctionNode.scala
> >>>>> >>> [2]
> >>>>> >>>
> >>>>> >>>
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/PropertyNode.scala
> >>>>> >>>
> >>>>> >>> >
> >>>>> >>> > Best.
> >>>>> >>> >
> >>>>> >>> > Julien.
> >>>>> >>> >
> >>>>> >>> >
> >>>>> >>> > 2013/4/22 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
> >>>>> >>> >>
> >>>>> >>> >> The templates where data is stored are not used directly in
> the
> >>>>> >>> >> main
> >>>>> >>> >> pages. It's a complicated process: page Toulouse uses template
> >>>>> >>> >> X, X
> >>>>> >>> >> uses Y,
> >>>>> >>> >> Y uses Z, and Z contains the data. Something like that, I'm
> 100%
> >>>>> >>> >> sure,
> >>>>> >>> >> but
> >>>>> >>> >> the details don't matter. This means that wikiPageUsesTemplate
> >>>>> >>> >> and
> >>>>> >>> >> InfoboxExtractor won't help.
> >>>>> >>> >>
> >>>>> >>> >> Generating a separate file is probably the best idea. We could
> >>>>> >>> >> also
> >>>>> >>> >> send
> >>>>> >>> >> these new triples to the main mapping based file, but that
> might
> >>>>> >>> >> be
> >>>>> >>> >> confusing: first, they're not mapping based; second, new
> triples
> >>>>> >>> >> about
> >>>>> >>> >> a
> >>>>> >>> >> city would be added in a completely different place in the
> file.
> >>>>> >>> >> (That's not
> >>>>> >>> >> a big problem though.)
> >>>>> >>> >>
> >>>>> >>> >> Cheers,
> >>>>> >>> >> JC
> >>>>> >>> >
> >>>>> >>> >
> >>>>> >>
> >>>>> >>
> >>>>> >
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Kontokostas Dimitris
> >>
> >>
> >
>

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr

_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Problem with extracted data

Reply via email to