Re: [Dbpedia-discussion] Problem with extracted data

Jona Christopher Sahnwaldt Mon, 22 Apr 2013 14:35:44 -0700

Hi Julien,

On 22 April 2013 21:43, Julien Plu <julien....@redaction-developpez.com> wrote:
> I started the code for the extractor and I have a problem with the regex in
> Scala. the string is :
> http://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es/Antony/%C3%A9volution_population&action=edit
>
> And my regex is : val populationRegex = """|pop=(\d+)""".r
>
> And I use this piece of code :
>
> populationRegex findAllIn  page.children.toString foreach (_ match {
>     case populationRegex (pop) => println(page.title.decoded + " : pop : " +
> param)


What is param?

But more generally - did you try using the AST (abstract syntax tree)
built by the parser, i.e. the tree whose root node is the PageNode?
I'm not sure how good our parser is at dealing with stuff like
"<includeonly>" and "{{#switch ...}}", but I think it works and
page.children should contain a ParserFunctionNode [1] object for the
#switch, which in turn has a child for each branch, e.g. one child for
an=2010 and one for pop=61793. These children are PropertyNode [2]
objects, which have a key and (who would have thought) more children.
Well, in this case, just one child, which is a TextNode. In a
nutshell: Find the "#switch" node, find children with keys "an" and
"pop", and generate triples for their values.

>     case _ =>
> })
>
> And instead of to get : "Données/Antony/évolution population : pop : 61793"
> just once
>
> I have many : "Données/Antony/évolution population : pop : null" as much as
> there is line in the string
>
> An idea of what I do wrongly ?
>
> I'm totally beginner in Scala :-( sorry.

Your code excerpt looks pretty good to me. :-)

The AST is usually much safer and cleaner than regexes. Regexes are
more suitable for unstructured strings, but here you're dealing with
pretty clean structures. So I would suggest you write some code that
walks through the PageNode tree. If you have any questions, don't
hesitate to ask. We're looking forward to your contributions. Thanks!

Cheers,
JC

[1] 
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/ParserFunctionNode.scala
[2] 
https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/wikiparser/PropertyNode.scala

>
> Best.
>
> Julien.
>
>
> 2013/4/22 Jona Christopher Sahnwaldt <j...@sahnwaldt.de>
>>
>> The templates where data is stored are not used directly in the main
>> pages. It's a complicated process: page Toulouse uses template X, X uses Y,
>> Y uses Z, and Z contains the data. Something like that, I'm 100% sure, but
>> the details don't matter. This means that wikiPageUsesTemplate and
>> InfoboxExtractor won't help.
>>
>> Generating a separate file is probably the best idea. We could also send
>> these new triples to the main mapping based file, but that might be
>> confusing: first, they're not mapping based; second, new triples about a
>> city would be added in a completely different place in the file. (That's not
>> a big problem though.)
>>
>> Cheers,
>> JC
>
>

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] Problem with extracted data

Reply via email to