working example/cmd? I'm not sure we are talking about the same thing.

On Sat, Mar 19, 2011 at 3:36 PM, Dimitris Kontokostas <jimk...@gmail.com>wrote:

> Hi,
> You can grep the output with 
> http://en.wikipedia.org<http://en.wikipedia.org/wiki/Anarchism>and pipe it to 
> sort -u
>
> Cheers,
> Dimitris
>
> On Sat, Mar 19, 2011 at 3:47 PM, Gabriele Kahlout <
> gabri...@mysimpatico.com> wrote:
>
>>
>>
>> On Sat, Mar 19, 2011 at 2:13 PM, Gabriele Kahlout <
>> gabri...@mysimpatico.com> wrote:
>>
>>> Hello,
>>>
>>> I've downloaded and wrote a simple parser to give me pedia urls from this
>>> dbpedia file
>>> <http://downloads.dbpedia.org/3.6/en/wikipedia_links_en.nt.bz2>as shown
>>> below. I find the result unsatisfactory since it contains many duplicates.
>>> Adding logic to the parser to avoid them (through remembering) seems to be
>>> also very expensive, since the file size (uncompressed) is 3GB. Is there a
>>> better approach to get Wikipedia urls like is done with dmoz in
>>>
>>> wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
>>> bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > 
>>> dmoz/urls
>>>
>>>
>>>
>>> http://en.wikipedia.org/wiki/AfghanistanGeography
>>> http://dbpedia.org/resource/AfghanistanGeography
>>> http://en.wikipedia.org/wiki/AfghanistanGeography
>>> n"@e
>>> http://dbpedia.org/resource/AfghanistanGeography
>>> http://en.wikipedia.org/wiki/AfghanistanGeography
>>> http://en.wikipedia.org/wiki/Anarchism
>>> http://dbpedia.org/resource/Anarchism
>>> http://en.wikipedia.org/wiki/Anarchism
>>> n"@e
>>> http://dbpedia.org/resource/Anarchism
>>> http://en.wikipedia.org/wiki/Anarchism
>>> http://en.wikipedia.org/wiki/AccessibleComputing
>>> http://dbpedia.org/resource/AccessibleComputing
>>> http://en.wikipedia.org/wiki/AccessibleComputing
>>> n"@e
>>> http://dbpedia.org/resource/AccessibleComputing
>>> http://en.wikipedia.org/wiki/AccessibleComputing
>>> http://en.wikipedia.org/wiki/AfghanistanHistory
>>> http://dbpedia.org/resource/AfghanistanHistory
>>> http://en.wikipedia.org/wiki/AfghanistanHistory
>>> n"@e
>>> http://dbpedia.org/resource/AfghanistanHistory
>>> http://en.wikipedia.org/wiki/AfghanistanHistory
>>> http://en.wikipedia.org/wiki/AfghanistanPeople
>>> http://dbpedia.org/resource/AfghanistanPeople
>>> http://en.wikipedia.org/wiki/AfghanistanPeople
>>> n"@e
>>> http://dbpedia.org/resource/AfghanistanPeople
>>> http://en.wikipedia.org/wiki/AfghanistanPeople
>>> http://en.wikipedia.org/wiki/AfghanistanTransportations
>>> http://dbpedia.org/resource/AfghanistanTransportations
>>> http://en.wikipedia.org/wiki/AfghanistanTransportations
>>> n"@e
>>> http://dbpedia.org/resource/AfghanistanTransportations
>>> http://en.wikipedia.org/wiki/AfghanistanTransportations
>>> http://en.wikipedia.org/wiki/AfghanistanCommunications
>>> http://dbpedia.org/resource/AfghanistanCommunications
>>> http://en.wikipedia.org/wiki/AfghanistanCommunications
>>>
>>>
>>> --
>>> Regards,
>>> K. Gabriele
>>>
>>> --- unchanged since 20/9/10 ---
>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>> receipt within 48 hours then I don't resend the email.
>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>>
>>> If an email is sent by a sender that is not a trusted contact or the
>>> email does not contain a valid code then the email is not received. A valid
>>> code starts with a hyphen and ends with "X".
>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>>> L(-[a-z]+[0-9]X)).
>>>
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Colocation vs. Managed Hosting
>> A question and answer guide to determining the best fit
>> for your organization - today and in the future.
>> http://p.sf.net/sfu/internap-sfd2d
>> _______________________________________________
>> Dbpedia-discussion mailing list
>> Dbpedia-discussion@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>
>>
>
>
> --
> Kontokostas Dimitris
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).
------------------------------------------------------------------------------
Colocation vs. Managed Hosting
A question and answer guide to determining the best fit
for your organization - today and in the future.
http://p.sf.net/sfu/internap-sfd2d
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to