[Wikitech-l] NLP using Wikipedia

2011-12-02 Thread Khalida BEN SIDI AHMED
- http://www.mediawiki.org/wiki/API%3aMain_page 5- http://jwbf.sourceforge.net/ I'd appreciate any suggestions. Regards Khalida Ben Sidi Ahmed ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Html dump for Wikipedia

2011-12-02 Thread Khalida BEN SIDI AHMED
Hello, I need an html dump of Wikipedia but the link http://static.wikipedia.org/ does not work. I'd appreciate any explanation or suggestion. Regards Ben Sidi Ahmed ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org

Re: [Wikitech-l] Html dump for Wikipedia

2011-12-02 Thread Khalida BEN SIDI AHMED
I need static Html dumps. In the webpage you've mentionned, when I click on the link of static html, this one is not accessible. Truly yours ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org

Re: [Wikitech-l] Html dump for Wikipedia

2011-12-02 Thread Khalida BEN SIDI AHMED
I need an Html dump of Wikipedia because I have written a java code which extract text from an html content and I would like to apply it on this dump. In fact I need to extract the first sentence of a list of articles (200) and I don' know how to do it on other dumps. If you have any idea of other

Re: [Wikitech-l] Html dump for Wikipedia

2011-12-02 Thread Khalida BEN SIDI AHMED
Currently, I'm using the online version with the java api Jsoup. It does not work perfectly. After the extraction of less than 10 articles, my project shows me a set of exceptions. Could you please give me the approximative number of articles I can get with these tools. If you just need a few

Re: [Wikitech-l] Html dump for Wikipedia

2011-12-02 Thread Khalida BEN SIDI AHMED
I just wonder if the problem can be due to the speed of my connexion. The text of the exception is : Grave: null java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:150) at

Re: [Wikitech-l] Html dump for Wikipedia

2011-12-02 Thread Khalida BEN SIDI AHMED
Hi Kaminski I appreciate your help, thank you very much indeed. I will try the options that were given to me today. If my attempts fail, I will contact you for help. Many thanks for Hoehrmann: I'll immediately see if I can succeed with curl or wget. Regards Khalida Ben Sidi Ahmed

Re: [Wikitech-l] Html dump for Wikipedia

2011-12-02 Thread Khalida BEN SIDI AHMED
An other important question to which I seek a response days ago: If I download Wikitaxi and have Wikipedia offline, can I query this offline version using Java? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org

[Wikitech-l] Extracting text from Wikipedia

2011-11-27 Thread Khalida BEN SIDI AHMED
Hello! I don't know if the subject of this question belongs to the scope of this group. Anyway, I will be pleased if I find an aswer to my question. I'm writing some Java code in order to realize NLP tasks upon texts using Wikipedia. What can I do in order to extract the first paragraph of a

Re: [Wikitech-l] Extracting text from Wikipedia

2011-11-27 Thread Khalida BEN SIDI AHMED
I have already read the responses given in this post. I want to the extract the first paragraph (or the first sentence) for a list of 100 articles. I could not use JWPL beacause I don't have a big hard disk space to create the DB. I try to use JSoup but I need examples.

Re: [Wikitech-l] Extracting text from Wikipedia

2011-11-27 Thread Khalida BEN SIDI AHMED
I'm developping my project in Java. I'm not a good php developper. JWPL needs fist to create a database whose size =158 GB. For the RAM, at least 2 GB are necessary. I don't have neither a big hard disk neither a big space ram. In addition, creating such big database to just extract the first

Re: [Wikitech-l] Extracting text from Wikipedia

2011-11-27 Thread Khalida BEN SIDI AHMED
The list of the articles I will need is not known from the beggining. Through my project, I will find a list of words (50). I try to find for them definitions in Wikipedia. After that I will extract the hyperonym of each word. I will have a new list for which I then retrieve the respective

Re: [Wikitech-l] Extracting text from Wikipedia

2011-11-27 Thread Khalida BEN SIDI AHMED
The words I use are belonging to a special domain (oil and gas industry). So Wordnet and even Wictionnary are not useful enough (they are generalized corpora). Thank you very much indeed. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org

Re: [Wikitech-l] Extracting text from Wikipedia

2011-11-27 Thread Khalida BEN SIDI AHMED
Thank you Hoehrmann. I will try to apply the options you've mentionned. However, if someone can help me in using JSoup, his ideas are welcomed. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org

[Wikitech-l] Html code

2011-11-27 Thread Khalida BEN SIDI AHMED
Hello! In the html code of a Wikipedia article how to recognise the *first*sentence of this article? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Html code

2011-11-27 Thread Khalida BEN SIDI AHMED
Thank you very much. That exactly what I wanted to know. 2011/11/27 Bjoern Hoehrmann derhoe...@gmx.net * Khalida BEN SIDI AHMED wrote: In the html code of a Wikipedia article how to recognise the *first*sentence of this article? It's not marked up and probably differs among language

Re: [Wikitech-l] Extracting text from Wikipedia

2011-11-27 Thread Khalida BEN SIDI AHMED
Hello, This is the answer that was given for my question: http://stackoverflow.com/questions/8286786/wikipedia-first-paragraph It works perfectly, the code may be useful for you. Truly yours Khalida Ben Sidi Ahmed ___ Wikitech-l mailing list Wikitech

[Wikitech-l] Forbidden access

2011-11-26 Thread Khalida BEN SIDI AHMED
For my research I need to download 3 files: - [LANGCODE]wiki-[DATE]-pages-articles.xml.bz2 *OR* [LANGCODE]wiki-[DATE]-pages-meta-current.xml.bz2 - [LANGCODE]wiki-[DATE]-pagelinks.sql.gz - [LANGCODE]wiki-[DATE]-categorylinks.sql.gz I downloaded the 2 first ones. Now I can not have an

Re: [Wikitech-l] Forbidden access

2011-11-26 Thread Khalida BEN SIDI AHMED
://dumps.wikimedia.org/enwiki/2015/ and they are accessible. Can you give a couple of specific links that did not work? Ariel Στις 26-11-2011, ημέρα Σαβ, και ώρα 20:54 +0100, ο/η Khalida BEN SIDI AHMED έγραψε: For my research I need to download 3 files: - [LANGCODE]wiki-[DATE]-pages

Re: [Wikitech-l] Forbidden access

2011-11-26 Thread Khalida BEN SIDI AHMED
Even http://dumps.wikimedia.org/enwiki/2015/ doesn't work. The text what the browser shows is alaways the same : 403 - Forbidden ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Forbidden access

2011-11-26 Thread Khalida BEN SIDI AHMED
In fact, I'm downloading now enwiki-latest-pagelinks.sql.gz in my laptop ( wifi connection). All the links of Wikipedia are forbidden either in my laptop or in my pc (linked internet connection). I stopped the downlod and the links are accessible. ___

Re: [Wikitech-l] Forbidden access

2011-11-26 Thread Khalida BEN SIDI AHMED
happenning simultaniously. Thank you very much indeed for your responses. Truly yours Khalida Ben Sidi Ahmed ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l