Re: [Foundation-l] excluding Wikipedia clones from searching
On 10/12/2010 23:51, John Doe wrote: I'm In the process of creating a cleanup tool that checks archive.org and webcitation.org if a URL is not archived it checks to see if it is live and if it is I request that webcitation archive it on demand, and fills in the archiveurl parameter of cite templates. What is the point of doing that? If an URL goes missing the information should be refound from another source. If it can't be re-referenced then perhaps it wasn't quite as reliable as one first thought, and if URLs aren't stable on any particular site then maybe one should re-examine the reliability of the originating source. Most dead URLs that I see, that can't be refound, come from references to online articles of minor events in BLPs. Simply the event was recorded on Monday and was fish and chip wrapping by Thursday. Or to put it another way non-notable in the grand scheme of things. In some cases the original source may also have removed the content because it was untrue and could not be substantiated. Stuffing URLs across to archive.org, or webcitation.org simply perpetuates unsubstantiated gossip. One really ought to examine one's motives for doing that. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
In a message dated 12/9/2010 11:06:30 PM Pacific Standard Time, jamesmikedup...@googlemail.com writes: Google does it, archive.org (wayback machine) does it, we can copy them for caching and searching i assume. we are not changing the license, but just preventing the information from disappearing on us. You are thinking of refs which are out-of-copyright. Google books only gives snippet views of some books still under copyright for which they've not gotten permission to show an entire page at a time (which is preview mode). archive.org as well has copies of works out-of-copyright (or otherwise in the public domain) Your original statement was that we should copy refs. Many or most of our refs are under copyright still. We would not be able to do what you suggest imho. W ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
i mean google has copies, caches of items for searching. How can google cache this? Archive.org has copyrighted materials as well. We should be able to save backups of this material as well. mike On Fri, Dec 10, 2010 at 5:16 PM, wjhon...@aol.com wrote: In a message dated 12/9/2010 11:06:30 PM Pacific Standard Time, jamesmikedup...@googlemail.com writes: Google does it, archive.org (wayback machine) does it, we can copy them for caching and searching i assume. we are not changing the license, but just preventing the information from disappearing on us. You are thinking of refs which are out-of-copyright. Google books only gives snippet views of some books still under copyright for which they've not gotten permission to show an entire page at a time (which is preview mode). archive.org as well has copies of works out-of-copyright (or otherwise in the public domain) Your original statement was that we should copy refs. Many or most of our refs are under copyright still. We would not be able to do what you suggest imho. W ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l -- James Michael DuPont Member of Free Libre Open Source Software Kosova and Albania flossk.org flossal.org ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
I am not talking about books, just webpages. lets take ladygaga.com as example Wayback engine : http://web.archive.org/web/*/http://www.ladygaga.com Google cache: http://webcache.googleusercontent.com/search?q=cache:1720lEPHkysJ:www.ladygaga.com/+lady+gagacd=1hl=dect=clnkgl=declient=firefox-a here are two copies of copyrighted materials, we should make sure that our referenced webpages are in archive.org or mirrored on some server. Ideally we would have our own search engine and cache. mike On Fri, Dec 10, 2010 at 9:00 PM, wjhon...@aol.com wrote: In a message dated 12/10/2010 11:55:21 AM Pacific Standard Time, jamesmikedup...@googlemail.com writes: i mean google has copies, caches of items for searching. How can google cache this? Archive.org has copyrighted materials as well. We should be able to save backups of this material as well. mike Mike I believe your statement lacks evidence. I don't think either of these has available full copies of anything under copyright. If you can give an example, please do so, so I can look at your specific example. Google Books has copies, not Google. The full readable copies are all under public domain. The snippet views are not. The preview views mean that they actually received *permission* from the copyright holder to do a preview view. That's why it's very rare to find a preview view for any book that predates the internet! You either get snippet or full. Probably the author is actually dead, and they can't find who holds the copyright easily today. Or it's too much trouble for a book that fifteen people look at. W -- James Michael DuPont Member of Free Libre Open Source Software Kosova and Albania flossk.org flossal.org ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
On Fri, Dec 10, 2010 at 9:54 PM, wjhon...@aol.com wrote: In a message dated 12/10/2010 12:48:31 PM Pacific Standard Time, jamesmikedup...@googlemail.com writes: I am not talking about books, just webpages. lets take ladygaga.com as example Wayback engine : http://web.archive.org/web/*/http://www.ladygaga.com Google cache: http://webcache.googleusercontent.com/search?q=cache:1720lEPHkysJ:www.ladygaga.com/+lady+gagacd=1hl=dect=clnkgl=declient=firefox-a here are two copies of copyrighted materials, we should make sure that our referenced webpages are in archive.org or mirrored on some server. Ideally we would have our own search engine and cache. mike I have no problem with the idea of pointing refs to a page on archive.org, however you must understand that even previously archived pages *may* be removed from archive.org at the owner's request or even at the request of a .robots entry. The only advantage I see over using archive.org instead of a plain link, is the ability to see what a page *looked* like in the past. I'm not sure that's a great advantage. Why do you think it is? If a page comes down, should we not err on the part of assuming the owner no longer wants it public and if the owner doesnt want it public, are we to make sure it stays public by caching it against their will? Both Google and Archive.org (much to my utter dismay) obey certain rules set up by web page owners to not index certain pages, or to remove them from caching history entirely (even old copies). Are you suggesting we disregard those rules? If not, then I see no advantage in our caching pages which are available in caches already. My point is we should index them ourselves. We should have the pages used as references first listed in an easy to use manner and if possible we should cache them. If they are not cacheable because of some restrictions, the references should be marked somehow as not as good and people might find better references. In the end, like citeseer you will find that pages that are available and open and cachable will be cited and used more than pages that are not. Right now, I dont know of a simple way to even get this list of references from wp. There is alot of work to do, and if we do this, it will benefit the wikipedia. Another thing to do is to translate the pages referenced. mike ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
I know all about the aspects of programming and copyright, I thought I answered the questions. Of course I can program this myself, and we can use open source indexing tools for that. the translations of course are a separate issue, they would be under the same restrictions as the source page. If we prefer pages that can be cached and translated, and mark the others that cannot, then by natural selection we will in long term replaces the pages that are not allowed to be cached with ones that can be. My suggestion is for a wikipedia project, something to be supported and run on the toolserver or similar. mike On Fri, Dec 10, 2010 at 10:19 PM, wjhon...@aol.com wrote: In a message dated 12/10/2010 1:10:26 PM Pacific Standard Time, jamesmikedup...@googlemail.com writes: My point is we should index them ourselves. We should have the pages used as references first listed in an easy to use manner and if possible we should cache them. If they are not cacheable because of some restrictions, the references should be marked somehow as not as good and people might find better references. In the end, like citeseer you will find that pages that are available and open and cachable will be cited and used more than pages that are not. Right now, I dont know of a simple way to even get this list of references from wp. There is alot of work to do, and if we do this, it will benefit the wikipedia. Another thing to do is to translate the pages referenced. mike I understand your point, but you're avoiding answering the points I raised. They are archived at archive.org by permission. You tell archive.org to archive your site, and they do. You tell them to stop, and they do. What advantage would we have to repeat the caching yet again that archive.org is already doing? You haven't answered that. No matter what occurs, you're going to have trouble retrieving the list of refs from a WP page (or any web page), without knowing some programming language like PHP. Using PHP it's a fairly trivial parsing request. It's that's your only problem, I can write you a script to do it, for twenty bucks. You cannot translate a work, which is under copyright protection, without violating their copyright. Copyright extends to any effort that substantially mimics the underlying work. A translation is found to violate copyright. You could however make a parody :) W -- James Michael DuPont Member of Free Libre Open Source Software Kosova and Albania flossk.org flossal.org ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
Well, lets backtrack. The original question was, how can we exclude wikipedia clones from the search. my idea was to create a search engine that includes only refs from wikipedia in it. then the idea was to make our own engine instead of only using google. lets agree that we need first a list of references and we can talk about the details of the searching later. thanks, mike On Fri, Dec 10, 2010 at 11:02 PM, wjhon...@aol.com wrote: In a message dated 12/10/2010 1:31:20 PM Pacific Standard Time, jamesmikedup...@googlemail.com writes: If we prefer pages that can be cached and translated, and mark the others that cannot, then by natural selection we will in long term replaces the pages that are not allowed to be cached with ones that can be. My suggestion is for a wikipedia project, something to be supported and run on the toolserver or similar. I think if you were to propose that we should prefer pages that can be cached and translated you'd get a firestorm of opposition. The majority of our refs, imho, are still under copyright. This is because the majority of our refs are either web pages created by various authors who do not specify a free license (and therefore under U.S. law automatically enjoy copyright protection). Or they are refs to works which are relatively current, and are cited, for example in Google Books Preview mode, or at Amazon look-inside pages. I still cannot see any reason why we would want to cache anything like this. You haven't addressed what benefit it gives us, to cache refs. My last question here is not about whether we can or how, but how does it help the project? How? W -- James Michael DuPont Member of Free Libre Open Source Software Kosova and Albania flossk.org flossal.org ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
On Fri, Dec 10, 2010 at 11:16 PM, wjhon...@aol.com wrote: In a message dated 12/10/2010 2:12:44 PM Pacific Standard Time, jamesmikedup...@googlemail.com writes: Well, lets backtrack. The original question was, how can we exclude wikipedia clones from the search. my idea was to create a search engine that includes only refs from wikipedia in it. then the idea was to make our own engine instead of only using google. lets agree that we need first a list of references and we can talk about the details of the searching later. thanks, mike I search for Mary Queen of Scots and I want to exclude Wikipedia clones from my results, because I'm really only interested in... how many times she appears in various Wikipedia pages. Why would I not just use the Wikipedia internal search engine then? my idea was that you will want to search pages that are referenced by wikipedia already, in my work on kosovo, it would be very helpful because there are lots of bad results on google, and it would be nice to use that also to see how many times certain names occur. That is why we need also our own indexing engine, I would like to count the occurances of each term and what page they occur on, and to xref that to names on wikipedia against them. Wanted pages could also be assisted like this, what are the most wanted pages that match against the most common terms in the new refindex or also existing pages. These are the things that I would like to do. -- James Michael DuPont Member of Free Libre Open Source Software Kosova and Albania flossk.org flossal.org ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
On Sat, Dec 11, 2010 at 12:02 AM, wjhon...@aol.com wrote: In a message dated 12/10/2010 2:58:08 PM Pacific Standard Time, jamesmikedup...@googlemail.com writes: my idea was that you will want to search pages that are referenced by wikipedia already, in my work on kosovo, it would be very helpful because there are lots of bad results on google, and it would be nice to use that also to see how many times certain names occur. That is why we need also our own indexing engine, I would like to count the occurances of each term and what page they occur on, and to xref that to names on wikipedia against them. Wanted pages could also be assisted like this, what are the most wanted pages that match against the most common terms in the new refindex or also existing pages. Well then all you would need to do is cross-reference the refs themselves. You don't need to cache the underlying pages to which they refer. well i was hoping to look at all the pages that wikipedia considers to be valuable enough to be referenced, and to find new information on those pages for other articles. I dont think it is enough to just look at the referernces on the wikipedia itself, we should resolve them and look at those pages, and also to build a list of sites of possible full indexing, or at least some spidering. So in your new search engine, when you search for Mary, Queen of Scots you really are saying, show me those external references, which are mentioned, in connection with Mary Queen of Scots, by Wikipedia. Not really, find all pages referenced in total by the wikipedia that contain the term Mary, Queen of Scots, maybe someone added a site to an article on King Henry that contains the text Mary, Queen of Scots that has not been referenced yet. show me the occurrences of the word, the frequency, maybe in the sentence or paragraph it occurs in and a link to the page and the ability to see the cached version if the site is down. it can also be cached on another site as well, if the same version. That doesn't require caching the pages to which refs refer. It only requires indexing those refs which currently are used in-world. Well indexing normally means caching as well, public or private. You need to copy the pages into the memory of a computer to index them. Best is to store them on disk. The first step will be to collect all references of course, but the second step will be to resolve them.This is also good to check for dead references and mark them as such. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
I'm In the process of creating a cleanup tool that checks archive.org and webcitation.org if a URL is not archived it checks to see if it is live and if it is I request that webcitation archive it on demand, and fills in the archiveurl parameter of cite templates. John ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
In a message dated 12/10/2010 2:12:44 PM Pacific Standard Time, jamesmikedup...@googlemail.com writes: Well, lets backtrack. The original question was, how can we exclude wikipedia clones from the search. my idea was to create a search engine that includes only refs from wikipedia in it. then the idea was to make our own engine instead of only using google. lets agree that we need first a list of references and we can talk about the details of the searching later. thanks, mike I search for Mary Queen of Scots and I want to exclude Wikipedia clones from my results, because I'm really only interested in... how many times she appears in various Wikipedia pages. Why would I not just use the Wikipedia internal search engine then? ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
In a message dated 12/10/2010 1:31:20 PM Pacific Standard Time, jamesmikedup...@googlemail.com writes: If we prefer pages that can be cached and translated, and mark the others that cannot, then by natural selection we will in long term replaces the pages that are not allowed to be cached with ones that can be. My suggestion is for a wikipedia project, something to be supported and run on the toolserver or similar. I think if you were to propose that we should prefer pages that can be cached and translated you'd get a firestorm of opposition. The majority of our refs, imho, are still under copyright. This is because the majority of our refs are either web pages created by various authors who do not specify a free license (and therefore under U.S. law automatically enjoy copyright protection). Or they are refs to works which are relatively current, and are cited, for example in Google Books Preview mode, or at Amazon look-inside pages. I still cannot see any reason why we would want to cache anything like this. You haven't addressed what benefit it gives us, to cache refs. My last question here is not about whether we can or how, but how does it help the project? How? W ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
In a message dated 12/10/2010 1:10:26 PM Pacific Standard Time, jamesmikedup...@googlemail.com writes: My point is we should index them ourselves. We should have the pages used as references first listed in an easy to use manner and if possible we should cache them. If they are not cacheable because of some restrictions, the references should be marked somehow as not as good and people might find better references. In the end, like citeseer you will find that pages that are available and open and cachable will be cited and used more than pages that are not. Right now, I dont know of a simple way to even get this list of references from wp. There is alot of work to do, and if we do this, it will benefit the wikipedia. Another thing to do is to translate the pages referenced. mike I understand your point, but you're avoiding answering the points I raised. They are archived at archive.org by permission. You tell archive.org to archive your site, and they do. You tell them to stop, and they do. What advantage would we have to repeat the caching yet again that archive.org is already doing? You haven't answered that. No matter what occurs, you're going to have trouble retrieving the list of refs from a WP page (or any web page), without knowing some programming language like PHP. Using PHP it's a fairly trivial parsing request. It's that's your only problem, I can write you a script to do it, for twenty bucks. You cannot translate a work, which is under copyright protection, without violating their copyright. Copyright extends to any effort that substantially mimics the underlying work. A translation is found to violate copyright. You could however make a parody :) W ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
In a message dated 12/10/2010 2:58:08 PM Pacific Standard Time, jamesmikedup...@googlemail.com writes: my idea was that you will want to search pages that are referenced by wikipedia already, in my work on kosovo, it would be very helpful because there are lots of bad results on google, and it would be nice to use that also to see how many times certain names occur. That is why we need also our own indexing engine, I would like to count the occurances of each term and what page they occur on, and to xref that to names on wikipedia against them. Wanted pages could also be assisted like this, what are the most wanted pages that match against the most common terms in the new refindex or also existing pages. Well then all you would need to do is cross-reference the refs themselves. You don't need to cache the underlying pages to which they refer. So in your new search engine, when you search for Mary, Queen of Scots you really are saying, show me those external references, which are mentioned, in connection with Mary Queen of Scots, by Wikipedia. That doesn't require caching the pages to which refs refer. It only requires indexing those refs which currently are used in-world. W ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
Bonjour Could you change the url for wikiwix, just remove lang=fr, since currently the search results are french and not ml as expected. Cordialement Pascal Martin 06 13 89 77 32 02 32 40 23 69 - Original Message - From: Mike Dupont jamesmikedup...@googlemail.com To: Wikimedia Foundation Mailing List foundation-l@lists.wikimedia.org Sent: Wednesday, December 08, 2010 7:58 PM Subject: Re: [Foundation-l] excluding Wikipedia clones from searching I thought about this more, It would be to extract a list of all pages that are included as ref in the WP. We would use this for the search engine. we should also make sure that all referenced pages (not linked ones) are stored in archive.org or someplace permanent. I wonder if there is some API to extract this list easily? mike On Wed, Dec 8, 2010 at 6:49 PM, praveenp me.prav...@gmail.com wrote: On Wednesday 08 December 2010 05:16 PM, Amir E. Aharoni wrote: I know that some Wikipedias customized Special:Search, adding other search engines except Wikipedias built-in one. I tried to see whether any Wikipedia added an ability to search using Google (or Bing, or Yahoo, or any other search engine) excluding Wikipedia clones. Does anyone know whether it's possible to build such a thing? And maybe it already exists and i didn't search well enough? http://ml.wikipedia.org/w/index.php?title=Special%3ASearch not excluding other sites, but only including results from ml.wikipedia.org using site:ml.wikipedia.org in query ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l -- James Michael DuPont Member of Free Libre Open Source Software Kosova and Albania flossk.org flossal.org ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
On Dec 8, 2010, at 6:21 PM, Mike Dupont wrote: Sounds like we need to have a notable search engine that includes only approved and allowed sources, that would be nice to have. Sounds like a great community project, Wiki Search! Domas ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
On Thu, Dec 9, 2010 at 9:55 AM, Domas Mituzas midom.li...@gmail.com wrote: On Dec 8, 2010, at 6:21 PM, Mike Dupont wrote: Sounds like we need to have a notable search engine that includes only approved and allowed sources, that would be nice to have. Sounds like a great community project, Wiki Search! yes it would be great. As i said, it could just include all pages listed as REF pages and that would allow people to review the results and find pages that should not belong. We also need to cache all these pages, best would be with a revision history. It should be similar to or using archive.org. The searching could also use lucene or some other project. It does not have to be google. On this note, I would really like to see a wordindex for openstreetmap as well, there is a huge amount of information that could be relevant in osm that should be easier to use in WP. mike ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
On Thu, Dec 9, 2010 at 9:55 AM, Domas Mituzas midom.li...@gmail.com wrote: On Dec 8, 2010, at 6:21 PM, Mike Dupont wrote: Sounds like we need to have a notable search engine that includes only approved and allowed sources, that would be nice to have. Sounds like a great community project, Wiki Search! yes it would be great. As i said, it could just include all pages listed as REF pages and that would allow people to review the results and find pages that should not belong. We also need to cache all these pages, best would be with a revision history. It should be similar to or using archive.org. The searching could also use lucene or some other project. It does not have to be google. On this note, I would really like to see a wordindex for openstreetmap as well, there is a huge amount of information that could be relevant in osm that should be easier to use in WP. mike Openstreetmap is a wiki still in the Wild West phase. Words cannot express the nonsense it hosts. Fred User:Fred Bauder ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
On Thu, Dec 9, 2010 at 12:52 PM, Fred Bauder fredb...@fairpoint.net wrote: On Thu, Dec 9, 2010 at 9:55 AM, Domas Mituzas midom.li...@gmail.com wrote: On Dec 8, 2010, at 6:21 PM, Mike Dupont wrote: Sounds like we need to have a notable search engine that includes only approved and allowed sources, that would be nice to have. Sounds like a great community project, Wiki Search! yes it would be great. As i said, it could just include all pages listed as REF pages and that would allow people to review the results and find pages that should not belong. We also need to cache all these pages, best would be with a revision history. It should be similar to or using archive.org. The searching could also use lucene or some other project. It does not have to be google. On this note, I would really like to see a wordindex for openstreetmap as well, there is a huge amount of information that could be relevant in osm that should be easier to use in WP. mike Openstreetmap is a wiki still in the Wild West phase. Words cannot express the nonsense it hosts. If you are looking for a place named X or a location for some article then it would be nice to have a better search engine of that content. Wikipedia can help. Of course the WP articles are of a higher standard than alot of OSM data, but there is a greater coverage. There are alot of articles with no coords that could be fixed or assisted by editor having a faster and better index to the OSM data, no doubt. mike ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
In a message dated 12/9/2010 2:51:39 AM Pacific Standard Time, jamesmikedup...@googlemail.com writes: yes it would be great. As i said, it could just include all pages listed as REF pages and that would allow people to review the results and find pages that should not belong. We also need to cache all these pages, best would be with a revision history. It should be similar to or using archive.org. We would not be able to do that for copyright reasons. Some if not most of the refs are still under copyright, we cannot make copies of those pages. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
On Thu, Dec 9, 2010 at 6:02 PM, wjhon...@aol.com wrote: In a message dated 12/9/2010 2:51:39 AM Pacific Standard Time, jamesmikedup...@googlemail.com writes: yes it would be great. As i said, it could just include all pages listed as REF pages and that would allow people to review the results and find pages that should not belong. We also need to cache all these pages, best would be with a revision history. It should be similar to or using archive.org. We would not be able to do that for copyright reasons. Some if not most of the refs are still under copyright, we cannot make copies of those pages. Google does it, archive.org (wayback machine) does it, we can copy them for caching and searching i assume. we are not changing the license, but just preventing the information from disappearing on us. mike ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
[Foundation-l] excluding Wikipedia clones from searching
The Google test used to be a tool for checking the notability of a subject or to find sources about it. For some languages it may be also used for other purposes - for example in Hebrew, the spelling of which is not established so well, it is very frequently used for finding the most common spelling, especially for article titles. It was never the ultimate tool, of course, but it was useful. With the proliferation of sites that indiscriminately copy Wikipedia content it is becoming less and less useful. For some time i used to fight this problem by adding -site:wikipedia.org-site: wapedia.mobi -site:miniwiki.org etc. to my search queries, but i hit a wall: Google limits the search string to 32 words, and today there are many more than 32 sites that clone Wikipedia, so this trick is also becoming useless. I know that some Wikipedias customized Special:Search, adding other search engines except Wikipedias built-in one. I tried to see whether any Wikipedia added an ability to search using Google (or Bing, or Yahoo, or any other search engine) excluding Wikipedia clones. Does anyone know whether it's possible to build such a thing? And maybe it already exists and i didn't search well enough? -- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com We're living in pieces, I want to live in peace. - T. Moore ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
On 12/08/2010 12:46 PM, Amir E. Aharoni wrote: The Google test used to be a tool for checking the notability of a subject or to find sources about it. For some languages it may be also used for other purposes - for example in Hebrew, the spelling of which is not established so well, it is very frequently used for finding the most common spelling, especially for article titles. It was never the ultimate tool, of course, but it was useful. With the proliferation of sites that indiscriminately copy Wikipedia content it is becoming less and less useful. For some time i used to fight this problem by adding -site:wikipedia.org-site: wapedia.mobi -site:miniwiki.org etc. to my search queries, but i hit a wall: Google limits the search string to 32 words, and today there are many more than 32 sites that clone Wikipedia, so this trick is also becoming useless. You may try -wikipedia -ויקיפדיה to narrow it down further, but I don't think there is any full solution. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
On Wed, Dec 8, 2010 at 10:46 PM, Amir E. Aharoni amir.ahar...@mail.huji.ac.il wrote: For some time i used to fight this problem by adding -site:wikipedia.org-site: wapedia.mobi -site:miniwiki.org etc. to my search queries, but i hit a wall: Google limits the search string to 32 words, and today there are many more than 32 sites that clone Wikipedia, so this trick is also becoming useless. If you have Firefox there's an addon that will let you filter out mirrors (among other things). See: http://meta.wikimedia.org/wiki/Mirror_filter -- Stephen Bain stephen.b...@gmail.com ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
If the copyright license has been followed -wikipedia should exclude all clones. However, often, material is copied without crediting it to Wikipedia. Fred User:Fred Bauder The Google test used to be a tool for checking the notability of a subject or to find sources about it. For some languages it may be also used for other purposes - for example in Hebrew, the spelling of which is not established so well, it is very frequently used for finding the most common spelling, especially for article titles. It was never the ultimate tool, of course, but it was useful. With the proliferation of sites that indiscriminately copy Wikipedia content it is becoming less and less useful. For some time i used to fight this problem by adding -site:wikipedia.org-site: wapedia.mobi -site:miniwiki.org etc. to my search queries, but i hit a wall: Google limits the search string to 32 words, and today there are many more than 32 sites that clone Wikipedia, so this trick is also becoming useless. I know that some Wikipedias customized Special:Search, adding other search engines except Wikipedias built-in one. I tried to see whether any Wikipedia added an ability to search using Google (or Bing, or Yahoo, or any other search engine) excluding Wikipedia clones. Does anyone know whether it's possible to build such a thing? And maybe it already exists and i didn't search well enough? -- Amir Elisha Aharoni · ×Ö¸×Ö´×ר ×Ö±×Ö´×ש×ָע ×Ö·×ֲר×Ö¹× Ö´× http://aharoni.wordpress.com We're living in pieces, I want to live in peace. - T. Moore ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
On Wed, Dec 8, 2010 at 15:42, Fred Bauder fredb...@fairpoint.net wrote: If the copyright license has been followed -wikipedia should exclude all clones. However, often, material is copied without crediting it to Wikipedia. Yes, but that may also exclude sites that are useful and original, but happen to mention Wikipedia. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
On 8 December 2010 15:26, Amir E. Aharoni amir.ahar...@mail.huji.ac.il wrote: Yes, but that may also exclude sites that are useful and original, but happen to mention Wikipedia. Add -quoted sentence from article intro to the search? - d. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
Sounds like we need to have a notable search engine that includes only approved and allowed sources, that would be nice to have. On Wed, Dec 8, 2010 at 5:08 PM, David Gerard dger...@gmail.com wrote: On 8 December 2010 15:26, Amir E. Aharoni amir.ahar...@mail.huji.ac.il wrote: Yes, but that may also exclude sites that are useful and original, but happen to mention Wikipedia. Add -quoted sentence from article intro to the search? - d. ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l -- James Michael DuPont Member of Free Libre Open Source Software Kosova and Albania flossk.org flossal.org ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
On 8 December 2010 11:46, Amir E. Aharoni amir.ahar...@mail.huji.ac.il wrote: For some time i used to fight this problem by adding -site:wikipedia.org-site: wapedia.mobi -site:miniwiki.org etc. to my search queries, but i hit a wall: Google limits the search string to 32 words, and today there are many more than 32 sites that clone Wikipedia, so this trick is also becoming useless. As noted above you can use -wikipedia; alternately, keywords common on mirrors, such as -mediawiki, -gfdl could be worth trying. -- - Andrew Gray andrew.g...@dunelm.org.uk ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
On Wednesday 08 December 2010 05:16 PM, Amir E. Aharoni wrote: I know that some Wikipedias customized Special:Search, adding other search engines except Wikipedias built-in one. I tried to see whether any Wikipedia added an ability to search using Google (or Bing, or Yahoo, or any other search engine) excluding Wikipedia clones. Does anyone know whether it's possible to build such a thing? And maybe it already exists and i didn't search well enough? http://ml.wikipedia.org/w/index.php?title=Special%3ASearch not excluding other sites, but only including results from ml.wikipedia.org using site:ml.wikipedia.org in query ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Re: [Foundation-l] excluding Wikipedia clones from searching
I thought about this more, It would be to extract a list of all pages that are included as ref in the WP. We would use this for the search engine. we should also make sure that all referenced pages (not linked ones) are stored in archive.org or someplace permanent. I wonder if there is some API to extract this list easily? mike On Wed, Dec 8, 2010 at 6:49 PM, praveenp me.prav...@gmail.com wrote: On Wednesday 08 December 2010 05:16 PM, Amir E. Aharoni wrote: I know that some Wikipedias customized Special:Search, adding other search engines except Wikipedias built-in one. I tried to see whether any Wikipedia added an ability to search using Google (or Bing, or Yahoo, or any other search engine) excluding Wikipedia clones. Does anyone know whether it's possible to build such a thing? And maybe it already exists and i didn't search well enough? http://ml.wikipedia.org/w/index.php?title=Special%3ASearch not excluding other sites, but only including results from ml.wikipedia.org using site:ml.wikipedia.org in query ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l -- James Michael DuPont Member of Free Libre Open Source Software Kosova and Albania flossk.org flossal.org ___ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l