Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-11 Thread ????
On 10/12/2010 23:51, John Doe wrote:
 I'm In the process of creating a cleanup tool that checks archive.org and
 webcitation.org  if a URL is not archived it checks to see if it is live and
 if it is I request that webcitation archive it on demand, and fills in the
 archiveurl parameter of cite templates.


What is the point of doing that? If an URL goes missing the information 
should be refound from another source. If it can't be re-referenced then 
perhaps it wasn't quite as reliable as one first thought, and if URLs 
aren't stable on any particular site then maybe one should re-examine 
the reliability of the originating source.

Most dead URLs that I see, that can't be refound, come from references 
to online articles of minor events in BLPs. Simply the event was 
recorded on Monday and was fish and chip wrapping by Thursday. Or to put 
it another way non-notable in the grand scheme of things. In some cases 
the original source may also have removed the content because it was 
untrue and could not be substantiated.

Stuffing URLs across to archive.org, or webcitation.org simply 
perpetuates unsubstantiated gossip. One really ought to examine one's 
motives for doing that.

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-10 Thread WJhonson
In a message dated 12/9/2010 11:06:30 PM Pacific Standard Time, 
jamesmikedup...@googlemail.com writes:


 Google does it, archive.org (wayback machine) does it, we can copy
 them for caching and searching i assume. we are not changing the
 license, but just preventing the information from disappearing on us. 
 

You are thinking of refs which are out-of-copyright.
Google books only gives snippet views of some books still under copyright 
for which they've not gotten permission to show an entire page at a time 
(which is preview mode).

archive.org as well has copies of works out-of-copyright (or otherwise in 
the public domain)

Your original statement was that we should copy refs.  Many or most of our 
refs are under copyright still.
We would not be able to do what you suggest imho.

W
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-10 Thread Mike Dupont
i mean google has copies, caches of items for searching.
How can google cache this?
Archive.org has copyrighted materials as well.
We should be able to save backups of this material as well.
mike

On Fri, Dec 10, 2010 at 5:16 PM,  wjhon...@aol.com wrote:
 In a message dated 12/9/2010 11:06:30 PM Pacific Standard Time,
 jamesmikedup...@googlemail.com writes:


 Google does it, archive.org (wayback machine) does it, we can copy
 them for caching and searching i assume. we are not changing the
 license, but just preventing the information from disappearing on us. 


 You are thinking of refs which are out-of-copyright.
 Google books only gives snippet views of some books still under copyright
 for which they've not gotten permission to show an entire page at a time
 (which is preview mode).

 archive.org as well has copies of works out-of-copyright (or otherwise in
 the public domain)

 Your original statement was that we should copy refs.  Many or most of our
 refs are under copyright still.
 We would not be able to do what you suggest imho.

 W
 ___
 foundation-l mailing list
 foundation-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l




-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova and Albania
flossk.org flossal.org

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-10 Thread Mike Dupont
I am not talking about books, just webpages.

lets take ladygaga.com as example

Wayback engine :
http://web.archive.org/web/*/http://www.ladygaga.com

Google cache:
http://webcache.googleusercontent.com/search?q=cache:1720lEPHkysJ:www.ladygaga.com/+lady+gagacd=1hl=dect=clnkgl=declient=firefox-a

here are two copies of copyrighted materials, we should make sure that
our referenced webpages are in archive.org or mirrored on some server.
Ideally we would have our own search engine and cache.

mike

On Fri, Dec 10, 2010 at 9:00 PM,  wjhon...@aol.com wrote:
 In a message dated 12/10/2010 11:55:21 AM Pacific Standard Time,
 jamesmikedup...@googlemail.com writes:


 i mean google has copies, caches of items for searching.
 How can google cache this?
 Archive.org has copyrighted materials as well.
 We should be able to save backups of this material as well.
 mike



 Mike I believe your statement lacks evidence.
 I don't think either of these has available full copies of anything under
 copyright.
 If you can give an example, please do so, so I can look at your specific
 example.

 Google Books has copies, not Google.  The full readable copies are all under
 public domain.
 The snippet views are not.  The preview views mean that they actually
 received *permission* from the copyright holder to do a preview view.

 That's why it's very rare to find a preview view for any book that predates
 the internet!  You either get snippet or full.
 Probably the author is actually dead, and they can't find who holds the
 copyright easily today.  Or it's too much trouble for a book that fifteen
 people look at.

 W



-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova and Albania
flossk.org flossal.org

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-10 Thread Mike Dupont
On Fri, Dec 10, 2010 at 9:54 PM,  wjhon...@aol.com wrote:
 In a message dated 12/10/2010 12:48:31 PM Pacific Standard Time,
 jamesmikedup...@googlemail.com writes:


 I am not talking about books, just webpages.

 lets take ladygaga.com as example

 Wayback engine :
 http://web.archive.org/web/*/http://www.ladygaga.com

 Google cache:
 http://webcache.googleusercontent.com/search?q=cache:1720lEPHkysJ:www.ladygaga.com/+lady+gagacd=1hl=dect=clnkgl=declient=firefox-a

 here are two copies of copyrighted materials, we should make sure that
 our referenced webpages are in archive.org or mirrored on some server.
 Ideally we would have our own search engine and cache.

 mike


 I have no problem with the idea of pointing refs to a page on archive.org,
 however you must understand that even previously archived pages *may* be
 removed from archive.org at the owner's request or even at the request of a
 .robots entry.

 The only advantage I see over using archive.org instead of a plain link, is
 the ability to see what a page *looked* like in the past.  I'm not sure
 that's a great advantage.  Why do you think it is?  If a page comes down,
 should we not err on the part of assuming the owner no longer wants it
 public and if the owner doesnt want it public, are we to make sure it stays
 public by caching it against their will?

 Both Google and Archive.org (much to my utter dismay) obey certain rules set
 up by web page owners to not index certain pages, or to remove them from
 caching history entirely (even old copies).  Are you suggesting we disregard
 those rules?  If not, then I see no advantage in our caching pages which are
 available in caches already.

My point is we should index them ourselves. We should have the pages
used as references first listed in an easy to use manner and if
possible we should cache them. If they are not cacheable because of
some restrictions, the references should be marked somehow as not as
good and people might find better references. In the end, like
citeseer you will find that pages that are available and open and
cachable will be cited and used more than pages that are not.

Right now, I dont know of a simple way to even get this list of
references from wp. There is alot of work to do, and if we do this, it
will benefit the wikipedia. Another thing to do is to translate the
pages referenced.

mike

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-10 Thread Mike Dupont
I know all about the aspects of programming and copyright, I thought I
answered the questions.
Of course I can program this myself, and we can use open source
indexing tools for that. the translations of course are a separate
issue, they would be under the same restrictions as the source page.

If we prefer pages that can be cached and translated, and mark the
others that cannot, then by natural selection we will in long term
replaces the pages that are not allowed to be cached with ones that
can be.

My suggestion is for a wikipedia project, something to be supported
and run on the toolserver or similar.

mike

On Fri, Dec 10, 2010 at 10:19 PM,  wjhon...@aol.com wrote:
 In a message dated 12/10/2010 1:10:26 PM Pacific Standard Time,
 jamesmikedup...@googlemail.com writes:


 My point is we should index them ourselves. We should have the pages
 used as references first listed in an easy to use manner and if
 possible we should cache them. If they are not cacheable because of
 some restrictions, the references should be marked somehow as not as
 good and people might find better references. In the end, like
 citeseer you will find that pages that are available and open and
 cachable will be cited and used more than pages that are not.

 Right now, I dont know of a simple way to even get this list of
 references from wp. There is alot of work to do, and if we do this, it
 will benefit the wikipedia. Another thing to do is to translate the
 pages referenced.

 mike


 I understand your point, but you're avoiding answering the points I raised.
 They are archived at archive.org by permission.  You tell archive.org to
 archive your site, and they do.  You tell them to stop, and they do.
 What advantage would we have to repeat the caching yet again that
 archive.org is already doing?  You haven't answered that.

 No matter what occurs, you're going to have trouble retrieving the list of
 refs from a WP page (or any web page), without knowing some programming
 language like PHP.  Using PHP it's a fairly trivial parsing request.  It's
 that's your only problem, I can write you a script to do it, for twenty
 bucks.

 You cannot translate a work, which is under copyright protection, without
 violating their copyright.  Copyright extends to any effort that
 substantially mimics the underlying work.  A translation is found to violate
 copyright.  You could however make a parody :)

 W




-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova and Albania
flossk.org flossal.org

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-10 Thread Mike Dupont
Well, lets backtrack.
The original question was, how can we exclude wikipedia clones from the search.
my idea was to create a search engine that includes only refs from
wikipedia in it.
then the idea was to make our own engine instead of only using google.
lets agree that we need first a list of references and we can talk
about the details of the searching later.
thanks,
mike

On Fri, Dec 10, 2010 at 11:02 PM,  wjhon...@aol.com wrote:
 In a message dated 12/10/2010 1:31:20 PM Pacific Standard Time,
 jamesmikedup...@googlemail.com writes:


 If we prefer pages that can be cached and translated, and mark the
 others that cannot, then by natural selection we will in long term
 replaces the pages that are not allowed to be cached with ones that
 can be.

 My suggestion is for a wikipedia project, something to be supported
 and run on the toolserver or similar.


 I think if you were to propose that we should prefer pages that can be
 cached and translated you'd get a firestorm of opposition.
 The majority of our refs, imho, are still under copyright.  This is because
 the majority of our refs are either web pages created by various authors who
 do not specify a free license (and therefore under U.S. law automatically
 enjoy copyright protection).  Or they are refs to works which are relatively
 current, and are cited, for example in Google Books Preview mode, or at
 Amazon look-inside pages.

 I still cannot see any reason why we would want to cache anything like
 this.  You haven't addressed what benefit it gives us, to cache refs.
 My last question here is not about whether we can or how, but how does it
 help the project?

 How?

 W




-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova and Albania
flossk.org flossal.org

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-10 Thread Mike Dupont
On Fri, Dec 10, 2010 at 11:16 PM,  wjhon...@aol.com wrote:
 In a message dated 12/10/2010 2:12:44 PM Pacific Standard Time,
 jamesmikedup...@googlemail.com writes:


 Well, lets backtrack.
 The original question was, how can we exclude wikipedia clones from the
 search.
 my idea was to create a search engine that includes only refs from
 wikipedia in it.
 then the idea was to make our own engine instead of only using google.
 lets agree that we need first a list of references and we can talk
 about the details of the searching later.
 thanks,
 mike


 I search for Mary Queen of Scots and I want to exclude Wikipedia clones
 from my results, because I'm really only interested in... how many times she
 appears in various Wikipedia pages.  Why would I not just use the Wikipedia
 internal search engine then?

my idea was that you will want to search pages that are referenced by
wikipedia already, in my work on kosovo, it would be very helpful
because there are lots of bad results on google, and it would be nice
to use that also to see how many times certain names occur.
That is why we need also our own indexing engine, I would like to
count the occurances of each term and what page they occur on, and to
xref that to names on wikipedia against them. Wanted pages could also
be assisted like this, what are the most wanted pages that match
against the most common terms in the new refindex or also existing
pages.

These are the things that I would like to do.

-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova and Albania
flossk.org flossal.org

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-10 Thread Mike Dupont
On Sat, Dec 11, 2010 at 12:02 AM,  wjhon...@aol.com wrote:
 In a message dated 12/10/2010 2:58:08 PM Pacific Standard Time,
 jamesmikedup...@googlemail.com writes:


 my idea was that you will want to search pages that are referenced by
 wikipedia already, in my work on kosovo, it would be very helpful
 because there are lots of bad results on google, and it would be nice
 to use that also to see how many times certain names occur.
 That is why we need also our own indexing engine, I would like to
 count the occurances of each term and what page they occur on, and to
 xref that to names on wikipedia against them. Wanted pages could also
 be assisted like this, what are the most wanted pages that match
 against the most common terms in the new refindex or also existing
 pages.



 Well then all you would need to do is cross-reference the refs themselves.
 You don't need to cache the underlying pages to which they refer.

well i was hoping to look at all the pages that wikipedia considers to
be valuable enough to be referenced, and to find new information on
those pages for other articles. I dont think it is enough to just look
at the referernces on the wikipedia itself, we should resolve them and
look at those pages, and also to build a list of sites of possible
full indexing, or at least some spidering.


 So in your new search engine, when you search for Mary, Queen of Scots you
 really are saying, show me those external references, which are mentioned,
 in connection with Mary Queen of Scots, by Wikipedia.

Not really, find all pages referenced in total by the wikipedia that
contain the term Mary, Queen of Scots, maybe someone added a site to
an article on King Henry that contains the text Mary, Queen of Scots
that has not been referenced yet.

show me the occurrences of the word, the frequency, maybe in the
sentence or paragraph it occurs in and a link to the page and the
ability to see the cached version if the site is down. it can also be
cached on another site as well, if the same version.

 That doesn't require caching the pages to which refs refer.  It only
 requires indexing those refs which currently are used in-world.

Well indexing normally means caching as well, public or private. You
need to copy the pages into the memory of a computer to index them.
Best is to store them on disk.

The first step will be to collect all references of course, but the
second step will be to resolve them.This is also good to check for
dead references and mark them as such.

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-10 Thread John Doe
I'm In the process of creating a cleanup tool that checks archive.org and
webcitation.org  if a URL is not archived it checks to see if it is live and
if it is I request that webcitation archive it on demand, and fills in the
archiveurl parameter of cite templates.

John
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-10 Thread WJhonson
In a message dated 12/10/2010 2:12:44 PM Pacific Standard Time, 
jamesmikedup...@googlemail.com writes:


 Well, lets backtrack.
 The original question was, how can we exclude wikipedia clones from the 
 search.
 my idea was to create a search engine that includes only refs from
 wikipedia in it.
 then the idea was to make our own engine instead of only using google.
 lets agree that we need first a list of references and we can talk
 about the details of the searching later.
 thanks,
 mike
 

I search for Mary Queen of Scots and I want to exclude Wikipedia clones 
from my results, because I'm really only interested in... how many times she 
appears in various Wikipedia pages.  Why would I not just use the Wikipedia 
internal search engine then?
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-10 Thread WJhonson
In a message dated 12/10/2010 1:31:20 PM Pacific Standard Time, 
jamesmikedup...@googlemail.com writes:


 If we prefer pages that can be cached and translated, and mark the
 others that cannot, then by natural selection we will in long term
 replaces the pages that are not allowed to be cached with ones that
 can be.
 
 My suggestion is for a wikipedia project, something to be supported
 and run on the toolserver or similar.
 

I think if you were to propose that we should prefer pages that can be 
cached and translated you'd get a firestorm of opposition.
The majority of our refs, imho, are still under copyright.  This is because 
the majority of our refs are either web pages created by various authors 
who do not specify a free license (and therefore under U.S. law automatically 
enjoy copyright protection).  Or they are refs to works which are relatively 
current, and are cited, for example in Google Books Preview mode, or at 
Amazon look-inside pages.

I still cannot see any reason why we would want to cache anything like 
this.  You haven't addressed what benefit it gives us, to cache refs.
My last question here is not about whether we can or how, but how does it 
help the project?

How?

W
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-10 Thread WJhonson
In a message dated 12/10/2010 1:10:26 PM Pacific Standard Time, 
jamesmikedup...@googlemail.com writes:


 My point is we should index them ourselves. We should have the pages
 used as references first listed in an easy to use manner and if
 possible we should cache them. If they are not cacheable because of
 some restrictions, the references should be marked somehow as not as
 good and people might find better references. In the end, like
 citeseer you will find that pages that are available and open and
 cachable will be cited and used more than pages that are not.
 
 Right now, I dont know of a simple way to even get this list of
 references from wp. There is alot of work to do, and if we do this, it
 will benefit the wikipedia. Another thing to do is to translate the
 pages referenced.
 
 mike
 

I understand your point, but you're avoiding answering the points I raised.
They are archived at archive.org by permission.  You tell archive.org to 
archive your site, and they do.  You tell them to stop, and they do.
What advantage would we have to repeat the caching yet again that 
archive.org is already doing?  You haven't answered that.

No matter what occurs, you're going to have trouble retrieving the list of 
refs from a WP page (or any web page), without knowing some programming 
language like PHP.  Using PHP it's a fairly trivial parsing request.  It's 
that's your only problem, I can write you a script to do it, for twenty bucks.

You cannot translate a work, which is under copyright protection, without 
violating their copyright.  Copyright extends to any effort that 
substantially mimics the underlying work.  A translation is found to violate 
copyright.  
You could however make a parody :)

W
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-10 Thread WJhonson
In a message dated 12/10/2010 2:58:08 PM Pacific Standard Time,  
jamesmikedup...@googlemail.com writes:


 my idea was that you will want to search pages that are referenced by
 wikipedia already, in my work on kosovo, it would be very helpful
 because there are lots of bad results on google, and it would be nice
 to use that also to see how many times certain names occur.
 That is why we need also our own indexing engine, I would like to
 count the occurances of each term and what page they occur on, and to
 xref that to names on wikipedia against them. Wanted pages could also
 be assisted like this, what are the most wanted pages that match
 against the most common terms in the new refindex or also existing
 pages.
 


Well then all you would need to do is cross-reference the refs themselves.  
You don't need to cache the underlying pages to which they refer.

So in your new search engine, when you search for Mary, Queen of Scots 
you really are saying, show me those external references, which are mentioned, 
in connection with Mary Queen of Scots, by Wikipedia.

That doesn't require caching the pages to which refs refer.  It only 
requires indexing those refs which currently are used in-world.

W
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-09 Thread Pascal Martin
Bonjour

Could you change the url for wikiwix, just remove lang=fr, since currently 
the search results are french and not ml as expected.

Cordialement
Pascal Martin
06 13 89 77 32
02 32 40 23 69


- Original Message - 
From: Mike Dupont jamesmikedup...@googlemail.com
To: Wikimedia Foundation Mailing List foundation-l@lists.wikimedia.org
Sent: Wednesday, December 08, 2010 7:58 PM
Subject: Re: [Foundation-l] excluding Wikipedia clones from searching


I thought about this more,
 It would be to extract a list of all pages that are included as ref
 in the WP. We would use this for the search engine.
 we should also make sure that all referenced pages (not linked ones)
 are stored in archive.org or someplace permanent.
 I wonder if there is some API to extract this list easily?
 mike

 On Wed, Dec 8, 2010 at 6:49 PM, praveenp me.prav...@gmail.com wrote:
 On Wednesday 08 December 2010 05:16 PM, Amir E. Aharoni wrote:
 I know that some Wikipedias customized Special:Search, adding other 
 search
 engines except Wikipedias built-in one. I tried to see whether any 
 Wikipedia
 added an ability to search using Google (or Bing, or Yahoo, or any other
 search engine) excluding Wikipedia clones. Does anyone know whether it's
 possible to build such a thing? And maybe it already exists and i didn't
 search well enough?

 http://ml.wikipedia.org/w/index.php?title=Special%3ASearch

 not excluding other sites, but only including results from
 ml.wikipedia.org using site:ml.wikipedia.org in query

 ___
 foundation-l mailing list
 foundation-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l




 -- 
 James Michael DuPont
 Member of Free Libre Open Source Software Kosova and Albania
 flossk.org flossal.org

 ___
 foundation-l mailing list
 foundation-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l 


___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-09 Thread Domas Mituzas

On Dec 8, 2010, at 6:21 PM, Mike Dupont wrote:

 Sounds like we need to have a notable search engine that includes only
 approved and allowed sources, that would be nice to have.

Sounds like a great community project, Wiki Search!

Domas

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-09 Thread Mike Dupont
On Thu, Dec 9, 2010 at 9:55 AM, Domas Mituzas midom.li...@gmail.com wrote:

 On Dec 8, 2010, at 6:21 PM, Mike Dupont wrote:

 Sounds like we need to have a notable search engine that includes only
 approved and allowed sources, that would be nice to have.

 Sounds like a great community project, Wiki Search!

yes it would be great. As i said, it could just include all pages
listed as REF pages and that would allow people to review the results
and find pages that should not belong.

We also need to cache all these pages, best would be with a revision
history. It should be similar to or using archive.org.

The searching could also use lucene or some other project. It does not
have to be google.

On this note, I would really like to see a wordindex for openstreetmap
as well, there is a huge amount of information that could be relevant
in osm that should be easier to use in WP.

mike

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-09 Thread Fred Bauder
 On Thu, Dec 9, 2010 at 9:55 AM, Domas Mituzas midom.li...@gmail.com
 wrote:

 On Dec 8, 2010, at 6:21 PM, Mike Dupont wrote:

 Sounds like we need to have a notable search engine that includes only
 approved and allowed sources, that would be nice to have.

 Sounds like a great community project, Wiki Search!

 yes it would be great. As i said, it could just include all pages
 listed as REF pages and that would allow people to review the results
 and find pages that should not belong.

 We also need to cache all these pages, best would be with a revision
 history. It should be similar to or using archive.org.

 The searching could also use lucene or some other project. It does not
 have to be google.

 On this note, I would really like to see a wordindex for openstreetmap
 as well, there is a huge amount of information that could be relevant
 in osm that should be easier to use in WP.

 mike

Openstreetmap is a wiki still in the Wild West phase. Words cannot
express the nonsense it hosts.

Fred

User:Fred Bauder



___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-09 Thread Mike Dupont
On Thu, Dec 9, 2010 at 12:52 PM, Fred Bauder fredb...@fairpoint.net wrote:
 On Thu, Dec 9, 2010 at 9:55 AM, Domas Mituzas midom.li...@gmail.com
 wrote:

 On Dec 8, 2010, at 6:21 PM, Mike Dupont wrote:

 Sounds like we need to have a notable search engine that includes only
 approved and allowed sources, that would be nice to have.

 Sounds like a great community project, Wiki Search!

 yes it would be great. As i said, it could just include all pages
 listed as REF pages and that would allow people to review the results
 and find pages that should not belong.

 We also need to cache all these pages, best would be with a revision
 history. It should be similar to or using archive.org.

 The searching could also use lucene or some other project. It does not
 have to be google.

 On this note, I would really like to see a wordindex for openstreetmap
 as well, there is a huge amount of information that could be relevant
 in osm that should be easier to use in WP.

 mike

 Openstreetmap is a wiki still in the Wild West phase. Words cannot
 express the nonsense it hosts.

If you are looking for a place named X or a location for some
article then it would be nice to have a better search engine of that
content. Wikipedia can help. Of course the WP articles are of a higher
standard than alot of OSM data, but there is a greater coverage. There
are alot of articles with no coords that could be fixed or assisted by
editor having a faster and better index to the  OSM data, no doubt.
mike

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-09 Thread WJhonson
In a message dated 12/9/2010 2:51:39 AM Pacific Standard Time, 
jamesmikedup...@googlemail.com writes:


 yes it would be great. As i said, it could just include all pages
 listed as REF pages and that would allow people to review the results
 and find pages that should not belong.
 
 We also need to cache all these pages, best would be with a revision
 history. It should be similar to or using archive.org.
 

We would not be able to do that for copyright reasons.
Some if not most of the refs are still under copyright, we cannot make 
copies of those pages.
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-09 Thread Mike Dupont
On Thu, Dec 9, 2010 at 6:02 PM,  wjhon...@aol.com wrote:
 In a message dated 12/9/2010 2:51:39 AM Pacific Standard Time,
 jamesmikedup...@googlemail.com writes:


 yes it would be great. As i said, it could just include all pages
 listed as REF pages and that would allow people to review the results
 and find pages that should not belong.

 We also need to cache all these pages, best would be with a revision
 history. It should be similar to or using archive.org.


 We would not be able to do that for copyright reasons.
 Some if not most of the refs are still under copyright, we cannot make
 copies of those pages.

Google does it, archive.org (wayback machine) does it, we can copy
them for caching and searching i assume. we are not changing the
license, but just preventing the information from disappearing on us.

mike

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


[Foundation-l] excluding Wikipedia clones from searching

2010-12-08 Thread Amir E. Aharoni
The Google test used to be a tool for checking the notability of a subject
or to find sources about it. For some languages it may be also used for
other purposes - for example in Hebrew, the spelling of which is not
established so well, it is very frequently used for finding the most common
spelling, especially for article titles. It was never the ultimate tool, of
course, but it was useful. With the proliferation of sites that
indiscriminately copy Wikipedia content it is becoming less and less useful.

For some time i used to fight this problem by adding -site:wikipedia.org-site:
wapedia.mobi -site:miniwiki.org etc. to my search queries, but i hit a
wall: Google limits the search string to 32 words, and today there are many
more than 32 sites that clone Wikipedia, so this trick is also becoming
useless.

I know that some Wikipedias customized Special:Search, adding other search
engines except Wikipedias built-in one. I tried to see whether any Wikipedia
added an ability to search using Google (or Bing, or Yahoo, or any other
search engine) excluding Wikipedia clones. Does anyone know whether it's
possible to build such a thing? And maybe it already exists and i didn't
search well enough?

--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
We're living in pieces,
 I want to live in peace. - T. Moore
___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-08 Thread Nikola Smolenski
On 12/08/2010 12:46 PM, Amir E. Aharoni wrote:
 The Google test used to be a tool for checking the notability of a subject
 or to find sources about it. For some languages it may be also used for
 other purposes - for example in Hebrew, the spelling of which is not
 established so well, it is very frequently used for finding the most common
 spelling, especially for article titles. It was never the ultimate tool, of
 course, but it was useful. With the proliferation of sites that
 indiscriminately copy Wikipedia content it is becoming less and less useful.

 For some time i used to fight this problem by adding 
 -site:wikipedia.org-site:
 wapedia.mobi -site:miniwiki.org etc. to my search queries, but i hit a
 wall: Google limits the search string to 32 words, and today there are many
 more than 32 sites that clone Wikipedia, so this trick is also becoming
 useless.

You may try -wikipedia -ויקיפדיה to narrow it down further, but I 
don't think there is any full solution.

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-08 Thread Stephen Bain
On Wed, Dec 8, 2010 at 10:46 PM, Amir E. Aharoni
amir.ahar...@mail.huji.ac.il wrote:

 For some time i used to fight this problem by adding 
 -site:wikipedia.org-site:
 wapedia.mobi -site:miniwiki.org etc. to my search queries, but i hit a
 wall: Google limits the search string to 32 words, and today there are many
 more than 32 sites that clone Wikipedia, so this trick is also becoming
 useless.

If you have Firefox there's an addon that will let you filter out
mirrors (among other things). See:

http://meta.wikimedia.org/wiki/Mirror_filter

-- 
Stephen Bain
stephen.b...@gmail.com

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-08 Thread Fred Bauder
If the copyright license has been followed -wikipedia should exclude all
clones. However, often, material is copied without crediting it to
Wikipedia.

Fred

User:Fred Bauder

 The Google test used to be a tool for checking the notability of a
 subject
 or to find sources about it. For some languages it may be also used for
 other purposes - for example in Hebrew, the spelling of which is not
 established so well, it is very frequently used for finding the most
 common
 spelling, especially for article titles. It was never the ultimate tool,
 of
 course, but it was useful. With the proliferation of sites that
 indiscriminately copy Wikipedia content it is becoming less and less
 useful.

 For some time i used to fight this problem by adding
 -site:wikipedia.org-site:
 wapedia.mobi -site:miniwiki.org etc. to my search queries, but i hit a
 wall: Google limits the search string to 32 words, and today there are
 many
 more than 32 sites that clone Wikipedia, so this trick is also becoming
 useless.

 I know that some Wikipedias customized Special:Search, adding other
 search
 engines except Wikipedias built-in one. I tried to see whether any
 Wikipedia
 added an ability to search using Google (or Bing, or Yahoo, or any other
 search engine) excluding Wikipedia clones. Does anyone know whether it's
 possible to build such a thing? And maybe it already exists and i didn't
 search well enough?

 --
 Amir Elisha Aharoni · אָמִיר אֱלִישָׁע
 אַהֲרוֹנִי
 http://aharoni.wordpress.com
 We're living in pieces,
  I want to live in peace. - T. Moore
 ___
 foundation-l mailing list
 foundation-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l




___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-08 Thread Amir E. Aharoni
On Wed, Dec 8, 2010 at 15:42, Fred Bauder fredb...@fairpoint.net wrote:

 If the copyright license has been followed -wikipedia should exclude all
 clones. However, often, material is copied without crediting it to
 Wikipedia.

Yes, but that may also exclude sites that are useful and original, but
happen to mention Wikipedia.

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-08 Thread David Gerard
On 8 December 2010 15:26, Amir E. Aharoni amir.ahar...@mail.huji.ac.il wrote:

 Yes, but that may also exclude sites that are useful and original, but
 happen to mention Wikipedia.

Add -quoted sentence from article intro to the search?


- d.

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-08 Thread Mike Dupont
Sounds like we need to have a notable search engine that includes only
approved and allowed sources, that would be nice to have.

On Wed, Dec 8, 2010 at 5:08 PM, David Gerard dger...@gmail.com wrote:
 On 8 December 2010 15:26, Amir E. Aharoni amir.ahar...@mail.huji.ac.il 
 wrote:

 Yes, but that may also exclude sites that are useful and original, but
 happen to mention Wikipedia.

 Add -quoted sentence from article intro to the search?


 - d.

 ___
 foundation-l mailing list
 foundation-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l




-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova and Albania
flossk.org flossal.org

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-08 Thread Andrew Gray
On 8 December 2010 11:46, Amir E. Aharoni amir.ahar...@mail.huji.ac.il wrote:

 For some time i used to fight this problem by adding 
 -site:wikipedia.org-site:
 wapedia.mobi -site:miniwiki.org etc. to my search queries, but i hit a
 wall: Google limits the search string to 32 words, and today there are many
 more than 32 sites that clone Wikipedia, so this trick is also becoming
 useless.

As noted above you can use -wikipedia; alternately, keywords common on
mirrors, such as -mediawiki, -gfdl could be worth trying.

-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-08 Thread praveenp
On Wednesday 08 December 2010 05:16 PM, Amir E. Aharoni wrote:
 I know that some Wikipedias customized Special:Search, adding other search
 engines except Wikipedias built-in one. I tried to see whether any Wikipedia
 added an ability to search using Google (or Bing, or Yahoo, or any other
 search engine) excluding Wikipedia clones. Does anyone know whether it's
 possible to build such a thing? And maybe it already exists and i didn't
 search well enough?

http://ml.wikipedia.org/w/index.php?title=Special%3ASearch

not excluding other sites, but only including results from 
ml.wikipedia.org using site:ml.wikipedia.org in query

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l


Re: [Foundation-l] excluding Wikipedia clones from searching

2010-12-08 Thread Mike Dupont
I thought about this more,
It would be to extract a list of all pages that are included as ref
in the WP. We would use this for the search engine.
we should also make sure that all referenced pages (not linked ones)
are stored in archive.org or someplace permanent.
I wonder if there is some API to extract this list easily?
mike

On Wed, Dec 8, 2010 at 6:49 PM, praveenp me.prav...@gmail.com wrote:
 On Wednesday 08 December 2010 05:16 PM, Amir E. Aharoni wrote:
 I know that some Wikipedias customized Special:Search, adding other search
 engines except Wikipedias built-in one. I tried to see whether any Wikipedia
 added an ability to search using Google (or Bing, or Yahoo, or any other
 search engine) excluding Wikipedia clones. Does anyone know whether it's
 possible to build such a thing? And maybe it already exists and i didn't
 search well enough?

 http://ml.wikipedia.org/w/index.php?title=Special%3ASearch

 not excluding other sites, but only including results from
 ml.wikipedia.org using site:ml.wikipedia.org in query

 ___
 foundation-l mailing list
 foundation-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l




-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova and Albania
flossk.org flossal.org

___
foundation-l mailing list
foundation-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l