[Wikimedia-l] Re: Bing-ChatGPT

Peter Southwood Sat, 18 Mar 2023 02:10:16 -0700

“Cowen has sufficient credentials to be treated as a reliable expert”

Maybe not for much longer.

Cheers, P.

From: The Cunctator [mailto:cuncta...@gmail.com] 
Sent: 17 March 2023 17:49
To: Wikimedia Mailing List
Subject: [Wikimedia-l] Re: Bing-ChatGPT

This is an important development for editors to be aware of - we're going to 
have to be increasingly on the lookout for sources using ML-generated bullshit. 
Here are two instances I'm aware of this week:

https://www.thenation.com/article/culture/internet-archive-publishers-lawsuit-chatbot/
In late February, Tyler Cowen, a libertarian economics professor at George 
Mason University, published a blog post  
<https://web.archive.org/web/20230305055906/https:/marginalrevolution.com/marginalrevolution/2023/02/who-was-the-most-important-critic-of-the-printing-press-in-the-17th-century.html>
 titled, “Who was the most important critic of the printing press in the 17th 
century?” Cowen’s post contended that the polymath and statesman Francis Bacon 
was an “important” critic of the printing press; unfortunately, the post 
contains long, fake quotes attributed to Bacon’s The Advancement of Learning 
(1605), complete with false chapter and section numbers.
Tech writer Mathew Ingram drew attention to the fabrications  
<https://newsletter.mathewingram.com/tyler-cowen-francis-bacon-and-the-chatgpt-engine/>
 a few days later, noting that Cowen has been  
<https://marginalrevolution.com/marginalrevolution/2023/02/how-should-you-talk-to-chatgpt-a-users-guide.html>
 writing approvingly about the AI chatbot ChatGPT for some time now; several 
commenters on Cowen’s post assumed the fake quotes must be the handiwork of 
ChatGPT. (Cowen did not reply to e-mailed questions regarding the post by press 
time, and later removed the post entirely, with no explanation whatsoever. 
However, a copy remains at the Internet Archive’s Wayback Machine).

<https://www.vice.com/en/article/3akz8y/ai-injected-misinformation-into-article-claiming-misinformation-in-navalny-doc>

https://www.vice.com/en/article/3akz8y/ai-injected-misinformation-into-article-claiming-misinformation-in-navalny-doc
An article claiming to identify misinformation in an Oscar-winning documentary 
about imprisoned Russian dissident Alexei Navalny is itself full of 
misinformation, thanks to the author using AI. 
Investigative news outlet The Grayzone recently  
<https://thegrayzone.com/2023/03/13/oscar-navalny-documentary-misinformation/> 
published an article that included AI-generated text as a source for its 
information. The  
<http://web.archive.org/web/20230314131551/https:/thegrayzone.com/2023/03/13/oscar-navalny-documentary-misinformation/>
 piece, “Oscar-winning ‘Navalny’ documentary is packed with misinformation” by 
Lucy Komisar, included hyperlinks to  
<http://web.archive.org/web/20230314121144/https:/www.thekomisarscoop.com/wp-content/uploads/2023/02/Many-contributors-have-backgrounds-that-suggest-they-are-biased-in-favor-of-western-governments-and-against-its-enemies.pdf>
 PDFs uploaded to the author’s personal website that appear to be screenshots 
of conversations she had with ChatSonic, a free generative AI chatbot that 
advertises itself as a ChatGPT alternative that can “write factual trending 
content” using Google search results.

That said, I don't think this is anything to be too stressed about; the 
Grayzone is already a deprecated source and blogs like Marginal Revolution are 
treated with caution, though Cowen has sufficient credentials to be treated as 
a reliable expert.

On Fri, Mar 17, 2023 at 11:23 AM Kimmo Virtanen <kimmo.virta...@wikimedia.fi> 
wrote:

Hi,

The development of open-source large language models is going forward. The 
GPT-4 was released and it seems that it passed the Bar exam and tried to hire 
humans to solve catchpas which were too complex. However, the development in 
the open source and hacking side has been pretty fast and it seems that there 
are all the pieces for running LLM models in personal hardware (and in web 
browsers). Biggest missing piece is fine tuning of open source models such as 
Neox for the English language. For multilingual and multimodal (for example 
images+text) the model is also needed.

So this is kind of a link dump for relevant things for creation of open source 
LLM model and service and also recap where the hacker community is now.

1.) Creation of an initial unaligned model. 

·         Possible models

·          <https://github.com/EleutherAI/gpt-neox> 20b Neo(X) by EleutherAI 
(Apache 2.0)

·          <https://huggingface.co/KoboldAI/fairseq-dense-13B> Fairseq Dense by 
Facebook (MIT-licence)

·          <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> 
LLaMa by Facebook (custom license, leaked research use only)

·          <https://huggingface.co/bigscience/bloom> Bloom by Bigscience ( 
<https://huggingface.co/spaces/bigscience/license> custom license. open, 
non-commercial)

2.) Fine-tuning or align

·         Example: Standford Alpaca is ChatGPT fine-tuned LLaMa

·          <https://crfm.stanford.edu/2023/03/13/alpaca.html> Alpaca: A Strong, 
Replicable Instruction-Following Model

·          <https://replicate.com/blog/replicate-alpaca> Train and run Stanford 
Alpaca on your own machine

·          <https://github.com/tloen/alpaca-lora> Github: Alpaca-LoRA: Low-Rank 
LLaMA Instruct-Tuning

3.) 8,4,3 bit-quantization of model for reduced hardware requirements

·          <https://til.simonwillison.net/llms/llama-7b-m2> Running LLaMA 7B 
and 13B on a 64GB M2 MacBook Pro with llama.cpp

·         Github:  <https://github.com/NouamaneTazi/bloomz.cpp> bloomz.cpp &  
<https://github.com/ggerganov/llama.cpp> llama.cpp (C++ only versions)

·          
<https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and> Int-4 
LLaMa is not enough - Int-3 and beyond

·          <https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible> How 
is LLaMa.cpp possible?

4.) Easy-to-use interfaces

·          <https://xenova.github.io/transformers.js/> Transformer.js 
(WebAssembly libraries to run LLM models in the browser)

·          <https://github.com/cocktailpeanut/dalai> Dalai  ( run LLaMA and 
Alpaca in own computer as Node.js web service)

·          <https://github.com/mlc-ai/web-stable-diffusion> 
web-stable-diffusion (stable diffusion image generation in browser)

Br,

-- Kimmo Virtanen

On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen <kimmo.virta...@gmail.com> wrote:

Hi,

The development of open-source large language models is going forward. The 
GPT-4 was released and it seems that it passed the Bar exam and tried to hire 
humans to solve catchpas which were too complex to it. However, the development 
in open source and hacking side has been pretty fast and it seems that there is 
all the pieces for running LLM models in personal hardware (and in web 
browser). Biggest missing piece is fine tuning of open source model such as 
Neox for english language. For multilingual and multimodal (for example 
images+text) the model is also needed.

So this is kind of link dump for relevant things for creation of open source 
LLM model and service and also recap where hacker community is now.

1.) Creation of an initial unaligned model. 

*       Possible models

*        <https://github.com/EleutherAI/gpt-neox> 20b Neo(X) by EleutherAI 
(Apache 2.0)
*        <https://huggingface.co/KoboldAI/fairseq-dense-13B> Fairseq Dense by 
Facebook (MIT-licence)
*        <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> 
LLaMa by Facebook (custom license, leaked research use only)
*        <https://huggingface.co/bigscience/bloom> Bloom by Bigscience ( 
<https://huggingface.co/spaces/bigscience/license> custom license. open, 
non-commercial)

2.) Fine-tuning or align

*       Example: Standford Alpaca is ChatGPT fine-tuned LLaMa

*        <https://crfm.stanford.edu/2023/03/13/alpaca.html> Alpaca: A Strong, 
Replicable Instruction-Following Model
*        <https://replicate.com/blog/replicate-alpaca> Train and run Stanford 
Alpaca on your own machine
*        <https://github.com/tloen/alpaca-lora> Github: Alpaca-LoRA: Low-Rank 
LLaMA Instruct-Tuning

3.) 8,4,3 bit-quantization of model for reduced hardware requirements

*        <https://til.simonwillison.net/llms/llama-7b-m2> Running LLaMA 7B and 
13B on a 64GB M2 MacBook Pro with llama.cpp
*       Github:  <https://github.com/NouamaneTazi/bloomz.cpp> bloomz.cpp &  
<https://github.com/ggerganov/llama.cpp> llama.cpp (C++ only versions)
*        <https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and> 
Int-4 LLaMa is not enough - Int-3 and beyond
*        <https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible> How 
is LLaMa.cpp possible?

4.) Easy-to-use interfaces

*        <https://xenova.github.io/transformers.js/> Transformer.js 
(WebAssembly libraries to run LLM models in the browser)
*        <https://github.com/cocktailpeanut/dalai> Dalai  ( run LLaMA and 
Alpaca in own computer as Node.js web service)
*        <https://github.com/mlc-ai/web-stable-diffusion> web-stable-diffusion 
(stable diffusion image generation in browser)

Br,

-- Kimmo Virtanen

On Mon, Mar 6, 2023 at 6:50 AM Steven Walling <steven.wall...@gmail.com> wrote:

On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) <l...@lu.is> wrote:

On Feb 22, 2023 at 9:28 AM -0800, Sage Ross <ragesoss+wikipe...@gmail.com 
<mailto:ragesoss%2bwikipe...@gmail.com> >, wrote:

Luis,

OpenAI researchers have released some info about data sources that
trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165

See section 2.2, starting on page 8 of the PDF.

The full text of English Wikipedia is one of five sources, the others
being CommonCrawl, a smaller subset of scraped websites based on
upvoted reddit links, and two unrevealed datasets of scanned books.
(I've read speculation that one of these datasets is basically the
Library Genesis archive.) Wikipedia is much smaller than the other
datasets, although they did weight it somewhat more heavily than any
other dataset. With the extra weighting, they say Wikipedia accounts
for 3% of the total training.

Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their 
training sources, it turns out, with similar weighting for Wikipedia - only 
4.5% of training text, but more heavily weighted than most other sources:

https://twitter.com/GuillaumeLample/status/1629151234597740550

Those stats are undercounting, since the top source (CommonCrawl) also itself 
includes Wikipedia as its third largest source. 

https://commoncrawl.github.io/cc-crawl-statistics/plots/domains

_______________________________________________
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/W3HAFQIMQWBZDTZL6EYZKFG3D2KL7XDL/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

_______________________________________________
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6UKCJWOUR2KVTS7QZYKPMKQGONXZ72QR/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

_______________________________________________
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/KODAIRDAW6TESXS6DHIX2QLLCYYFDKCB/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>

Virus-free. 
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
 www.avg.com

_______________________________________________
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/R5XGX25WRYRN3XDFO2TNYCVGNUMHO24V/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

Reply via email to