Re: [Cloud] protobufs?

2019-11-25 Thread Roy Smith
Hi Andrew,

Thanks for that info.  I've never heard of JSON Schema before.  I've done a bit 
of reading on it.  As far as I can tell, it's pretty much a 1:1 mapping to 
proto specification language.  What's not clear to me is how you actually use 
JSON Schema in real life.

I get that it provides documentation of the schema.  That's pretty obvious.

It's not clear how the validation part works.  Does a production data consumer 
validate every incoming JSON object it receives?  Do producers validate every 
object they send?

Beyond that, what else do you do with JS?  Is there some sort of code 
generation aspect to it, as with the proto compiler?

Over the past few days, I've been playing with some code a bit, and thinking a 
lot.  I'm slowly coming to the conclusion that JSON is the way for me to go, 
for pretty much the reasons you outlined in T198256.  The biggest advantage I 
can see to protos (outside of the immersive google infrastructure) is 
efficiency.  If I need better performance later, swapping out one for the other 
doesn't seem like it would be a major problem.

On a vaguely related note, I saw in the recent Cloud Services Survey, a 
question that mentioned MongoDB.  Is mongo used inside of WMF?  It seems like 
it would be a natural in a JSON shop.  I don't see it running on the bastion 
hosts.

> On Nov 25, 2019, at 11:14 AM, Andrew Otto  wrote:
> 
> Hi Roy,
> 
> We had to evaluate data formats for event streaming systems as part of WMF's 
> Modern Event Platform  program 
> .
>   The event streaming world mostly uses Avro, but there is plenty of 
> Protobufs around too.
> 
> We ultimately decided to use JSON with JSONSchema as our transport format.  
> While lacking some advantages of the other binary options, JSON is just more 
> ubiquitous and easier to work with in a distributed and open source focused 
> developer community.  (You don't need the schema to read the data.)
> 
> More reading:
> - Choose Schema Tech RFC 
> - An old JSON justification blog post 
> 
> 
> Our choice of JSONSchema and JSON is mostly around canonical data schemas for 
> in-flight data transport and protocols.  For data at rest, it might make more 
> sense to serialize into something completely different (we use Parquet in 
> Hadoop for most data there).  You can read some WIP documentation about how 
> we use JSONSchema here 
> .
> 
> 
> 
> 
> 
> 
> On Fri, Nov 22, 2019 at 10:14 PM Roy Smith  > wrote:
> I'm starting to look at some machine learning projects I've wanted to do for 
> a while (ex: sock-puppet detection).  This quickly leads to having to make 
> decisions about data storage formats, i.e. csv, json, protobufs, etc.  Left 
> to my own devices, I'd probably use protos, but I don't want to be swimming 
> upstream.
> 
> Are there any standards in wiki-land for how people store data?  If there's 
> some common way that "everybody does it", that's how I want to do it too.  
> Or, does every project just do their own thing?
> ___
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org  (formerly 
> lab...@lists.wikimedia.org )
> https://lists.wikimedia.org/mailman/listinfo/cloud 
> ___
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/cloud

___
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Re: [Cloud] protobufs?

2019-11-25 Thread Andrew Otto
Hi Roy,

We had to evaluate data formats for event streaming systems as part of
WMF's Modern Event Platform 
program
.
The event streaming world mostly uses Avro, but there is plenty of
Protobufs around too.

We ultimately decided to use JSON with JSONSchema as our transport format.
While lacking some advantages of the other binary options, JSON is just
more ubiquitous and easier to work with in a distributed and open source
focused developer community.  (You don't need the schema to read the data.)

More reading:
- Choose Schema Tech RFC 
- An old JSON justification blog post


Our choice of JSONSchema and JSON is mostly around canonical data schemas
for in-flight data transport and protocols.  For data at rest, it might
make more sense to serialize into something completely different (we use
Parquet in Hadoop for most data there).  You can read some WIP
documentation about how we use JSONSchema here
.






On Fri, Nov 22, 2019 at 10:14 PM Roy Smith  wrote:

> I'm starting to look at some machine learning projects I've wanted to do
> for a while (ex: sock-puppet detection).  This quickly leads to having to
> make decisions about data storage formats, i.e. csv, json, protobufs, etc.
> Left to my own devices, I'd probably use protos, but I don't want to be
> swimming upstream.
>
> Are there any standards in wiki-land for how people store data?  If
> there's some common way that "everybody does it", that's how I want to do
> it too.  Or, does every project just do their own thing?
> ___
> Wikimedia Cloud Services mailing list
> Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/cloud
___
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Re: [Cloud] [Cloud-announce] Cloud VPS users, please claim your projects -- one week left

2019-11-25 Thread Andrew Bogott
Many thanks to all of you who have acted on this already!  There are now 
17 unclaimed projects -- these will be shut down next week if they 
remain unclaimed.  They are:


butterfly
design
etcd
hat-imagescalers
indico
lewton-test
mcr-dev
orig
queryrapi
social-tools
structurednavigation
test-twemproxy
visualeditor
wikifactmine
wikilabels
wmf-research-tools
wpx


On 9/30/19 11:24 AM, Andrew Bogott wrote:
Every year or so the Cloud Services team tries to identify and clean 
up unused projects and VMs.  We do this via an opt-in process: anyone 
can mark a project as 'in use,' and that project will be preserved for 
another year.


I've created a wiki page the lists all existing projects, here:

https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2019_Purge

If you are a VPS user, please visit that page and mark any projects 
that you use as {{Used}}.  Note that it's not necessary for you to be 
a project admin to mark something -- if you know that you're currently 
using a resource and want to keep using it, go ahead and mark it 
accordingly.  If you /are/ a project admin, please take a moment to 
mark which VMs are or aren't used in your projects.


When December arrives, I will shut down and begin the process of 
reclaiming resources from unused projects.


If you think you use a VPS project but aren't sure which, I encourage 
you to poke around on https://tools.wmflabs.org/openstack-browser/ to 
see what looks familiar.  Worst case, just email 
cloud@lists.wikimedia.org with a description of your use case and 
we'll sort it out there.


Exclusive toolforge users are free to ignore this task.

Thank you!

-Andrew and WMCS team




___
Wikimedia Cloud Services announce mailing list
cloud-annou...@lists.wikimedia.org (formerly labs-annou...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud-announce
___
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud