[Wikitech-l] forking media files

2011-08-15 Thread Peter Gervai
Let me retitle one of the topics nobody seems to touch.

On Fri, Aug 12, 2011 at 13:44, Brion Vibber  wrote:

> * media files -- these are freely copiable but I'm not sure the state of
> easily obtaing them in bulk. As the data set moved into TB it became
> impractical to just build .tar dumps. There are batch downloader tools
> available, and the metadata's all in dumps and api.

Right now it is basically locked: there is no way to bulk copy the
media files, including doing simply a backup of one wikipedia, or
commons. I've tried, I've asked, and the answer was basically to
contact a dev and arrange it, which obviously could be done (I know
many of the folks) but that isn't the point.

Some explanations were mentioned, mostly mentioning that media and its
metadata is quite detached, and thus it's hard to enforce licensing
quirks like attribution, special licenses and such. I can guess this
is a relevant comment since the text corpus is uniformly licensed
under CC/GFDL while the media files are at best non-homogeneous (like
commons, where everything's free in a way) and completely chaos at its
worst (individual wikipedias, where there may be anything from
leftover fair use to copyrighted by various entities to images to be
deleted "soon").

Still, I do not believe it's a good method to make it close to
impossible to bulk copy the data. I am not sure which technical means
is best, as there are many competing ones.

We could, for example, open up an API which would serve media file
with its metadata together, possibly supporting mass operations.
Still, it's pretty ineffective.

Or we could support zsync, rsync and such (and I again recommend
examining zsync's several interesting abilities to offload the work to
the client), but there ought to be some pointers to image metadata, at
least an oneliner file with every image linking to the license page.

Or we could connect the bulk way to established editor accounts, so we
could have at least a bit of an assurance that s/he knows what s/he's
doing.

-- 
 byte-byte,
    grin

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] forking media files

2011-08-15 Thread Russell N. Nelson - rnnelson
The problem is that 1) the files are bulky, 2) there are many of them, 3) they 
are in constant flux, and 4) it's likely that your connection would close for 
whatever reason part-way through the download.. Even taking a snapshot of the 
filenames is dicey. By the time you finish, it's likely that there will be new 
ones, and possible that some will be deleted. Probably the best way to make 
this work is to 1) make a snapshot of files periodically, 2) create an API 
which returns a tarball using the snapshot of files that also implements Range 
requests. The snapshot of filenames would have to include file sizes so the 
server would know where to restart. Once the snapshot had not been accessed in 
a week, it would be deleted. As a snapshot got older and older it would be less 
and less accurate, but hey, life is tough that way.

Of course, this would result in a 12-terabyte file on the recipient's host. 
That wouldn't work very well. I'm pretty sure that the recipient would need an 
http client which would 1) keep track of the place in the bytestream and 2) 
split out files and write them to disk as separate files. It's possible that a 
program like getbot already implements this.


From: wikitech-l-boun...@lists.wikimedia.org 
[wikitech-l-boun...@lists.wikimedia.org] on behalf of Peter Gervai 
[grin...@gmail.com]
Sent: Monday, August 15, 2011 4:45 AM
To: Wikimedia developers
Subject: [Wikitech-l] forking media files

Let me retitle one of the topics nobody seems to touch.

On Fri, Aug 12, 2011 at 13:44, Brion Vibber  wrote:

> * media files -- these are freely copiable but I'm not sure the state of
> easily obtaing them in bulk. As the data set moved into TB it became
> impractical to just build .tar dumps. There are batch downloader tools
> available, and the metadata's all in dumps and api.

Right now it is basically locked: there is no way to bulk copy the
media files, including doing simply a backup of one wikipedia, or
commons. I've tried, I've asked, and the answer was basically to
contact a dev and arrange it, which obviously could be done (I know
many of the folks) but that isn't the point.

Some explanations were mentioned, mostly mentioning that media and its
metadata is quite detached, and thus it's hard to enforce licensing
quirks like attribution, special licenses and such. I can guess this
is a relevant comment since the text corpus is uniformly licensed
under CC/GFDL while the media files are at best non-homogeneous (like
commons, where everything's free in a way) and completely chaos at its
worst (individual wikipedias, where there may be anything from
leftover fair use to copyrighted by various entities to images to be
deleted "soon").

Still, I do not believe it's a good method to make it close to
impossible to bulk copy the data. I am not sure which technical means
is best, as there are many competing ones.

We could, for example, open up an API which would serve media file
with its metadata together, possibly supporting mass operations.
Still, it's pretty ineffective.

Or we could support zsync, rsync and such (and I again recommend
examining zsync's several interesting abilities to offload the work to
the client), but there ought to be some pointers to image metadata, at
least an oneliner file with every image linking to the license page.

Or we could connect the bulk way to established editor accounts, so we
could have at least a bit of an assurance that s/he knows what s/he's
doing.

--
 byte-byte,
grin

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] forking media files

2011-08-15 Thread Peter Gervai
On Mon, Aug 15, 2011 at 18:40, Russell N. Nelson - rnnelson
 wrote:
> The problem is that 1) the files are bulky,

That's expected. :-)

> 2) there are many of them, 3) they are in constant flux,

That is not really a problem: since there are many of them
statistically they are not in flux.

> and 4) it's likely that your connection would close for whatever reason 
> part-way through the download..

I seem not to forgot to mention zsync/rsync. ;-)

> Even taking a snapshot of the filenames is dicey. By the time you finish, 
> it's likely that there will be new ones, and possible that some will be 
> deleted. Probably the best way to make this work is to 1) make a snapshot of 
> files periodically,

Since I've been told they're backed up it naturally should exist.

> 2) create an API which returns a tarball using the snapshot of files that 
> also implements Range requests.

I would very much prefer ready-to-use format instead of a tarball, not
to mention it's pretty resource consuming to create a tarball just for
that.

> Of course, this would result in a 12-terabyte file on the recipient's host. 
> That wouldn't work very well. I'm pretty sure that the recipient would need 
> an http client which would 1) keep track of the place in the bytestream and 
> 2) split out files and write them to disk as separate files. It's possible 
> that a program like getbot already implements this.

I'd make a snapshot without tar especially because partial transfers
aren't possible that way.

-- 
 byte-byte,
    grin

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] forking media files

2011-08-15 Thread Russell N. Nelson - rnnelson
I hate this email client. Hate, hate, hate. Thank you, Microsoft, for making my 
life that little bit worse. Anyway, you can't rely on the media files being 
stored in a filesystem. They could be stored in a database or an object 
storage. So *sync is not available. I don't know how the media files are backed 
up. If you only want the originals, that's a lot less than 12TB (or whatever is 
the current number for thumbs+origs). If you just want to fetch a tarball, wget 
or curl will automaticallly restart a connection and supply a range parameter 
if the server supports it. If you want a ready-to-use format, then you're going 
to need a client which can write individual files. But it's not particularly 
efficient to stream 120B files over a separate TCP connection. You'd have to 
have a client which can do TCP session reuse. No matter how you cut it, you're 
looking at a custom client. But there's no need to invent a new download 
protocol or stream format. That's why I suggest tarball and range. Standards 
... they're not just for breakfast.


From: wikitech-l-boun...@lists.wikimedia.org 
[wikitech-l-boun...@lists.wikimedia.org] on behalf of Peter Gervai 
[grin...@gmail.com]
Sent: Monday, August 15, 2011 5:40 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] forking media files

On Mon, Aug 15, 2011 at 18:40, Russell N. Nelson - rnnelson
 wrote:
> The problem is that 1) the files are bulky,

That's expected. :-)

> 2) there are many of them, 3) they are in constant flux,

That is not really a problem: since there are many of them
statistically they are not in flux.

> and 4) it's likely that your connection would close for whatever reason 
> part-way through the download..

I seem not to forgot to mention zsync/rsync. ;-)

> Even taking a snapshot of the filenames is dicey. By the time you finish, 
> it's likely that there will be new ones, and possible that some will be 
> deleted. Probably the best way to make this work is to 1) make a snapshot of 
> files periodically,

Since I've been told they're backed up it naturally should exist.

> 2) create an API which returns a tarball using the snapshot of files that 
> also implements Range requests.

I would very much prefer ready-to-use format instead of a tarball, not
to mention it's pretty resource consuming to create a tarball just for
that.

> Of course, this would result in a 12-terabyte file on the recipient's host. 
> That wouldn't work very well. I'm pretty sure that the recipient would need 
> an http client which would 1) keep track of the place in the bytestream and 
> 2) split out files and write them to disk as separate files. It's possible 
> that a program like getbot already implements this.

I'd make a snapshot without tar especially because partial transfers
aren't possible that way.

--
 byte-byte,
grin

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] forking media files

2011-08-15 Thread Brion Vibber
On Mon, Aug 15, 2011 at 3:14 PM, Russell N. Nelson - rnnelson <
rnnel...@clarkson.edu> wrote:

> Anyway, you can't rely on the media files being stored in a filesystem.
> They could be stored in a database or an object storage. So *sync is not
> available.


Note that at this moment, upload file storage on the Wikimedia sites is
still 'some big directories on an NFS server', but at some point is planned
to migrate to a Swift cluster backend:
http://www.mediawiki.org/wiki/Extension:SwiftMedia

A file dump / bulk-fetching intermediary would probably need to speak to
MediaWiki to get lists of available files and then bump back via the backend
to actually obtain them.

There's no reason why this couldn't speak something like the rsync protocol,
of course.

I don't know how the media files are backed up. If you only want the
> originals, that's a lot less than 12TB (or whatever is the current number
> for thumbs+origs). If you just want to fetch a tarball, wget or curl will
> automaticallly restart a connection and supply a range parameter if the
> server supports it. If you want a ready-to-use format, then you're going to
> need a client which can write individual files. But it's not particularly
> efficient to stream 120B files over a separate TCP connection. You'd have to
> have a client which can do TCP session reuse. No matter how you cut it,
> you're looking at a custom client. But there's no need to invent a new
> download protocol or stream format. That's why I suggest tarball and range.
> Standards ... they're not just for breakfast.
>

Range on a tarball assumes that you have a static tarball file -- or else
predictable, unchanging snapshot of its contents that can be used to
simulate one:

1) every filename in the data set, in order
2) every file's exact size and version
3) every other bit of file metadata that might go into constructing that
tarball

or else actually generating and storing a giant tarball, and then keeping it
around long enough for all clients to download the whole thing -- obviously
not very attractive.

Since every tiny change (*any* new file, *any* changed file, *any* deleted
file) would alter the generated tarball and shift terabytes of data around,
this doesn't seem like it would be a big win for anything other than initial
downloads of the full data set (or else batching up specifically-requested
files).


Anything that involves updating your mirror/copy/fork/backup needs to work
in a more live fashion, that only needs to transfer new data for things that
have changed. rsync can check for differences but still needs to go over the
full file list (and so still takes a Long Time and lots of bandwidth just to
do that).

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] forking media files

2011-08-15 Thread Russell N. Nelson - rnnelson
Exactly what I propose. Keep a list of files and their sizes, so that when 
somebody asks for a range, you can skip files up until you get to the range 
they've requested. 
Not worrying about new or already-downloaded changed files, or deleted files. 
You're not getting a "current" copy of the files, you're getting a copy of the 
files that were available when you started your download. Minus the deleted 
files, which by policy we shouldn't be handing out anyway.

rsync doesn't have the MW database to consult for changes.


From: wikitech-l-boun...@lists.wikimedia.org 
[wikitech-l-boun...@lists.wikimedia.org] on behalf of Brion Vibber 
[br...@pobox.com]
Sent: Monday, August 15, 2011 6:31 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] forking media files

On Mon, Aug 15, 2011 at 3:14 PM, Russell N. Nelson - rnnelson <
rnnel...@clarkson.edu> wrote:

> 
> download protocol or stream format. That's why I suggest tarball and range.
> Standards ... they're not just for breakfast.
>

Range on a tarball assumes that you have a static tarball file -- or else
predictable, unchanging snapshot of its contents that can be used to
simulate one:

1) every filename in the data set, in order
2) every file's exact size and version
3) every other bit of file metadata that might go into constructing that
tarball

or else actually generating and storing a giant tarball, and then keeping it
around long enough for all clients to download the whole thing -- obviously
not very attractive.

Since every tiny change (*any* new file, *any* changed file, *any* deleted
file) would alter the generated tarball and shift terabytes of data around,
this doesn't seem like it would be a big win for anything other than initial
downloads of the full data set (or else batching up specifically-requested
files).


Anything that involves updating your mirror/copy/fork/backup needs to work
in a more live fashion, that only needs to transfer new data for things that
have changed. rsync can check for differences but still needs to go over the
full file list (and so still takes a Long Time and lots of bandwidth just to
do that).

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] forking media files

2011-08-15 Thread Brion Vibber
On Mon, Aug 15, 2011 at 4:16 PM, Russell N. Nelson - rnnelson <
rnnel...@clarkson.edu> wrote:

> Exactly what I propose. Keep a list of files and their sizes, so that when
> somebody asks for a range, you can skip files up until you get to the range
> they've requested.
> Not worrying about new or already-downloaded changed files, or deleted
> files. You're not getting a "current" copy of the files, you're getting a
> copy of the files that were available when you started your download. Minus
> the deleted files, which by policy we shouldn't be handing out anyway.
>

Except the ones that weren't deleted when you started your download, I
presume? Otherwise you've now got an inconsistent data set. And of course
anything that has changed, you'll want to make sure you can access the
original version, not the new version, or else the size or contents will be
wrong and you'll end up sending bad info.



> rsync doesn't have the MW database to consult for changes.
>

That's an implementation detail, isn't it? GNU tar doesn't either.

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] forking media files

2011-08-15 Thread Russell N. Nelson - rnnelson
Wasn't suggesting that GNU tar or PDtar or any suchlike would be usable. I'm 
pretty sure that whatever protocol is used, you probably can't have a standard 
client or server simply because of the size of the data to be transfered. Maybe 
rsync would work with a custom rsyncd? Not so familiar with that protocol. 
Doesn't it compute an md5 for all files and ship it around?
-russ

From: wikitech-l-boun...@lists.wikimedia.org 
[wikitech-l-boun...@lists.wikimedia.org] on behalf of Brion Vibber 
[br...@pobox.com]
Sent: Monday, August 15, 2011 7:22 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] forking media files

On Mon, Aug 15, 2011 at 4:16 PM, Russell N. Nelson - rnnelson <
rnnel...@clarkson.edu> wrote:

> rsync doesn't have the MW database to consult for changes.
>

That's an implementation detail, isn't it? GNU tar doesn't either.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] forking media files

2011-08-15 Thread Brion Vibber
On Mon, Aug 15, 2011 at 4:30 PM, Russell N. Nelson - rnnelson <
rnnel...@clarkson.edu> wrote:

> Wasn't suggesting that GNU tar or PDtar or any suchlike would be usable.
> I'm pretty sure that whatever protocol is used, you probably can't have a
> standard client or server simply because of the size of the data to be
> transfered. Maybe rsync would work with a custom rsyncd? Not so familiar
> with that protocol. Doesn't it compute an md5 for all files and ship it
> around?
>

rsync's wire protocol isn't very well documented, but roughly speaking, it
builds a "file list" of every file that may need to be transferred, and the
server and client compare notes to see which ones will actually need to get
transferred (and then which pieces of the files need to be transferred, if
they exist in both places).

Since rsync 3 the file list can be built and sent incrementally, which made
it possible to do batch rsyncs of Wikimedia uploads to/from a couple of
ad-hoc off-site servers (I think Greg Maxwell ran one for a while? I do not
know whether any of these are still in place -- other people manage these
servers now and I just haven't paid attention).

Older versions of rsync would build and transfer the entire file list first,
which was impractical when a complete file list for millions and millions of
files would be bigger than RAM and take hours just to generate. :)


A custom rsync daemon could certainly speak to regular rsync clients to
manage doing the file listings and pulling up the appropriate backend file.
Simply re-starting the transfer can handle pulling updates or continuing a
broken transfer with no additional trouble.

For ideal incremental updates & recoveries you'd want to avoid having to
transfer data about unchanged files -- rsync will still have to send that
file list over so it can check if files need to be updated.

A more customized protocol might end up better at that; offhand I'm not sure
if rsync 3's protocol can be super convenient at that or whether something
else would be needed.

(For the most part we don't need rsync's ability to transfer pieces of large
individual files, though it's a win if a transfer gets interrupted on a
large video file; usually we just want to find *new* files or files that
need to be deleted. It may be possible to optimize this on the existing
protocol with timestamp limitations.)

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] forking media files

2011-08-15 Thread Russell N. Nelson - rnnelson
I see several paths forward on this:

1) Make an existing protocol and client work. Whether ftp or rsync or http or 
scp, they think they're copying a tree of filles.
  a) Give people access to something that looks like a tree of folders, and 
just let them recurse as needed using "wget -m". Doesn't quite make me want to 
barf.
2) Make an existing protocol work, even if a new client is needed for optimal 
use. E.g. wget -m with an extra parameter that only shows the client new files 
since the date of the last sync.
3) Devise a new protocol. Call it "BCD" for "Big Copy of Data".
  a) I'm thinking that the client should have the capability of asking for 
files with timestamps in a given range.
  b) The client would then be able to keep a record of the timestamp ranges for 
which it is currently accurate.
  c) A file deletion event would have a timestamp. Once deleted, the file would 
be unavailable even if its timestamp was requested.
  d) Any change in filename becomes an edit event.
  e) The idea is that a client would never have to re-ask for a timestamp range 
again.
__
From: wikitech-l-boun...@lists.wikimedia.org 
[wikitech-l-boun...@lists.wikimedia.org] on behalf of Brion Vibber 
[br...@pobox.com]
Sent: Monday, August 15, 2011 8:06 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] forking media files


A more customized protocol might end up better at that; offhand I'm not sure
if rsync 3's protocol can be super convenient at that or whether something
else would be needed.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l