Hi!
There was no response here. I made the following issue instead:
https://phabricator.wikimedia.org/T360859
Mitar
On Sat, Mar 2, 2024 at 7:24 PM Mitar wrote:
>
> Hi!
>
> Recently, a timestamp with calendarmodel
> https://www.wikidata.org/wiki/Q12138 has been introduced
Mitar created this task.
Mitar added a project: Wikidata.
Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION
Recently, a timestamp with calendarmodel https://www.wikidata.org/wiki/Q12138
has been introduced into
Wikidata:
https://www.wikidata.org/w/index.php?title
Hi!
Recently, a timestamp with calendarmodel
https://www.wikidata.org/wiki/Q12138 has been introduced into
Wikidata:
https://www.wikidata.org/w/index.php?title=Q105958428=2004936527
How is this possible? I thought that the only allowed values are
Q1985727 and Q1985786?
Mitar
--
https
Mitar added a comment.
Awesome! Thanks. This looks really amazing. I am not too convinced that we
should introduce a different dump format, but changing compression seems to
really be a low hanging fruit.
TASK DETAIL
https://phabricator.wikimedia.org/T222985
EMAIL PREFERENCES
https
Mitar added a comment.
I think it would be useful to have a benchmark with more options: JSON with
gzip, bzip (decompressed with lbzip2), and zstd. And then for QuickStatements
the same. Could you do that?
TASK DETAIL
https://phabricator.wikimedia.org/T222985
EMAIL PREFERENCES
https
Mitar closed this task as "Resolved".
Mitar claimed this task.
TASK DETAIL
https://phabricator.wikimedia.org/T278031
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Mitar
Cc: ImreSamu, Addshore, Mitar, Aklapper, Busfault, Ast
Mitar added a comment.
I checked `wikidata-20220620-all.json.bz2` and it contains now `modified`
field (alongside other fields which are present in API).
TASK DETAIL
https://phabricator.wikimedia.org/T278031
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel
Mitar added a comment.
I would vote for simply including hashes in dumps. They would make dumps
bigger, but they would be consistent with output of `EntityData` which
currently includes hashes for all snaks.
TASK DETAIL
https://phabricator.wikimedia.org/T174029
EMAIL PREFERENCES
https
Mitar added a comment.
Just a followup from somebody coming to Wikidata dumps in 2021: it is really
confusing that dumps do not include hashes, especially because `EntityData`
seems to show them now for all snaks (main, qualifiers, references). So when
one is debugging this, using
ole list if you need that.
Mitar
--
http://mitar.tnode.com/
https://twitter.com/mitar_m
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
that would be great. Of course, even better would
be to prevent insertion (because in 99% it means somebody is blindly
inserting a default zero value).
[1]
https://www.wikidata.org/w/index.php?title=Special:Contributions/Mitar==500=Mitar
Mitar
On Mon, Jan 10, 2022 at 4:50 PM Lydia Pintscher
wrot
/Wikibase/master/php/md_docs_topics_json.html
Mitar
--
http://mitar.tnode.com/
https://twitter.com/mitar_m
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
? Are they information? Can they be safely
ignored? Should those claims be updated in Wikidata to remove those
fields?
I can provide a list of those if anyone is interested.
[1] https://doc.wikimedia.org/Wikibase/master/php/md_docs_topics_json.html
Mitar
--
http://mitar.tnode.com/
https
Mitar added a comment.
I learned today that Wikipedia has a nice approach with a multistream bz2
archive <https://dumps.wikimedia.org/enwiki/> and additional file with an
index, which tells you an offset into the bz2 archive you have to decompress as
a chunk to access particula
Mitar added a comment.
In fact, this is not a problem, see
https://phabricator.wikimedia.org/T222985#7164507
pbzip2 is problematic and cannot decompress in parallel files not compressed
with pbzip2. But lbzip2 can. So using lbzip2 makes decompression of single file
dumps fast. So
Mitar added a comment.
OK, so it seems the problem is in pbzip2. It is not able to decompress in
parallel unless compression was made with pbzip2, too. But lbzip2 can
decompress all of them in parallel.
See:
$ time bunzip2 -c -k latest-lexemes.json.bz2 > /dev/null
r
Mitar added a comment.
Are you saying that existing wikidata json dumps can be decompressed in
parallel if using lbzip2, but not pbzip2?
TASK DETAIL
https://phabricator.wikimedia.org/T222985
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Mitar
Mitar added a comment.
I am realizing that maybe the problem is just that bzip2 compression is not
multistream but singlestream. Moreover, using newer compression algorithms like
zstd might decrease decompression speed even further, removing the need for
multiple files altogether. See https
Mitar added a comment.
As a reference see also this discussion
<https://www.wikidata.org/wiki/Wikidata_talk:Database_download#Dumps_cannot_be_decompressed_in_parallel>.
I think the problem with bzip2 is that it is currently singlestream so one
cannot really decompress it in pa
Mitar added a comment.
Are you sure `lastrevid` works like that for the whole dump? I think that
dump is made from multiple shards, so it might be that `lastrevid` is not
consistent across all items?
TASK DETAIL
https://phabricator.wikimedia.org/T209390
EMAIL PREFERENCES
https
Mitar added a comment.
Thank you for redirecting me to this issue. As I mentioned in T278204
<https://phabricator.wikimedia.org/T278204> my main motivation is in fact not
downloading in parallel, but processing in parallel. Just decompressing that
large file takes half a day on my m
Mitar added a comment.
I realized I have exactly the same need as poster on StackOveflow: get a dump
and then using real-time feed to keep it updated. But you have to know where to
start with the real-time feed through EventStreams, using historical
consumption
<ht
Mitar updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T278204
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Mitar
Cc: Addshore, hoo, Mitar, Invadibot, maantietaja, jannee_e, Akuckartz, Nandana,
Lahi, Gq86
Mitar updated the task description.
TASK DETAIL
https://phabricator.wikimedia.org/T278204
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Mitar
Cc: Addshore, hoo, Mitar, Invadibot, maantietaja, jannee_e, Akuckartz, Nandana,
Lahi, Gq86
Mitar created this task.
Mitar added projects: Wikidata, Dumps-Generation.
Restricted Application added a project: wdwb-tech.
TASK DESCRIPTION
My understanding is that dumps are currently in fact already produced by
multiple shards and then combined into one file. I wonder why simply multiple
Mitar added a comment.
I see that API does return the `modified` field:
https://www.wikidata.org/w/api.php?action=wbgetentities=json=Q1
TASK DETAIL
https://phabricator.wikimedia.org/T278031
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Mitar
Mitar added a comment.
Personally, I would love to have for each item in the dump a timestamp when
it was created and a timestamp when it was last modified.
Related: https://phabricator.wikimedia.org/T278031
TASK DETAIL
https://phabricator.wikimedia.org/T209390
EMAIL PREFERENCES
Restricted Application added a project: wdwb-tech.
TASK DETAIL
https://phabricator.wikimedia.org/T209390
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Mitar
Cc: Mitar, ArielGlenn, Smalyshev, Addshore, Invadibot, maantietaja, jannee_e,
Akuckartz
28 matches
Mail list logo