[Wikidata-tech] Re: Timestamps with calendarmodel other than Q1985727 and Q1985786

2024-03-24 Thread Mitar
Hi! There was no response here. I made the following issue instead: https://phabricator.wikimedia.org/T360859 Mitar On Sat, Mar 2, 2024 at 7:24 PM Mitar wrote: > > Hi! > > Recently, a timestamp with calendarmodel > https://www.wikidata.org/wiki/Q12138 has been introduced

[Wikidata-bugs] [Maniphest] T360859: Timestamps with calendarmodel other than Q1985727 and Q1985786

2024-03-24 Thread Mitar
Mitar created this task. Mitar added a project: Wikidata. Restricted Application added a subscriber: Aklapper. TASK DESCRIPTION Recently, a timestamp with calendarmodel https://www.wikidata.org/wiki/Q12138 has been introduced into Wikidata: https://www.wikidata.org/w/index.php?title

[Wikidata-tech] Timestamps with calendarmodel other than Q1985727 and Q1985786

2024-03-02 Thread Mitar
Hi! Recently, a timestamp with calendarmodel https://www.wikidata.org/wiki/Q12138 has been introduced into Wikidata: https://www.wikidata.org/w/index.php?title=Q105958428=2004936527 How is this possible? I thought that the only allowed values are Q1985727 and Q1985786? Mitar -- https

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2023-05-13 Thread Mitar
Mitar added a comment. Awesome! Thanks. This looks really amazing. I am not too convinced that we should introduce a different dump format, but changing compression seems to really be a low hanging fruit. TASK DETAIL https://phabricator.wikimedia.org/T222985 EMAIL PREFERENCES https

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2023-05-08 Thread Mitar
Mitar added a comment. I think it would be useful to have a benchmark with more options: JSON with gzip, bzip (decompressed with lbzip2), and zstd. And then for QuickStatements the same. Could you do that? TASK DETAIL https://phabricator.wikimedia.org/T222985 EMAIL PREFERENCES https

[Wikidata-bugs] [Maniphest] T278031: Wikibase canonical JSON format is missing "modified" in Wikidata JSON dumps

2022-06-24 Thread Mitar
Mitar closed this task as "Resolved". Mitar claimed this task. TASK DETAIL https://phabricator.wikimedia.org/T278031 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Mitar Cc: ImreSamu, Addshore, Mitar, Aklapper, Busfault, Ast

[Wikidata-bugs] [Maniphest] T278031: Wikibase canonical JSON format is missing "modified" in Wikidata JSON dumps

2022-06-24 Thread Mitar
Mitar added a comment. I checked `wikidata-20220620-all.json.bz2` and it contains now `modified` field (alongside other fields which are present in API). TASK DETAIL https://phabricator.wikimedia.org/T278031 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel

[Wikidata-bugs] [Maniphest] T174029: Two kinds of JSON dumps?

2022-01-26 Thread Mitar
Mitar added a comment. I would vote for simply including hashes in dumps. They would make dumps bigger, but they would be consistent with output of `EntityData` which currently includes hashes for all snaks. TASK DETAIL https://phabricator.wikimedia.org/T174029 EMAIL PREFERENCES https

[Wikidata-bugs] [Maniphest] T171607: Main snak and reference snaks do not include hash in JSON output

2022-01-26 Thread Mitar
Mitar added a comment. Just a followup from somebody coming to Wikidata dumps in 2021: it is really confusing that dumps do not include hashes, especially because `EntityData` seems to show them now for all snaks (main, qualifiers, references). So when one is debugging this, using

[Wikidata] Re: Timezone, before, and after fields in JSON dump

2022-01-10 Thread Mitar
ole list if you need that. Mitar -- http://mitar.tnode.com/ https://twitter.com/mitar_m ___ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-le...@lists.wikimedia.org

[Wikidata] Re: +0000-00-00T00:00:00Z in JSON dump

2022-01-10 Thread Mitar
that would be great. Of course, even better would be to prevent insertion (because in 99% it means somebody is blindly inserting a default zero value). [1] https://www.wikidata.org/w/index.php?title=Special:Contributions/Mitar==500=Mitar Mitar On Mon, Jan 10, 2022 at 4:50 PM Lydia Pintscher wrot

[Wikidata] +0000-00-00T00:00:00Z in JSON dump

2022-01-09 Thread Mitar
/Wikibase/master/php/md_docs_topics_json.html Mitar -- http://mitar.tnode.com/ https://twitter.com/mitar_m ___ Wikidata mailing list -- wikidata@lists.wikimedia.org To unsubscribe send an email to wikidata-le...@lists.wikimedia.org

[Wikidata] Timezone, before, and after fields in JSON dump

2022-01-09 Thread Mitar
? Are they information? Can they be safely ignored? Should those claims be updated in Wikidata to remove those fields? I can provide a list of those if anyone is interested. [1] https://doc.wikimedia.org/Wikibase/master/php/md_docs_topics_json.html Mitar -- http://mitar.tnode.com/ https

[Wikidata-bugs] [Maniphest] T115223: Provide wikidata downloads as multiple files to make access more robust and efficient

2021-12-31 Thread Mitar
Mitar added a comment. I learned today that Wikipedia has a nice approach with a multistream bz2 archive <https://dumps.wikimedia.org/enwiki/> and additional file with an index, which tells you an offset into the bz2 archive you have to decompress as a chunk to access particula

[Wikidata-bugs] [Maniphest] T115223: Provide wikidata downloads as multiple files to make access more robust and efficient

2021-06-20 Thread Mitar
Mitar added a comment. In fact, this is not a problem, see https://phabricator.wikimedia.org/T222985#7164507 pbzip2 is problematic and cannot decompress in parallel files not compressed with pbzip2. But lbzip2 can. So using lbzip2 makes decompression of single file dumps fast. So

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-06-20 Thread Mitar
Mitar added a comment. OK, so it seems the problem is in pbzip2. It is not able to decompress in parallel unless compression was made with pbzip2, too. But lbzip2 can decompress all of them in parallel. See: $ time bunzip2 -c -k latest-lexemes.json.bz2 > /dev/null r

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-06-20 Thread Mitar
Mitar added a comment. Are you saying that existing wikidata json dumps can be decompressed in parallel if using lbzip2, but not pbzip2? TASK DETAIL https://phabricator.wikimedia.org/T222985 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Mitar

[Wikidata-bugs] [Maniphest] T115223: Provide wikidata downloads as multiple files to make access more robust and efficient

2021-06-19 Thread Mitar
Mitar added a comment. I am realizing that maybe the problem is just that bzip2 compression is not multistream but singlestream. Moreover, using newer compression algorithms like zstd might decrease decompression speed even further, removing the need for multiple files altogether. See https

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-06-19 Thread Mitar
Mitar added a comment. As a reference see also this discussion <https://www.wikidata.org/wiki/Wikidata_talk:Database_download#Dumps_cannot_be_decompressed_in_parallel>. I think the problem with bzip2 is that it is currently singlestream so one cannot really decompress it in pa

[Wikidata-bugs] [Maniphest] T209390: Output some meta data about the wikidata JSON dump

2021-04-28 Thread Mitar
Mitar added a comment. Are you sure `lastrevid` works like that for the whole dump? I think that dump is made from multiple shards, so it might be that `lastrevid` is not consistent across all items? TASK DETAIL https://phabricator.wikimedia.org/T209390 EMAIL PREFERENCES https

[Wikidata-bugs] [Maniphest] T115223: Provide wikidata downloads as multiple files to make access more robust and efficient

2021-04-03 Thread Mitar
Mitar added a comment. Thank you for redirecting me to this issue. As I mentioned in T278204 <https://phabricator.wikimedia.org/T278204> my main motivation is in fact not downloading in parallel, but processing in parallel. Just decompressing that large file takes half a day on my m

[Wikidata-bugs] [Maniphest] T209390: Output some meta data about the wikidata JSON dump

2021-03-23 Thread Mitar
Mitar added a comment. I realized I have exactly the same need as poster on StackOveflow: get a dump and then using real-time feed to keep it updated. But you have to know where to start with the real-time feed through EventStreams, using historical consumption <ht

[Wikidata-bugs] [Maniphest] T278204: Provide Wikidata dumps as multiple files

2021-03-23 Thread Mitar
Mitar updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T278204 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Mitar Cc: Addshore, hoo, Mitar, Invadibot, maantietaja, jannee_e, Akuckartz, Nandana, Lahi, Gq86

[Wikidata-bugs] [Maniphest] T278204: Provide Wikidata dumps as multiple files

2021-03-23 Thread Mitar
Mitar updated the task description. TASK DETAIL https://phabricator.wikimedia.org/T278204 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Mitar Cc: Addshore, hoo, Mitar, Invadibot, maantietaja, jannee_e, Akuckartz, Nandana, Lahi, Gq86

[Wikidata-bugs] [Maniphest] T278204: Provide Wikidata dumps as multiple files

2021-03-22 Thread Mitar
Mitar created this task. Mitar added projects: Wikidata, Dumps-Generation. Restricted Application added a project: wdwb-tech. TASK DESCRIPTION My understanding is that dumps are currently in fact already produced by multiple shards and then combined into one file. I wonder why simply multiple

[Wikidata-bugs] [Maniphest] T278031: Wikibase canonical JSON format is missing "modified" in Wikidata JSON dumps

2021-03-21 Thread Mitar
Mitar added a comment. I see that API does return the `modified` field: https://www.wikidata.org/w/api.php?action=wbgetentities=json=Q1 TASK DETAIL https://phabricator.wikimedia.org/T278031 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Mitar

[Wikidata-bugs] [Maniphest] T209390: Output some meta data about the wikidata JSON dump

2021-03-21 Thread Mitar
Mitar added a comment. Personally, I would love to have for each item in the dump a timestamp when it was created and a timestamp when it was last modified. Related: https://phabricator.wikimedia.org/T278031 TASK DETAIL https://phabricator.wikimedia.org/T209390 EMAIL PREFERENCES

[Wikidata-bugs] [Maniphest] T209390: Output some meta data about the wikidata JSON dump

2021-03-21 Thread Mitar
Restricted Application added a project: wdwb-tech. TASK DETAIL https://phabricator.wikimedia.org/T209390 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Mitar Cc: Mitar, ArielGlenn, Smalyshev, Addshore, Invadibot, maantietaja, jannee_e, Akuckartz