[Wikidata-bugs] [Maniphest] T115223: Provide wikidata downloads as multiple files to make access more robust and efficient

2021-12-31 Thread Mitar
Mitar added a comment.


  I learned today that Wikipedia has a nice approach with a multistream bz2 
archive  and additional file with an 
index, which tells you an offset into the bz2 archive you have to decompress as 
a chunk to access particular page. Wikidata could do the same, just for items 
and properties. This would allow one to extract only those entities they care 
about. Mutlistream also enables one to decompress parts of the file in parallel 
on multiple machines, by distributing offsets between them. Wikipedia also 
provides the same multistream archive as multiple files so that one can even 
easier distribute the whole dump over multiple machines. I like that approach.

TASK DETAIL
  https://phabricator.wikimedia.org/T115223

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Mitar
Cc: Addshore, Mitar, abian, JanZerebecki, Hydriz, hoo, Halfak, NealMcB, 
Aklapper, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Svick, Mbch331, jeremyb
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T115223: Provide wikidata downloads as multiple files to make access more robust and efficient

2021-06-20 Thread Mitar
Mitar added a comment.


  In fact, this is not a problem, see 
https://phabricator.wikimedia.org/T222985#7164507
  
  pbzip2 is problematic and cannot decompress in parallel files not compressed 
with pbzip2. But lbzip2 can. So using lbzip2 makes decompression of single file 
dumps fast. So not sure if it would be faster to have multiple files.

TASK DETAIL
  https://phabricator.wikimedia.org/T115223

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Mitar
Cc: Addshore, Mitar, abian, JanZerebecki, Hydriz, hoo, Halfak, NealMcB, 
Aklapper, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Svick, Mbch331, jeremyb
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T115223: Provide wikidata downloads as multiple files to make access more robust and efficient

2021-06-19 Thread Mitar
Mitar added a comment.


  I am realizing that maybe the problem is just that bzip2 compression is not 
multistream but singlestream. Moreover, using newer compression algorithms like 
zstd might decrease decompression speed even further, removing the need for 
multiple files altogether. See https://phabricator.wikimedia.org/T222985#7163885

TASK DETAIL
  https://phabricator.wikimedia.org/T115223

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Mitar
Cc: Addshore, Mitar, abian, JanZerebecki, Hydriz, hoo, Halfak, NealMcB, 
Aklapper, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Svick, Mbch331, jeremyb
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T115223: Provide wikidata downloads as multiple files to make access more robust and efficient

2021-04-03 Thread Mitar
Mitar added a comment.


  Thank you for redirecting me to this issue. As I mentioned in T278204 
 my main motivation is in fact not 
downloading in parallel, but processing in parallel. Just decompressing that 
large file takes half a day on my machine. If I can instead use 12 machines on 
12 splits, for example, I can do that decompression (or some other processing) 
in one hour instead.

TASK DETAIL
  https://phabricator.wikimedia.org/T115223

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Mitar
Cc: Addshore, Mitar, abian, JanZerebecki, Hydriz, hoo, Halfak, NealMcB, 
Aklapper, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Svick, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T115223: Provide wikidata downloads as multiple files to make access more robust and efficient

2021-04-03 Thread NealMcB
NealMcB added a comment.


  The recommended download format continues to be JSON as discussed at
   https://www.wikidata.org/wiki/Wikidata:Database_download
  
  Since this was reported in 2015, the smallest version of the "latest-all" 
database has grown more than tenfold from 5.4 GB to 64 GB in size, making the 
usage challenges far greater. From 
https://dumps.wikimedia.org/wikidatawiki/entities/:
  
  latest-all.json.bz231-Mar-2021 17:03 
64697800080
  
  Others are running across the issues, motivating the duplicate issue T278204 
 which was recently merged. They 
note that
  
  > dumps are currently in fact already produced by multiple shards and then 
combined into one file
  
  and
  
  > There are already no guarantees on the order of documents in dumps
  
  making it seem yet more reasonable to provide them as multiple files not a 
single file.
  
  What would it take to resolve this issue? How can we help?

TASK DETAIL
  https://phabricator.wikimedia.org/T115223

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: NealMcB
Cc: Addshore, Mitar, abian, JanZerebecki, Hydriz, hoo, Halfak, NealMcB, 
Aklapper, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Svick, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] T115223: Provide wikidata downloads as multiple files to make access more robust and efficient

2021-04-03 Thread Bugreporter
Bugreporter merged a task: T278204: Provide Wikidata dumps as multiple files.
Bugreporter added subscribers: Mitar, Addshore.

TASK DETAIL
  https://phabricator.wikimedia.org/T115223

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Bugreporter
Cc: Addshore, Mitar, abian, JanZerebecki, Hydriz, hoo, Halfak, NealMcB, 
Aklapper, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Svick, Mbch331, jeremyb
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs