[Pywikipedia-bugs] [Maniphest] [Commented On] T254459: Large PDF upload issue

Fae Wed, 10 Jun 2020 02:17:03 -0700

Fae added a comment.


  Sticking the upload script inside an infinite loop, allowing the upload to 
break on any first API error seems a practical but bad //brute-force// work 
around. However this is incredibly slow, wasteful of processing time and 
bandwidth, and not a solution for the vast majority for Commons contributors.
  
  Here's what I found:
  
  1. The most common pattern is an indefinite series of retries, which shortly 
after over 1 hour give up with `Maximum retries attempted without success.` In 
actuality, the file may have been uploaded during the retries but the API 
failed to return a success message, and failed to return a 'duplicate' message 
when trying to re-upload the file it just uploaded.
  2. Second most likely is successful upload on first run, but this decreases 
in likelihood with the size of the file. In the example of 
"catalogofcopyri12libr" there have been around 10 attempts to upload the 185MB 
file, and it has yet to succeed (most of these attempts taking an hour to 
time-out).
  3. Third most likely is after a series of retries, the API returns a message 
like `duplicate: Uploaded file is a duplicate of 
[u'Catalog_of_Copyright_Entries_1977_Books_and_Pamphlets_Jan-June.pdf'].` Which 
though a technically correct message, is a symptom of failing to return a 
successful upload message.
  4. Lastly is a `http-curl-error`, which seems to be the InternetArchive 
falling over during these mass multiple requests.
  
  Here's an example of one of the parallel looping tasks we are running. This 
report of InternetArchive idents is after the second attempts, so the files in 
bold have timed-out twice, while the others succeeded on first or second 
attempt:
  
  - ** //2 catalogofcopyri12libr 189.8M// **
  - **//9 catalogofcopyrig17libr 173.9M// **
  - **//10 catalogofco11libr 217.7M //**
  - ** //11 catalogofcop13libr 136.6M //**
  - ** //13 catalogofcopyr11libr 199.4M //**
  - 17 1977booksandpamp33112libr 143.1M
  - 18 1977musicjanjune33152libr 106.8M
  - ** //19 1977booksandpamp33112library 128.1M// **
  - 29 1977booksandpamphle33111libr 128.8M
  - 30 1976worksofartja330711libr 83.3M
  - 33 catalogofcopyrig33051libr 91.6M
  - 37 catalogofcopyrig33012library 134.4M
  - 38 catalogofcopyrig33011library 122.1M
  - 39 catalogofcopyrig33012libr 139.5M
  - 40 catalogofcopyrig33011libr 128.2M
  - 41 catalogofcopyrig33051library 88.3M
  - 43 catalogofcopyrig33052libr 114.5M
  - 44 catalogofcopyrig33052library 109.9M
  - 45 catalogofcopyrig32912lib 131.4M
  - 49 catalogofcopyrig32952libr 103.3M
  
    One conclusion would be that PDFs over 125MB are highly unlikely to be 
successfully uploaded by ordinary volunteers, and significantly larger PDFs 
cannot be batch uploaded in any practical way by anyone at this time.

TASK DETAIL
  https://phabricator.wikimedia.org/T254459

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Fae
Cc: Dvorapa, Aklapper, pywikibot-bugs-list, Fae, JohnsonLee01, SHEKH, Dijkstra, 
CBogen, Biazzzzoo, Philoserf, Khutuck, Zkhalido, Viztor, Wenyi, Tbscho, MayS, 
Mdupont, JJMC89, Poyekhali, Altostratus, Taiwania_Justo, Avicennasis, 
Ixocactus, Wong128hk, mys_721tx, Hydriz, El_Grafo, Dinoguy1000, jayvdb, Masti, 
Alchimista, Steinsplitter, Rxy, Jay8g, Keegan

_______________________________________________
pywikibot-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikibot-bugs

[Pywikipedia-bugs] [Maniphest] [Commented On] T254459: Large PDF upload issue

Reply via email to