Re: [OSM-talk] Some measure to prevent duplicate uploads of same data ...
MP wrote: > While API 0.6 have implemented object versioning, preventing > accidentally overwriting someone else's changes, with introduction of > atomic uploads now I see many problems with duplicate data. > > These come often with imports of data or generally if someone uploads > any new data without modifying any existing data (like if someone just > traces hundreds of buildings from ortophoto, or alike ) > > Since in JOSM (and possibly in other tools) the atomic upload is the > default method, that user presses some "upload" button and in few > seconds all the changes are uploaded to the server, which then starts > processing it (this could take some time for larger changes) and once > it is finished, it will send new node ID's back to the editor. > > Unfortunately, sometimes while waiting for server to process the > uploaded data, the connection will timeout, so the user sees some > error message - thinking the upload failed, he presses "upload" > again, starting to push new copy of all the objects to the server. > Later, the server want to return ID's from first upload, but nobody is > listening on the orher end anymore. > > Ultimate result is sometimes having 2 to 4 identical copies of some > data, sometimes it is thousands of duplicate nodes and ways. > > Suggestion for one possible countermeasure: > after server receives complete succesful atomic upload from user, > compute SHA1, MD5, or some other checksum of the uploaded XML. Store > it and if user tries uploading exactly the same thing again (because > he thinks the upload have failed, which is not true), send him just > some error message instead, like: "You have already uploaded this > data". > > Or alternatively, send the user whatever result was there from the > last upload (either new set of ID's, or some error message in case > that previous upload failed because of some error) > > I think perhaps last 2 or 3 checksums could be stored in case someone > have multiple parallel uploads in multiple editors. > > Martin This is really a problem, especially for large data imports. The solution might not be so easy: JOSM offers to upload the data in chunks of different sizes, or even each object separately. If the upload fails (due to timeout), the user might vary these paramenters, so the checksums become useless. There was a discussion on that topic in josm trac: http://josm.openstreetmap.de/ticket/4401 It was suggested, that a final handshake should be required after the diff is sent from the server. If the client does not respond, the upload is discarded. It would be nice to have a solution for this in API 0.7, but in the meantime, the editors should learn to handle this in a better way. The user should be informed, that the dataset is in a dirty state and offer downloading the changeset. The new objects of the current dataset should then be matched heuristically (by their coordinates and tags) with the objects in the changeset. __ Sebastian ___ talk mailing list talk@openstreetmap.org http://lists.openstreetmap.org/listinfo/talk
Re: [OSM-talk] Some measure to prevent duplicate uploads of same data ...
Hi, On 6 March 2010 00:16, MP wrote: > While API 0.6 have implemented object versioning, preventing > accidentally overwriting someone else's changes, with introduction of > atomic uploads now I see many problems with duplicate data. > > These come often with imports of data or generally if someone uploads > any new data without modifying any existing data (like if someone just > traces hundreds of buildings from ortophoto, or alike ) > > Since in JOSM (and possibly in other tools) the atomic upload is the > default method, that user presses some "upload" button and in few > seconds all the changes are uploaded to the server, which then starts > processing it (this could take some time for larger changes) and once > it is finished, it will send new node ID's back to the editor. > > Unfortunately, sometimes while waiting for server to process the > uploaded data, the connection will timeout, so the user sees some > error message - thinking the upload failed, he presses "upload" > again, starting to push new copy of all the objects to the server. > Later, the server want to return ID's from first upload, but nobody is > listening on the orher end anymore. > > Ultimate result is sometimes having 2 to 4 identical copies of some > data, sometimes it is thousands of duplicate nodes and ways. > > Suggestion for one possible countermeasure: > after server receives complete succesful atomic upload from user, > compute SHA1, MD5, or some other checksum of the uploaded XML. Store > it and if user tries uploading exactly the same thing again (because > he thinks the upload have failed, which is not true), send him just > some error message instead, like: "You have already uploaded this > data". This sounds like a good idea to me. Perhaps it should only be employed for diff uploads with only 's, for all other cases a re-upload will fail with a conflict. An identical measure can be implemented in the client such as JOSM. Only the uploads with solely new objects need to be extra cautious, but even for other uploads JOSM could admittedly be better at treating network errors, for example by looking at the last open changeset and retrieving the new IDs and versions of objects which should have been in the server response. I have a very experimental script that generates the server response based on the content uploaded and the corresponding changeset as downloaded from the api, which I use for bulk uploads, at http://svn.openstreetmap.org/applications/utils/import/bulkupload/change2diff2.py It only works if the changeset contains only the single diff and it makes other significant assumptions. Generally if you're not uploading through a proxy and the diff is not in conflict with existing data (for example because it only creates new objects) I notice that it will always hit the database if 100% of the xml is uploaded, i.e. once the last byte has been sent out the api never cancels the commit, if on the contrary not all bytes were sent out, the api will not be able to parse it as xml, so it's deterministic. Cheers ___ talk mailing list talk@openstreetmap.org http://lists.openstreetmap.org/listinfo/talk
[OSM-talk] Some measure to prevent duplicate uploads of same data ...
While API 0.6 have implemented object versioning, preventing accidentally overwriting someone else's changes, with introduction of atomic uploads now I see many problems with duplicate data. These come often with imports of data or generally if someone uploads any new data without modifying any existing data (like if someone just traces hundreds of buildings from ortophoto, or alike ) Since in JOSM (and possibly in other tools) the atomic upload is the default method, that user presses some "upload" button and in few seconds all the changes are uploaded to the server, which then starts processing it (this could take some time for larger changes) and once it is finished, it will send new node ID's back to the editor. Unfortunately, sometimes while waiting for server to process the uploaded data, the connection will timeout, so the user sees some error message - thinking the upload failed, he presses "upload" again, starting to push new copy of all the objects to the server. Later, the server want to return ID's from first upload, but nobody is listening on the orher end anymore. Ultimate result is sometimes having 2 to 4 identical copies of some data, sometimes it is thousands of duplicate nodes and ways. Suggestion for one possible countermeasure: after server receives complete succesful atomic upload from user, compute SHA1, MD5, or some other checksum of the uploaded XML. Store it and if user tries uploading exactly the same thing again (because he thinks the upload have failed, which is not true), send him just some error message instead, like: "You have already uploaded this data". Or alternatively, send the user whatever result was there from the last upload (either new set of ID's, or some error message in case that previous upload failed because of some error) I think perhaps last 2 or 3 checksums could be stored in case someone have multiple parallel uploads in multiple editors. Martin ___ talk mailing list talk@openstreetmap.org http://lists.openstreetmap.org/listinfo/talk