Re: [OSM-talk] Some measure to prevent duplicate uploads of same data ...

2010-03-05 Thread Sebastian Klein
MP wrote:
> While API 0.6 have implemented object versioning, preventing
> accidentally overwriting someone else's changes, with introduction of
> atomic uploads now I see many problems with duplicate data.
> 
> These come often with imports of data or generally if someone uploads
> any new data without modifying any existing data (like if someone just
> traces hundreds of buildings from ortophoto, or alike )
> 
> Since in JOSM (and possibly in other tools) the atomic upload is the
> default method, that user presses some "upload" button and in few
> seconds all the changes are uploaded to the server, which then starts
> processing it (this could take some time for larger changes) and once
> it is finished, it will send new node ID's back to the editor.
> 
> Unfortunately, sometimes while waiting for server to process the
> uploaded data, the connection will timeout, so the user sees some
> error message  - thinking the upload failed, he presses "upload"
> again, starting to push new copy of all the objects to the server.
> Later, the server want to return ID's from first upload, but nobody is
> listening on the orher end anymore.
> 
> Ultimate result is sometimes having 2 to 4 identical copies of some
> data, sometimes it is thousands of duplicate nodes and ways.
> 
> Suggestion for one possible countermeasure:
>  after server receives complete succesful atomic upload from user,
> compute SHA1, MD5, or some other checksum of the uploaded XML. Store
> it and if user tries uploading exactly the same thing again (because
> he thinks the upload have failed, which is not true), send him just
> some error message instead, like: "You have already uploaded this
> data".
> 
> Or alternatively, send the user whatever result was there from the
> last upload (either new set of ID's, or some error message in case
> that previous upload failed because of some error)
> 
> I think perhaps last 2 or 3 checksums could be stored in case someone
> have multiple parallel uploads in multiple editors.
> 
> Martin

This is really a problem, especially for large data imports. The 
solution might not be so easy:

JOSM offers to upload the data in chunks of different sizes, or even 
each object separately. If the upload fails (due to timeout), the user 
might vary these paramenters, so the checksums become useless.

There was a discussion on that topic in josm trac:

http://josm.openstreetmap.de/ticket/4401

It was suggested, that a final handshake should be required after the 
diff is sent from the server. If the client does not respond, the upload 
is discarded.

It would be nice to have a solution for this in API 0.7, but in the 
meantime, the editors should learn to handle this in a better way.

The user should be informed, that the dataset is in a dirty state and 
offer downloading the changeset. The new objects of the current dataset 
should then be matched heuristically (by their coordinates and tags) 
with the objects in the changeset.

__

Sebastian

___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


Re: [OSM-talk] Some measure to prevent duplicate uploads of same data ...

2010-03-05 Thread andrzej zaborowski
Hi,

On 6 March 2010 00:16, MP  wrote:
> While API 0.6 have implemented object versioning, preventing
> accidentally overwriting someone else's changes, with introduction of
> atomic uploads now I see many problems with duplicate data.
>
> These come often with imports of data or generally if someone uploads
> any new data without modifying any existing data (like if someone just
> traces hundreds of buildings from ortophoto, or alike )
>
> Since in JOSM (and possibly in other tools) the atomic upload is the
> default method, that user presses some "upload" button and in few
> seconds all the changes are uploaded to the server, which then starts
> processing it (this could take some time for larger changes) and once
> it is finished, it will send new node ID's back to the editor.
>
> Unfortunately, sometimes while waiting for server to process the
> uploaded data, the connection will timeout, so the user sees some
> error message  - thinking the upload failed, he presses "upload"
> again, starting to push new copy of all the objects to the server.
> Later, the server want to return ID's from first upload, but nobody is
> listening on the orher end anymore.
>
> Ultimate result is sometimes having 2 to 4 identical copies of some
> data, sometimes it is thousands of duplicate nodes and ways.
>
> Suggestion for one possible countermeasure:
>  after server receives complete succesful atomic upload from user,
> compute SHA1, MD5, or some other checksum of the uploaded XML. Store
> it and if user tries uploading exactly the same thing again (because
> he thinks the upload have failed, which is not true), send him just
> some error message instead, like: "You have already uploaded this
> data".

This sounds like a good idea to me.  Perhaps it should only be
employed for diff uploads with only 's, for all other cases a
re-upload will fail with a conflict.  An identical measure can be
implemented in the client such as JOSM.  Only the uploads with solely
new objects need to be extra cautious, but even for other uploads JOSM
could admittedly be better at treating network errors, for example by
looking at the last open changeset and retrieving the new IDs and
versions of objects which should have been in the server response.

I have a very experimental script that generates the server response
based on the content uploaded and the corresponding changeset as
downloaded from the api, which I use for bulk uploads, at
http://svn.openstreetmap.org/applications/utils/import/bulkupload/change2diff2.py
It only works if the changeset contains only the single diff and it
makes other significant assumptions.

Generally if you're not uploading through a proxy and the diff is not
in conflict with existing data (for example because it only creates
new objects) I notice that it will always hit the database if 100% of
the xml is uploaded, i.e. once the last byte has been sent out the api
never cancels the commit, if on the contrary not all bytes were sent
out, the api will not be able to parse it as xml, so it's
deterministic.

Cheers

___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk


[OSM-talk] Some measure to prevent duplicate uploads of same data ...

2010-03-05 Thread MP
While API 0.6 have implemented object versioning, preventing
accidentally overwriting someone else's changes, with introduction of
atomic uploads now I see many problems with duplicate data.

These come often with imports of data or generally if someone uploads
any new data without modifying any existing data (like if someone just
traces hundreds of buildings from ortophoto, or alike )

Since in JOSM (and possibly in other tools) the atomic upload is the
default method, that user presses some "upload" button and in few
seconds all the changes are uploaded to the server, which then starts
processing it (this could take some time for larger changes) and once
it is finished, it will send new node ID's back to the editor.

Unfortunately, sometimes while waiting for server to process the
uploaded data, the connection will timeout, so the user sees some
error message  - thinking the upload failed, he presses "upload"
again, starting to push new copy of all the objects to the server.
Later, the server want to return ID's from first upload, but nobody is
listening on the orher end anymore.

Ultimate result is sometimes having 2 to 4 identical copies of some
data, sometimes it is thousands of duplicate nodes and ways.

Suggestion for one possible countermeasure:
 after server receives complete succesful atomic upload from user,
compute SHA1, MD5, or some other checksum of the uploaded XML. Store
it and if user tries uploading exactly the same thing again (because
he thinks the upload have failed, which is not true), send him just
some error message instead, like: "You have already uploaded this
data".

Or alternatively, send the user whatever result was there from the
last upload (either new set of ID's, or some error message in case
that previous upload failed because of some error)

I think perhaps last 2 or 3 checksums could be stored in case someone
have multiple parallel uploads in multiple editors.

Martin

___
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk