[Talk-us] Duplicates in data uploads (using JOSM) -- was: Re: [Imports] Uploads to City of Salisbury, MD

2012-03-22 Thread Jaakko Helleranta.com
"With previous large uploads I have experience the same behaviour resulting
in massive dupes. So I guess it is not a conversion issue."

I don't have experience with conversions nor (mass) imports -- but I _have_
had "massive dupes" problems a number of times when uploading larger
amounts of data with JOSM over a bad connection. The problem has always
been related to the combination of large uploads and bad connections where
(if I understand right) the JOSM data upload connection gets a hick-up at
some point and isn't able to finish the job -- and doesn't leave a note for
itself where it was left of. Then, because of reasons I don't _exactly_
understand there's duplication of data on the next upload(s (attempts)).

My vague understanding is that this is due to at least the fact that JOSM
uploads nodes first and only after that the information about ways (i.e.
which nodes belong to which ways). And then when it hasn't gotten or
confirmation for succesful uploads (or it hasn't recorded that to it's data
file(?)) it considers the uploaded nodes to still be new at next upload(s
(attempts)).

I feel that duplication sometimes happens also to partial uploads where the
ways have uploaded, too, resulting in duplicate uploaded ways but I haven't
documented this well enough to say this solidly.

If you have a bad connection / feel that this may be your problem it is a
good idea to tweak the JOSM Advanced upload settings (Upload > Advanced
tab: "Upload data in chunks of objects. Chunk size: ", where  is
your number of objects per chunk. I use 200 in with my Haitian connection.

Cheers,
-Jaakko
http://osm.org/user/jaakkoh
--
jaa...@helleranta.com * Skype: jhelleranta * Mobile: +509-37-269154  *
http://go.hel.cc/MyProfile



On Thu, Mar 22, 2012 at 8:28 AM, Marc Zoss  wrote:

> Nick and Josh
>
> thanks for the clarification on your upload strategy. With previous large
> uploads I have experience the same behaviour resulting in massive dupes. So
> I guess it is not a conversion issue.
>
> If you want me to commit the remove duplicates changeset, I can do so. But
> you will have to go through the data subsequently and check if the issues
> are resolved and no new ones emerged.
>
> M
>
> On 22.03.2012, at 14:12, Nick Chamberlain wrote:
>
> > Josh and Marc,
> >
> > Thank you!  I apologize that I'm unable to speak the OSM language as
> > well as everyone, I'm working on it :)  I posted on the Salisbury,
> > Maryland Import page that Josh created to give more detail about my
> > uploads.
> >
> > I didn't really think that I created so many duplicates, because I did a
> > lot of things in JOSM before I actually chose to upload.  One thing I
> > know for sure is that I didn't I upload until I was actually able to - I
> > was getting a proxy error and the uploads were timing out when I
> > attempted to upload the entire batch.  I assumed that these attempts
> > were unsuccessful, which I might be wrong about and might have resulted
> > in duplication.
> >
> > I assumed that my successful attempts started, maybe @ 10901673, when I
> > realized I needed to break the original shapefile up tabularly into
> > percentiles and upload 10 segments of the building footprint dataset,
> > one after the other.  These were all definitely successful, and were
> > only done once per percentile.
> >
> > Josh, where are you finding the list of changesets in the format you
> > posted?  I can only figure out how to list them in my editor profile
> > with my comments.
> >
> > If you believe that the method you mention that removes the 71,000 nodes
> > is the best approach, please feel free to do so.  I will also gladly
> > manually fix the inner ring tagging issue as the data gets fixed.
> > Please let me know what I can do to help.  I am also willing to share
> > the .osm files and/or shapefiles if that will help.  Thanks.
> >
> > - Nick
> >
> > -Original Message-
> > From: joshthephysic...@gmail.com [mailto:joshthephysic...@gmail.com] On
> > Behalf Of Josh Doe
> > Sent: Thursday, March 22, 2012 8:51 AM
> > To: Marc Zoss
> > Cc: impo...@openstreetmap.org; talk-us@openstreetmap.org; Nick
> > Chamberlain
> > Subject: Re: [Imports] [Talk-us] Uploads to City of Salisbury, MD
> >
> > On Thu, Mar 22, 2012 at 8:04 AM, Marc Zoss  wrote:
> >> I briefly downloaded all sby:bldgtype-tagged ways and relation of
> > Maryland through the overpass-api. Then removed the ones having only a
> > sby:bldgtype tag, run the validator and deleted the duplicated nodes and
> > ways.
> >> This would result in a changeset to remove the roughly 71'000
> > duplicates nodes and ways.
> >>
> >> If the area was edited since the import and reverting gets tricky,
> > this might be the option to go, at least the result looks ok at the
> > first glance.
> >>
> >> Please also note that the conversion step seems to add a building=yes
> > tag on on inner ring of building polygons () which is certainly bad
> > tagging, despite the correct rendering (52 occurrences, so cou

Re: [Talk-us] Duplicates in data uploads (using JOSM) -- was: Re: [Imports] Uploads to City of Salisbury, MD

2012-03-22 Thread Nick Chamberlain
Jaakko,

 

Thank you for the explanation.  I will tweak my chunk sizes further next
time.  I did so before, but they were still fairly large and took a few
hours per upload.  Reducing them might take longer, but if that fixes
duplication I will do that.  Thanks.

 

- Nick

 

From: Jaakko Helleranta.com [mailto:jaa...@helleranta.com] 
Sent: Thursday, March 22, 2012 10:47 AM
To: Marc Zoss
Cc: Nick Chamberlain; Josh Doe; impo...@openstreetmap.org;
talk-us@openstreetmap.org
Subject: Duplicates in data uploads (using JOSM) -- was: Re: [Imports]
[Talk-us] Uploads to City of Salisbury, MD

 

"With previous large uploads I have experience the same behaviour
resulting in massive dupes. So I guess it is not a conversion issue."


 

I don't have experience with conversions nor (mass) imports -- but I
_have_ had "massive dupes" problems a number of times when uploading
larger amounts of data with JOSM over a bad connection. The problem has
always been related to the combination of large uploads and bad
connections where (if I understand right) the JOSM data upload
connection gets a hick-up at some point and isn't able to finish the job
-- and doesn't leave a note for itself where it was left of. Then,
because of reasons I don't _exactly_ understand there's duplication of
data on the next upload(s (attempts)). 

 

My vague understanding is that this is due to at least the fact that
JOSM uploads nodes first and only after that the information about ways
(i.e. which nodes belong to which ways). And then when it hasn't gotten
or confirmation for succesful uploads (or it hasn't recorded that to
it's data file(?)) it considers the uploaded nodes to still be new at
next upload(s (attempts)).

 

I feel that duplication sometimes happens also to partial uploads where
the ways have uploaded, too, resulting in duplicate uploaded ways but I
haven't documented this well enough to say this solidly.

 

If you have a bad connection / feel that this may be your problem it is
a good idea to tweak the JOSM Advanced upload settings (Upload >
Advanced tab: "Upload data in chunks of objects. Chunk size: ",
where  is your number of objects per chunk. I use 200 in with my
Haitian connection.

 

Cheers,

-Jaakko

http://osm.org/user/jaakkoh

--

jaa...@helleranta.com * Skype: jhelleranta * Mobile: +509-37-269154  *
http://go.hel.cc/MyProfile





On Thu, Mar 22, 2012 at 8:28 AM, Marc Zoss  wrote:

Nick and Josh

thanks for the clarification on your upload strategy. With previous
large uploads I have experience the same behaviour resulting in massive
dupes. So I guess it is not a conversion issue.

If you want me to commit the remove duplicates changeset, I can do so.
But you will have to go through the data subsequently and check if the
issues are resolved and no new ones emerged.

M

On 22.03.2012, at 14:12, Nick Chamberlain wrote:

> Josh and Marc,
>
> Thank you!  I apologize that I'm unable to speak the OSM language as
> well as everyone, I'm working on it :)  I posted on the Salisbury,
> Maryland Import page that Josh created to give more detail about my
> uploads.
>
> I didn't really think that I created so many duplicates, because I did
a
> lot of things in JOSM before I actually chose to upload.  One thing I
> know for sure is that I didn't I upload until I was actually able to -
I
> was getting a proxy error and the uploads were timing out when I
> attempted to upload the entire batch.  I assumed that these attempts
> were unsuccessful, which I might be wrong about and might have
resulted
> in duplication.
>
> I assumed that my successful attempts started, maybe @ 10901673, when
I
> realized I needed to break the original shapefile up tabularly into
> percentiles and upload 10 segments of the building footprint dataset,
> one after the other.  These were all definitely successful, and were
> only done once per percentile.
>
> Josh, where are you finding the list of changesets in the format you
> posted?  I can only figure out how to list them in my editor profile
> with my comments.
>
> If you believe that the method you mention that removes the 71,000
nodes
> is the best approach, please feel free to do so.  I will also gladly
> manually fix the inner ring tagging issue as the data gets fixed.
> Please let me know what I can do to help.  I am also willing to share
> the .osm files and/or shapefiles if that will help.  Thanks.
>
> - Nick
>
> -Original Message-
> From: joshthephysic...@gmail.com [mailto:joshthephysic...@gmail.com]
On
> Behalf Of Josh Doe
> Sent: Thursday, March 22, 2012 8:51 AM
> To: Marc Zoss
> Cc: impo...@openstreetmap.org; talk-us@openstreetmap.org; Nick
> Chamberlain
> Subject: Re: [Imports] [Talk-us] Uploads to City of Salisbury, MD
>
> On Thu, Mar 22, 2012 at 8:04 AM, Marc Zoss  wrote:
>> I briefly downloaded all sby:bldgtype-tagged ways and relation of
> Maryland through the overpass-api. Then removed the ones having only a
> sby:bldgtype tag, run the validator and deleted the duplicated