Re: [OSM-talk-be] Importing Villo! API data

2017-11-05 Thread CedB12

Hi Glenn,

I will respond to some of your points because they are relevant to my
contributions in this thread. At the end of this email I also comment on
a survey I made today of six stations in order to evaluate the quality
of the API data.

As far as I can tell from my survey, the station names returned by the
Villo! API in the "name" field are exactly what shows up at the
stations' locations. (On the other hand the website only shows the
"address" field, which contains a name that often matches the "name"
field, but not always.) The station names are not printed on the
infrastructure: they only show up on the dynamic displays. (Only the
reference number is physically printed on the station, along with
"bonus" if it is a bonus station.)

The full official name of (most) stations, as reported by the "name" API
field, follows the format of Yves' example: "076 - PLACE VAN MEENEN/VAN
MEENENPLEIN". Of course, in OSM we want to split that into two (or
three) components: ref and name (or ref, name:fr and name:nl). Note
however that this cannot be straightforwardly automated, unlike with the
Antwerp Velo API data. There are multiple reasons for this.

First of all, names are in all-caps and (partially) stripped from
accents, and turning that into properly capitalized names with no
missing accents is nontrivial. Second, many stations are misspelled or
don't follow the standard OSM practice of expanding abbreviations (e.g.
Place St Jean -> Place Saint-Jean). Third, there is the problem of
bilingual names: Dutch names are sometimes missing while a STIB/MIVB
station nearby (or some street, or some building) has the exact same
French name and an available Dutch translation. Moreover in a couple of
instances it is not so easy to split the French and Dutch names. For
example "255 - SACRE-COEUR DE/HEILIGE HART VAN GANSHOREN". Finally names
are limited to 50 characters, and we probably don't want to encode them
as-is even if that is the official name. For example "257 - PL
MARGHERITE D'AUTRICHE / MARGARETHA VAN OO".


When I saw all those issues I decided to go through the list of station
names and clean them up myself. I did a first pass using a dictionary I
built from OSM street names to translate all-caps words to
properly-capitalized words with accents. Then I went through the list by
hand to fix conversion mistakes, misspellings, and provide Dutch
translations when they were missing. The results are in my github
repository (see my previous message in this thread), and that is what I
propose we use in name, name:fr and name:nl tags.

I don't know how we can do QA on name tags given the quality of the
source data, but at the very least we can store the official name (in
all caps, maybe with the station number stripped off) in the
official_name tag. That way we can easily compare that field against the
API in the event that it changes. Sometimes the Villo! operators change
the name to include a notice that the station is closed for works, but
this can be filtered out, either by removing all text in parentheses or
ignoring name discrepancies on stations which are marked as "closed"
(which is another field in the API).


Given that the API names are the same as the names displayed
on-location, we can reliably use them for armchair mapping, so I
wouldn't say the API "just sucks and we shouldn't use it". The API also
reports station capacity and the possibility of card payment, which is
also useful.



I did a quick survey of six stations in Auderghem to compare the API
data to reality. Three stations had wrong coordinates (wrong street
block). I suppose they must have been correct at some point in the past,
but the stations have been moved since. However in two out of three
wrongly-located stations, the API "address" field pointed at the correct
house numbers. The third station was not in front of a house so the
"address" field only pointed out the street name.

I checked the "banking", "bonus" and "bike_stands" fields, which all
matched reality, as well as the sum of "available_bike_stands" and
"available_bikes". Note that sometimes this sum is not equal to
"bike_stands". I checked one of those stations (311 - Delta), where
bike_stands is 22 but available stands+bikes is 21. This is explained by
the fact that one of the stands is out of service, as indicated by a red
light on the stand. Strangely, last time I checked, one station in the
API (003 - Porte de Flandre / Vlaamsepoort) had four more available
bike+stands than "bike_stands", which makes no sense unless the station
was upgraded without updating the API field "bike_stands". I did not
survey that station.

As far as I could tell, the data reported on the interactive displays on
the stations matches the API data exactly (including the wrong
locations).

In conclusion, I think the "name" API field is perfectly OK to use after
cleanup. Columns "banking" and "bonus" matched in the six stations
surveyed. The "bike_stands" field seems to be static data, unlike

Re: [OSM-talk-be] Importing Villo! API data

2017-11-04 Thread CedB12

Hello again,

Sorry for taking so much time after my last message. I guess I can now
share a concrete proposal for name tag changes to continue the
discussion.

I have put everything in a few different formats in the following
repository:

  https://github.com/cedb12/villo-names

My main contribution of data is in the CSV file, which cleans up the
Villo! API data and provides Dutch translations when missing.

Yves suggested to base those translations on nearby "official" names
such as streets, buildings or MIVB public transport stations. The data
in the repository now follows this advice. There are also a few
additional translation candidates which I marked as "nl:own translation"
in the "note" column of the CSV file. I include them in my tagging
proposal for now because they seem reasonable to me, but we can discuss
this.

There are only two very small details which I am not too sure about:

"Parvis de Saint-Gilles": should we name it "Sint-Gillis Voorplein"
after the MIVB station, or use the spelling "Sint-Gillisvoorplein",
which (I assume) is used on street name plates?

"Vieux Tilleul" is named after "Square du Vieux Tilleul -
Oude-Lindesquare". I put "Oude Linde" in the CSV file because I am not
sure why there is a hyphen in the street name. Should we use
"Oude-Linde" instead?


The git repository also contains comparisons of this cleaned up API data
with existing OSM data. I created three GeoJSON files (one for each tag:
name, name:fr and name:nl) highlighting the differences (see also the
README file). The readme file includes links to geojson.io to visualize
the data.

I also formatted those differences in a markdown file (conflation.md)
which may be more convenient to review.


If those proposed names seem fine to all of you, I can proceed with the
merging of those name tags into our existing OSM nodes.

My proposal for official_name still stands. It would be the same as the
"name" column in the CSV file in the GitHub repo linked above, which I
automatically derived from the Villo! API.


Independently of naming issues, since I saw no objection to my use of
this API data (regarding license and automated edits), I went ahead and
fixed 26 'ref' tags of existing stations based on distance. I reviewed
the changed nodes individually to make sure I made no mistake in
matching OSM nodes to API records. The changeset is here:

  https://www.openstreetmap.org/changeset/53509501

Cheers,

Cédric

___
Talk-be mailing list
Talk-be@openstreetmap.org
https://lists.openstreetmap.org/listinfo/talk-be


Re: [OSM-talk-be] Importing Villo! API data

2017-10-17 Thread CedB12

Hello Yves,

Thank you for sharing your thoughts.

I agree with you on the problem of importing locations in bulk. Still, I
think it is safe to use the API data to clean up the names and reference
numbers of the stations we have already mapped.

As far as I know, the API "name" values match exactly the names shown on
the stations' interactive displays, while the API "address" values match
the information reported on the Villo website. I don't know about the
app or other data sources.

So when you say that "what they present as the name" is unreliable, are
you talking about the data shown on the website, i.e., the "address"
value reported in the API? Then I agree that extracting address
information from those values is difficult because of the inconsistent
formatting.

However, as far as the name tags are concerned (name, name:fr, name:nl,
official_name), I think those are supposed to reflect the data as it is
visible on location. This means that the "name" API values should be our
source for that information. Those values are formatted more
consistently, too: with very few exceptions, the value is either "NNN -
NAME" or "NNN - FRENCH_NAME/DUTCH_NAME" where NNN is the station number,
and outliers are easy to spot. Is there any reason why we should not be
using that (or the same with the number removed) as official_name?

I actually have a spreadsheet where I converted all the "name" values
reported by the API to a properly-capitalized form and tried to fix all
the typos I could find. I will share it later.

Cédric

On 10/15/2017 06:57 PM, Yves bxl-forever wrote:

Hello,

In the past weeks I have also wanted to do some cleanup on Villo! stations and 
it’s a fact that there still quite a lot of work to be done.

Just a few thoughts about the idea of bulk data imports because this is what gave us 
really "ugly" nodes sometimes.

The name itself is a problem because what they present as the name is actually a string 
that concatenates the ID of the station, the name and its address.  This is why tagging 
this as "official_name" does not seem to make any sense.

Their JSON dataset usually looks like this:

"name":"076 - PLACE VAN MEENEN/VAN MEENENPLEIN",
"address":"PLACE VAN MEENEN/VAN MEENENPLEIN - AV PAUL DEJAER (FACE 35 - 39) / PAUL 
DEJAERLAAN (TEGENOVER 35 - 39)"


And we must translate it as such in our OSM nodes:

ref=76
name="Place Van Meenen - Van Meenenplein"
name:fr="Place Van Meenen"
name:nl="Van Meenenplein"
addr:street="Place Maurice Van Meenen - Maurice Van Meenenplein"
addr:housenumber="35-39"


It’s probably feasible to parse the fields automatically and make something 
that looks clean.
But I am not sure that the street name will always match (see example here, official name 
has "Maurice" somewhere and our parsing script will not guess it unless you 
feed it with a list of all streets).

About missing names in one language, this is tricky: normally we should stick 
with the official name given by the operator.  But another approach will be 
that if we know of an official translation (because it is the same name as the 
street or even a bus stop nearby, or a building) it should be used.  And I 
agree that we should fix typos without asking, like in your example.

Another problem is that the longitude and latitude fields must be checked to 
avoid putting stations in the middle of an intersection or inside a building.


In summary, I will recommend a safer approach, i.e. extracting a list of 
missing stations, and add them one by one manually, after checking whether the 
data looks fine.
But it will be nice to hear the thoughts of other members of the community.

Have a nice day.

Yves


___
Talk-be mailing list
Talk-be@openstreetmap.org
https://lists.openstreetmap.org/listinfo/talk-be


[OSM-talk-be] Importing Villo! API data

2017-10-15 Thread CedB12

Hello all,

Lately I have been looking at the Villo! dataset from the JCDecaux API
at [1], which is released under the Etalab Open License (see also [2]).
I want to consult the community about the use of this data to improve
the tagging of the stations we have already mapped. I would also like to
discuss a potential import of the hundred or so stations that are
reported in the API but have not been mapped yet in OSM.

My priority is to fix the tagging of station names and reference
numbers, which are often wrong or missing in the already-mapped
stations. I am aware of a few quality issues in the names reported by
the API (which, as far as I know, are actually the names reported at the
stations themselves), so this cannot be a fully automated process. As
far as ref tags are concerned, only 25 existing station nodes do not
match the API. I have not pushed any change yet in case this thread
brings up an objection to the use of this API data.

More importantly, given the quality issues in the API names, we would
need to discuss how exactly we want to tag names vs. what the "official"
names are.

To give you a quick example of what kind of problems we can find in the
API, consider that one station is named "342 - MAISON COMMUNALE DE
BERCHEM ST AGHATE". Like all other stations, the name is in all-caps.
This one in particular contains a misspelling: the commune is actually
spelled "Berchem-Ste-Agathe". Also, unlike other stations, this one has
no official Dutch name, and it is not clear to me whether we should
provide our own translation in the name and name:nl tags.

I actually got a little bit ahead of myself and had prepared a diary
entry draft as well as a more detailed and specific email for this
mailing list, but I now realize that unloading all of this at once might
have felt a bit forceful. So before I go into the details of all the
quirks in the API data and formulate a general proposal for tagging, I
wanted to take a more open-ended approach and ask if anyone had anything
to share regarding our mapping and tagging of Villo! stations. I am also
interested in your thoughts on how we should tag the station I gave
above as an example (in terms of name, name:fr, name:nl, and maybe other
kinds of name tags like official_name).

But before that, I would like to make sure that it is OK to import
Etalab-licensed data, because otherwise this effort will be pointless. I
assume it must be fine because the license states to be compatible with
"any licence which requires at least the attribution of the «
Information »" [3], including the Open Government License which is in
turn listed on the OSM wiki page on ODbL compatibility [4]. How are the
requirements of the license (attribution by source name + date + URL)
handled, though?

Also, does an operation of this scale (tagging a subset of 200 existing
nodes and possibly importing another 100) require that I follow the
import guidelines?

Thanks,

Cédric


[1] https://developer.jcdecaux.com/
[2] http://opendatastore.brussels/en/dataset/villo
[3] https://developer.jcdecaux.com/files/Open-Licence-en.pdf
[4] https://wiki.openstreetmap.org/wiki/Import/ODbL_Compatibility

___
Talk-be mailing list
Talk-be@openstreetmap.org
https://lists.openstreetmap.org/listinfo/talk-be