[gdal-dev] Simple schema support for GeoJSON

2014-11-21 Thread Jukka Rahkonen
Hi,

I wonder if GDAL could have some simple and relatively user friendly way for
defining a schema for GeoJSON data. The GeoJSON driver seems to guess the
data types of attributes with some undocumented way but users could have
better knowledge about the desired schema.

I know I can control the data type by using OGR SQL and CAST as in
ogrinfo -sql "select cast(EMPLOYED as float) from OGRGeojson" states.json -so

However, perhaps GeoJSON is enough popular for deserving an easier way for
writing a schema. First I thought that it would be enough to copy the "csvt"
text file mechanism from the GDAL CSV driver
http://www.gdal.org/drv_csv.html. However, the csvt file is a plain list of
types which will be applied to the attributes in the same order than they
appear in the text file 
"Integer(5)","Real(10.7)","String(15)" 

For GeoJSON it would feel more user friendly to include the attribute names
in the list somehow like
 "population;Integer(5)","area;Real(10.7)","name;String(15)".

This would make it easier for users to write a valid "jsont" file. A list
with attribute names could perhaps also help GDAL as well because the
features in GeoJSON file do not necessarily have same attributes.

As an example this is the right schema for a WFS feature type which is
captured from
http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request=describefeaturetype&typename=topp:states


name="the_geom" type="gml:MultiPolygonPropertyType"/> 
name="STATE_NAME" type="xsd:string"/> 
name="STATE_FIPS" type="xsd:string"/> 
name="SUB_REGION" type="xsd:string"/> 
name="STATE_ABBR" type="xsd:string"/> 
name="LAND_KM" type="xsd:double"/> 
name="WATER_KM" type="xsd:double"/> 
name="PERSONS" type="xsd:double"/> 
name="FAMILIES" type="xsd:double"/> 
name="HOUSHOLD" type="xsd:double"/> 
name="MALE" type="xsd:double"/> 
name="FEMALE" type="xsd:double"/> 
name="WORKERS" type="xsd:double"/> 
name="DRVALONE" type="xsd:double"/> 
name="CARPOOL" type="xsd:double"/> 
name="PUBTRANS" type="xsd:double"/> 
name="EMPLOYED" type="xsd:double"/> 
name="UNEMPLOY" type="xsd:double"/> 
name="SERVICE" type="xsd:double"/> 
name="MANUAL" type="xsd:double"/> 
name="P_MALE" type="xsd:double"/> 
name="P_FEMALE" type="xsd:double"/> 
name="SAMP_POP" type="xsd:double"/> 


This is what GDAL is guessing:
STATE_NAME: String (0.0)
STATE_FIPS: String (0.0)
SUB_REGION: String (0.0)
STATE_ABBR: String (0.0)
LAND_KM: Real (0.0)
WATER_KM: Real (0.0)
PERSONS: Real (0.0)
FAMILIES: Integer (0.0)
HOUSHOLD: Real (0.0)
MALE: Real (0.0)
FEMALE: Real (0.0)
WORKERS: Real (0.0)
DRVALONE: Integer (0.0)
CARPOOL: Integer (0.0)
PUBTRANS: Integer (0.0)
EMPLOYED: Real (0.0)
UNEMPLOY: Integer (0.0)
SERVICE: Integer (0.0)
MANUAL: Integer (0.0)
P_MALE: Real (0.0)
P_FEMALE: Real (0.0)
SAMP_POP: Integer (0.0)
bbox: RealList (0.0)

-Jukka Rahkonen-

___
gdal-dev mailing list
gdal-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/gdal-dev


Re: [gdal-dev] Simple schema support for GeoJSON

2014-11-21 Thread Even Rouault
Jukka,

Data type guessing implemented in the OGR GeoJSON driver is quite natural 
hopefully.
A whole scan of the GeoJSON file is made and the following rules are applied :
- if an attribute has integer-only content --> Integer
- if an attribute has an array of integer-only content  --> IntegerList
- if an attribute has integer or floating point content --> Real
- if an attribute has an array of integer or floating point content --> RealList
- if an attribute has an array of anything else content --> StringList
- otherwise --> String

With RFC 50 and other pending improvements in the driver:
- if an attribute has boolean-only content --> Integer(Boolean)
- if an attribute has an array of boolean-only content --> IntegerList(Boolean)
- if an attribute has date-only content --> Date
- if an attribute has time-only content --> Time
- if an attribute has datetime or date content --> DateTime

I'm not sure we want to invent a .jsont format, but if you download
http://svn.osgeo.org/gdal/trunk/gdal/swig/python/samples/ogr2vrt.py

and run  :

python ogr2vrt.py 
"http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request=getfeature&typename=topp:states&outputformat=json";
 test.vrt

This will create you a VRT with the default schema, that you can easily edit.
Note: as with OGR SQL CAST, this is post processing. So if the guess done by 
the GeoJSON driver
leads to a loss of information, you cannot recover it. Hopefully the 
implemented rules will not
lead to information loss.

A better approach would be to have the schema embedded in a JSON way in the 
GeoJSON file itself.
That could be an evolution of the format, but I'm not sure this would be really 
popular,
given JSON/GeoJSON is heavily used by NoSQL approaches...

Hum, doing a quick search, I just found http://json-schema.org/ that appears to 
be an IETF draft.
It doesn't look that the schema is embedded in the data file itself.

There's also GeoJSON-LD that might be a bit related : 
https://github.com/geojson/geojson-ld

CC'ing Sean in case he has thoughts on this.

Even

> Hi,
> 
> I wonder if GDAL could have some simple and relatively user friendly way
> for defining a schema for GeoJSON data. The GeoJSON driver seems to guess
> the data types of attributes with some undocumented way but users could
> have better knowledge about the desired schema.
> 
> I know I can control the data type by using OGR SQL and CAST as in
> ogrinfo -sql "select cast(EMPLOYED as float) from OGRGeojson" states.json
> -so
> 
> However, perhaps GeoJSON is enough popular for deserving an easier way for
> writing a schema. First I thought that it would be enough to copy the
> "csvt" text file mechanism from the GDAL CSV driver
> http://www.gdal.org/drv_csv.html. However, the csvt file is a plain list of
> types which will be applied to the attributes in the same order than they
> appear in the text file
> "Integer(5)","Real(10.7)","String(15)"
> 
> For GeoJSON it would feel more user friendly to include the attribute names
> in the list somehow like
>  "population;Integer(5)","area;Real(10.7)","name;String(15)".
> 
> This would make it easier for users to write a valid "jsont" file. A list
> with attribute names could perhaps also help GDAL as well because the
> features in GeoJSON file do not necessarily have same attributes.
> 
> As an example this is the right schema for a WFS feature type which is
> captured from
> http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request=des
> cribefeaturetype&typename=topp:states
> 
> 
> name="the_geom" type="gml:MultiPolygonPropertyType"/>
> name="STATE_NAME" type="xsd:string"/>
> name="STATE_FIPS" type="xsd:string"/>
> name="SUB_REGION" type="xsd:string"/>
> name="STATE_ABBR" type="xsd:string"/>
> name="LAND_KM" type="xsd:double"/>
> name="WATER_KM" type="xsd:double"/>
> name="PERSONS" type="xsd:double"/>
> name="FAMILIES" type="xsd:double"/>
> name="HOUSHOLD" type="xsd:double"/>
> name="MALE" type="xsd:double"/>
> name="FEMALE" type="xsd:double"/>
> name="WORKERS" type="xsd:double"/>
> name="DRVALONE" type="xsd:double"/>
> name="CARPOOL" type="xsd:double"/>
> name="PUBTRANS" type="xsd:double"/>
> name="EMPLOYED" type="xsd:double"/>
> name="UNEMPLOY" type="xsd:double"/>
> name="SERVICE" type="xsd:double"/>
> name="MANUAL" type="xsd:double"/>
> name="P_MALE" type="xsd:double"/>
> name="P_FEMALE" type="xsd:double"/>
> name="SAMP_POP" type="xsd:double"/>
> 
> 
> This is what GDAL is guessing:
> STATE_NAME: String (0.0)
> STATE_FIPS: String (0.0)
> SUB_REGION: String (0.0)
> STATE_ABBR: String (0.0)
> LAND_KM: Real (0.0)
> WATER_KM: Real (0.0)
> PERSONS: Real (0.0)
> FAMILIES: Integer (0.0)
> HOUSHOLD: Real (0.0)
> MALE: Real (0.0)
> FEMALE: Real (0.0)
> WORKERS: Real (0.0)
> DRVALONE: Integer (0.0)
> CARPOOL: Integer (0.0)
> PUBTRANS: Integer (0.0)
> EMPLOYED: Real (0.0)
> UNEMPLOY: Integer (0.0)
> SERVICE: Integer (0.0)
> MANUAL: Integer (0.0)
> P_MALE: Real (0.0)
> P_FEMALE: Real (0.0)
> SAMP_POP: Integer (0.0)
> bbo

Re: [gdal-dev] Simple schema support for GeoJSON

2014-11-21 Thread Rahkonen Jukka (Tike)
Hi,

I have no use for this feature myself but by reading various mailing lists and 
forums I have learned that many people consider it is always a good idea to 
read data for example from WFS services as GeoJSON instead of GML. I can easily 
imagine that there will be troubles with guess-by-data method if they are 
making subsequent requests from the service. For example strings which are all 
numbers but which may contain leading zeroes are saved either to integers or 
strings  if leading zeroes are interpreted right at all. Or floats which do not 
always contain decimals, or list attributes which sometimes have only zero or 
one member.

Embedded schema feels optimal because then it would always travel together with 
the data and we all have probably lost .tfw or .prj files sometimes.

-Jukka-

Even Rouault wrote:

> Jukka,
> 
> Data type guessing implemented in the OGR GeoJSON driver is quite natural
> hopefully.
> A whole scan of the GeoJSON file is made and the following rules are applied :
> - if an attribute has integer-only content --> Integer
> - if an attribute has an array of integer-only content  --> IntegerList
> - if an attribute has integer or floating point content --> Real
> - if an attribute has an array of integer or floating point content --> 
> RealList
> - if an attribute has an array of anything else content --> StringList
> - otherwise --> String
> 
> With RFC 50 and other pending improvements in the driver:
> - if an attribute has boolean-only content --> Integer(Boolean)
> - if an attribute has an array of boolean-only content --> 
> IntegerList(Boolean)
> - if an attribute has date-only content --> Date
> - if an attribute has time-only content --> Time
> - if an attribute has datetime or date content --> DateTime
> 
> I'm not sure we want to invent a .jsont format, but if you download
> http://svn.osgeo.org/gdal/trunk/gdal/swig/python/samples/ogr2vrt.py
> 
> and run  :
> 
> python ogr2vrt.py
> "http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request
> =getfeature&typename=topp:states&outputformat=json" test.vrt
> 
> This will create you a VRT with the default schema, that you can easily edit.
> Note: as with OGR SQL CAST, this is post processing. So if the guess done by 
> the
> GeoJSON driver leads to a loss of information, you cannot recover it. 
> Hopefully
> the implemented rules will not lead to information loss.
> 
> A better approach would be to have the schema embedded in a JSON way in the
> GeoJSON file itself.
> That could be an evolution of the format, but I'm not sure this would be 
> really
> popular, given JSON/GeoJSON is heavily used by NoSQL approaches...
> 
> Hum, doing a quick search, I just found http://json-schema.org/ that appears 
> to
> be an IETF draft.
> It doesn't look that the schema is embedded in the data file itself.
> 
> There's also GeoJSON-LD that might be a bit related :
> https://github.com/geojson/geojson-ld
> 
> CC'ing Sean in case he has thoughts on this.
> 
> Even
> 
> > Hi,
> >
> > I wonder if GDAL could have some simple and relatively user friendly
> > way for defining a schema for GeoJSON data. The GeoJSON driver seems
> > to guess the data types of attributes with some undocumented way but
> > users could have better knowledge about the desired schema.
> >
> > I know I can control the data type by using OGR SQL and CAST as in
> > ogrinfo -sql "select cast(EMPLOYED as float) from OGRGeojson"
> > states.json -so
> >
> > However, perhaps GeoJSON is enough popular for deserving an easier way
> > for writing a schema. First I thought that it would be enough to copy
> > the "csvt" text file mechanism from the GDAL CSV driver
> > http://www.gdal.org/drv_csv.html. However, the csvt file is a plain
> > list of types which will be applied to the attributes in the same
> > order than they appear in the text file
> > "Integer(5)","Real(10.7)","String(15)"
> >
> > For GeoJSON it would feel more user friendly to include the attribute
> > names in the list somehow like
> > "population;Integer(5)","area;Real(10.7)","name;String(15)".
> >
> > This would make it easier for users to write a valid "jsont" file. A
> > list with attribute names could perhaps also help GDAL as well because
> > the features in GeoJSON file do not necessarily have same attributes.
> >
> > As an example this is the right schema for a WFS feature type which is
> > captured from
> > http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&reques
> > t=des
> > cribefeaturetype&typename=topp:states
> >
> >
> > name="the_geom" type="gml:MultiPolygonPropertyType"/>
> > name="STATE_NAME" type="xsd:string"/>
> > name="STATE_FIPS" type="xsd:string"/>
> > name="SUB_REGION" type="xsd:string"/>
> > name="STATE_ABBR" type="xsd:string"/>
> > name="LAND_KM" type="xsd:double"/>
> > name="WATER_KM" type="xsd:double"/>
> > name="PERSONS" type="xsd:double"/>
> > name="FAMILIES" type="xsd:double"/>
> > name="HOUSHOLD" type="xsd:double"/>
> > name="MALE" type="xsd:double"/>
>

Re: [gdal-dev] Simple schema support for GeoJSON

2014-11-21 Thread Even Rouault
Le vendredi 21 novembre 2014 15:35:43, Rahkonen Jukka (Tike) a écrit :
> Hi,
> 
> I have no use for this feature myself but by reading various mailing lists
> and forums I have learned that many people consider it is always a good
> idea to read data for example from WFS services as GeoJSON instead of GML.

Because it consumes less bandwidth ?

For the record, if you try the following, it will use the GML schema for the 
user
exposed layer and will do a on-the-fly transform from the hidden GeoJSON layer 
schema
to the GML schema, similarly to the one you could do with a CAST/VRT.

$ ogrinfo 
"WFS:http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request=getfeature&typename=topp:states&outputformat=json";
 -ro -al -where "STATE_NAME = 'California'"

Layer name: topp:states
Geometry: Multi Polygon
Feature Count: 1
Extent: (-124.391472, 32.535725) - (-114.124451, 42.002346)
Layer SRS WKT:
GEOGCS["WGS 84",
DATUM["WGS_1984",
SPHEROID["WGS 84",6378137,298.257223563,
AUTHORITY["EPSG","7030"]],
AUTHORITY["EPSG","6326"]],
PRIMEM["Greenwich",0,
AUTHORITY["EPSG","8901"]],
UNIT["degree",0.0174532925199433,
AUTHORITY["EPSG","9122"]],
AUTHORITY["EPSG","4326"]]
gml_id: String (0.0)
STATE_NAME: String (0.0)
STATE_FIPS: String (0.0)
SUB_REGION: String (0.0)
STATE_ABBR: String (0.0)
LAND_KM: Real (0.0)
WATER_KM: Real (0.0)
PERSONS: Real (0.0)
FAMILIES: Real (0.0)
HOUSHOLD: Real (0.0)
MALE: Real (0.0)
FEMALE: Real (0.0)
WORKERS: Real (0.0)
DRVALONE: Real (0.0)
CARPOOL: Real (0.0)
PUBTRANS: Real (0.0)
EMPLOYED: Real (0.0)
UNEMPLOY: Real (0.0)
SERVICE: Real (0.0)
MANUAL: Real (0.0)
P_MALE: Real (0.0)
P_FEMALE: Real (0.0)
SAMP_POP: Real (0.0)
OGRFeature(topp:states):0
  gml_id (String) = (null)
  STATE_NAME (String) = California
  STATE_FIPS (String) = 06
  SUB_REGION (String) = Pacific
  STATE_ABBR (String) = CA
  LAND_KM (Real) = 403970.143
  WATER_KM (Real) = 20023.368
  PERSONS (Real) = 29760021
  FAMILIES (Real) = 7139394
  HOUSHOLD (Real) = 10381206
  MALE (Real) = 14897627
  FEMALE (Real) = 14862394
  WORKERS (Real) = 11306576
  DRVALONE (Real) = 9982242
  CARPOOL (Real) = 2036025
  PUBTRANS (Real) = 685797
  EMPLOYED (Real) = 13996309
  UNEMPLOY (Real) = 996502
  SERVICE (Real) = 3664771
  MANUAL (Real) = 1798201
  P_MALE (Real) = 0.501
  P_FEMALE (Real) = 0.499
  SAMP_POP (Real) = 3792553
  MULTIPOLYGON ((()))

> I can easily imagine that there will be troubles with guess-by-data method
> if they are making subsequent requests from the service. For example
> strings which are all numbers but which may contain leading zeroes are
> saved either to integers or strings  if leading zeroes are interpreted
> right at all. 

In JSON, "00123" and 00123 are different objects. So a string with leading 
zeros should be serialized as "00123" and not 00123. If it is serialized as 
"00123", the GeoJSON driver will interpret it as a 
string.

> Or floats which do not always contain decimals, or list
> attributes which sometimes have only zero or one member.

Yes, those cases could cause issues.

> 
> Embedded schema feels optimal because then it would always travel together
> with the data and we all have probably lost .tfw or .prj files sometimes.
> 
> -Jukka-
> 
> Even Rouault wrote:
> > Jukka,
> > 
> > Data type guessing implemented in the OGR GeoJSON driver is quite natural
> > hopefully.
> > A whole scan of the GeoJSON file is made and the following rules are
> > applied : - if an attribute has integer-only content --> Integer
> > - if an attribute has an array of integer-only content  --> IntegerList
> > - if an attribute has integer or floating point content --> Real
> > - if an attribute has an array of integer or floating point content -->
> > RealList - if an attribute has an array of anything else content -->
> > StringList - otherwise --> String
> > 
> > With RFC 50 and other pending improvements in the driver:
> > - if an attribute has boolean-only content --> Integer(Boolean)
> > - if an attribute has an array of boolean-only content -->
> > IntegerList(Boolean) - if an attribute has date-only content --> Date
> > - if an attribute has time-only content --> Time
> > - if an attribute has datetime or date content --> DateTime
> > 
> > I'm not sure we want to invent a .jsont format, but if you download
> > http://svn.osgeo.org/gdal/trunk/gdal/swig/python/samples/ogr2vrt.py
> > 
> > and run  :
> > 
> > python ogr2vrt.py
> > "http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request
> > =getfeature&typename=topp:states&outputformat=json" test.vrt
> > 
> > This will create you a VRT with the default schema, that you can easily
> > edit. Note: as with OGR SQL CAST, this is post processing. So if the
> > guess done by the GeoJSON driver leads to a loss of information, you
> > cannot recover it. Hopefully the implemented rules will not lead to
> > information loss.
> > 
> > A better approach would be to have the schema embedded in a JSON 

Re: [gdal-dev] Simple schema support for GeoJSON

2014-11-21 Thread Andreas Oxenstierna

Hi

The normal reason to select GeoJSON for geoweb applications is that JSON 
is parsed directly by the web browser, i.e. you get JavaScript objects
directly digestable by your JavaScript code. This may be also 
considerable faster than parsing XML.

Bandwidth is more or less irrelevant in comparison.


Le vendredi 21 novembre 2014 15:35:43, Rahkonen Jukka (Tike) a écrit :

Hi,

I have no use for this feature myself but by reading various mailing lists
and forums I have learned that many people consider it is always a good
idea to read data for example from WFS services as GeoJSON instead of GML.

Because it consumes less bandwidth ?

For the record, if you try the following, it will use the GML schema for the 
user
exposed layer and will do a on-the-fly transform from the hidden GeoJSON layer 
schema
to the GML schema, similarly to the one you could do with a CAST/VRT.

$ ogrinfo 
"WFS:http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request=getfeature&typename=topp:states&outputformat=json";
 -ro -al -where "STATE_NAME = 'California'"

Layer name: topp:states
Geometry: Multi Polygon
Feature Count: 1
Extent: (-124.391472, 32.535725) - (-114.124451, 42.002346)
Layer SRS WKT:
GEOGCS["WGS 84",
 DATUM["WGS_1984",
 SPHEROID["WGS 84",6378137,298.257223563,
 AUTHORITY["EPSG","7030"]],
 AUTHORITY["EPSG","6326"]],
 PRIMEM["Greenwich",0,
 AUTHORITY["EPSG","8901"]],
 UNIT["degree",0.0174532925199433,
 AUTHORITY["EPSG","9122"]],
 AUTHORITY["EPSG","4326"]]
gml_id: String (0.0)
STATE_NAME: String (0.0)
STATE_FIPS: String (0.0)
SUB_REGION: String (0.0)
STATE_ABBR: String (0.0)
LAND_KM: Real (0.0)
WATER_KM: Real (0.0)
PERSONS: Real (0.0)
FAMILIES: Real (0.0)
HOUSHOLD: Real (0.0)
MALE: Real (0.0)
FEMALE: Real (0.0)
WORKERS: Real (0.0)
DRVALONE: Real (0.0)
CARPOOL: Real (0.0)
PUBTRANS: Real (0.0)
EMPLOYED: Real (0.0)
UNEMPLOY: Real (0.0)
SERVICE: Real (0.0)
MANUAL: Real (0.0)
P_MALE: Real (0.0)
P_FEMALE: Real (0.0)
SAMP_POP: Real (0.0)
OGRFeature(topp:states):0
   gml_id (String) = (null)
   STATE_NAME (String) = California
   STATE_FIPS (String) = 06
   SUB_REGION (String) = Pacific
   STATE_ABBR (String) = CA
   LAND_KM (Real) = 403970.143
   WATER_KM (Real) = 20023.368
   PERSONS (Real) = 29760021
   FAMILIES (Real) = 7139394
   HOUSHOLD (Real) = 10381206
   MALE (Real) = 14897627
   FEMALE (Real) = 14862394
   WORKERS (Real) = 11306576
   DRVALONE (Real) = 9982242
   CARPOOL (Real) = 2036025
   PUBTRANS (Real) = 685797
   EMPLOYED (Real) = 13996309
   UNEMPLOY (Real) = 996502
   SERVICE (Real) = 3664771
   MANUAL (Real) = 1798201
   P_MALE (Real) = 0.501
   P_FEMALE (Real) = 0.499
   SAMP_POP (Real) = 3792553
   MULTIPOLYGON ((()))


I can easily imagine that there will be troubles with guess-by-data method
if they are making subsequent requests from the service. For example
strings which are all numbers but which may contain leading zeroes are
saved either to integers or strings  if leading zeroes are interpreted
right at all.

In JSON, "00123" and 00123 are different objects. So a string with leading zeros should be 
serialized as "00123" and not 00123. If it is serialized as "00123", the GeoJSON driver 
will interpret it as a
string.


Or floats which do not always contain decimals, or list
attributes which sometimes have only zero or one member.

Yes, those cases could cause issues.


Embedded schema feels optimal because then it would always travel together
with the data and we all have probably lost .tfw or .prj files sometimes.

-Jukka-

Even Rouault wrote:

Jukka,

Data type guessing implemented in the OGR GeoJSON driver is quite natural
hopefully.
A whole scan of the GeoJSON file is made and the following rules are
applied : - if an attribute has integer-only content --> Integer
- if an attribute has an array of integer-only content  --> IntegerList
- if an attribute has integer or floating point content --> Real
- if an attribute has an array of integer or floating point content -->
RealList - if an attribute has an array of anything else content -->
StringList - otherwise --> String

With RFC 50 and other pending improvements in the driver:
- if an attribute has boolean-only content --> Integer(Boolean)
- if an attribute has an array of boolean-only content -->
IntegerList(Boolean) - if an attribute has date-only content --> Date
- if an attribute has time-only content --> Time
- if an attribute has datetime or date content --> DateTime

I'm not sure we want to invent a .jsont format, but if you download
http://svn.osgeo.org/gdal/trunk/gdal/swig/python/samples/ogr2vrt.py

and run  :

python ogr2vrt.py
"http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request
=getfeature&typename=topp:states&outputformat=json" test.vrt

This will create you a VRT with the default schema, that you can easily
edit. Note: as with OGR SQL CAST, this is post processing. So if the
guess done by the GeoJSON driver leads to a loss of inform

Re: [gdal-dev] Simple schema support for GeoJSON

2014-11-21 Thread Rahkonen Jukka (Tike)
Even Rouault 

> Le vendredi 21 novembre 2014 15:35:43, Rahkonen Jukka (Tike) a écrit :
> > Hi,
> >
> > I have no use for this feature myself but by reading various mailing
> > lists and forums I have learned that many people consider it is always
> > a good idea to read data for example from WFS services as GeoJSON instead
> of GML.
> 
> Because it consumes less bandwidth ?


I suppose rather that they generalize the good experiences about GeoJSON on 
browsers to mean that GML is poor for everything.  I found an interesting site 
http://jsperf.com/openlayers-format-reading-speed/2

I imagine that I can see a meaningful speed difference there indeed.

-Jukka-

___
gdal-dev mailing list
gdal-dev@lists.osgeo.org
http://lists.osgeo.org/mailman/listinfo/gdal-dev


Re: [gdal-dev] Simple schema support for GeoJSON

2014-11-21 Thread Sean Gillies
Hi Even, Jukka,

While the OGC service architecture is heavily dependent on schemas, OGR
type schemas are not *generally* useful for GeoJSON. Consider the following
abbreviated feature collection:

  "features": [
{"properties": {"a": 0, "b": "lol"}, ...},
{"properties": {"c": "2014-11-21", "d": "wut"}, ...}
  ]

It has two features and they are distinctly different types. A schema that
says these features have 4 fields would be nonsensical.

There are a bunch of different JSON schema approaches and none of them seem
to have any traction. https://github.com/json-schema/json-schema for
example looks to be stalled. I think the lack of traction reflects some
deeper reality: that XML and JSON have very different strengths and use
cases and that attempts to XML-ize JSON by adding schemas will always
eventually run out of steam.

For OGR to write schemas into GeoJSON would be a mistake. They could be
misleading and because there will never (as far as I can tell) be consensus
in the JSON community on the right form of schema, anything OGR implemented
would end up being a "loser".


On Fri, Nov 21, 2014 at 6:28 AM, Even Rouault 
wrote:

> Jukka,
>
> Data type guessing implemented in the OGR GeoJSON driver is quite natural
> hopefully.
> A whole scan of the GeoJSON file is made and the following rules are
> applied :
> - if an attribute has integer-only content --> Integer
> - if an attribute has an array of integer-only content  --> IntegerList
> - if an attribute has integer or floating point content --> Real
> - if an attribute has an array of integer or floating point content -->
> RealList
> - if an attribute has an array of anything else content --> StringList
> - otherwise --> String
>
> With RFC 50 and other pending improvements in the driver:
> - if an attribute has boolean-only content --> Integer(Boolean)
> - if an attribute has an array of boolean-only content -->
> IntegerList(Boolean)
> - if an attribute has date-only content --> Date
> - if an attribute has time-only content --> Time
> - if an attribute has datetime or date content --> DateTime
>
> I'm not sure we want to invent a .jsont format, but if you download
> http://svn.osgeo.org/gdal/trunk/gdal/swig/python/samples/ogr2vrt.py
>
> and run  :
>
> python ogr2vrt.py "
> http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request=getfeature&typename=topp:states&outputformat=json";
> test.vrt
>
> This will create you a VRT with the default schema, that you can easily
> edit.
> Note: as with OGR SQL CAST, this is post processing. So if the guess done
> by the GeoJSON driver
> leads to a loss of information, you cannot recover it. Hopefully the
> implemented rules will not
> lead to information loss.
>
> A better approach would be to have the schema embedded in a JSON way in
> the GeoJSON file itself.
> That could be an evolution of the format, but I'm not sure this would be
> really popular,
> given JSON/GeoJSON is heavily used by NoSQL approaches...
>
> Hum, doing a quick search, I just found http://json-schema.org/ that
> appears to be an IETF draft.
> It doesn't look that the schema is embedded in the data file itself.
>
> There's also GeoJSON-LD that might be a bit related :
> https://github.com/geojson/geojson-ld
>
> CC'ing Sean in case he has thoughts on this.
>
> Even
>
> > Hi,
> >
> > I wonder if GDAL could have some simple and relatively user friendly way
> > for defining a schema for GeoJSON data. The GeoJSON driver seems to guess
> > the data types of attributes with some undocumented way but users could
> > have better knowledge about the desired schema.
> >
> > I know I can control the data type by using OGR SQL and CAST as in
> > ogrinfo -sql "select cast(EMPLOYED as float) from OGRGeojson" states.json
> > -so
> >
> > However, perhaps GeoJSON is enough popular for deserving an easier way
> for
> > writing a schema. First I thought that it would be enough to copy the
> > "csvt" text file mechanism from the GDAL CSV driver
> > http://www.gdal.org/drv_csv.html. However, the csvt file is a plain
> list of
> > types which will be applied to the attributes in the same order than they
> > appear in the text file
> > "Integer(5)","Real(10.7)","String(15)"
> >
> > For GeoJSON it would feel more user friendly to include the attribute
> names
> > in the list somehow like
> >  "population;Integer(5)","area;Real(10.7)","name;String(15)".
> >
> > This would make it easier for users to write a valid "jsont" file. A list
> > with attribute names could perhaps also help GDAL as well because the
> > features in GeoJSON file do not necessarily have same attributes.
> >
> > As an example this is the right schema for a WFS feature type which is
> > captured from
> >
> http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request=des
> > cribefeaturetype&typename=topp:states
> >
> >
> > name="the_geom" type="gml:MultiPolygonPropertyType"/>
> > name="STATE_NAME" type="xsd:string"/>
> > name="STATE_FIPS" type="xsd:string"/>
> > n

Re: [gdal-dev] Simple schema support for GeoJSON

2014-11-21 Thread Rahkonen Jukka (Tike)
Hi,


As I wrote, I got a motivation for my first mail because I have seen that 
people are quite often using GeoJSON for delivering geospatial data as data, to 
be saved on disk and used like shapefiles, GML etc. As a result you get stuff 
like this:

http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request=getfeature&typename=topp:states&outputformat=application/json


You wrote and I agree with it "that XML and JSON have very different strengths 
and use cases ". However, people do what they want and I do feel that GeoJSON 
will be used for use cases where XML could be stronger like as the only 
supported format in some download services.


About the nonsensical 4-field schema, it is a little bit violent but just what 
about everybody who is using OpenStreetMap data is doing all the time. OSM 
features are pushed into traditional simple feature model and a set of tags are 
converted to attributes in a fixed schema. There are lots of null fields in the 
data and even that is in a way  nonsensical, it is also practical because it 
makes it possible to use osm2pgsql and PostGIS and Mapnik for rendering.


I am so fixated to consume data that I was not thinking at all about how to 
write GeoJSON with GDAL. I was just thinking that if some data are only 
available as GeoJSON, how users could convert it to PostGIS etc. so that the 
data types of the attributes will be the same as in the original data.


Because GeoJSON will not carry the data types as a payload I suppose that the 
current guess-the-datatype approach is the best starting point. Workaround by 
using VRT as Even suggested is good for fine tuning and cast with SQL works as 
well. The correct datatypes may still be somehow uncertain but perhaps those 
who maintain such services will announce the structure of their data on their 
web pages if they feel that it is important and they for example are awaiting 
data updates from users. When it comes to WFS, it seems to be an easy case 
because the XML schema can be reused as "GeoJSON schema".


-Jukka Rahkonen-




Sean Gillies 

> Hi Even, Jukka,

> While the OGC service architecture is heavily dependent on schemas, OGR type 
> schemas are not *generally* useful for GeoJSON. Consider the following 
> abbreviated feature collection:

  "features": [
{"properties": {"a": 0, "b": "lol"}, ...},
{"properties": {"c": "2014-11-21", "d": "wut"}, ...}
  ]

> It has two features and they are distinctly different types. A schema that 
> says these features have 4 fields would be nonsensical.

> There are a bunch of different JSON schema approaches and none of them seem 
> to have any traction. https://github.com/json-schema/json-schema for example 
> looks to be stalled. I think the lack of traction reflects some deeper 
> reality: that XML and JSON have very different strengths and use cases and 
> that attempts to XML-ize JSON by adding schemas will always eventually run 
> out of steam.

> For OGR to write schemas into GeoJSON would be a mistake. They could be 
> misleading and because there will never (as far as I can tell) be consensus 
> in the JSON community on the right form of schema, anything OGR implemented 
> would end up being a "loser".


On Fri, Nov 21, 2014 at 6:28 AM, Even Rouault 
mailto:even.roua...@spatialys.com>> wrote:
Jukka,

Data type guessing implemented in the OGR GeoJSON driver is quite natural 
hopefully.
A whole scan of the GeoJSON file is made and the following rules are applied :
- if an attribute has integer-only content --> Integer
- if an attribute has an array of integer-only content  --> IntegerList
- if an attribute has integer or floating point content --> Real
- if an attribute has an array of integer or floating point content --> RealList
- if an attribute has an array of anything else content --> StringList
- otherwise --> String

With RFC 50 and other pending improvements in the driver:
- if an attribute has boolean-only content --> Integer(Boolean)
- if an attribute has an array of boolean-only content --> IntegerList(Boolean)
- if an attribute has date-only content --> Date
- if an attribute has time-only content --> Time
- if an attribute has datetime or date content --> DateTime

I'm not sure we want to invent a .jsont format, but if you download
http://svn.osgeo.org/gdal/trunk/gdal/swig/python/samples/ogr2vrt.py

and run  :

python ogr2vrt.py 
"http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request=getfeature&typename=topp:states&outputformat=json";
 test.vrt

This will create you a VRT with the default schema, that you can easily edit.
Note: as with OGR SQL CAST, this is post processing. So if the guess done by 
the GeoJSON driver
leads to a loss of information, you cannot recover it. Hopefully the 
implemented rules will not
lead to information loss.

A better approach would be to have the schema embedded in a JSON way in the 
GeoJSON file itself.
That could be an evolution of the format, but