Yes, there are lots of things to consider when processing large blobs in
Spark. What I have come to learn:
 - Do the spatial join (points and the geotiff extent) with as few columns
as possible. Ideally an id only for the geotiff. After that join you can
join back the geotiff using the id.
 - Aggregate the points to an array of points per geotiff. Your getValue
udf should take an array of points and return an array of values. That way
each geotiff is only loaded once.
 - Parquet in Spark is not very good at handling large blobs. If reading
parquet with geotiffs is slow you can repartition() with a very large
number to force smaller row groups when writing or use Avro instead.
https://www.uber.com/en-SE/blog/hdfs-file-format-apache-spark/

Good luck!

Br,
Martin Andersson


Den fre 20 jan. 2023 kl 13:08 skrev Pedro Mano Fernandes <
pedromor...@gmail.com>:

> Thanks Martin, it sounds promising. I'll actually give it a try before
> going with geotiff conversions.
>
> I'm foreseeing some concerns, though:
>
>    - I'm afraid it won't be optimal for a big geotiff - I may have to
>    split the geotiff into smaller geotiffs
>    - I wonder how the spatial partitioning optimization will behave in
>    such approach - I may have to load smaller geotiffs and use their geometry
>    to join (my coordinates against envelope boundaries) before calculating the
>    getValue for my coordinates
>
> Best,
>
> On Fri, 20 Jan 2023 at 08:49, Martin Andersson <
> u.martin.anders...@gmail.com> wrote:
>
>> I would read the geotiff files as binary:
>> https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html
>>
>> Then you can define a udf to extract values directly from the geotiffs.
>> If you're on python you can use raster.io to do that.
>>
>> In java it would look some thing like this:
>>
>>   Integer getValue(byte[] geotiff, double x, double y)
>>       throws IOException, TransformException {
>>     try (ByteArrayInputStream inputStream = new
>> ByteArrayInputStream(geotiff)) {
>>       GeoTiffReader geoTiffReader = new GeoTiffReader(inputStream);
>>       GridCoverage2D grid = geoTiffReader.read(null);
>>       Raster raster = grid.getRenderedImage().getData();
>>       GridGeometry2D gridGeometry = grid.getGridGeometry();
>>
>>       DirectPosition2D directPosition2D = new DirectPosition2D(x, y);
>>       GridCoordinates2D gridCoordinates2D =
>> gridGeometry.worldToGrid(directPosition2D);
>>       try {
>>           int[] pixel = raster.getPixel(gridCoordinates2D.x,
>> gridCoordinates2D.y, new int[1]);
>>           return pixel[0];
>>       } catch (ArrayIndexOutOfBoundsException exc) {
>>           // point is outside the extentent
>>           result.add(null);
>>       }
>>     }
>> }
>>
>> Br,
>> Martin Andersson
>>
>> Den ons 18 jan. 2023 kl 17:59 skrev Pedro Mano Fernandes <
>> pedromor...@gmail.com>:
>>
>>> Thanks for the update, guys.
>>>
>>> I'm not ready to contribute yet.
>>>
>>> In the meanwhile, the solution could be perhaps to convert GeoTiff to
>>> another format supported by Sedona. If anyone has had this use case before
>>> or has any idea, please share.
>>>
>>> Best,
>>>
>>> On Wed, 18 Jan 2023 at 09:47, Martin Andersson <
>>> u.martin.anders...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I think you are looking for something like this:
>>>> https://postgis.net/docs/RT_ST_Value.html
>>>>
>>>> The raster support in Sedona is very limited at the moment. The lack of
>>>> a proper raster type makes implementing st_value impossible. We had a brief
>>>> discussion about that recently.
>>>> https://lists.apache.org/thread/qdfcvxl6z5pb7m7ky5zsksyytyxqwv8c
>>>>
>>>> If you want to make a contribution and need some guidance, please let
>>>> me know!
>>>>
>>>> Br,
>>>> Martin Andersson
>>>>
>>>> Den ons 18 jan. 2023 kl 05:45 skrev Jia Yu <ji...@apache.org>:
>>>>
>>>>> Hi Pedro,
>>>>>
>>>>> I got your point. Unfortunately, we don't have this function yet in
>>>>> Sedona.
>>>>> But we welcome anyone who want to contribute this to Sedona!
>>>>>
>>>>> Thanks,
>>>>> Jia
>>>>>
>>>>> On Tue, Jan 17, 2023 at 9:11 AM Pedro Mano Fernandes <
>>>>> pedromor...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Hi all,
>>>>> >
>>>>> > Any clue? Or any documentation I can refer to?
>>>>> >
>>>>> > Here goes a dummy example to better explain myself: in QGIS I can
>>>>> click a
>>>>> > point (coordinates) of the geotiff and get the value in that point
>>>>> (in this
>>>>> > case 231 of Band 1).
>>>>> >
>>>>> > [image: image.png]
>>>>> >
>>>>> > Thanks,
>>>>> >
>>>>> > On Sun, 15 Jan 2023 at 16:17, Pedro Mano Fernandes <
>>>>> pedromor...@gmail.com>
>>>>> > wrote:
>>>>> >
>>>>> >> Hi Jia,
>>>>> >>
>>>>> >> Thanks for the fast response.
>>>>> >>
>>>>> >> With the regular spatial join I’ll get the array of data of the
>>>>> whole
>>>>> >> geotiff polygon. I was hoping to get the data element for specific
>>>>> >> coordinates inside that polygon. In other words: I guess the array
>>>>> of data
>>>>> >> corresponds to all the positions in the polygon, but I want to fetch
>>>>> >> specific positions.
>>>>> >>
>>>>> >> Thanks,
>>>>> >>
>>>>> >> On Sun, 15 Jan 2023 at 01:09, Jia Yu <ji...@apache.org> wrote:
>>>>> >>
>>>>> >>> Hi Pedro,
>>>>> >>>
>>>>> >>> Once you use Sedona geotiff reader to read those geotiffs, you
>>>>> will get
>>>>> >>> a dataframe with the following schema:
>>>>> >>>
>>>>> >>>  |-- image: struct (nullable = true)
>>>>> >>>  |    |-- origin: string (nullable = true)
>>>>> >>>  |    |-- Geometry: string (nullable = true)
>>>>> >>>  |    |-- height: integer (nullable = true)
>>>>> >>>  |    |-- width: integer (nullable = true)
>>>>> >>>  |    |-- nBands: integer (nullable = true)
>>>>> >>>  |    |-- data: array (nullable = true)
>>>>> >>>  |    |    |-- element: double (containsNull = true)
>>>>> >>>
>>>>> >>>
>>>>> >>> You can use the following way to fetch the geometry column and
>>>>> perform
>>>>> >>> the spatial join;
>>>>> >>>
>>>>> >>> geotiffDF = geotiffDF.selectExpr("image.origin as
>>>>> >>> origin","ST_GeomFromWkt(image.geometry) as Geom", "image.height as
>>>>> height",
>>>>> >>> "image.width as width", "image.data as data", "image.nBands as
>>>>> bands")
>>>>> >>> geotiffDF.createOrReplaceTempView("GeotiffDataframe")
>>>>> >>> geotiffDF.show()
>>>>> >>>
>>>>> >>> More info can be found:
>>>>> >>>
>>>>> https://sedona.apache.org/1.3.1-incubating/api/sql/Raster-loader/#geotiff-dataframe-loader
>>>>> >>>
>>>>> >>> Thanks,
>>>>> >>> Jia
>>>>> >>>
>>>>> >>> On Sat, Jan 14, 2023 at 9:10 AM Pedro Mano Fernandes <
>>>>> >>> pedromor...@gmail.com> wrote:
>>>>> >>>
>>>>> >>>> Hi everyone!
>>>>> >>>>
>>>>> >>>> I'm trying to use elevation data in GeoTiff format. I understand
>>>>> how to
>>>>> >>>> load the dataset, as described in
>>>>> >>>>
>>>>> >>>>
>>>>> https://sedona.staged.apache.org/api/sql/Raster-loader/#geotiff-dataframe-loader
>>>>> >>>> .
>>>>> >>>> Now I'm wondering how to join this dataframe with another one that
>>>>> >>>> contains
>>>>> >>>> coordinates, in order to get the elevation data for those
>>>>> coordinates.
>>>>> >>>>
>>>>> >>>> Something along these lines:
>>>>> >>>>
>>>>> >>>> pointsDF
>>>>> >>>>   .join(geotiffDF, ...)
>>>>> >>>>   .select("lon", "lat", "geotiff_data")
>>>>> >>>>
>>>>> >>>> Are there any examples or documentation I can follow to
>>>>> accomplish this?
>>>>> >>>>
>>>>> >>>> Thanks,
>>>>> >>>>
>>>>> >>>> --
>>>>> >>>> Pedro Mano Fernandes
>>>>> >>>>
>>>>> >>> --
>>>>> >> Pedro Mano Fernandes
>>>>> >>
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Pedro Mano Fernandes
>>>>> >
>>>>>
>>>>
>>>>
>>>> --
>>>> Hälsningar,
>>>> Martin
>>>>
>>>
>>>
>>> --
>>> Pedro Mano Fernandes
>>>
>>
>
> --
> Pedro Mano Fernandes
>

Reply via email to