Re: [PR] [GH-2360] Support fetching libpostal model data from HDFS/object store [sedona]

via GitHub Wed, 11 Feb 2026 13:05:27 -0800


jiayuasu commented on code in PR #2637:
URL: https://github.com/apache/sedona/pull/2637#discussion_r2795526441



##########
docs/api/sql/Function.md:
##########
@@ -19,13 +19,19 @@
 
 ## ExpandAddress
 
-Introduction: Returns an array of expanded forms of the input address string. 
This is backed by the [libpostal](https://github.com/openvenues/libpostal) 
library's address expanding functionality.
+Introduction: Returns an array of expanded forms of the input address string. 
This is backed by the [libpostal](https://github.com/openvenues/libpostal) 
library's address expanding functionality. Jpostal requires at least 2 GB of 
free disk space to store the data files used for address parsing and expanding. 
By default, the data files are downloaded automatically to a temporary 
directory (`<java.io.tmpdir>/libpostal/`, e.g. `/tmp/libpostal/` on 
Linux/macOS) when the function is called for the first time. The version of 
jpostal installed with this package only supports Linux and macOS. If you are 
using Windows, you will need to install libjpostal and libpostal manually and 
ensure that they are available in your `java.library.path`.
 
-!!!Note
-    Jpostal requires at least 2 GB of free disk space to store the data files 
used for address parsing and expanding. The data files are downloaded 
automatically when the function is called for the first time.
+The data directory can be configured via `spark.sedona.libpostal.dataDir`. You 
can point it to a remote filesystem path (HDFS, S3, GCS, ABFS, etc.) such as 
`hdfs:///data/libpostal/` or `s3a://my-bucket/libpostal/`. When using a remote 
path, you must distribute the data to all executors before running queries by 
calling `sc.addFile("hdfs:///data/libpostal/", recursive=True)` (PySpark) or 
`sc.addFile("hdfs:///data/libpostal/", recursive = true)` (Scala). In this 
remote-URI mode, the automatic internet download performed by jpostal is 
disabled, so the remote directory must already contain the libpostal model 
files. For local filesystem paths, jpostal's download-if-needed behavior 
remains enabled.

Review Comment:
   sc.addFile needs to be called from the driver before tasks run on executors
   It requires access to SparkContext, which isn't readily available from 
within a UDF/expression



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GH-2360] Support fetching libpostal model data from HDFS/object store [sedona]

Reply via email to