Re: Installing Apache Sedona and its dependencies
Can't comment on the runtime, but there was a bug that prevented global indexing from being used in a lot of cases (see https://github.com/apache/incubator-sedona/pull/511), including any attempts from SQL directly. The non-indexed join is very memory inefficient right now (it loads all points/inner objects into memory at once), which is likely what caused the OOM issue. The DynamicIndex join is the most memory efficient, but you need to use the RDD API directly. Not sure when the next release will be but until then can't really do big joins from SQL. Adam On Mon, Mar 8, 2021 at 2:30 PM Robert Bozsik wrote: > Hi Jia, > > Thanks very much for your help! > The setup worked, I managed to run Apache Sedona in a Jupyter Notebook :) > > However, another problem has occurred. > I have two cases: > 1. small join: gdf1 contains POLYGONs, shape: (250 rows, 3 columns), gdf2 > contains POINTs, shape: (2+ million rows, 5 columns) > 2. big join: sdf1 contains POLYGONs, shape: (250 rows, 3 columns), sdf2 > contains POINs, shape: (56+ million rows, 5 columns) > > The "small join" takes 3 seconds to run in Geopandas, but 5 minutes to run > in PySpark with Apache Sedona. > All the setting are done based on this notebook: > > https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL.ipynb > , with these 3 lines in addition: > spark.conf.set("sedona.global.index", "true") > spark.conf.set("sedona.global.indextype", "rtree") > spark.conf.set("sedona.join.gridtype", "kdbtree > based on the setting of this file: > > https://github.com/iag-geo/spark_testing/blob/master/apache_sedona/02_run_spatial_query.py > > I also tried spatial partitioning, creating RDDs, then made JoinQuery, > after that JoinQueryRaw as well, but it took again around 5 minutes. > > I tried out the "big join" with Apache Sedona. After an hour and a half, I > received the following warnings and errors: > WARN BlockManager: Block rdd_53_1 could not be removed as it was not found > on disk or in memory > ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 187) > java.lang.OutOfMemoryError: Java heap space > ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread > Thread[Executor task launch worker for task 187,5,main] > java.lang.OutOfMemoryError: Java heap space > WARN TaskSetManager: Lost task 1.0 in stage 11.0 (TID 187, roberts-mbp, > executor driver): java.lang.OutOfMemoryError: Java heap space > ERROR TaskSetManager: Task 1 in stage 11.0 failed 1 times; aborting job > ERROR Utils: Uncaught exception in thread Executor task launch worker for > task 190 java.lang.NullPointerException > > There should be some problem with my settings, but I cannot go forward > without help from someone with more experience. > > Do you have any recommendations, what to read or how to try to make the > "big join"? > > Have a nice evening, > Robert > > > On Sat, Mar 6, 2021 at 9:54 PM Jia Yu wrote: > > > Hi Robert, > > > > The tutorial you found on our website is a step-by-step tutorial for > > Python Jupyter. In that tutorial, pipenv will install all dependencies > from > > binder/Pipfile: > > https://github.com/apache/incubator-sedona/tree/master/binder > > > > If you run into any specific issues, you can post here and we can help > you. > > > > Thanks, > > Jia > > > > On Sat, Mar 6, 2021 at 9:57 AM Robert Bozsik > > wrote: > > > >> Hi all, > >> I am new to PySpark and programming. > >> I would like to do a spatial join between two geographical datasets, one > >> consists of 50+ million rows. > >> Is there here anyone who could explain to me step by step how to install > >> Apache Sedona (GeoSpark) and its dependencies on a Mac? > >> After the installation I would like to run it locally in a virtual > >> environment, first using Jupyter Notebook then in a .py file. > >> On the official website I have found a quick start guide: > >> https://sedona.apache.org/download/overview/ > >> and a Python Jupyter Notebook Examples guide: > >> https://sedona.apache.org/tutorial/jupyter-notebook/ > >> However, it is still not clear, how to install and make it run. > >> Unfortunately, I didn't find any useful step-by-step guide with the help > >> of Google or YouTube and feel myself in an infinite loop of reading > links > >> after links that explain always different solutions. > >> Thanks a lot in advance, > >> Robert > >> robertboz...@gmail.com > >> > > > -- Adam Binford
Re: Installing Apache Sedona and its dependencies
Hi Jia, Thanks very much for your help! The setup worked, I managed to run Apache Sedona in a Jupyter Notebook :) However, another problem has occurred. I have two cases: 1. small join: gdf1 contains POLYGONs, shape: (250 rows, 3 columns), gdf2 contains POINTs, shape: (2+ million rows, 5 columns) 2. big join: sdf1 contains POLYGONs, shape: (250 rows, 3 columns), sdf2 contains POINs, shape: (56+ million rows, 5 columns) The "small join" takes 3 seconds to run in Geopandas, but 5 minutes to run in PySpark with Apache Sedona. All the setting are done based on this notebook: https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL.ipynb , with these 3 lines in addition: spark.conf.set("sedona.global.index", "true") spark.conf.set("sedona.global.indextype", "rtree") spark.conf.set("sedona.join.gridtype", "kdbtree based on the setting of this file: https://github.com/iag-geo/spark_testing/blob/master/apache_sedona/02_run_spatial_query.py I also tried spatial partitioning, creating RDDs, then made JoinQuery, after that JoinQueryRaw as well, but it took again around 5 minutes. I tried out the "big join" with Apache Sedona. After an hour and a half, I received the following warnings and errors: WARN BlockManager: Block rdd_53_1 could not be removed as it was not found on disk or in memory ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 187) java.lang.OutOfMemoryError: Java heap space ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 187,5,main] java.lang.OutOfMemoryError: Java heap space WARN TaskSetManager: Lost task 1.0 in stage 11.0 (TID 187, roberts-mbp, executor driver): java.lang.OutOfMemoryError: Java heap space ERROR TaskSetManager: Task 1 in stage 11.0 failed 1 times; aborting job ERROR Utils: Uncaught exception in thread Executor task launch worker for task 190 java.lang.NullPointerException There should be some problem with my settings, but I cannot go forward without help from someone with more experience. Do you have any recommendations, what to read or how to try to make the "big join"? Have a nice evening, Robert On Sat, Mar 6, 2021 at 9:54 PM Jia Yu wrote: > Hi Robert, > > The tutorial you found on our website is a step-by-step tutorial for > Python Jupyter. In that tutorial, pipenv will install all dependencies from > binder/Pipfile: > https://github.com/apache/incubator-sedona/tree/master/binder > > If you run into any specific issues, you can post here and we can help you. > > Thanks, > Jia > > On Sat, Mar 6, 2021 at 9:57 AM Robert Bozsik > wrote: > >> Hi all, >> I am new to PySpark and programming. >> I would like to do a spatial join between two geographical datasets, one >> consists of 50+ million rows. >> Is there here anyone who could explain to me step by step how to install >> Apache Sedona (GeoSpark) and its dependencies on a Mac? >> After the installation I would like to run it locally in a virtual >> environment, first using Jupyter Notebook then in a .py file. >> On the official website I have found a quick start guide: >> https://sedona.apache.org/download/overview/ >> and a Python Jupyter Notebook Examples guide: >> https://sedona.apache.org/tutorial/jupyter-notebook/ >> However, it is still not clear, how to install and make it run. >> Unfortunately, I didn't find any useful step-by-step guide with the help >> of Google or YouTube and feel myself in an infinite loop of reading links >> after links that explain always different solutions. >> Thanks a lot in advance, >> Robert >> robertboz...@gmail.com >> >
Re: Installing Apache Sedona and its dependencies
Hi Robert, The tutorial you found on our website is a step-by-step tutorial for Python Jupyter. In that tutorial, pipenv will install all dependencies from binder/Pipfile: https://github.com/apache/incubator-sedona/tree/master/binder If you run into any specific issues, you can post here and we can help you. Thanks, Jia On Sat, Mar 6, 2021 at 9:57 AM Robert Bozsik wrote: > Hi all, > I am new to PySpark and programming. > I would like to do a spatial join between two geographical datasets, one > consists of 50+ million rows. > Is there here anyone who could explain to me step by step how to install > Apache Sedona (GeoSpark) and its dependencies on a Mac? > After the installation I would like to run it locally in a virtual > environment, first using Jupyter Notebook then in a .py file. > On the official website I have found a quick start guide: > https://sedona.apache.org/download/overview/ > and a Python Jupyter Notebook Examples guide: > https://sedona.apache.org/tutorial/jupyter-notebook/ > However, it is still not clear, how to install and make it run. > Unfortunately, I didn't find any useful step-by-step guide with the help > of Google or YouTube and feel myself in an infinite loop of reading links > after links that explain always different solutions. > Thanks a lot in advance, > Robert > robertboz...@gmail.com >
Installing Apache Sedona and its dependencies
Hi all, I am new to PySpark and programming. I would like to do a spatial join between two geographical datasets, one consists of 50+ million rows. Is there here anyone who could explain to me step by step how to install Apache Sedona (GeoSpark) and its dependencies on a Mac? After the installation I would like to run it locally in a virtual environment, first using Jupyter Notebook then in a .py file. On the official website I have found a quick start guide: https://sedona.apache.org/download/overview/ and a Python Jupyter Notebook Examples guide: https://sedona.apache.org/tutorial/jupyter-notebook/ However, it is still not clear, how to install and make it run. Unfortunately, I didn't find any useful step-by-step guide with the help of Google or YouTube and feel myself in an infinite loop of reading links after links that explain always different solutions. Thanks a lot in advance, Robert robertboz...@gmail.com