subject:"Installing Apache Sedona and its dependencies"

Re: Installing Apache Sedona and its dependencies

2021-03-09 Thread Adam Binford

Can't comment on the runtime, but there was a bug that prevented global
indexing from being used in a lot of cases (see
https://github.com/apache/incubator-sedona/pull/511), including any
attempts from SQL directly. The non-indexed join is very memory inefficient
right now (it loads all points/inner objects into memory at once), which is
likely what caused the OOM issue. The DynamicIndex join is the most memory
efficient, but you need to use the RDD API directly. Not sure when the next
release will be but until then can't really do big joins from SQL.

Adam

On Mon, Mar 8, 2021 at 2:30 PM Robert Bozsik  wrote:

> Hi Jia,
>
> Thanks very much for your help!
> The setup worked, I managed to run Apache Sedona in a Jupyter Notebook :)
>
> However, another problem has occurred.
> I have two cases:
> 1. small join: gdf1 contains POLYGONs, shape: (250 rows, 3 columns), gdf2
> contains POINTs, shape: (2+ million rows, 5 columns)
> 2. big join: sdf1 contains POLYGONs, shape: (250 rows, 3 columns), sdf2
> contains POINs, shape: (56+ million rows, 5 columns)
>
> The "small join" takes 3 seconds to run in Geopandas, but 5 minutes to run
> in PySpark with Apache Sedona.
> All the setting are done based on this notebook:
>
> https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL.ipynb
> , with these 3 lines in addition:
> spark.conf.set("sedona.global.index", "true")
> spark.conf.set("sedona.global.indextype", "rtree")
> spark.conf.set("sedona.join.gridtype", "kdbtree
> based on the setting of this file:
>
> https://github.com/iag-geo/spark_testing/blob/master/apache_sedona/02_run_spatial_query.py
>
> I also tried spatial partitioning, creating RDDs, then made JoinQuery,
> after that JoinQueryRaw as well, but it took again around 5 minutes.
>
> I tried out the "big join" with Apache Sedona. After an hour and a half, I
> received the following warnings and errors:
> WARN BlockManager: Block rdd_53_1 could not be removed as it was not found
> on disk or in memory
> ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 187)
> java.lang.OutOfMemoryError: Java heap space
> ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread
> Thread[Executor task launch worker for task 187,5,main]
> java.lang.OutOfMemoryError: Java heap space
> WARN TaskSetManager: Lost task 1.0 in stage 11.0 (TID 187, roberts-mbp,
> executor driver): java.lang.OutOfMemoryError: Java heap space
> ERROR TaskSetManager: Task 1 in stage 11.0 failed 1 times; aborting job
> ERROR Utils: Uncaught exception in thread Executor task launch worker for
> task 190 java.lang.NullPointerException
>
> There should be some problem with my settings, but I cannot go forward
> without help from someone with more experience.
>
> Do you have any recommendations, what to read or how to try to make the
> "big join"?
>
> Have a nice evening,
> Robert
>
>
> On Sat, Mar 6, 2021 at 9:54 PM Jia Yu  wrote:
>
> > Hi Robert,
> >
> > The tutorial you found on our website is a step-by-step tutorial for
> > Python Jupyter. In that tutorial, pipenv will install all dependencies
> from
> > binder/Pipfile:
> > https://github.com/apache/incubator-sedona/tree/master/binder
> >
> > If you run into any specific issues, you can post here and we can help
> you.
> >
> > Thanks,
> > Jia
> >
> > On Sat, Mar 6, 2021 at 9:57 AM Robert Bozsik 
> > wrote:
> >
> >> Hi all,
> >> I am new to PySpark and programming.
> >> I would like to do a spatial join between two geographical datasets, one
> >> consists of 50+ million rows.
> >> Is there here anyone who could explain to me step by step how to install
> >> Apache Sedona (GeoSpark) and its dependencies on a Mac?
> >> After the installation I would like to run it locally in a virtual
> >> environment, first using Jupyter Notebook then in a .py file.
> >> On the official website I have found a quick start guide:
> >> https://sedona.apache.org/download/overview/
> >> and a Python Jupyter Notebook Examples guide:
> >> https://sedona.apache.org/tutorial/jupyter-notebook/
> >> However, it is still not clear, how to install and make it run.
> >> Unfortunately, I didn't find any useful step-by-step guide with the help
> >> of Google or YouTube and feel myself in an infinite loop of reading
> links
> >> after links that explain always different solutions.
> >> Thanks a lot in advance,
> >> Robert
> >> robertboz...@gmail.com
> >>
> >
>


-- 
Adam Binford

Re: Installing Apache Sedona and its dependencies

2021-03-08 Thread Robert Bozsik

Hi Jia,

Thanks very much for your help!
The setup worked, I managed to run Apache Sedona in a Jupyter Notebook :)

However, another problem has occurred.
I have two cases:
1. small join: gdf1 contains POLYGONs, shape: (250 rows, 3 columns), gdf2
contains POINTs, shape: (2+ million rows, 5 columns)
2. big join: sdf1 contains POLYGONs, shape: (250 rows, 3 columns), sdf2
contains POINs, shape: (56+ million rows, 5 columns)

The "small join" takes 3 seconds to run in Geopandas, but 5 minutes to run
in PySpark with Apache Sedona.
All the setting are done based on this notebook:
https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL.ipynb
, with these 3 lines in addition:
spark.conf.set("sedona.global.index", "true")
spark.conf.set("sedona.global.indextype", "rtree")
spark.conf.set("sedona.join.gridtype", "kdbtree
based on the setting of this file:
https://github.com/iag-geo/spark_testing/blob/master/apache_sedona/02_run_spatial_query.py

I also tried spatial partitioning, creating RDDs, then made JoinQuery,
after that JoinQueryRaw as well, but it took again around 5 minutes.

I tried out the "big join" with Apache Sedona. After an hour and a half, I
received the following warnings and errors:
WARN BlockManager: Block rdd_53_1 could not be removed as it was not found
on disk or in memory
ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 187)
java.lang.OutOfMemoryError: Java heap space
ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread
Thread[Executor task launch worker for task 187,5,main]
java.lang.OutOfMemoryError: Java heap space
WARN TaskSetManager: Lost task 1.0 in stage 11.0 (TID 187, roberts-mbp,
executor driver): java.lang.OutOfMemoryError: Java heap space
ERROR TaskSetManager: Task 1 in stage 11.0 failed 1 times; aborting job
ERROR Utils: Uncaught exception in thread Executor task launch worker for
task 190 java.lang.NullPointerException

There should be some problem with my settings, but I cannot go forward
without help from someone with more experience.

Do you have any recommendations, what to read or how to try to make the
"big join"?

Have a nice evening,
Robert

On Sat, Mar 6, 2021 at 9:54 PM Jia Yu  wrote:

> Hi Robert,
>
> The tutorial you found on our website is a step-by-step tutorial for
> Python Jupyter. In that tutorial, pipenv will install all dependencies from
> binder/Pipfile:
> https://github.com/apache/incubator-sedona/tree/master/binder
>
> If you run into any specific issues, you can post here and we can help you.
>
> Thanks,
> Jia
>
> On Sat, Mar 6, 2021 at 9:57 AM Robert Bozsik 
> wrote:
>
>> Hi all,
>> I am new to PySpark and programming.
>> I would like to do a spatial join between two geographical datasets, one
>> consists of 50+ million rows.
>> Is there here anyone who could explain to me step by step how to install
>> Apache Sedona (GeoSpark) and its dependencies on a Mac?
>> After the installation I would like to run it locally in a virtual
>> environment, first using Jupyter Notebook then in a .py file.
>> On the official website I have found a quick start guide:
>> https://sedona.apache.org/download/overview/
>> and a Python Jupyter Notebook Examples guide:
>> https://sedona.apache.org/tutorial/jupyter-notebook/
>> However, it is still not clear, how to install and make it run.
>> Unfortunately, I didn't find any useful step-by-step guide with the help
>> of Google or YouTube and feel myself in an infinite loop of reading links
>> after links that explain always different solutions.
>> Thanks a lot in advance,
>> Robert
>> robertboz...@gmail.com
>>
>

Re: Installing Apache Sedona and its dependencies

2021-03-06 Thread Jia Yu

Hi Robert,

The tutorial you found on our website is a step-by-step tutorial for Python
Jupyter. In that tutorial, pipenv will install all dependencies from
binder/Pipfile:
https://github.com/apache/incubator-sedona/tree/master/binder

If you run into any specific issues, you can post here and we can help you.

Thanks,
Jia

On Sat, Mar 6, 2021 at 9:57 AM Robert Bozsik  wrote:

> Hi all,
> I am new to PySpark and programming.
> I would like to do a spatial join between two geographical datasets, one
> consists of 50+ million rows.
> Is there here anyone who could explain to me step by step how to install
> Apache Sedona (GeoSpark) and its dependencies on a Mac?
> After the installation I would like to run it locally in a virtual
> environment, first using Jupyter Notebook then in a .py file.
> On the official website I have found a quick start guide:
> https://sedona.apache.org/download/overview/
> and a Python Jupyter Notebook Examples guide:
> https://sedona.apache.org/tutorial/jupyter-notebook/
> However, it is still not clear, how to install and make it run.
> Unfortunately, I didn't find any useful step-by-step guide with the help
> of Google or YouTube and feel myself in an infinite loop of reading links
> after links that explain always different solutions.
> Thanks a lot in advance,
> Robert
> robertboz...@gmail.com
>

Installing Apache Sedona and its dependencies

2021-03-06 Thread Robert Bozsik

Hi all,
I am new to PySpark and programming.
I would like to do a spatial join between two geographical datasets, one 
consists of 50+ million rows.
Is there here anyone who could explain to me step by step how to install Apache 
Sedona (GeoSpark) and its dependencies on a Mac?
After the installation I would like to run it locally in a virtual environment, 
first using Jupyter Notebook then in a .py file.
On the official website I have found a quick start guide:
https://sedona.apache.org/download/overview/
and a Python Jupyter Notebook Examples guide:
https://sedona.apache.org/tutorial/jupyter-notebook/
However, it is still not clear, how to install and make it run. 
Unfortunately, I didn't find any useful step-by-step guide with the help of 
Google or YouTube and feel myself in an infinite loop of reading links after 
links that explain always different solutions.
Thanks a lot in advance,
Robert
robertboz...@gmail.com

Re: Installing Apache Sedona and its dependencies

Re: Installing Apache Sedona and its dependencies

Re: Installing Apache Sedona and its dependencies

Installing Apache Sedona and its dependencies

4 matches

Site Navigation

Mail list logo

Footer information