Kristin Cowalcijk created SEDONA-325:
----------------------------------------

             Summary: RS_FromGeoTiff is leaking file descriptors
                 Key: SEDONA-325
                 URL: https://issues.apache.org/jira/browse/SEDONA-325
             Project: Apache Sedona
          Issue Type: Bug
            Reporter: Kristin Cowalcijk


I tried loading a raster dataset composed of 20000+ GeoTiff images in a local 
spark session using the following code:
{code:python}
df_binary = spark.read.format("binaryFile").option("pathGlobFilter", 
"*.tif").option("recursiveFileLookup", "true").load(DATA_ROOT_PATH + 
'/raster/EuroSAT_MS')
df_geotiff = df_binary.withColumn("rast", 
expr("RS_FromGeoTiff(content)")).withColumn("name", expr("reverse(split(path, 
'/'))[0]")).select("name", "length", "rast")
df_geotiff.where("name LIKE 'Forest_%.tif'").selectExpr("name", 
"RS_BandAsArray(rast, 3) as band").orderBy("name").show()
{code}
The spark job failed with the following error messages:
{code:java}
Py4JJavaError: An error occurred while calling o70.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 782 
in stage 5.0 failed 1 times, most recent failure: Lost task 782.0 in stage 5.0 
(TID 786) (kontinuation executor driver): java.io.FileNotFoundException: 
/home/kontinuation/documents/wherobots/notebooks/data/raster/EuroSAT_MS/Forest/Forest_2298.tif
 (Too many open files)

It is possible the underlying files have been updated. You can explicitly 
invalidate
the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
recreating the Dataset/DataFrame involved.
       
        at 
org.apache.spark.sql.errors.QueryExecutionErrors$.readCurrentFileNotFoundError(QueryExecutionErrors.scala:661)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:212)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:270)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32)
        at org.sparkproject.guava.collect.Ordering.leastOf(Ordering.java:664)
        at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
        at org.apache.spark.rdd.RDD.$anonfun$takeOrdered$2(RDD.scala:1539)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{code}
It says that the spark job is opening too many files. If we run {{lsof}} to 
inspect opened files, we can see most of them are temporary files prefixed by 
{{{}imageio{}}}:
{code:java}
java    3843951 kontinuation 1006u      REG              252,1   107244   
1204728 /tmp/imageio3709666550975207536.tmp
java    3843951 kontinuation 1007u      REG              252,1   107244   
1204729 /tmp/imageio7503001112441146978.tmp
java    3843951 kontinuation 1008u      REG              252,1   107244   
1204730 /tmp/imageio1035759556272836613.tmp
java    3843951 kontinuation 1009u      REG              252,1   107244   
1204731 /tmp/imageio451679980601844202.tmp
java    3843951 kontinuation 1010u      REG              252,1   107244   
1204732 /tmp/imageio2111699718021158223.tmp
java    3843951 kontinuation 1011u      REG              252,1   107244   
1204733 /tmp/imageio8919853818666809481.tmp
java    3843951 kontinuation 1012u      REG              252,1   107244   
1204734 /tmp/imageio6956257348066899899.tmp
java    3843951 kontinuation 1013u      REG              252,1   107244   
1204735 /tmp/imageio3045964803135174263.tmp
java    3843951 kontinuation 1014u      REG              252,1   107244   
1204736 /tmp/imageio8138794596381465904.tmp
java    3843951 kontinuation 1015u      REG              252,1   107244   
1204737 /tmp/imageio6991404647914889791.tmp
java    3843951 kontinuation 1016u      REG              252,1   107244   
1204738 /tmp/imageio3098287432603901322.tmp
java    3843951 kontinuation 1017u      REG              252,1   107244   
1204739 /tmp/imageio599912999779858439.tmp
java    3843951 kontinuation 1018u      REG              252,1   107244   
1204740 /tmp/imageio8841430021636925470.tmp
java    3843951 kontinuation 1019u      REG              252,1   107244   
1204741 /tmp/imageio8981079233288315985.tmp
java    3843951 kontinuation 1020u      REG              252,1   107244   
1204742 /tmp/imageio3673591736487787612.tmp
java    3843951 kontinuation 1021u      REG              252,1   107244   
1204743 /tmp/imageio8805168727392534534.tmp
java    3843951 kontinuation 1022u      REG              252,1   107244   
1204744 /tmp/imageio441228595459753924.tmp
java    3843951 kontinuation 1023u      REG              252,1   107244   
1204753 /tmp/imageio6548224310964783498.tmp
{code}
I believe this is caused by a bug in GeoTools, which initializes a file-backed 
cache when reading GeoTiff from an input stream, and won't close the 
file-backed cache when the grid coverage object was disposed. The temporary 
files named {{imageioXXXX}} where created by the file-backed cache. If the size 
of the raster dataset exceeds the maximum number of opened files, the job will 
fail and the spark session won't properly respond to any future queries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to