[ 
https://issues.apache.org/jira/browse/SPARK-29189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang, Gang updated SPARK-29189:
-------------------------------
    Description: 
In our PROD env, we have a pure Spark cluster, I think this is also pretty 
common, where computation is separated from storage layer. In such deploy mode, 
data locality is never reachable. 
 And there are some configurations in Spark scheduler to reduce waiting time 
for data locality(e.g. "spark.locality.wait"). While, problem is that, in 
listing file phase, the location informations of all the files, with all the 
blocks inside each file, are all fetched from the distributed file system. 
Actually, in a PROD environment, a table can be so huge that even fetching all 
these location informations need take tens of seconds.
 To improve such scenario, Spark need provide an option, where data locality 
can be totally ignored, all we need in the listing file phase are the files 
locations, without any block location informations.

 

And we made a benchmark in our PROD env, after ignore the block locations, we 
got a pretty huge improvement.
||Table Size||Total File Number||Total Block Number||List File Duration(With 
Block Location)||List File Duration(Without Block Location)||
|22.6T|30000|120000|16.841s|1.730s|
|28.8 T|42001|148964|10.099s|2.858s|
|3.4 T|20000| 20000|5.833s|4.881s|

 

  was:
In our PROD env, we have a pure Spark cluster, I think this is also pretty 
common, where computation is separated from storage layer. In such deploy mode, 
data locality is never reachable. 
 And there are some configurations in Spark scheduler to reduce waiting time 
for data locality(e.g. "spark.locality.wait"). While, problem is that, in 
listing file phase, the location informations of all the files, with all the 
blocks inside each file, are all fetched from the distributed file system. 
Actually, in a PROD environment, a table can be so huge that even fetching all 
these location informations need take tens of seconds.
 To improve such scenario, Spark need provide an option, where data locality 
can be totally ignored, all we need in the listing file phase are the files 
locations, without any block location informations.

 

And we made a benchmark in our PROD env, after ignore the block locations, we 
got a pretty huge improvement.
||Table Size||Total File Number||Total Block Number||List File With Block 
Location Duration||List File Without Block Location Duration||
|22.6T|30000|120000|16.841s|1.730s|
|28.8 T|42001|148964|10.099s|2.858s|
|3.4 T|20000| 20000|5.833s|4.881s|

 


> Add an option to ignore block locations when listing file
> ---------------------------------------------------------
>
>                 Key: SPARK-29189
>                 URL: https://issues.apache.org/jira/browse/SPARK-29189
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Wang, Gang
>            Priority: Major
>
> In our PROD env, we have a pure Spark cluster, I think this is also pretty 
> common, where computation is separated from storage layer. In such deploy 
> mode, data locality is never reachable. 
>  And there are some configurations in Spark scheduler to reduce waiting time 
> for data locality(e.g. "spark.locality.wait"). While, problem is that, in 
> listing file phase, the location informations of all the files, with all the 
> blocks inside each file, are all fetched from the distributed file system. 
> Actually, in a PROD environment, a table can be so huge that even fetching 
> all these location informations need take tens of seconds.
>  To improve such scenario, Spark need provide an option, where data locality 
> can be totally ignored, all we need in the listing file phase are the files 
> locations, without any block location informations.
>  
> And we made a benchmark in our PROD env, after ignore the block locations, we 
> got a pretty huge improvement.
> ||Table Size||Total File Number||Total Block Number||List File Duration(With 
> Block Location)||List File Duration(Without Block Location)||
> |22.6T|30000|120000|16.841s|1.730s|
> |28.8 T|42001|148964|10.099s|2.858s|
> |3.4 T|20000| 20000|5.833s|4.881s|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to