[ 
https://issues.apache.org/jira/browse/SPARK-38062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484806#comment-17484806
 ] 

Erik Krogen edited comment on SPARK-38062 at 1/31/22, 5:33 PM:
---------------------------------------------------------------

I think the issue I'm reporting is very similar to SPARK-37318, is it not? The 
root cause in both cases is that the code within 
{{FallbackStorage}}/{{BlockManagerDecommissioner}} makes an assumption that 
"remote" is an invalid DNS entry. In the case of SPARK-37318, the issue 
reported was that "remote" resolved to a valid, reachable IP. In the case of 
this JIRA, "remote" resolved to an IP that wasn't reachable, but in both cases, 
the core issue was that "remote" was resolvable by DNS -- the behavior after 
the resolution isn't too important IMO. The production code is basically broken 
as soon as "remote" is resolvable by DNS.
I'm also proposing essentially reverting SPARK-37318, because my new proposed 
fix resolves the issue at the source (within the production code), rather than 
working around it with a change to the test code to hide the broken nature of 
the production code.
So it seems to me that these two issues are very closely related. Though, 
admittedly the link to SPARK-35584 may have been misplaced.


was (Author: xkrogen):
I think the issue I'm reporting is very similar to SPARK-37318, is it not? The 
root cause in both cases is that the code within 
{{FallbackStorage}}/{{BlockManagerDecommissioner}} makes an assumption that 
"remote" is an invalid DNS entry. In the case of SPARK-37318, the issue 
reported was that "remote" resolved to a valid, reachable IP. In the case of 
this JIRA, "remote" resolved to an IP that wasn't reachable, but in both cases, 
the core issue was that "remote" was resolvable by DNS -- the behavior after 
the resolution isn't too important IMO. The code is basically broken as soon as 
"remote" is resolvable by DNS.
I'm also proposing essentially reverting SPARK-37318, because my new proposed 
fix resolves the issue at the source (within the production code), rather than 
working around it with a change to the test code to hide the broken nature of 
the production code.
So it seems to me that these two issues are very closely related. Though, 
admittedly the link to SPARK-35584 may have been misplaced.

> FallbackStorage shouldn't attempt to resolve arbitrary "remote" hostname
> ------------------------------------------------------------------------
>
>                 Key: SPARK-38062
>                 URL: https://issues.apache.org/jira/browse/SPARK-38062
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.1.1
>            Reporter: Erik Krogen
>            Priority: Major
>
> {{FallbackStorage}} uses a placeholder block manager ID:
> {code:scala}
> private[spark] object FallbackStorage extends Logging {
>   /** We use one block manager id as a place holder. */
>   val FALLBACK_BLOCK_MANAGER_ID: BlockManagerId = BlockManagerId("fallback", 
> "remote", 7337)
> {code}
> That second argument is normally interpreted as a hostname, but is passed as 
> the string "remote" in this case.
> {{BlockManager}} will consider this placeholder as one of the peers in some 
> cases:
> {code:language=scala|title=BlockManager.scala}
>   private[storage] def getPeers(forceFetch: Boolean): Seq[BlockManagerId] = {
>     peerFetchLock.synchronized {
>       ...
>       if (cachedPeers.isEmpty &&
>           
> conf.get(config.STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH).isDefined) {
>         Seq(FallbackStorage.FALLBACK_BLOCK_MANAGER_ID)
>       } else {
>         cachedPeers
>       }
>     }
>   }
> {code}
> {{BlockManagerDecommissioner.ShuffleMigrationRunnable}} will then attempt to 
> perform an upload to this placeholder ID:
> {code:scala}
>             try {
>               blocks.foreach { case (blockId, buffer) =>
>                 logDebug(s"Migrating sub-block ${blockId}")
>                 bm.blockTransferService.uploadBlockSync(
>                   peer.host,
>                   peer.port,
>                   peer.executorId,
>                   blockId,
>                   buffer,
>                   StorageLevel.DISK_ONLY,
>                   null) // class tag, we don't need for shuffle
>                 logDebug(s"Migrated sub-block $blockId")
>               }
>               logInfo(s"Migrated $shuffleBlockInfo to $peer")
>             } catch {
>               case e: IOException =>
>                 ...
>                 if 
> (bm.migratableResolver.getMigrationBlocks(shuffleBlockInfo).size < 
> blocks.size) {
>                   logWarning(s"Skipping block $shuffleBlockInfo, block 
> deleted.")
>                 } else if (fallbackStorage.isDefined) {
>                   fallbackStorage.foreach(_.copy(shuffleBlockInfo, bm))
>                 } else {
>                   logError(s"Error occurred during migrating 
> $shuffleBlockInfo", e)
>                   keepRunning = false
>                 }
> {code}
> Since "remote" is not expected to be a resolvable hostname, an 
> {{IOException}} occurs, and {{fallbackStorage}} is used. But, we shouldn't 
> try to resolve this. First off, it's completely unnecessary and strange to be 
> treating the placeholder ID as a resolvable hostname, relying on an exception 
> to realize that we need to use the {{fallbackStorage}}.
> To make matters worse, in some network environments, "remote" may be a 
> resolvable hostname, completely breaking this functionality. In the 
> particular environment that I use for running automated tests, there is a DNS 
> entry for "remote" which, when you attempt to connect to it, will hang for a 
> long period of time. This essentially hangs the executor decommission 
> process, and in the case of unit tests, breaks {{FallbackStorageSuite}} as it 
> exceeds its timeouts. I'm not sure, but it's possible this is related to 
> SPARK-35584 as well (if sometimes in the GA environment, it takes a long time 
> for the OS to decide that "remote" is not a valid hostname).
> We shouldn't attempt to treat this placeholder ID as a real hostname.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to