Github user tejasapatil commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11628#discussion_r55752244
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala ---
    @@ -157,8 +167,16 @@ private[spark] class PipedRDD[T: ClassTag](
         val lines = Source.fromInputStream(proc.getInputStream).getLines()
         new Iterator[String] {
           def next(): String = lines.next()
    +
    +      private def propagateChildThreadException(): Unit = {
    --- End diff --
    
    @srowen : Thanks for your comments.
    
    >> hasNext is supposed to be idempotent
    
    Failures / exception is a special case. What if you are iterating over 
something read over the network or disk and you haven't received the entire 
thing ? In those cases its fine for hasNext to fail after hitting problems like 
connection lost. The only way to avoid it would be to read everything off the 
source and only then start iteration which is not practical as the data might 
be large.
    
    >> You'd never see an exception if you never called it.
    
    It makes sense. Say if the next stage is doing LIMIT 10, and it could 
already read 10 rows from pipe(), its fine to not fail entire thing due to 
exception which could happen while pipe emits 1 million'th row.
    
    >> Why not check the result of the process exit before returning the 
iterator?
    
    The process may be still running while we return the iterator. The current 
model is to return iterator while the pipe process is processing the inputs. I 
think it makes sense to do so because for an input RDD having trillion records, 
waiting for the pipe to process all of the input and then return the iterator 
will be lot of time.
    
    >> This is already effectively blocking on the return from the process 
anyway.
    
    Can you please be specific which line you think is blocking ? I guess you 
meant line 167 in the PR but I don't see it to be blocking


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to