Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/11628#discussion_r55752244 --- Diff: core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala --- @@ -157,8 +167,16 @@ private[spark] class PipedRDD[T: ClassTag]( val lines = Source.fromInputStream(proc.getInputStream).getLines() new Iterator[String] { def next(): String = lines.next() + + private def propagateChildThreadException(): Unit = { --- End diff -- @srowen : Thanks for your comments. >> hasNext is supposed to be idempotent Failures / exception is a special case. What if you are iterating over something read over the network or disk and you haven't received the entire thing ? In those cases its fine for hasNext to fail after hitting problems like connection lost. The only way to avoid it would be to read everything off the source and only then start iteration which is not practical as the data might be large. >> You'd never see an exception if you never called it. It makes sense. Say if the next stage is doing LIMIT 10, and it could already read 10 rows from pipe(), its fine to not fail entire thing due to exception which could happen while pipe emits 1 million'th row. >> Why not check the result of the process exit before returning the iterator? The process may be still running while we return the iterator. The current model is to return iterator while the pipe process is processing the inputs. I think it makes sense to do so because for an input RDD having trillion records, waiting for the pipe to process all of the input and then return the iterator will be lot of time. >> This is already effectively blocking on the return from the process anyway. Can you please be specific which line you think is blocking ? I guess you meant line 167 in the PR but I don't see it to be blocking
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org