Github user marsishandsome commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74046962
  
    I've updated the implementation.
    
    Two background threads are used to load the log files:
    1. one thread to check the file list
    2. another to fetch and parse the log files
    
    There my be some race condition problems if a thread pool is used to fetch 
and parse the log files.
    The following problems must be taken care:
    1. The threads in the pool share a common unparsed file list, which is 
produced by another thread
    2. The threads in the pool update a common parsed file list
    3. The unparsed file list is sorted by file update time
    4. The parsed file list is sorted by application finish time
    5. The UI thread can at the same time get the content of both unparsed file 
list and parsed file list
    
    Other reasons why I choose the two-thread implementation are: 
    1. If a thread pool is used, the network will be the next bottleneck.
    2. It's ok for users, at least for me, if the missing meta information will 
be finished loading in 3 hours. At least they can visit the job detail webpage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to