[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519600#comment-14519600 ] Sean Owen commented on SPARK-7189: -- I thought that was the point, but maybe I misunderstand: you have to err on the side of re-processing a file even if it doesn't look like it changed. Right? History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519761#comment-14519761 ] Marcelo Vanzin commented on SPARK-7189: --- bq. There will always be situations that several operations finished within very short time That's true. But if we want to go with the assumption that these files are only even appended to, which is the case for the writer (EventLoggingListener), we can check {{(timestamp, file size)}} to detect modifications. Or event just file size, for that matter, and just use timestamp for ordering. History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520421#comment-14520421 ] Marcelo Vanzin commented on SPARK-7189: --- Correct, {{=}} is still required. The trick here is to find an alternative way to detect whether logs that match the last timestamp really need to be parsed. History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520428#comment-14520428 ] Marcelo Vanzin commented on SPARK-7189: --- Just redundant work. Not a big deal for small-to-medium sized apps, I guess, but if the last parsed app has a really large event log, then it may slow down the HS polling a bit. History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520417#comment-14520417 ] Sean Owen commented on SPARK-7189: -- So is the outcome still that the check needs to be = ? to my understanding that is, at worst, a little overly conservative, and at best, required. History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520426#comment-14520426 ] Sean Owen commented on SPARK-7189: -- What's the downside of parsing them anyway, just a little redundant work or is it noticeably slower? if it's just a little slower but definitely correct, let's leave it. History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520694#comment-14520694 ] Zhang, Liye commented on SPARK-7189: *=* is always need if we use timestamp as one of the criteria. History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520691#comment-14520691 ] Zhang, Liye commented on SPARK-7189: Yes, we can use file size to monitor the file change, that works for *write* operations. And if we introduce file size, we'll need a hashmap to maintain the information, and this hash map can also check whether the file is renamed or not. History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520697#comment-14520697 ] Zhang, Liye commented on SPARK-7189: Just redundant work, and we can leave it. And I marked this issue as *minor* History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516852#comment-14516852 ] Sean Owen commented on SPARK-7189: -- Hm, I'd swear we had discussed this already and there was a good reason for it from [~vanzin], but I can't find the PR or JIRA now. I remember a PR changing the = to and the result was that it was on purpose. Not sure if this was a helpful comment but I do remember something like this. History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517136#comment-14517136 ] Zhang, Liye commented on SPARK-7189: Yes, I think the current solution is a tradeoff, we can not simply changing the = to which will cause other problems. Anyway, I haven't think up any other solution yet, maybe others have some novel/nice ideas. History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517289#comment-14517289 ] Marcelo Vanzin commented on SPARK-7189: --- Changing the {{=}} causes problems. If you want to fix this, you need to keep track of the log files that were loaded at the last timestamp, and ignore them if they still have that same timestamp when you re-list the log directory. History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518721#comment-14518721 ] Zhang, Liye commented on SPARK-7189: Hi [~vanzin], I think using timestamp is not that precise. This method is very similar with the way using modification time. There will always be situations that several operations finished within very short time (say less than 1 millisecond or even shorter). So timestamp and modification time can not be trusted. The target is to get the status change of the files, including contents change (write operation) and permission change (rename operation). `Inotify` can get the change but it's not available in HDFS before version 2.7. One way to tell the change is to set one flag after each operation and reset the flag after reloading the file. But this will make the code really ugly, a bad option. History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org