[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519600#comment-14519600
 ] 

Sean Owen commented on SPARK-7189:
--

I thought that was the point, but maybe I misunderstand: you have to err on the 
side of re-processing a file even if it doesn't look like it changed. Right?

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-29 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519761#comment-14519761
 ] 

Marcelo Vanzin commented on SPARK-7189:
---

bq. There will always be situations that several operations finished within 
very short time

That's true. But if we want to go with the assumption that these files are only 
even appended to, which is the case for the writer (EventLoggingListener), we 
can check {{(timestamp, file size)}} to detect modifications. Or event just 
file size, for that matter, and just use timestamp for ordering.

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-29 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520421#comment-14520421
 ] 

Marcelo Vanzin commented on SPARK-7189:
---

Correct, {{=}} is still required. The trick here is to find an alternative way 
to detect whether logs that match the last timestamp really need to be parsed.

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-29 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520428#comment-14520428
 ] 

Marcelo Vanzin commented on SPARK-7189:
---

Just redundant work. Not a big deal for small-to-medium sized apps, I guess, 
but if the last parsed app has a really large event log, then it may slow down 
the HS polling a bit.

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520417#comment-14520417
 ] 

Sean Owen commented on SPARK-7189:
--

So is the outcome still that the check needs to be = ? to my understanding 
that is, at worst, a little overly conservative, and at best, required.

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520426#comment-14520426
 ] 

Sean Owen commented on SPARK-7189:
--

What's the downside of parsing them anyway, just a little redundant work or is 
it noticeably slower? if it's just a little slower but definitely correct, 
let's leave it.

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-29 Thread Zhang, Liye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520694#comment-14520694
 ] 

Zhang, Liye commented on SPARK-7189:


*=* is always need if we use timestamp as one of the criteria. 

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-29 Thread Zhang, Liye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520691#comment-14520691
 ] 

Zhang, Liye commented on SPARK-7189:


Yes, we can use file size to monitor the file change, that works for *write* 
operations. And if we introduce file size, we'll need a hashmap to maintain the 
information, and this hash map can also check whether the file is renamed or 
not. 

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-29 Thread Zhang, Liye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520697#comment-14520697
 ] 

Zhang, Liye commented on SPARK-7189:


Just redundant work, and we can leave it. And I marked this issue as *minor*

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516852#comment-14516852
 ] 

Sean Owen commented on SPARK-7189:
--

Hm, I'd swear we had discussed this already and there was a good reason for it 
from [~vanzin], but I can't find the PR or JIRA now. I remember a PR changing 
the = to  and the result was that it was on purpose. Not sure if this was a 
helpful comment but I do remember something like this.

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-28 Thread Zhang, Liye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517136#comment-14517136
 ] 

Zhang, Liye commented on SPARK-7189:


Yes, I think the current solution is a tradeoff, we can not simply changing the 
= to  which will cause other problems. Anyway, I haven't think up any other 
solution yet, maybe others have some novel/nice ideas.

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-28 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517289#comment-14517289
 ] 

Marcelo Vanzin commented on SPARK-7189:
---

Changing the {{=}} causes problems. If you want to fix this, you need to keep 
track of the log files that were loaded at the last timestamp, and ignore them 
if they still have that same timestamp when you re-list the log directory.

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-28 Thread Zhang, Liye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518721#comment-14518721
 ] 

Zhang, Liye commented on SPARK-7189:


Hi [~vanzin], I think using timestamp is not that precise. This method is very 
similar with the way using modification time. There will always be situations 
that several operations finished within very short time (say less than 1 
millisecond or even shorter). So timestamp and modification time can not be 
trusted. 

The target is to get the status change of the files, including contents change 
(write operation) and permission change (rename operation). `Inotify` can get 
the change but it's not available in HDFS before version 2.7. One way to tell 
the change is to set one flag after each operation and reset the flag after 
reloading the file. But this will make the code really ugly, a bad option. 

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org