[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk
[ https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=17105#comment-17105 ] Robin Sommer commented on BIT-1215: --- I haven't looked at the code yet but if there's hard line length limit in there, that's a problem. bro-cut shouldn't care how long lines are. bro-cut should be rewritten in C for speed and to not depend on gawk Key: BIT-1215 URL: https://bro-tracker.atlassian.net/browse/BIT-1215 Project: Bro Issue Tracker Issue Type: Improvement Components: Bro, bro-aux Reporter: Daniel Thayer Fix For: 2.4 The current implementation of bro-cut is too slow when processing large log files (takes more than a minute to process a single log file a few hundred MB in size). Justin Azoff rewrote bro-cut in C and found that it runs an order of magnitude faster. Another benefit of a C version of bro-cut is that we will no longer depend on gawk for anything (and some of Bro's supported platforms do not include gawk by default). -- This message was sent by Atlassian JIRA (v6.3-OD-08-005-WN#6328) ___ bro-dev mailing list bro-dev@bro.org http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
Re: [Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk
I haven't looked at the code yet but if there's hard line length limit in there, that's a problem. bro-cut shouldn't care how long lines are. ___ bro-dev mailing list bro-dev@bro.org http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
Re: [Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk
We are going to make it configurable and default to like a 1000KB line. Otherwise, you add a check to see if you need to reallocate memory for every line processed, which seems inefficient for edge cases. Letting the user override the default is a good compromise though. On Jul 10, 2014, at 4:30 PM, Robin Sommer (JIRA) j...@bro-tracker.atlassian.net wrote: I haven't looked at the code yet but if there's hard line length limit in there, that's a problem. bro-cut shouldn't care how long lines are. ___ bro-dev mailing list bro-dev@bro.org http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk
[ https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=17107#comment-17107 ] Justin Azoff commented on BIT-1215: --- I think start with 1M and realloc 2x as needed is the way to go after all. We need (and already have) the check to see if fgets truncated the line. I think the only thing to do would be to add an absolute max line length of 64M or something to handle the case where someone accidentally runs bro-cut against a binary file (like a compressed bro log) that just doesn't contain any newlines. bro-cut should be rewritten in C for speed and to not depend on gawk Key: BIT-1215 URL: https://bro-tracker.atlassian.net/browse/BIT-1215 Project: Bro Issue Tracker Issue Type: Improvement Components: Bro, bro-aux Reporter: Daniel Thayer Fix For: 2.4 The current implementation of bro-cut is too slow when processing large log files (takes more than a minute to process a single log file a few hundred MB in size). Justin Azoff rewrote bro-cut in C and found that it runs an order of magnitude faster. Another benefit of a C version of bro-cut is that we will no longer depend on gawk for anything (and some of Bro's supported platforms do not include gawk by default). -- This message was sent by Atlassian JIRA (v6.3-OD-08-005-WN#6328) ___ bro-dev mailing list bro-dev@bro.org http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk
[ https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=17106#comment-17106 ] Adam Slagell commented on BIT-1215: --- We are going to make it configurable and default to like a 1000KB line. Otherwise, you add a check to see if you need to reallocate memory for every line processed, which seems inefficient for edge cases. Letting the user override the default is a good compromise though. bro-cut should be rewritten in C for speed and to not depend on gawk Key: BIT-1215 URL: https://bro-tracker.atlassian.net/browse/BIT-1215 Project: Bro Issue Tracker Issue Type: Improvement Components: Bro, bro-aux Reporter: Daniel Thayer Fix For: 2.4 The current implementation of bro-cut is too slow when processing large log files (takes more than a minute to process a single log file a few hundred MB in size). Justin Azoff rewrote bro-cut in C and found that it runs an order of magnitude faster. Another benefit of a C version of bro-cut is that we will no longer depend on gawk for anything (and some of Bro's supported platforms do not include gawk by default). -- This message was sent by Atlassian JIRA (v6.3-OD-08-005-WN#6328) ___ bro-dev mailing list bro-dev@bro.org http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk
[ https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=17108#comment-17108 ] Robin Sommer commented on BIT-1215: --- Yes. Maybe a bit less than 2x, exponential grows quickly. :) Would be nicer to recognize that differently, like by not finding a log header; that way we can give a good error message. If such a check is in place, I wouldn't actually bother with another double-check on line length; in the unlikely case that the file has a correct header but totally broken content, I'm sure there are plenty other cases where bro-cut would fail, and it seems there's not more here that can happen in addition than running out of memory (which the OS will catch). bro-cut should be rewritten in C for speed and to not depend on gawk Key: BIT-1215 URL: https://bro-tracker.atlassian.net/browse/BIT-1215 Project: Bro Issue Tracker Issue Type: Improvement Components: Bro, bro-aux Reporter: Daniel Thayer Fix For: 2.4 The current implementation of bro-cut is too slow when processing large log files (takes more than a minute to process a single log file a few hundred MB in size). Justin Azoff rewrote bro-cut in C and found that it runs an order of magnitude faster. Another benefit of a C version of bro-cut is that we will no longer depend on gawk for anything (and some of Bro's supported platforms do not include gawk by default). -- This message was sent by Atlassian JIRA (v6.3-OD-08-005-WN#6328) ___ bro-dev mailing list bro-dev@bro.org http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
Re: [Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk
On Thu, Jul 10, 2014 at 17:27 -0500, you wrote: I think start with 1M and realloc 2x as needed is the way to go after all. Yes. Maybe a bit less than 2x, exponential grows quickly. :) I think the only thing to do would be to add an absolute max line length of 64M or something to handle the case where someone accidentally runs bro-cut against a binary file (like a compressed bro log) that just doesn't contain any newlines. Would be nicer to recognize that differently, like by not finding a log header; that way we can give a good error message. If such a check is in place, I wouldn't actually bother with another double-check on line length; in the unlikely case that the file has a correct header but totally broken content, I'm sure there are plenty other cases where bro-cut would fail, and it seems there's not more here that can happen in addition than running out of memory (which the OS will catch). ___ bro-dev mailing list bro-dev@bro.org http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk
Daniel Thayer created BIT-1215: -- Summary: bro-cut should be rewritten in C for speed and to not depend on gawk Key: BIT-1215 URL: https://bro-tracker.atlassian.net/browse/BIT-1215 Project: Bro Issue Tracker Issue Type: Improvement Components: bro-aux Reporter: Daniel Thayer Fix For: 2.4 The current implementation of bro-cut is too slow when processing large log files (takes more than a minute to process a single log file a few hundred MB in size). Justin Azoff rewrote bro-cut in C and found that it runs an order of magnitude faster. Another benefit of a C version of bro-cut is that we will no longer depend on gawk for anything (and some of Bro's supported platforms do not include gawk by default). -- This message was sent by Atlassian JIRA (v6.3-OD-08-005-WN#6328) ___ bro-dev mailing list bro-dev@bro.org http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk
[ https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Thayer updated BIT-1215: --- Component/s: Bro bro-cut should be rewritten in C for speed and to not depend on gawk Key: BIT-1215 URL: https://bro-tracker.atlassian.net/browse/BIT-1215 Project: Bro Issue Tracker Issue Type: Improvement Components: Bro, bro-aux Reporter: Daniel Thayer Fix For: 2.4 The current implementation of bro-cut is too slow when processing large log files (takes more than a minute to process a single log file a few hundred MB in size). Justin Azoff rewrote bro-cut in C and found that it runs an order of magnitude faster. Another benefit of a C version of bro-cut is that we will no longer depend on gawk for anything (and some of Bro's supported platforms do not include gawk by default). -- This message was sent by Atlassian JIRA (v6.3-OD-08-005-WN#6328) ___ bro-dev mailing list bro-dev@bro.org http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk
[ https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=17102#comment-17102 ] Daniel Thayer commented on BIT-1215: Branch topic/dnthayer/ticket1215 in bro and bro-aux repos contains the new bro-cut, and a couple of doc changes (remove gawk from list of optional Bro dependencies, and update btest sphinx PATH so that the documentation examples that use bro-cut can find the new bro-cut). bro-cut should be rewritten in C for speed and to not depend on gawk Key: BIT-1215 URL: https://bro-tracker.atlassian.net/browse/BIT-1215 Project: Bro Issue Tracker Issue Type: Improvement Components: Bro, bro-aux Reporter: Daniel Thayer Fix For: 2.4 The current implementation of bro-cut is too slow when processing large log files (takes more than a minute to process a single log file a few hundred MB in size). Justin Azoff rewrote bro-cut in C and found that it runs an order of magnitude faster. Another benefit of a C version of bro-cut is that we will no longer depend on gawk for anything (and some of Bro's supported platforms do not include gawk by default). -- This message was sent by Atlassian JIRA (v6.3-OD-08-005-WN#6328) ___ bro-dev mailing list bro-dev@bro.org http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev
[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk
[ https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Thayer updated BIT-1215: --- Status: Merge Request (was: Open) bro-cut should be rewritten in C for speed and to not depend on gawk Key: BIT-1215 URL: https://bro-tracker.atlassian.net/browse/BIT-1215 Project: Bro Issue Tracker Issue Type: Improvement Components: Bro, bro-aux Reporter: Daniel Thayer Fix For: 2.4 The current implementation of bro-cut is too slow when processing large log files (takes more than a minute to process a single log file a few hundred MB in size). Justin Azoff rewrote bro-cut in C and found that it runs an order of magnitude faster. Another benefit of a C version of bro-cut is that we will no longer depend on gawk for anything (and some of Bro's supported platforms do not include gawk by default). -- This message was sent by Atlassian JIRA (v6.3-OD-08-005-WN#6328) ___ bro-dev mailing list bro-dev@bro.org http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev