[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-10 Thread Robin Sommer (JIRA)

[ 
https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=17105#comment-17105
 ] 

Robin Sommer commented on BIT-1215:
---

I haven't looked at the code yet but if there's hard line length
limit in there, that's a problem. bro-cut shouldn't care how long
lines are.




 bro-cut should be rewritten in C for speed and to not depend on gawk
 

 Key: BIT-1215
 URL: https://bro-tracker.atlassian.net/browse/BIT-1215
 Project: Bro Issue Tracker
  Issue Type: Improvement
  Components: Bro, bro-aux
Reporter: Daniel Thayer
 Fix For: 2.4


 The current implementation of bro-cut is too slow when processing large log 
 files (takes more than a minute to process a single log file a few hundred MB 
 in size).  Justin Azoff rewrote bro-cut in C and found that it runs an order 
 of magnitude faster.  Another benefit of a C version of bro-cut is that we 
 will no longer depend on gawk for anything (and some of Bro's supported 
 platforms do not include gawk by default).



--
This message was sent by Atlassian JIRA
(v6.3-OD-08-005-WN#6328)
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-10 Thread Robin Sommer
I haven't looked at the code yet but if there's hard line length
limit in there, that's a problem. bro-cut shouldn't care how long
lines are.


___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-10 Thread Slagell, Adam J
We are going to make it configurable and default to like a 1000KB line. 
Otherwise, you add a check to see if you need to reallocate memory for every 
line processed, which seems inefficient for edge cases. Letting the user 
override the default is a good compromise though. 

 On Jul 10, 2014, at 4:30 PM, Robin Sommer (JIRA) 
 j...@bro-tracker.atlassian.net wrote:
 
 I haven't looked at the code yet but if there's hard line length
 limit in there, that's a problem. bro-cut shouldn't care how long
 lines are.

___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-10 Thread Justin Azoff (JIRA)

[ 
https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=17107#comment-17107
 ] 

Justin Azoff commented on BIT-1215:
---

I think start with 1M and realloc 2x as needed is the way to go after all.  We 
need (and already have) the check to see if fgets truncated the line.

I think the only thing to do would be to add an absolute max line length of 64M 
or something to handle the case where someone accidentally runs bro-cut against 
a binary file (like a compressed bro log) that just doesn't contain any 
newlines.

 bro-cut should be rewritten in C for speed and to not depend on gawk
 

 Key: BIT-1215
 URL: https://bro-tracker.atlassian.net/browse/BIT-1215
 Project: Bro Issue Tracker
  Issue Type: Improvement
  Components: Bro, bro-aux
Reporter: Daniel Thayer
 Fix For: 2.4


 The current implementation of bro-cut is too slow when processing large log 
 files (takes more than a minute to process a single log file a few hundred MB 
 in size).  Justin Azoff rewrote bro-cut in C and found that it runs an order 
 of magnitude faster.  Another benefit of a C version of bro-cut is that we 
 will no longer depend on gawk for anything (and some of Bro's supported 
 platforms do not include gawk by default).



--
This message was sent by Atlassian JIRA
(v6.3-OD-08-005-WN#6328)
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-10 Thread Adam Slagell (JIRA)

[ 
https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=17106#comment-17106
 ] 

Adam Slagell commented on BIT-1215:
---

We are going to make it configurable and default to like a 1000KB line. 
Otherwise, you add a check to see if you need to reallocate memory for every 
line processed, which seems inefficient for edge cases. Letting the user 
override the default is a good compromise though. 



 bro-cut should be rewritten in C for speed and to not depend on gawk
 

 Key: BIT-1215
 URL: https://bro-tracker.atlassian.net/browse/BIT-1215
 Project: Bro Issue Tracker
  Issue Type: Improvement
  Components: Bro, bro-aux
Reporter: Daniel Thayer
 Fix For: 2.4


 The current implementation of bro-cut is too slow when processing large log 
 files (takes more than a minute to process a single log file a few hundred MB 
 in size).  Justin Azoff rewrote bro-cut in C and found that it runs an order 
 of magnitude faster.  Another benefit of a C version of bro-cut is that we 
 will no longer depend on gawk for anything (and some of Bro's supported 
 platforms do not include gawk by default).



--
This message was sent by Atlassian JIRA
(v6.3-OD-08-005-WN#6328)
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-10 Thread Robin Sommer (JIRA)

[ 
https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=17108#comment-17108
 ] 

Robin Sommer commented on BIT-1215:
---






Yes. Maybe a bit less than 2x, exponential grows quickly. :)


Would be nicer to recognize that differently, like by not finding a
log header; that way we can give a good error message. If such a check
is in place, I wouldn't actually bother with another double-check on
line length; in the unlikely case that the file has a correct header
but totally broken content, I'm sure there are plenty other cases
where bro-cut would fail, and it seems there's not more here that can
happen in addition than running out of memory (which the OS will
catch).


 bro-cut should be rewritten in C for speed and to not depend on gawk
 

 Key: BIT-1215
 URL: https://bro-tracker.atlassian.net/browse/BIT-1215
 Project: Bro Issue Tracker
  Issue Type: Improvement
  Components: Bro, bro-aux
Reporter: Daniel Thayer
 Fix For: 2.4


 The current implementation of bro-cut is too slow when processing large log 
 files (takes more than a minute to process a single log file a few hundred MB 
 in size).  Justin Azoff rewrote bro-cut in C and found that it runs an order 
 of magnitude faster.  Another benefit of a C version of bro-cut is that we 
 will no longer depend on gawk for anything (and some of Bro's supported 
 platforms do not include gawk by default).



--
This message was sent by Atlassian JIRA
(v6.3-OD-08-005-WN#6328)
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


Re: [Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-10 Thread Robin Sommer



On Thu, Jul 10, 2014 at 17:27 -0500, you wrote:

 I think start with 1M and realloc 2x as needed is the way to go after
 all.

Yes. Maybe a bit less than 2x, exponential grows quickly. :)

 I think the only thing to do would be to add an absolute max line
 length of 64M or something to handle the case where someone
 accidentally runs bro-cut against a binary file (like a compressed bro
 log) that just doesn't contain any newlines.

Would be nicer to recognize that differently, like by not finding a
log header; that way we can give a good error message. If such a check
is in place, I wouldn't actually bother with another double-check on
line length; in the unlikely case that the file has a correct header
but totally broken content, I'm sure there are plenty other cases
where bro-cut would fail, and it seems there's not more here that can
happen in addition than running out of memory (which the OS will
catch).
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-09 Thread Daniel Thayer (JIRA)
Daniel Thayer created BIT-1215:
--

 Summary: bro-cut should be rewritten in C for speed and to not 
depend on gawk
 Key: BIT-1215
 URL: https://bro-tracker.atlassian.net/browse/BIT-1215
 Project: Bro Issue Tracker
  Issue Type: Improvement
  Components: bro-aux
Reporter: Daniel Thayer
 Fix For: 2.4


The current implementation of bro-cut is too slow when processing large log 
files (takes more than a minute to process a single log file a few hundred MB 
in size).  Justin Azoff rewrote bro-cut in C and found that it runs an order of 
magnitude faster.  Another benefit of a C version of bro-cut is that we will no 
longer depend on gawk for anything (and some of Bro's supported platforms do 
not include gawk by default).




--
This message was sent by Atlassian JIRA
(v6.3-OD-08-005-WN#6328)
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-09 Thread Daniel Thayer (JIRA)

 [ 
https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Thayer updated BIT-1215:
---
Component/s: Bro

 bro-cut should be rewritten in C for speed and to not depend on gawk
 

 Key: BIT-1215
 URL: https://bro-tracker.atlassian.net/browse/BIT-1215
 Project: Bro Issue Tracker
  Issue Type: Improvement
  Components: Bro, bro-aux
Reporter: Daniel Thayer
 Fix For: 2.4


 The current implementation of bro-cut is too slow when processing large log 
 files (takes more than a minute to process a single log file a few hundred MB 
 in size).  Justin Azoff rewrote bro-cut in C and found that it runs an order 
 of magnitude faster.  Another benefit of a C version of bro-cut is that we 
 will no longer depend on gawk for anything (and some of Bro's supported 
 platforms do not include gawk by default).



--
This message was sent by Atlassian JIRA
(v6.3-OD-08-005-WN#6328)
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-09 Thread Daniel Thayer (JIRA)

[ 
https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=17102#comment-17102
 ] 

Daniel Thayer commented on BIT-1215:


Branch topic/dnthayer/ticket1215 in bro and bro-aux repos contains
the new bro-cut, and a couple of doc changes (remove gawk from
list of optional Bro dependencies, and update btest sphinx PATH so that
the documentation examples that use bro-cut can find the new bro-cut).


 bro-cut should be rewritten in C for speed and to not depend on gawk
 

 Key: BIT-1215
 URL: https://bro-tracker.atlassian.net/browse/BIT-1215
 Project: Bro Issue Tracker
  Issue Type: Improvement
  Components: Bro, bro-aux
Reporter: Daniel Thayer
 Fix For: 2.4


 The current implementation of bro-cut is too slow when processing large log 
 files (takes more than a minute to process a single log file a few hundred MB 
 in size).  Justin Azoff rewrote bro-cut in C and found that it runs an order 
 of magnitude faster.  Another benefit of a C version of bro-cut is that we 
 will no longer depend on gawk for anything (and some of Bro's supported 
 platforms do not include gawk by default).



--
This message was sent by Atlassian JIRA
(v6.3-OD-08-005-WN#6328)
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev


[Bro-Dev] [JIRA] (BIT-1215) bro-cut should be rewritten in C for speed and to not depend on gawk

2014-07-09 Thread Daniel Thayer (JIRA)

 [ 
https://bro-tracker.atlassian.net/browse/BIT-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Thayer updated BIT-1215:
---
Status: Merge Request  (was: Open)

 bro-cut should be rewritten in C for speed and to not depend on gawk
 

 Key: BIT-1215
 URL: https://bro-tracker.atlassian.net/browse/BIT-1215
 Project: Bro Issue Tracker
  Issue Type: Improvement
  Components: Bro, bro-aux
Reporter: Daniel Thayer
 Fix For: 2.4


 The current implementation of bro-cut is too slow when processing large log 
 files (takes more than a minute to process a single log file a few hundred MB 
 in size).  Justin Azoff rewrote bro-cut in C and found that it runs an order 
 of magnitude faster.  Another benefit of a C version of bro-cut is that we 
 will no longer depend on gawk for anything (and some of Bro's supported 
 platforms do not include gawk by default).



--
This message was sent by Atlassian JIRA
(v6.3-OD-08-005-WN#6328)
___
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev