bug#26576: -v when used with -C

2017-04-21 Thread Assaf Gordon

On Thu, Apr 20, 2017 at 02:34:47PM -0500, Eric Blake wrote:

On 04/20/2017 11:51 AM, Assaf Gordon wrote:


If I may suggest the following sed program:

 $ sed -n ':x 1,2{N;bx} ; /UGLY/{ N;N;z;bx }; /./P;N;D' file


Works as long as lines 1 and 2 do not contain UGLY. But misbehaves if
UGLY appears early:

[...]

Also misbehaves if two occurrences of UGLY appear with overlapping context:


[...]

May be fixable with even more magic, perhaps by using the hold buffer to
track the status of the last three lines, and suppressing output if any
of the last three inputs were UGLY.  But more complicated than I want to
spend time on for the sake of this email.



Good catch, thanks for pointing this out.

Indeed, that was an ad-hoc script, suitible for some limited scenarios
but not robust as a general solution.

-assaf







bug#26576: -v when used with -C

2017-04-20 Thread Eric Blake
On 04/20/2017 11:51 AM, Assaf Gordon wrote:

> If I may suggest the following sed program:
> 
>  $ cat file
>  a
>  b
>  c
>  bla1
>  bla2
>  UGLY
>  bla3
>  bla4
>  e
>  f
>  g
> 
>  $ sed -n ':x 1,2{N;bx} ; /UGLY/{ N;N;z;bx }; /./P;N;D' file

Works as long as lines 1 and 2 do not contain UGLY. But misbehaves if
UGLY appears early:

$ printf '2\nUGLY\n3\n4\nc\nd\n' | sed -n ':x 1,2{N;bx};
/UGLY/{N;N;z;bx}; /./P;N;D'
d

Oops - missed c.

Also misbehaves if two occurrences of UGLY appear with overlapping context:

$ printf 'a\nb\n1\n2\nUGLY\n3\nUGLY\n4\n5\nc\nd\n' | sed -n ':x
1,2{N;bx}; /UGLY/{N;N;z;bx}; /./P;N;D'
a
b
4
5
c
d

Oops - didn't filter 4 and 5.

May be fixable with even more magic, perhaps by using the hold buffer to
track the status of the last three lines, and suppressing output if any
of the last three inputs were UGLY.  But more complicated than I want to
spend time on for the sake of this email.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


bug#26576: -v when used with -C

2017-04-20 Thread 積丹尼 Dan Jacobson
Yes those are brilliant uses of sed. However for now

‘-v’
‘--invert-match’
 Invert the sense of matching, to select non-matching lines.  (‘-v’
 is specified by POSIX.)

perhaps should mention that "-v is processed before -C, -A, and -B, not after."





bug#26576: -v when used with -C

2017-04-20 Thread Assaf Gordon

Hello,

On Thu, Apr 20, 2017 at 11:26:47AM -0500, Eric Blake wrote:

On 04/20/2017 10:37 AM, 積丹尼 Dan Jacobson wrote:

I want to do
$ cat file|some_program
but I must must exclude the UGLY line and its two neighbors.

OK I have found the UGLY line, and its two neighbors
$ grep -C 2 UGLY file
bla
bla
UGLY
bla
bla

but I have no way to exclude them before piping to some_program.


It's very corner case, so I'm not sure it's worth burning an option and
complicating grep to do this, plus waiting for a future version of grep
with the proposed new option to percolate to your machines, when you
already accomplish the same task using existing tools (admittedly with
more complexity).




If I may suggest the following sed program:

 $ cat file
 a
 b
 c
 bla1
 bla2
 UGLY
 bla3
 bla4
 e
 f
 g

 $ sed -n ':x 1,2{N;bx} ; /UGLY/{ N;N;z;bx }; /./P;N;D' file
 a
 b
 c
 e
 f
 g


The combination of N/P/D commands use sed's pattern space
as a fifo buffer (N appends a new line, P prints the last line,
D deletes the last line).
In between, if the pattern space contains the marker UGLY,
the entire buffer is deleted and the cycle is restarted.

Specifically:

1. ':x 1,2{N;bx}' => Load the buffer with the first two lines.

2. '/UGLY/ {N;N;z;bx}' => If the marker is found in the pattern
  space (which should contain 3 lines now),
  consume two more lines (N;N), clear the buffer (z) and
  jump to the beginning.
  'z' is GNU extension. It can be replaced with 's/.*//'.

3. '/./P' => If the pattern space isn't empty, print up to
  the first line;

4. 'N;D' => Read the next line from the input file and append
  it to the pattern space, Delete the last line from the
  pattern space (the same line that was printed with 'P').



The following program can be used to learn a bit more about how 
the N/P/D commands work. It uses 'l' to the print content

of the pattern space, and you can see it behaves like a FIFO:

 $ sed -n ':x 1,2{N;bx} ; l;P;N;D' file
 a\nb\nc$
 a
 b\nc\nbla1$
 b
 c\nbla1\nbla2$
 c
 bla1\nbla2\nUGLY$
 bla1
 bla2\nUGLY\nbla3$
 bla2
 UGLY\nbla3\nbla4$
 UGLY
 bla3\nbla4\ne$
 bla3
 bla4\ne\nf$
 bla4
 e\nf\ng$
 e


More information about sed's buffers can be found here:
https://www.gnu.org/software/sed/manual/sed.html#advanced-sed

hope this helps,
regards,
- assaf









bug#26576: -v when used with -C

2017-04-20 Thread Eric Blake
On 04/20/2017 11:38 AM, 積丹尼 Dan Jacobson wrote:
> Yes, if somebody ever adds this option perhaps call it --compliment.

Except that you mean --complement (you are not praising the lines, but
making an opposite selection of lines).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


bug#26576: -v when used with -C

2017-04-20 Thread 積丹尼 Dan Jacobson
Yes, if somebody ever adds this option perhaps call it --compliment.





bug#26576: -v when used with -C

2017-04-20 Thread Eric Blake
On 04/20/2017 10:37 AM, 積丹尼 Dan Jacobson wrote:
> I want to do
> $ cat file|some_program
> but I must must exclude the UGLY line and its two neighbors.
> 
> OK I have found the UGLY line, and its two neighbors
> $ grep -C 2 UGLY file
> bla
> bla
> UGLY
> bla
> bla
> 
> but I have no way to exclude them before piping to some_program.

So it sounds like you are asking for some sort of new --invert-output,
which toggles which lines to display.  Revisiting my example, it would
change:

$ seq 10 | grep -C 25
3
4
5
6
7

into:

$ seq 10 | grep -C 25 --invert-output
1
2
--
8
9
10

as well as:

$ seq 10 | grep -C 2 -v 5
1
2
3
4
5
6
7
8
9
10
$ seq 10 | grep -C 2 -v '[3-8]'
1
2
3
4
--
7
8
9
10

into:

$ seq 10 | grep -C 2 -v 5 --invert-output
$ seq 10 | grep -C 2 -v '[3-8]' --invert-output
5
6

It's very corner case, so I'm not sure it's worth burning an option and
complicating grep to do this, plus waiting for a future version of grep
with the proposed new option to percolate to your machines, when you
already accomplish the same task using existing tools (admittedly with
more complexity).

For example, you can use sed twice if the data is in a file that can be
re-read or easily regenerated (in this case, I'm skipping d, h, and any
line within -C1 of the ugly lines):

$ printf %s\\n a b c d e f g h i j > file
$ ugly=$(sed -n '/[dh]/ =' file)
$ sed "$(for line in $ugly; do echo "$((line-1)),$((line+1))d;";
   done)" file
a
b
f
j

Or it should be easy enough to write an awk script that stashes all
input lines into one array, then checks for regular expression matches,
and sets multiple entries in a corresponding poison array to 1 (based on
how many lines of context you want to poison), then in an END block only
print out lines if the corresponding poison[] entry is not 1.  Although
I'll leave that as an exercise for the reader.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


bug#26576: -v when used with -C

2017-04-20 Thread 積丹尼 Dan Jacobson
I want to do
$ cat file|some_program
but I must must exclude the UGLY line and its two neighbors.

OK I have found the UGLY line, and its two neighbors
$ grep -C 2 UGLY file
bla
bla
UGLY
bla
bla

but I have no way to exclude them before piping to some_program.





bug#26576: -v when used with -C

2017-04-20 Thread Eric Blake
On 04/20/2017 10:14 AM, 積丹尼 Dan Jacobson wrote:
> Mmmm, OK, but grep still needs an additional future option to print just
> the missing set...

What output are you wanting?  If all you want is the non-matching lines,
don't ask for context (since the context will include matching lines).

If you want your request to be acted on, please demonstrate with some
sample input and the resulting output you want to accomplish, and then
we can help you figure out if that particular output can already be
generated using existing options.  But your vague request to "print just
the missing set" doesn't tell me what you really want.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


bug#26576: -v when used with -C

2017-04-20 Thread 積丹尼 Dan Jacobson
Mmmm, OK, but grep still needs an additional future option to print just
the missing set...





bug#26576: -v when used with -C

2017-04-20 Thread Eric Blake
tag 26576 notabug
thanks

On 04/20/2017 09:39 AM, 積丹尼 Dan Jacobson wrote:
> You know if this only gets five lines,
> grep -C 2ZZZ 1.vcf|wc - 1.vcf
>   5   5 197 -
>16861731   83630 1.vcf
> then this
> grep -C 2 -v ZZZ 1.vcf|wc - 1.vcf
>16861731   83630 -
>16861731   83630 1.vcf
> should get all EXCEPT five lines.

Not necessarily true.  Let's simplify your example to something that
doesn't require knowing the contents of 1.vcf:

$ seq 10 | grep -C 25
3
4
5
6
7

That says show all lines that match the regex '5', as well as (up to) 2
context lines on either side.  So we get a total output of five lines,
even though only one of those five lines actually matched.

Now the converse:

$ seq 10 | grep -C 2 -v 5
1
2
3
4
5
6
7
8
9
10

That says to show all lines that do not match the regex '5', as well as
(up to) 2 context lines on either side.  So we get a total output of ten
lines, but that is comprised of 4 matching lines, 1 context line, and 5
more matching lines (grep was smart enough to consolidate the two tail
lines after 4 and the two head lines before 6 into a single output line,
rather than displaying two independent chunks).

For further proof that -C and -v are correctly working together, try
something that excludes enough context lines to actually get two hunks:

$ seq 10 | grep -C 2 -v '[3-8]'
1
2
3
4
--
7
8
9
10

Now you're matching 2 lines, then 2 lines tail context, then a hunk
separator, then 2 lines head context, then 2 more matching lines.

Therefore, I'm tagging this as not a bug.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


bug#26576: -v when used with -C

2017-04-20 Thread 積丹尼 Dan Jacobson
You know if this only gets five lines,
grep -C 2ZZZ 1.vcf|wc - 1.vcf
  5   5 197 -
   16861731   83630 1.vcf
then this
grep -C 2 -v ZZZ 1.vcf|wc - 1.vcf
   16861731   83630 -
   16861731   83630 1.vcf
should get all EXCEPT five lines.