Re: Extending R.E. Syntax (was: Re: Eliminating binary from a text file)

2015-07-21 Thread Shlomi Fish
Hi Omer,

On Mon, Jul 20, 2015 at 9:46 PM, Omer Zak  wrote:

> Instead of creating a separate bgrep, it would have been better to be
> able to extend the syntax of regular expressions (in egrep, Perl and
> other platforms) to allow specification of binary strings having
> arbitrary length by means of an hex string.
>
> This would come instead of making it very cumbersome to specify strings
> longer than one character (\xnn or \u or equivalent - see also:
> http://www.regular-expressions.info/unicode.html).
>
>
Well, you can already match binary sub-strings inside Perl regular
expressions using the method you describe (\xHH\xHH\xHH . etc.) In Perl
you can do something like:

my $bin_string = [Binary string generated by whatever means necesary]

if ($haystack =~ / ... \Q$bin_string\E ... /)
{
}

So given the rarity of matching binary strings, it seems like a good
compromise.

And if we are at it, it would have been nice to add to all R.E. engines
> hooks to allow private extensions of R.E. syntax, in order to allow
> people to concisely express special parsing requirements.
>
>
Recent versions of perl 5 allow you to use different (and possibly custom)
regular expression engines.

Regards,

-- Shlomi



> --- Omer
>
>
-- 
--
Shlomi Fish http://www.shlomifish.org/

Chuck Norris helps the gods that help themselves.

Please reply to list if it's a mailing list post - http://shlom.in/reply .
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Eliminating binary from a text file

2015-07-21 Thread Orna Agmon Ben-Yehuda
Amos, we have a winner!!! Exactly what I looked for!
Thanks
Orna

On Tue, Jul 21, 2015 at 7:53 AM, Amos Shapira 
wrote:

> Then how about:
>
> "grep -v -P -a '\x00' file"?
>
> Based on http://superuser.com/a/612336/27453. Explantion of the flags:
>
> -v - inverse - print NON-matching lines
> -P - use Perl regexp
> -a - force treating the file as a text file
>
> On 21 July 2015 at 13:39, Shachar Shemesh  wrote:
>
>>  On 21/07/15 00:22, Boruch Baum wrote:
>>
>> I see that I'm late to the discussion and that your original problem has
>> morphed a bit. Maybe the simplest and oldest solution is the `tr -d'
>> command. See `man tr'.
>>
>>
>>  Read the original question again. She needs to eliminate the entire line
>> where a corruption happened, not just the corrupt bytes themselves.
>>
>> Shachar
>>
>> ___
>> Linux-il mailing list
>> Linux-il@cs.huji.ac.il
>> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>>
>>
>
>
> --
> 
>
> ___
> Linux-il mailing list
> Linux-il@cs.huji.ac.il
> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>
>


-- 
Orna Agmon Ben-Yehuda.
http://ladypine.org
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Extending R.E. Syntax (was: Re: Eliminating binary from a text file)

2015-07-20 Thread Alexander Sukholitko
Hi,
Possibly using "strings file_name >new_file_name" command could resolve
this problem.
Thank you.
Alex

On Mon, Jul 20, 2015 at 9:46 PM, Omer Zak  wrote:

> Instead of creating a separate bgrep, it would have been better to be
> able to extend the syntax of regular expressions (in egrep, Perl and
> other platforms) to allow specification of binary strings having
> arbitrary length by means of an hex string.
>
> This would come instead of making it very cumbersome to specify strings
> longer than one character (\xnn or \u or equivalent - see also:
> http://www.regular-expressions.info/unicode.html).
>
> And if we are at it, it would have been nice to add to all R.E. engines
> hooks to allow private extensions of R.E. syntax, in order to allow
> people to concisely express special parsing requirements.
>
> --- Omer
>
>
> On Mon, 2015-07-20 at 21:24 +0300, Shachar Shemesh wrote:
> > On 20/07/15 11:56, Orna Agmon Ben-Yehuda wrote:
> >
> > > Hello everyone,
> > >
> > >
> > > I often have damaged text files (due to a lovely storage system).
> > > The files are of different formats, although I can usually assume
> > > they contain spaces. The files are structured as lines.
> > >
> > >
> > > Every once in a while, the lovely destruction (ahmstorage)
> > > system inserts binary garbage to the file. I wish to fix the files
> > > by removing the cancer without leaving any leftovers. That is, I
> > > want to lose partial lines.
> > >
> > >
> > > I tried using grep with all sorts of keys, but it did not do the
> > > trick.
> > > strings catches too little - it leaves partial lines.
> > > Is there an elegant  way to  do the trick line-wise?
> > >
> > >
> > > Thanks
> > > Orna
> > >
> > http://debugmo.de/2009/04/bgrep-a-binary-grep/
> --
> What happens if one mixes together evolution with time travel to the
> past?  See: http://www.zak.co.il/a/stuff/opinions/eng/evol_tm.html
> My own blog is at http://www.zak.co.il/tddpirate/
>
> My opinions, as expressed in this E-mail message, are mine alone.
> They do not represent the official policy of any organization with which
> I may be affiliated in any way.
> WARNING TO SPAMMERS:  at http://www.zak.co.il/spamwarning.html
>
>
> ___
> Linux-il mailing list
> Linux-il@cs.huji.ac.il
> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Eliminating binary from a text file

2015-07-20 Thread Amos Shapira
Then how about:

"grep -v -P -a '\x00' file"?

Based on http://superuser.com/a/612336/27453. Explantion of the flags:

-v - inverse - print NON-matching lines
-P - use Perl regexp
-a - force treating the file as a text file

On 21 July 2015 at 13:39, Shachar Shemesh  wrote:

>  On 21/07/15 00:22, Boruch Baum wrote:
>
> I see that I'm late to the discussion and that your original problem has
> morphed a bit. Maybe the simplest and oldest solution is the `tr -d'
> command. See `man tr'.
>
>
>  Read the original question again. She needs to eliminate the entire line
> where a corruption happened, not just the corrupt bytes themselves.
>
> Shachar
>
> ___
> Linux-il mailing list
> Linux-il@cs.huji.ac.il
> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>
>


-- 

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Eliminating binary from a text file

2015-07-20 Thread Orna Agmon Ben-Yehuda
tr does what strings does - clears only the characters, not the full bad
line. Not what I wanted.
but now that I know what data I have there, I can clear it using any script
(back to Dov's solution).

Thanks everyone!

On Tue, Jul 21, 2015 at 2:20 AM, Amos Shapira 
wrote:

> +1 for "tr -d '\0' < file > newfile", based on the updated description.
> But "prevention is better than a cure" - find a way to avoid this in the
> first place.
>
> On 21 July 2015 at 07:22, Boruch Baum  wrote:
>
>> I see that I'm late to the discussion and that your original problem has
>> morphed a bit. Maybe the simplest and oldest solution is the `tr -d'
>> command. See `man tr'.
>>
>> On 07/20/2015 04:56 AM, Orna Agmon Ben-Yehuda wrote:
>> > Hello everyone,
>> >
>> > I often have damaged text files (due to a lovely storage system). The
>> files
>> > are of different formats, although I can usually assume they contain
>> > spaces. The files are structured as lines.
>> >
>> > Every once in a while, the lovely destruction (ahmstorage) system
>> > inserts binary garbage to the file. I wish to fix the files by removing
>> the
>> > cancer without leaving any leftovers. That is, I want to lose partial
>> lines.
>> >
>> > I tried using grep with all sorts of keys, but it did not do the trick.
>> > strings catches too little - it leaves partial lines.
>> > Is there an elegant  way to  do the trick line-wise?
>> >
>> > Thanks
>> > Orna
>> >
>> >
>> >
>> > ___
>> > Linux-il mailing list
>> > Linux-il@cs.huji.ac.il
>> > http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>> >
>>
>>
>> --
>> hkp://keys.gnupg.net
>> CA45 09B5 5351 7C11 A9D1  7286 0036 9E45 1595 8BC0
>>
>>
>> ___
>> Linux-il mailing list
>> Linux-il@cs.huji.ac.il
>> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>>
>
>
>
> --
> 
>
> ___
> Linux-il mailing list
> Linux-il@cs.huji.ac.il
> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>
>


-- 
Orna Agmon Ben-Yehuda.
http://ladypine.org
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Eliminating binary from a text file

2015-07-20 Thread Shachar Shemesh
On 21/07/15 00:22, Boruch Baum wrote:
> I see that I'm late to the discussion and that your original problem has
> morphed a bit. Maybe the simplest and oldest solution is the `tr -d'
> command. See `man tr'.
>
Read the original question again. She needs to eliminate the entire line
where a corruption happened, not just the corrupt bytes themselves.

Shachar
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Eliminating binary from a text file

2015-07-20 Thread Amos Shapira
+1 for "tr -d '\0' < file > newfile", based on the updated description.
But "prevention is better than a cure" - find a way to avoid this in the
first place.

On 21 July 2015 at 07:22, Boruch Baum  wrote:

> I see that I'm late to the discussion and that your original problem has
> morphed a bit. Maybe the simplest and oldest solution is the `tr -d'
> command. See `man tr'.
>
> On 07/20/2015 04:56 AM, Orna Agmon Ben-Yehuda wrote:
> > Hello everyone,
> >
> > I often have damaged text files (due to a lovely storage system). The
> files
> > are of different formats, although I can usually assume they contain
> > spaces. The files are structured as lines.
> >
> > Every once in a while, the lovely destruction (ahmstorage) system
> > inserts binary garbage to the file. I wish to fix the files by removing
> the
> > cancer without leaving any leftovers. That is, I want to lose partial
> lines.
> >
> > I tried using grep with all sorts of keys, but it did not do the trick.
> > strings catches too little - it leaves partial lines.
> > Is there an elegant  way to  do the trick line-wise?
> >
> > Thanks
> > Orna
> >
> >
> >
> > ___
> > Linux-il mailing list
> > Linux-il@cs.huji.ac.il
> > http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
> >
>
>
> --
> hkp://keys.gnupg.net
> CA45 09B5 5351 7C11 A9D1  7286 0036 9E45 1595 8BC0
>
>
> ___
> Linux-il mailing list
> Linux-il@cs.huji.ac.il
> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>



-- 

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Eliminating binary from a text file

2015-07-20 Thread Boruch Baum
I see that I'm late to the discussion and that your original problem has
morphed a bit. Maybe the simplest and oldest solution is the `tr -d'
command. See `man tr'.

On 07/20/2015 04:56 AM, Orna Agmon Ben-Yehuda wrote:
> Hello everyone,
> 
> I often have damaged text files (due to a lovely storage system). The files
> are of different formats, although I can usually assume they contain
> spaces. The files are structured as lines.
> 
> Every once in a while, the lovely destruction (ahmstorage) system
> inserts binary garbage to the file. I wish to fix the files by removing the
> cancer without leaving any leftovers. That is, I want to lose partial lines.
> 
> I tried using grep with all sorts of keys, but it did not do the trick.
> strings catches too little - it leaves partial lines.
> Is there an elegant  way to  do the trick line-wise?
> 
> Thanks
> Orna
> 
> 
> 
> ___
> Linux-il mailing list
> Linux-il@cs.huji.ac.il
> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
> 


-- 
hkp://keys.gnupg.net
CA45 09B5 5351 7C11 A9D1  7286 0036 9E45 1595 8BC0


___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Eliminating binary from a text file

2015-07-20 Thread Orna Agmon Ben-Yehuda
The bad data is NULLs (I did not have hexedit, but was introduced to
hexedit mode in emacs which proved useful).

In the meantime, Muli Ben-Yehuda suggested to prevent the mess to begin
with. The corrupted file is the output of a C program. The problem is that
the program continues writing to the file, but it does not verify that the
data is written. In a normal filesystem, I would not care, but mine fails
several times a day. The NULLs are empty data, because the program did
fseek forward, but the file was not written.

The solution I am testing is syncing. The options I got  were:
1.  to mount the filesystem such that it will always sync,
2.  to sync everything the user is running at a certain point, or
3. to fsync just the problematic file, when I stop writing to it.

I am currently testing the third option, for one file only. It is likely to
hurt the performance the least.



On Mon, Jul 20, 2015 at 1:40 PM, Rabin Yasharzadehe  wrote:

> can you provide a example of a bad lines and how do you like them to look
> like after you fix them ?
>
> --
> Rabin
>
> On Mon, Jul 20, 2015 at 11:56 AM, Orna Agmon Ben-Yehuda <
> ladyp...@gmail.com> wrote:
>
>> Hello everyone,
>>
>> I often have damaged text files (due to a lovely storage system). The
>> files are of different formats, although I can usually assume they contain
>> spaces. The files are structured as lines.
>>
>> Every once in a while, the lovely destruction (ahmstorage) system
>> inserts binary garbage to the file. I wish to fix the files by removing the
>> cancer without leaving any leftovers. That is, I want to lose partial lines.
>>
>> I tried using grep with all sorts of keys, but it did not do the trick.
>> strings catches too little - it leaves partial lines.
>> Is there an elegant  way to  do the trick line-wise?
>>
>> Thanks
>> Orna
>>
>> --
>> Orna Agmon Ben-Yehuda.
>> http://ladypine.org
>>
>> ___
>> Linux-il mailing list
>> Linux-il@cs.huji.ac.il
>> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>>
>>
>


-- 
Orna Agmon Ben-Yehuda.
http://ladypine.org
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Extending R.E. Syntax (was: Re: Eliminating binary from a text file)

2015-07-20 Thread Omer Zak
Instead of creating a separate bgrep, it would have been better to be
able to extend the syntax of regular expressions (in egrep, Perl and
other platforms) to allow specification of binary strings having
arbitrary length by means of an hex string.

This would come instead of making it very cumbersome to specify strings
longer than one character (\xnn or \u or equivalent - see also:
http://www.regular-expressions.info/unicode.html).

And if we are at it, it would have been nice to add to all R.E. engines
hooks to allow private extensions of R.E. syntax, in order to allow
people to concisely express special parsing requirements.

--- Omer


On Mon, 2015-07-20 at 21:24 +0300, Shachar Shemesh wrote:
> On 20/07/15 11:56, Orna Agmon Ben-Yehuda wrote:
> 
> > Hello everyone, 
> > 
> > 
> > I often have damaged text files (due to a lovely storage system).
> > The files are of different formats, although I can usually assume
> > they contain spaces. The files are structured as lines.
> > 
> > 
> > Every once in a while, the lovely destruction (ahmstorage)
> > system inserts binary garbage to the file. I wish to fix the files
> > by removing the cancer without leaving any leftovers. That is, I
> > want to lose partial lines.
> > 
> > 
> > I tried using grep with all sorts of keys, but it did not do the
> > trick.
> > strings catches too little - it leaves partial lines.
> > Is there an elegant  way to  do the trick line-wise?
> > 
> > 
> > Thanks
> > Orna
> > 
> http://debugmo.de/2009/04/bgrep-a-binary-grep/
-- 
What happens if one mixes together evolution with time travel to the
past?  See: http://www.zak.co.il/a/stuff/opinions/eng/evol_tm.html
My own blog is at http://www.zak.co.il/tddpirate/

My opinions, as expressed in this E-mail message, are mine alone.
They do not represent the official policy of any organization with which
I may be affiliated in any way.
WARNING TO SPAMMERS:  at http://www.zak.co.il/spamwarning.html


___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Eliminating binary from a text file

2015-07-20 Thread Shachar Shemesh
On 20/07/15 11:56, Orna Agmon Ben-Yehuda wrote:
> Hello everyone,
>
> I often have damaged text files (due to a lovely storage system). The
> files are of different formats, although I can usually assume they
> contain spaces. The files are structured as lines.
>
> Every once in a while, the lovely destruction (ahmstorage) system
> inserts binary garbage to the file. I wish to fix the files by
> removing the cancer without leaving any leftovers. That is, I want to
> lose partial lines.
>
> I tried using grep with all sorts of keys, but it did not do the trick.
> strings catches too little - it leaves partial lines.
> Is there an elegant  way to  do the trick line-wise?
>
> Thanks
> Orna
http://debugmo.de/2009/04/bgrep-a-binary-grep/

Shachar
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Eliminating binary from a text file

2015-07-20 Thread Rabin Yasharzadehe
can you provide a example of a bad lines and how do you like them to look
like after you fix them ?

--
Rabin

On Mon, Jul 20, 2015 at 11:56 AM, Orna Agmon Ben-Yehuda 
wrote:

> Hello everyone,
>
> I often have damaged text files (due to a lovely storage system). The
> files are of different formats, although I can usually assume they contain
> spaces. The files are structured as lines.
>
> Every once in a while, the lovely destruction (ahmstorage) system
> inserts binary garbage to the file. I wish to fix the files by removing the
> cancer without leaving any leftovers. That is, I want to lose partial lines.
>
> I tried using grep with all sorts of keys, but it did not do the trick.
> strings catches too little - it leaves partial lines.
> Is there an elegant  way to  do the trick line-wise?
>
> Thanks
> Orna
>
> --
> Orna Agmon Ben-Yehuda.
> http://ladypine.org
>
> ___
> Linux-il mailing list
> Linux-il@cs.huji.ac.il
> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>
>
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Eliminating binary from a text file

2015-07-20 Thread Shlomi Fish
Hi Orna,

On Mon, Jul 20, 2015 at 11:56 AM, Orna Agmon Ben-Yehuda 
wrote:

> Hello everyone,
>
> I often have damaged text files (due to a lovely storage system). The
> files are of different formats, although I can usually assume they contain
> spaces. The files are structured as lines.
>
> Every once in a while, the lovely destruction (ahmstorage) system
> inserts binary garbage to the file. I wish to fix the files by removing the
> cancer without leaving any leftovers. That is, I want to lose partial lines.
>
> I tried using grep with all sorts of keys, but it did not do the trick.
> strings catches too little - it leaves partial lines.
> Is there an elegant  way to  do the trick line-wise?
>
>
It would help to know exactly which lines you wish to eliminate. Otherwise,
you can do various tasks like that using perl -lane (while possibly using
the -i flag) E.g: (untested):

$ export THRESH=5
$ perl -lan -E 'print unless ((() = /([\x80-\xFF])/g) > $ENV{THRESH})' <
existing-file.txt > new-file.txt

The "ruby" executable has similar flags (with the Ruby’s expression syntax
naturally).

Hope it helps.

Regards,

— Shlomi Fish


-- 
Chuck Norris helps the gods that help themselves.

Please reply to list if it's a mailing list post - http://shlom.in/reply .
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Re: Eliminating binary from a text file

2015-07-20 Thread Dov Grobgeld
Why not do it through a short python script? Something like (not tested)


import os

for dirpath, dirnames, filenames in os.walk('damagedfilesystem'):
   for fn in filenames:
 if fn.endswith('.txt'):
   new_fn = fn.replace('.txt','-fixed.txt')
   out_fh = open(new_fn,'w')
for line in open(fn):
   if islineok(line):
   out_fh.write(line)
close(out_fh)


Just fill in islineok() with whatever logic you want.

Regards,
Dov


On Mon, Jul 20, 2015 at 11:56 AM, Orna Agmon Ben-Yehuda 
wrote:

> Hello everyone,
>
> I often have damaged text files (due to a lovely storage system). The
> files are of different formats, although I can usually assume they contain
> spaces. The files are structured as lines.
>
> Every once in a while, the lovely destruction (ahmstorage) system
> inserts binary garbage to the file. I wish to fix the files by removing the
> cancer without leaving any leftovers. That is, I want to lose partial lines.
>
> I tried using grep with all sorts of keys, but it did not do the trick.
> strings catches too little - it leaves partial lines.
> Is there an elegant  way to  do the trick line-wise?
>
> Thanks
> Orna
>
> --
> Orna Agmon Ben-Yehuda.
> http://ladypine.org
>
> ___
> Linux-il mailing list
> Linux-il@cs.huji.ac.il
> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>
>
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


Eliminating binary from a text file

2015-07-20 Thread Orna Agmon Ben-Yehuda
Hello everyone,

I often have damaged text files (due to a lovely storage system). The files
are of different formats, although I can usually assume they contain
spaces. The files are structured as lines.

Every once in a while, the lovely destruction (ahmstorage) system
inserts binary garbage to the file. I wish to fix the files by removing the
cancer without leaving any leftovers. That is, I want to lose partial lines.

I tried using grep with all sorts of keys, but it did not do the trick.
strings catches too little - it leaves partial lines.
Is there an elegant  way to  do the trick line-wise?

Thanks
Orna

-- 
Orna Agmon Ben-Yehuda.
http://ladypine.org
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il