Vlad,

xsv has a fixlengths command that might help:

* xsv fixlengths --help*

*Transforms CSV data so that all records have the same length. The length 
is*

*the length of the longest record in the data (not counting trailing empty 
fields,*

*but at least 1). Records with smaller lengths are padded with empty 
fields.*


*This requires two complete scans of the CSV data: one for determining the*

*record size and one for the actual transform. Because of this, the input*

*given must be a file and not stdin.*


*Alternatively, if --length is set, then all records are forced to that 
length.*

*This requires a single pass and can be done with stdin.*


*Usage:*

*    xsv fixlengths [options] [<input>]*


*fixlengths options:*

*    -l, --length <arg>     Forcefully set the length of each record. If a*

*                           record is not the size given, then it is 
truncated*

*                           or expanded as appropriate.*


*Common options:*

*    -h, --help             Display this message*

*    -o, --output <file>    Write output to <file> instead of stdout.*

*    -d, --delimiter <arg>  The field delimiter for reading CSV data.*

*                           Must be a single character. (default: ,)*

HTH,

Jean Jourdain
On Friday, March 28, 2025 at 7:16:34 PM UTC+1 GP wrote:

> Your Pattern Playground results are perplexing. Using your first post's 
> example CSV data, the grep:
>
>
>
> \d{3};\w{3};[^;]*;[^;]*;\d{10};(\w{2});(\d{2});(\d{5});([^;]*);[^;]*;([^;]*);([^;]*);([^;]*);[^;]*;\w{2};\d{2};\d{5};[^;]*;\d{12};[^;]*;[^;]*;\d{8};[^;]*;\d{12};[^;]*;[^;]*;\d;\d;\d;\d;\d;\d;\d;\d;([^;]*);[^\n]*
>
> results in every line but the first column labels line matching.
>
> To figure out what the problem might be on your system with your local 
> language configuration using either BBEdit's Pattern Playground or regex101 
> start out by building the grep pattern from scratch and rebuilding it from 
> left to right by semicolon delineated field pattern parts. E.g., first 
> \d{3}; which should find/highlight 7 matches in each line of the example 
> CSV data - second add \w{3}; for a total grep of \d{3};\w{3}; which should 
> result in the leading 200;BAG; being highlighted for each line in the 
> example. Continue on like that until you find the next added semicolon 
> delineated field pattern part fails to show a match for the left side part 
> of each line in the example data. It'll be something in that line's or 
> lines' field/column that isn't matching what the just add grep pattern 
> part's matching criteria is.
>
> In addition to sorting, an additional use of a working grep pattern is 
> that you can also use it with BBEdit's Text -> Process Lines Containing... 
> to find all lines that do NOT contain that grep pattern which will help in 
> finding malformed CSV data in the large CSV data files your working with. 
> On Friday, March 28, 2025 at 7:12:03 AM UTC-7 Vlad Ghitulescu wrote:
>
> Hey GP
>
>
> I corrected the error re „Specific sub-patterns:“ but this didn’t seem to 
> bring any change: The ADRC_POST_CODE1 is still not sorted
>
> [image: CleanShot 2025-03-28 at 10.02.07.png]
>
> The command gave also no recognizable sign that is ready, so I’m not sure 
> that it didn’t have also problems with the line 25816, where the CRLF 
> follows a house-number (see previous emails).
>
> BBEdit’s Pattern Playground shows however that there is no result after 
> searching with the regex
>
> [image: CleanShot 2025-03-28 at 10.09.51.png]
>
> I’ll take the regex to regex101 (thanks for the hint!) and see if I could 
> spot an error.
>
>
>
> Regards,
> Vlad
>
>
>
>
> Am 26.03.2025 um 19:42 schrieb GP <[email protected]>:
>
> First, in your Sort Lines dialog screenshot, you need to select the 
> "Specific sub-patterns:" option instead of "Entire match" in order for the 
> lines to be sorted by your column sorting criteria (MSGNO, ADRC_COUNTRY, 
> ADRC_REGION, ADRC_POST_CODE1, ADRC_CITY1, ADRC_CITY2, ADRC_STREET and 
> ADRC_HOUSE_NUM1). Since the sort lines grep pattern:
>
>
> \d{3};\w{3};[^;]*;[^;]*;\d{10};(\w{2});(\d{2});(\d{5});([^;]*);[^;]*;([^;]*);([^;]*);([^;]*);[^;]*;\w{2};\d{2};\d{5};[^;]*;\d{12};[^;]*;[^;]*;\d{8};[^;]*;\d{12};[^;]*;[^;]*;\d;\d;\d;\d;\d;\d;\d;\d;([^;]*);[^\n]*
>
> will match every line in your example, using the "Entire match" option 
> devolves the sort into a simple whole line string sort which would put the 
> MSGNO (i.e. \8 in the example) column contents last instead of first in the 
> sort order. (See the "Sort Lines" section in Chapter 5 of the BBEdit User 
> Manual for details of using sub-pattern sort ordering.)
>
> With the "Entire match" option, if you look at every 2..> line the left 
> part of each line is the same until you get to the part of the string with 
> the ADRC_ADDRNUMBER characters so the differences in that part of the 
> string is Sort Line's "Entire match" is using to determine the ordering of 
> the whole line strings.
>
> Using the "Specific sub-patterns:" option is what allows you to specify 
> what substring part(s) of a string/line and what composed ordering of those 
> concatenated substring will be used in determining the sort ordering 
> between whole strings/lines.
>
> To see what's going on with Sort Lines' "Specific sub-patterns:" option 
> you can use BBEdit's Pattern Playground to see what the concatenated 
> substring for a line is being used to determine line sort ordering. For 
> "Search pattern:" put:
>
>
> \d{3};\w{3};[^;]*;[^;]*;\d{10};(\w{2});(\d{2});(\d{5});([^;]*);[^;]*;([^;]*);([^;]*);([^;]*);[^;]*;\w{2};\d{2};\d{5};[^;]*;\d{12};[^;]*;[^;]*;\d{8};[^;]*;\d{12};[^;]*;[^;]*;\d;\d;\d;\d;\d;\d;\d;\d;([^;]*);[^\n]*
>
> and for "Replace pattern" put:
>
> \8\1\2\3\4\5\6\7
>
> and for "Contents of" chose an open example file.
>
> As you step through each grep pattern match (using the Next button), the 
> "Replacement text:" field will show you the concatenated string composed 
> from the capture group ordered substring of the whole matched string/line. 
> It is that "Replacement text:" string that Sort Lines uses for "Specific 
> sub-patterns:" option sorting evaluation.
>
> P.S. If an explanation of what the parts of a grep regular expression is 
> specifying would help,  https://regex101.com has a pretty good 
> explanation panel that explains what each bit of a regular expression is 
> doing. 
> On Wednesday, March 26, 2025 at 6:24:57 AM UTC-7 Vlad Ghitulescu wrote:
>
> Hey GP
>
>
> And thanks for the suggestion!
>
> I tried the sort-solution before trying to understand the regex itself 😶
>
> I pasted into Text —> Sort Lines… like this
>
> [image: CleanShot 2025-03-26 at 08.24.24.png]
>
> but after Sort it doesn’t look like the postal code column was considered
>
> [image: CleanShot 2025-03-26 at 08.25.19.png]
>
> Did I miss something?
>
> Thanks again!
>
>
> Regards,
> Vlad
>
>
>
>
>
> Am 25.03.2025 um 22:32 schrieb GP <[email protected]>:
>
> As a follow up...
>
> BBEdit's Pattern Playground is a great help in constructing tedious grep 
> patterns like you'll need for your filtering and sorting needs. The really 
> tedious part is getting the field position(s) you want to filter or sort on 
> so you can modify that field's match pattern to conform to the desired 
> filter or sorting criteria.
>
> For example... For your " Filter all lines that have ADR_CHK_KZ = 1" using 
> Text -> Process Lines Containing ... with the grep pattern:
>
>
>
> \d{3};\w{3};[^;]*;[^;]*;\d{10};\w{2};\d{2};\d{5};[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;\w{2};\d{2};\d{5};[^;]*;\d{12};[^;]*;[^;]*;\d{8};[^;]*;\d{12};[^;]*;[^;]*;\d;\d;\d;\d;\d;\d;\d;(1);[^;]*;[^\n]*
>
> will do the trick. For filtering you don't need the group capturing on the 
> 1 but it is useful with Pattern Playground to verify you're getting the 
> right field position and field contents matched.
>
> For your "Sort the file by MSGNO, ADRC_COUNTRY, ADRC_REGION, 
> ADRC_POST_CODE1, ADRC_CITY1, ADRC_CITY2, ADRC_STREET and ADRC_HOUSE_NUM1" 
> using Text -> Sort Lines ... with a grep pattern of:
>
>
> \d{3};\w{3};[^;]*;[^;]*;\d{10};(\w{2});(\d{2});(\d{5});([^;]*);[^;]*;([^;]*);([^;]*);([^;]*);[^;]*;\w{2};\d{2};\d{5};[^;]*;\d{12};[^;]*;[^;]*;\d{8};[^;]*;\d{12};[^;]*;[^;]*;\d;\d;\d;\d;\d;\d;\d;\d;([^;]*);[^\n]*
>
> with "Specific sub-patterns" selected with \8\1\2\3\4\5\6\7 in the fill in 
> field will sort your example text using your desired field ordering.
> On Tuesday, March 25, 2025 at 12:53:47 PM UTC-7 GP wrote:
>
> For filtering, look at Text -> Process Lines Containing ... and for 
> sorting Text -> Sort Lines ... using grep patterns to identify what you 
> want to match for filtering and what subpattern field or fields you want to 
> sort ordered on.
>
> If the number of fields in your sample is representative of the real CSV 
> files you're working with, it is going to be something of a pain in the 
> rear coming up with the grep patterns needed to accomplish the desired 
> filtering and sorting.
>
> On Tuesday, March 25, 2025 at 11:03:35 AM UTC-7 Vlad Ghitulescu wrote:
>
> Hey, 
>
>
> I use BBEdit very often while working with big CSV-files (300 - 500 MB, up 
> to 4 million rows) looking like this: 
>
> MANDT;BU;IDENTIFIER;OBJNR;ADRC_ADDRNUMBER;ADRC_COUNTRY;ADRC_REGION;ADRC_POST_CODE1;ADRC_CITY1;ADRC_CITY_EXT;ADRC_CITY2;ADRC_STREET;ADRC_HOUSE_NUM1;ADRC_HOUSE_NUM2;LOKAREF_COUNTRY;LOKAREF_REGION;LOKAREF_POST_CODE1;LOKAREF_CITY1;LOKAREF_CITY_CODE;LOKAREF_CITY_EXT;LOKAREF_CITY2;LOKAREF_CITYP_CODE;LOKAREF_STREET;LOKAREF_STRT_CODE;LOKAREF_HOUSE_NUM1;LOKAREF_HOUSE_NUM2;COUNTRY_KZ;REGION_KZ;POST_CODE1_KZ;CITY1_KZ;CITY_EXT_KZ;CITY2_KZ;STREET_KZ;ADR_CHK_KZ;MSGNO;MESSAGE
>  
>
> 200;BAG;20250324080508_/ETN/PM_EAV_ADR_CHK_ADRC_V14157F;;0007723592;DE;09;86415;Mering;;Sankt
>  
> Afra;Egerländer Straße;;;DE;09;86415;Mering;500000002795;, Schwab;Sankt 
> Afra;00000006;Egerländerstraße;910011919800;;;0;0;0;0;1;0;1;1;; 
> 200;BAG;20250324080508_/ETN/PM_EAV_ADR_CHK_ADRC_V14157F;;0007723657;DE;09;85655;Aying;;Kaps;Kaps;;;DE;09;85653;Aying;500000002262;;Kaps;00000010;Kaps;700055566100;;;0;0;1;0;3;0;0;1;;
>  
>
> 200;BAG;20250324080508_/ETN/PM_EAV_ADR_CHK_ADRC_V14157F;;0007723658;DE;09;83083;Riedering;;Patting;Patting;;;DE;09;83083;Riedering;500000002552;b
>  
> Rosenheim, Oberbay;Patting;00000037;Pattinger 
> Straße;910003809300;;;0;0;0;0;1;0;1;1;; 
> 200;BAG;20250324080508_/ETN/PM_EAV_ADR_CHK_ADRC_V14157F;;0007723674;DE;09;85655;Aying;;Großhelfendorf;Hirschbergstraße;;;DE;09;85653;Aying;500000002262;;Großhelfendorf;00000007;Hirschbergstraße;910002873200;;;0;0;1;0;3;0;0;1;;
>  
>
> 200;BAG;20250324080508_/ETN/PM_EAV_ADR_CHK_ADRC_V14157F;;0007723878;DE;09;93336;Altmannstein;;Berghausen;Altmannsteiner
>  
> Str.;;;DE;09;93336;Altmannstein;500000005266;;Berghausen;00000003;Altmannsteiner
>  
> Straße;910001339100;;;0;0;0;0;3;0;1;1;; 
> 200;BAG;20250324080508_/ETN/PM_EAV_ADR_CHK_ADRC_V14157F;;0007723908;DE;09;93336;Altmannstein;;Berghausen;Altmannsteiner
>  
> Str.;;;DE;09;93336;Altmannstein;500000005266;;Berghausen;00000003;Altmannsteiner
>  
> Straße;910001339100;;;0;0;0;0;3;0;1;1;; 
> 200;BAG;20250324080508_/ETN/PM_EAV_ADR_CHK_ADRC_V14157F;;0007723918;DE;09;93336;Altmannstein;;Berghausen;Altmannsteiner
>  
> Str.;;;DE;09;93336;Altmannstein;500000005266;;Berghausen;00000003;Altmannsteiner
>  
> Straße;910001339100;;;0;0;0;0;3;0;1;1;; 
> 200;BAG;20250324080508_/ETN/PM_EAV_ADR_CHK_ADRC_V14157F;;0007723956;DE;09;93336;Altmannstein;;Berghausen;Altmannsteiner
>  
> Str.;;;DE;09;93336;Altmannstein;500000005266;;Berghausen;00000003;Altmannsteiner
>  
> Straße;910001339100;;;0;0;0;0;3;0;1;1;; 
> 200;BAG;20250324080508_/ETN/PM_EAV_ADR_CHK_ADRC_V14157F;;0007724554;DE;09;95131;Schwarzenbach
>  
> a.Wald;;Schwarzenbach a 
> Wald;Walter-Münch-Straße;;;DE;09;95131;Schwarzenbach 
> a.Wald;500000011836;;Schwarzenbach 
> a.Wald;00000001;Walter-Münch-Straße;910007835500;;;0;0;0;0;3;1;0;1;; 
> 200;BAG;20250324080508_/ETN/PM_EAV_ADR_CHK_ADRC_V14157F;;0007724593;DE;09;95131;Schwarzenbach
>  
> a.Wald;;Schwarzenbach a 
> Wald;Walter-Münch-Straße;;;DE;09;95131;Schwarzenbach 
> a.Wald;500000011836;;Schwarzenbach 
> a.Wald;00000001;Walter-Münch-Straße;910007835500;;;0;0;0;0;3;1;0;1;; 
>
> Once in a while I’d like to filter or sort such huge files by one or more 
> columns, like: 
>
> 1. Filter all lines that have ADR_CHK_KZ = 1 or 
> 2. Sort the file by MSGNO, ADRC_COUNTRY, ADRC_REGION, ADRC_POST_CODE1, 
> ADRC_CITY1, ADRC_CITY2, ADRC_STREET and ADRC_HOUSE_NUM1. 
>
> Is there a way to do this sort of tasks with BBEdit? 
>
> Thanks! 
>
>
> Regards, 
> Vlad 
>
>
>
>
> -- 
> This is the BBEdit Talk public discussion group. If you have a feature 
> request or believe that the application isn't working correctly, please 
> email "[email protected]" rather than posting here. Follow @bbedit on 
> Mastodon: <https://mastodon.social/@bbedit>
> --- 
> You received this message because you are subscribed to the Google Groups 
> "BBEdit Talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
>
> To view this discussion visit 
> https://groups.google.com/d/msgid/bbedit/50130484-14eb-4298-b762-800f88b2c66en%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/bbedit/50130484-14eb-4298-b762-800f88b2c66en%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>
>
> -- 
> This is the BBEdit Talk public discussion group. If you have a feature 
> request or believe that the application isn't working correctly, please 
> email "[email protected]" rather than posting here. Follow @bbedit on 
> Mastodon: <https://mastodon.social/@bbedit>
> --- 
> You received this message because you are subscribed to the Google Groups 
> "BBEdit Talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
>
> To view this discussion visit 
> https://groups.google.com/d/msgid/bbedit/3e139849-cf1a-41d8-821e-97f87cc39513n%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/bbedit/3e139849-cf1a-41d8-821e-97f87cc39513n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>
>

-- 
This is the BBEdit Talk public discussion group. If you have a feature request 
or believe that the application isn't working correctly, please email 
"[email protected]" rather than posting here. Follow @bbedit on Mastodon: 
<https://mastodon.social/@bbedit>
--- 
You received this message because you are subscribed to the Google Groups 
"BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/bbedit/91d5fb2e-1280-40c1-b9c8-c83cbbf698fbn%40googlegroups.com.

Reply via email to