Re: [racket-dev] `string-split'
An hour and a half ago, Ryan Culpepper wrote: > Instead of trying to design a 'string-split' that is both > miraculously intuitive and profoundly flexible, why not design it > like a Model-T Invalid analogy: the issue is not flexibility, it's making something that is simple (first) and useful (second) in most cases. An hour and a half ago, Michael W wrote: > (TL;DR: I'd suggest two functions: one (string-words str) function > that does Eli's way, and one (string-split str sep) that does it > Laurent's way). I don't think that we argued on what it should do, rather it looks like we're both looking for whatever option looks best... > > -> (string-split " st ring") > > '("" "st" "" "ring") > > > > which is why I think that the above is a better definition in terms of > > newbie-ness. > > No, every other language I've worked with does that. > [...] The examples you're quoting are the equivalents of our `regexp-split', which works in a similar way and is not going to change. We're talking about some watered-down version that is easier to use. Just now, Laurent wrote: > (TL;DR: I'd suggest two functions: one (string-words str) > function that does Eli's way, and one (string-split str sep) > that does it Laurent's way). > > That would be a good option to me, considering that "my way" is with > remaining ""s in the output list. The question remains if a string > can be accepted for sep, in which case the empty string must be > considered, as pointed out in the Lua discussion. Though a single > char should be sufficient for nearly all simple cases. I think that I have a good conclusion here, I'll post on a new thread. -- ((lambda (x) (x x)) (lambda (x) (x x))) Eli Barzilay: http://barzilay.org/ Maze is Life! _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
> > (TL;DR: I'd suggest two functions: one (string-words str) > function that does Eli's way, and one (string-split str sep) that > does it Laurent's way). That would be a good option to me, considering that "my way" is with remaining ""s in the output list. The question remains if a string can be accepted for sep, in which case the empty string must be considered, as pointed out in the Lua discussion. Though a single char should be sufficient for nearly all simple cases. Laurent _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
(TL;DR: I'd suggest two functions: one (string-words str) function that does Eli's way, and one (string-split str sep) that does it Laurent's way). 50 minutes ago, Eli Barzilay wrote: > That doesn't seem right -- with this you get > > -> (string-split " st ring") > '("" "st" "" "ring") > > which is why I think that the above is a better definition in terms of > newbie-ness. No, every other language I've worked with does that. $ python Python 3.2.2 (default, Nov 21 2011, 16:51:01) [GCC 4.6.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> " st ring".split(" ") ['', 'st', '', 'ring'] $ node > " st ring".split(" ") [ '', 'st', '', 'ring' ] $ php -a php > var_dump(split(" ", " str ing")); array(4) { [0]=> string(0) "" [1]=> string(3) "str" [2]=> string(0) "" [3]=> string(3) "ing" } Haskell uses two functions; one which eliminates contiguous runs and one which doesn't (and comes from an entire external library, sheesh! though it's easy to write your own): $ ghci Prelude> words " str ing" ["str","ing"] Prelude> Data.List.Split.splitOn " " " str ing" ["","str","","ing"] Ruby has the weirdest behavior, which I consider to be a bug: $ irb irb(main):001:0> " st ring".split(" ") => ["st", "ring"] irb(main):002:0> " st ring".split(/ /) => ["", "st", "", "ring"] The ruby docs say: http://www.ruby-doc.org/core-1.9.3/String.html If pattern is a String, then its contents are used as the delimiter when splitting str. If pattern is a single space, str is split on whitespace, with leading whitespace and runs of contiguous whitespace characters ignored. If pattern is a Regexp, str is divided where the pattern matches. Whenever the pattern matches a zero-length string, str is split into individual characters. If pattern contains groups, the respective matches will be returned in the array as well. In looking for Lua (which doesn't include one, by the way), I found http://lua-users.org/wiki/SplitJoin which has a big summary of the issues. -- For the Future! _mike _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
Instead of trying to design a 'string-split' that is both miraculously intuitive and profoundly flexible, why not design it like a Model-T and then write a guide/cookbook for how to use regexps to do all of the common cases that the extremely limited 'string-split' doesn't handle? I suspect that writing such a guide will expose a few cases where common patterns can be turned into functions (similar to 'regexp-replace-quote'). Ryan On 04/19/2012 07:27 AM, Eli Barzilay wrote: Just now, Laurent wrote: 1. Laurent: Does this make more sense? Yes, this definitely makes more sense to me. It would then treat (string-split "aXXby" "X") just like the " " case. Although if you want to find the columns of a latex line like "x&& y& z" you will have the wrong result. Maybe use an optional argument to remove the empty strings? (not sure) (This complicates things...) First, I don't think that there's a need to make it able to do stuff like that -- either you go with regexps, or you use combinations like (map string-trim (string-split "x&& y& z" "&")) 4. Related to Q3: what does "xy" as that argument mean exactly? a. #rx"[xy]" b. #rx"[xy]+" c. #rx"xy" d. #rx"(?:xy)+" Good question. d. would be the simplest case for newbies, but b. might be more useful. I think several other languages avoid this issue by using only one character as the separator. The complication is that with " " or " \t" it seems that you'd want b, and with "&" you'd want c. (Maybe even make"&" equivalent to #rx" *& *" -- that looks like it's too much guessing.) And you're also making a point for: e. Throw an error, must be a single-character string. BTW, this question is important because it affects other functions, so I'd like to resolve it before doing anything. _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
> > 4. Related to Q3: what does "xy" as that argument mean exactly? > > a. #rx"[xy]" > > b. #rx"[xy]+" > > c. #rx"xy" > > d. #rx"(?:xy)+" > > > > Good question. d. would be the simplest case for newbies, but > > b. might be more useful. I think several other languages avoid this > > issue by using only one character as the separator. > > The complication is that with " " or " \t" it seems that you'd want b, > and with "&" you'd want c. (Maybe even make "&" equivalent to > #rx" *& *" -- that looks like it's too much guessing.) > > And you're also making a point for: > > e. Throw an error, must be a single-character string. > > BTW, this question is important because it affects other functions, so > I'd like to resolve it before doing anything. > If we make things as simple-but-useful as possible, then I'd go for a single char separator with option b/d. (I don't think there are many cases where one would want a string as a separator?) Personally, I don't like much when functions ask for a character because the #\ looks ugly to me, but it still makes more sense than asking for a string that must have a single character. Laurent _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
Just now, Laurent wrote: > 1. Laurent: Does this make more sense? > > Yes, this definitely makes more sense to me. It would then treat > (string-split "aXXby" "X") just like the " " case. > > Although if you want to find the columns of a latex line like "x && > y & z" you will have the wrong result. Maybe use an optional > argument to remove the empty strings? (not sure) (This complicates things...) First, I don't think that there's a need to make it able to do stuff like that -- either you go with regexps, or you use combinations like (map string-trim (string-split "x && y & z" "&")) > 4. Related to Q3: what does "xy" as that argument mean exactly? > a. #rx"[xy]" > b. #rx"[xy]+" > c. #rx"xy" > d. #rx"(?:xy)+" > > Good question. d. would be the simplest case for newbies, but > b. might be more useful. I think several other languages avoid this > issue by using only one character as the separator. The complication is that with " " or " \t" it seems that you'd want b, and with "&" you'd want c. (Maybe even make "&" equivalent to #rx" *& *" -- that looks like it's too much guessing.) And you're also making a point for: e. Throw an error, must be a single-character string. BTW, this question is important because it affects other functions, so I'd like to resolve it before doing anything. -- ((lambda (x) (x x)) (lambda (x) (x x))) Eli Barzilay: http://barzilay.org/ Maze is Life! _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
> > 4. Related to Q3: what does "xy" as that argument mean exactly? >> a. #rx"[xy]" >> b. #rx"[xy]+" >> c. #rx"xy" >> d. #rx"(?:xy)+" >> > > Good question. d. would be the simplest case for newbies, but b. might be > more useful. > It would make more sense that a string really is a string, not a set of characters. Without going as far as srfi-14, a set could be a list of strings or characters, but maybe this is not needed. Laurent _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
Continuing with this line, it seems that a better definition is as > follows: > > (define (string-split str [sep " "]) >(remove* '("") (regexp-split (regexp-quote (or sep " ")) str))) > > Except that the full definition could be a bit more efficient. > > Three questions: > > 1. Laurent: Does this make more sense? > Yes, this definitely makes more sense to me. It would then treat (string-split "aXXby" "X") just like the " " case. Although if you want to find the columns of a latex line like "x && y & z" you will have the wrong result. Maybe use an optional argument to remove the empty strings? (not sure) > 2. Matthew: Is there any reason to make the #f-as-default part of the > interface? (Even with the new reply I don't see a necessity for > this -- if the target is newbies, then I think that keeping it as a > string is simpler...) > There is probably no need for #f with the new spec. 4. Related to Q3: what does "xy" as that argument mean exactly? > a. #rx"[xy]" > b. #rx"[xy]+" > c. #rx"xy" > d. #rx"(?:xy)+" > Good question. d. would be the simplest case for newbies, but b. might be more useful. I think several other languages avoid this issue by using only one character as the separator. Laurent _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
[Meta-note: I'm not just flatly object to these, just trying to clarify the exact behavior and the possible effects on other functions.] 10 minutes ago, Laurent wrote: > > > (define (string-split str [sep #px"\\s+"]) > (remove* '("") (regexp-split sep str))) > > Nearly, I meant something more like this: > > (define (string-split str [splitter " "]) > (regexp-split (regexp-quote splitter) str)) > > No regexp from the user POV, and much easier to use with little > knowledge. That doesn't seem right -- with this you get -> (string-split " st ring") '("" "st" "" "ring") which is why I think that the above is a better definition in terms of newbie-ness. 10 minutes ago, Matthew Flatt wrote: > I agree with this: we should add `string-split', the one-argument case > should be as Eli wrote, and the two-argument case should be as Laurent > wrote. (Probably the optional second argument should be string-or-#f, > where #f means to use #px"\\s+".) Continuing with this line, it seems that a better definition is as follows: (define (string-split str [sep " "]) (remove* '("") (regexp-split (regexp-quote (or sep " ")) str))) Except that the full definition could be a bit more efficient. Three questions: 1. Laurent: Does this make more sense? 2. Matthew: Is there any reason to make the #f-as-default part of the interface? (Even with the new reply I don't see a necessity for this -- if the target is newbies, then I think that keeping it as a string is simpler...) 3. There's also the point of how this optional argument plays with other functions in `racket/string'. If it works as above, then `string-trim' and `string-normalize-spaces' should change accordingly so they take the same kind of input simplified "regexp". 4. Related to Q3: what does "xy" as that argument mean exactly? a. #rx"[xy]" b. #rx"[xy]+" c. #rx"xy" d. #rx"(?:xy)+" -- ((lambda (x) (x x)) (lambda (x) (x x))) Eli Barzilay: http://barzilay.org/ Maze is Life! _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
A few minutes ago, Laurent wrote: > > Then instead of #f one idea is to go one step further and consider > different useful cases based on input symbols like 'whitespaces, > 'non-alpha, etc. ? Or even a list of string/symbols that can be used > as a splitter. That would make a more powerful function for > sure. (It's just that I'm troubled by the uniqueness of this magical > default argument) (This is something that I do object to... It leads to srfi-14 which is one overkill way for that, and we already have regexps that do that. So I think that "simple" is a major point.) -- ((lambda (x) (x x)) (lambda (x) (x x))) Eli Barzilay: http://barzilay.org/ Maze is Life! _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
On Thu, Apr 19, 2012 at 14:53, Matthew Flatt wrote: > At Thu, 19 Apr 2012 14:43:44 +0200, Laurent wrote: > > On Thu, Apr 19, 2012 at 14:33, Matthew Flatt wrote: > > > > > I agree with this: we should add `string-split', the one-argument case > > > should be as Eli wrote, > > > > > > About this I'm not sure, as one cannot reproduce this behavior by > providing > > an argument (or it could make the difference between > string-as-not-regexps > > and regexps? Wouldn't this be different from other places?). > > I'm suggesting that supplying `#f' as the argument would be the same as > not supplying the argument. > > It is a special case, though. I don't mind the specialness here, > because I see the job of `string-split' as making a couple of useful > special cases easy (as opposed to the generality of `regexp-split'). > Then instead of #f one idea is to go one step further and consider different useful cases based on input symbols like 'whitespaces, 'non-alpha, etc. ? Or even a list of string/symbols that can be used as a splitter. That would make a more powerful function for sure. (It's just that I'm troubled by the uniqueness of this magical default argument) Laurent > > > > It would then appear somewhat magical. To me the " " default splitter > seems > > more intuitive. > > > > Laurent > > > > > > > and the two-argument case should be as Laurent > > > wrote. (Probably the optional second argument should be string-or-#f, > > > where #f means to use #px"\\s+".) > > > > > > At Thu, 19 Apr 2012 14:30:31 +0200, Laurent wrote: > > > > (define (string-split str [sep #px"\\s+"]) > > > > >(remove* '("") (regexp-split sep str))) > > > > > > > > > > > > > Nearly, I meant something more like this: > > > > > > > > (define (string-split str [splitter " "]) > > > > (regexp-split (regexp-quote splitter) str)) > > > > > > > > No regexp from the user POV, and much easier to use with little > > > knowledge. > > > > _ > > > > Racket Developers list: > > > > http://lists.racket-lang.org/dev > > > > _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
At Thu, 19 Apr 2012 14:43:44 +0200, Laurent wrote: > On Thu, Apr 19, 2012 at 14:33, Matthew Flatt wrote: > > > I agree with this: we should add `string-split', the one-argument case > > should be as Eli wrote, > > > About this I'm not sure, as one cannot reproduce this behavior by providing > an argument (or it could make the difference between string-as-not-regexps > and regexps? Wouldn't this be different from other places?). I'm suggesting that supplying `#f' as the argument would be the same as not supplying the argument. It is a special case, though. I don't mind the specialness here, because I see the job of `string-split' as making a couple of useful special cases easy (as opposed to the generality of `regexp-split'). > It would then appear somewhat magical. To me the " " default splitter seems > more intuitive. > > Laurent > > > > and the two-argument case should be as Laurent > > wrote. (Probably the optional second argument should be string-or-#f, > > where #f means to use #px"\\s+".) > > > > At Thu, 19 Apr 2012 14:30:31 +0200, Laurent wrote: > > > (define (string-split str [sep #px"\\s+"]) > > > >(remove* '("") (regexp-split sep str))) > > > > > > > > > > Nearly, I meant something more like this: > > > > > > (define (string-split str [splitter " "]) > > > (regexp-split (regexp-quote splitter) str)) > > > > > > No regexp from the user POV, and much easier to use with little > > knowledge. > > > _ > > > Racket Developers list: > > > http://lists.racket-lang.org/dev > > _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
On Thu, Apr 19, 2012 at 14:33, Matthew Flatt wrote: > I agree with this: we should add `string-split', the one-argument case > should be as Eli wrote, About this I'm not sure, as one cannot reproduce this behavior by providing an argument (or it could make the difference between string-as-not-regexps and regexps? Wouldn't this be different from other places?). It would then appear somewhat magical. To me the " " default splitter seems more intuitive. Laurent > and the two-argument case should be as Laurent > wrote. (Probably the optional second argument should be string-or-#f, > where #f means to use #px"\\s+".) > > At Thu, 19 Apr 2012 14:30:31 +0200, Laurent wrote: > > (define (string-split str [sep #px"\\s+"]) > > >(remove* '("") (regexp-split sep str))) > > > > > > > Nearly, I meant something more like this: > > > > (define (string-split str [splitter " "]) > > (regexp-split (regexp-quote splitter) str)) > > > > No regexp from the user POV, and much easier to use with little > knowledge. > > _ > > Racket Developers list: > > http://lists.racket-lang.org/dev > _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
I think Laurent pointed out in his initial message that beginners may be intimidated by regexps. I agree. Plus someone who isn't fluent with regexp may be more comfortable with string-split. Last but not least, a program documents itself more clearly with string-split vs regexp. On Apr 19, 2012, at 8:21 AM, Eli Barzilay wrote: > [Changed title to talk about each one separately.] > > Two hours ago, Laurent wrote: >> One string function that I often find useful in various scripting >> languages is a `string-split' (explode in php). It can be done with >> `regexp-split', but having something more along the lines of a >> `string-split' should belong to a racket/string lib I think. Plus >> it would be symmetric with `string-join', which already is in >> racket/ string (or at least a doc line pointing to regexp-split >> should be added there). > > If you mean something like this: > > (define (string-split str) (regexp-match* #px"\\S+" str)) > > ? > > If so, then I see a much weaker point for it -- unlike other small > utilities, this one doesn't even compose two function calls. > > The very weak point here is if you want a default argument that > specifies the gaps to split on rather than the words: > > (define (string-split str [sep #px"\\s+"]) >(remove* '("") (regexp-split sep str))) > > but that *does* use regexps, so I don't see the point, still... > > -- > ((lambda (x) (x x)) (lambda (x) (x x))) Eli Barzilay: >http://barzilay.org/ Maze is Life! > > _ > Racket Developers list: > http://lists.racket-lang.org/dev _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
I agree with this: we should add `string-split', the one-argument case should be as Eli wrote, and the two-argument case should be as Laurent wrote. (Probably the optional second argument should be string-or-#f, where #f means to use #px"\\s+".) At Thu, 19 Apr 2012 14:30:31 +0200, Laurent wrote: > (define (string-split str [sep #px"\\s+"]) > >(remove* '("") (regexp-split sep str))) > > > > Nearly, I meant something more like this: > > (define (string-split str [splitter " "]) > (regexp-split (regexp-quote splitter) str)) > > No regexp from the user POV, and much easier to use with little knowledge. > _ > Racket Developers list: > http://lists.racket-lang.org/dev _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
(define (string-split str [sep #px"\\s+"]) >(remove* '("") (regexp-split sep str))) > Nearly, I meant something more like this: (define (string-split str [splitter " "]) (regexp-split (regexp-quote splitter) str)) No regexp from the user POV, and much easier to use with little knowledge. _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
On Thu, Apr 19, 2012 at 8:21 AM, Eli Barzilay wrote: > > Two hours ago, Laurent wrote: >> One string function that I often find useful in various scripting >> languages is a `string-split' (explode in php). It can be done with >> `regexp-split', but having something more along the lines of a >> `string-split' should belong to a racket/string lib I think. Plus >> it would be symmetric with `string-join', which already is in >> racket/ string (or at least a doc line pointing to regexp-split >> should be added there). > > If you mean something like this: > > (define (string-split str) (regexp-match* #px"\\S+" str)) > > ? > > If so, then I see a much weaker point for it -- unlike other small > utilities, this one doesn't even compose two function calls. It composes one function call (with an extremely complex API) with one domain-specific language (that lots of people don't know/understand/use) into one extremely simple but useful function. > The very weak point here is if you want a default argument that > specifies the gaps to split on rather than the words: > > (define (string-split str [sep #px"\\s+"]) > (remove* '("") (regexp-split sep str))) > > but that *does* use regexps, so I don't see the point, still... Note that (string-split str ";") works given that implementation, which I think makes it both easy-to-understand and useful. -- sam th sa...@ccs.neu.edu _ Racket Developers list: http://lists.racket-lang.org/dev
Re: [racket-dev] `string-split'
[Changed title to talk about each one separately.] Two hours ago, Laurent wrote: > One string function that I often find useful in various scripting > languages is a `string-split' (explode in php). It can be done with > `regexp-split', but having something more along the lines of a > `string-split' should belong to a racket/string lib I think. Plus > it would be symmetric with `string-join', which already is in > racket/ string (or at least a doc line pointing to regexp-split > should be added there). If you mean something like this: (define (string-split str) (regexp-match* #px"\\S+" str)) ? If so, then I see a much weaker point for it -- unlike other small utilities, this one doesn't even compose two function calls. The very weak point here is if you want a default argument that specifies the gaps to split on rather than the words: (define (string-split str [sep #px"\\s+"]) (remove* '("") (regexp-split sep str))) but that *does* use regexps, so I don't see the point, still... -- ((lambda (x) (x x)) (lambda (x) (x x))) Eli Barzilay: http://barzilay.org/ Maze is Life! _ Racket Developers list: http://lists.racket-lang.org/dev