Re: Splitter quiz / survey

Sean Kelly Mon, 27 Apr 2009 15:40:13 -0700

== Quote from Andrei Alexandrescu (seewebsiteforem...@erdani.org)'s article
> Brad Roberts wrote:
> > Without looking at the docs, code, or compiling and running a test, what 
> > will
> > this do:
> >
> >     foreach(x, splitter(",a,b,", ","))
> >         writefln("x = %s", a);
> >
> > I'll make it multiple choice:
> >
> > choice 1)
> >   x = a
> >   x = b
> >
> > choice 2)
> >   x =
> >   x = a
> >   x = b
> >
> > choice 3)
> >   x = a
> >   x = b
> >   x =
> >
> > choice 4)
> >   x =
> >   x = a
> >   x = b
> >   x =
> >
> Thanks for bringing this to attention, Brad. Splitter does what Perl's
> split does: 2. This means comma is an item terminator and not an item
> separator.


Interesting.  It never occurred to me to think of the comma as a terminator.
I'm actually quite surprised by Perl's behavior.

> Why did I think this is a good idea? Because in most cases, I
> was thankful to Perl's split that it does exactly the right thing.
> Whenever I read text from linguistic corpora, I see that words (or other
> word properties) are separated by spaces. There is never a space before
> the first word on a line, but there is often a trailing space at the end
> of the line. Why? Because the text was processed by a program that
> output "word, ' '" or "tag, ' '" for each word of tag. Then if I split
> the text by whitespace, I'd be annoyed to see that trailing spaces do
> matter.

Only because the program that generated this text was doing something
unexpected though, right?

> For the same reason, C accepts enum X { a, b, } but not ,a ,b.
> Mechanically generating enum values is easier if each value has a
> trailing comma.

This has always seemed weird to me.  C doesn't accept a trailing comma
in function parameter lists.  I don't mind it accepting commas in enum
blocks mostly because leaving a trailing comma in multi-line blocks
can mean a smaller diff if I want to append new elements to the block
later, but it certainly isn't sufficient to justify the syntax IMO.

> Similarly, when you split a text by '\n', a leading empty line is
> important, whereas you wouldn't expect a final '\n' to introduce an
> empty line.

I very well may.  It really depends on the use.

> Now clearly there are cases in which leading or trailing empty items are
> both important. I'm just saying they are more rare.

I think there are two issues worth considering here.  First is semantics--
the term "split" clearly suggests a division between two things.  Second
is that it's easier to throw out null strings than to infer their existence
from a function that doesn't communicate this information.  A similar
issue was raised regarding readln preserving line terminators in the
strings it returns.

As for rarity, CSV is an extremely popular format for tabulated text files,
and split seems like a natural fit for processing lines from such files.  I'd
think that processing such files would at least be a very common need
in the business sector.

> We could add an
> enumerated parameter to Splitter:
> enum PleaseFindAGoodName { terminator, separator }
> foreach (line; splitter(",a,b,", ","))
>      ... terminator is implicit ...
> foreach (line; splitter(",a,b,", ",", PleaseFindAGoodName.separator))
>      ... separator ...
> We might just go with the terminator semantics and ask people who need
> separator semantics to use a stripl() or a munch() prior to splitting.

Did you perhaps meant the reverse?  It would be easy enough to strip
trailing whitespace from your text files to get the behavior you expect,
but I don't see how this would help people who consider the trailing
token significant (the separator case).

Re: Splitter quiz / survey

Reply via email to