On Monday, 30 December 2013 at 22:38:43 UTC, Brad Anderson wrote:
On Monday, 30 December 2013 at 22:30:02 UTC, John Colvin wrote:
On Monday, 30 December 2013 at 22:17:21 UTC, John Colvin wrote:
On Monday, 30 December 2013 at 21:40:58 UTC, Thomas Gann
wrote:
I've written a Markov bot in D, and I have function whose
job it is to take an input string, convert all newline
characters to spaces and all uppercase letters to lowercase,
and then return an array of words that are generated by
splitting the string up by whitespace. Here is the function
is question:
string[] split_sentence(string input)
{
string line;
foreach(c; input)
{
if(c == '\n' || c == '\r')
line ~= ' ';
else
line ~= c.toLower();
}
return line.splitter(' ').filter!(a => a.length).array;
}
Obviously, one issue is that because the string is
immutable, I can't modify it directly, and so I actually
build an entirely new string in place. I would have just
made a mutable duplicate of the input and modify that, but
then I would get errors returning, because it expects
string[] and not char[][]. Is there a more elegant way to do
what I'm doing?
A few points:
by declaring a new string and appending to it you are risking
a lot of allocations. Either use std.array.appender or
allocate the array with the correct size at the beginning.
using .array on the end of the ufcs chain is yet another
allocation. It can be avoided using std.algorithm.copy to
copy the result back in to 'line'
In my opinion the whole API would be better as range-based:
auto splitSentence(R)(R input)
if(isInputRange!R)
{
return input
.map!(c => (c == "\n"[0] || c == "\r"[0]) ? ' ' :
c.toLower)
.splitter!(' ')
.filter!(a => !(a.empty));
}
sorry, ignore that attempt, it's woefully broken...
Re: weird literal syntax, you didn't happen to be using dpaste
to test and have trouble with character literals, did you?
Because I did and thought I was going insane until realized
DPaste was broken.
oohhh so that was what that was. Anyway, here's what I have so
far:
import std.range : isInputRange;
auto splitSentence(R)(R input)
if(isInputRange!R)
{
import std.algorithm : map, splitter, filter;
import std.uni : toLower;
import std.range : ElementType;
dchar preProc(ElementType!R c)
{
return (c == '\n' || c == '\r') ? ' ' : c.toLower;
}
return input
.map!preProc
.splitter!(c => c == ' ')
.filter!(a => !(a.empty));
}
I have to have the function instead of a lamda in order to get
the return type correct. I guess I could cast or use std.conv.to
instead.
Anyway, the big problem I've hit is that AFAICT std.algorithm
makes a complete mess of unicode and i can't find a byCodeUnit
range anywhere in order to make it correct.