Re: GSOC Idea.

2016-03-19 Thread Russel Winder via Digitalmars-d
On Wed, 2016-03-02 at 20:21 +, Marco via Digitalmars-d wrote:
> Hi, I am Marco, a CS student at UCL. This is both a presentation 
> post and also post where I ask some suggestions about an idea for 
> a GSOC project.
[…]

Students from UCL definitely worth supporting. :-)

Did you have more of a think about what you might do in the D arena?

-- 
Russel.
=
Dr Russel Winder  t: +44 20 7585 2200   voip: sip:russel.win...@ekiga.net
41 Buckmaster Roadm: +44 7770 465 077   xmpp: rus...@winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder



signature.asc
Description: This is a digitally signed message part


Re: GSOC Idea.

2016-03-04 Thread Johannes Pfau via Digitalmars-d
Am Thu, 03 Mar 2016 14:09:33 +
schrieb Marco :

> On Thursday, 3 March 2016 at 09:29:38 UTC, Johannes Pfau wrote:
> > Am Thu, 3 Mar 2016 16:14:24 +1300
> > schrieb Rikki Cattermole :
> >
> > [...]  
> 
> > The arduino Due uses a Cortex M3, so arduino could mean ARM or 
> > AVR.
> > GDC can target AVR devices with minimal changes. The main 
> > problem
> > is that you really can't use druntime / typeinfo on such small 
> > devices
> > and that some features (e.g. simply using structs) require
> > TypeInfo. I have some outdated commits here to disable the 
> > TypeInfo
> > stuff:
> > https://github.com/D-Programming-microD/GDC/commits/microD
> >
> > [...]  
> 
> Thank you for your detailed answers. Now I do understand that 
> working with D and arduino maybe be out of my capability. I may 
> now decide to work on porting a parser generator library to 
> phobos. I do know that pegged is a parser generator in D. Can 
> this new project be easier? I am also quite interested in 
> parsing, so that would be a very good experience for me.
> 
> Sorry for being such a newbie, I hope you understand my 
> willingness to learn.

Sorry if my answer scared you away from that task :-) I think it's not
necessarily difficult, but as there's no finished/polished
D implementation for embedded devices yet it's pioneering work.
And as it affects many areas in the D ecosystem (compiler,druntime,dub)
and some low-level programming it's much easier for someone
already familiar with low-level D implementation details and/or
embedded programming.

I don't know anything about parser generators, so somebody else will
have to answer these questions. You should probably start a new thread
with an updated title for that though, to get some more attention.


Re: GSOC Idea.

2016-03-03 Thread Rikki Cattermole via Digitalmars-d

On 04/03/16 3:09 AM, Marco wrote:

On Thursday, 3 March 2016 at 09:29:38 UTC, Johannes Pfau wrote:

Am Thu, 3 Mar 2016 16:14:24 +1300
schrieb Rikki Cattermole :

[...]



The arduino Due uses a Cortex M3, so arduino could mean ARM or AVR.
GDC can target AVR devices with minimal changes. The main problem
is that you really can't use druntime / typeinfo on such small devices
and that some features (e.g. simply using structs) require
TypeInfo. I have some outdated commits here to disable the TypeInfo
stuff:
https://github.com/D-Programming-microD/GDC/commits/microD

[...]


Thank you for your detailed answers. Now I do understand that working
with D and arduino maybe be out of my capability. I may now decide to
work on porting a parser generator library to phobos. I do know that
pegged is a parser generator in D. Can this new project be easier? I am
also quite interested in parsing, so that would be a very good
experience for me.

Sorry for being such a newbie, I hope you understand my willingness to
learn.


You may want to stay clear of CTFE which is how pegged works primarily.


Re: GSOC Idea.

2016-03-03 Thread Marco via Digitalmars-d

On Thursday, 3 March 2016 at 09:29:38 UTC, Johannes Pfau wrote:

Am Thu, 3 Mar 2016 16:14:24 +1300
schrieb Rikki Cattermole :

[...]


The arduino Due uses a Cortex M3, so arduino could mean ARM or 
AVR.
GDC can target AVR devices with minimal changes. The main 
problem
is that you really can't use druntime / typeinfo on such small 
devices

and that some features (e.g. simply using structs) require
TypeInfo. I have some outdated commits here to disable the 
TypeInfo

stuff:
https://github.com/D-Programming-microD/GDC/commits/microD

[...]


Thank you for your detailed answers. Now I do understand that 
working with D and arduino maybe be out of my capability. I may 
now decide to work on porting a parser generator library to 
phobos. I do know that pegged is a parser generator in D. Can 
this new project be easier? I am also quite interested in 
parsing, so that would be a very good experience for me.


Sorry for being such a newbie, I hope you understand my 
willingness to learn.


Re: GSOC Idea.

2016-03-03 Thread Johannes Pfau via Digitalmars-d
Am Thu, 3 Mar 2016 16:14:24 +1300
schrieb Rikki Cattermole :

> On 03/03/16 9:21 AM, Marco wrote:
> > Hi, I am Marco, a CS student at UCL. This is both a presentation
> > post and also post where I ask some suggestions about an idea for a
> > GSOC project.
> >
> > I am a first year student and I do now that I may be too
> > inexperienced for such difficult projects, but my willingness to
> > learn is big. I am very fascinated by the world of compilers. Due
> > to this fact I would really appreciate to be selected this summer
> > and to work on a project for the D Lang.
> >
> > My main idea is to port the D Lang to the arduino environment. I do
> > know that in past there has been a proposed project for GSOC very
> > similar to this idea, but I can't find any actual implementation of
> > this idea on the Internet.
> >
> > Would the task be too difficult? Or with constant hard work would be
> > doable?
> >
> > Thank you in advance.
> >
> > (In the case that I will not be selected in the GSOC programme, I
> > will probably continue to be an active member of the development of
> > the D Lang, due to my real interest for it. So this post is my
> > first step in this community)  
> 
> Half a year ago Jens Bauer was offering Cortex-M's for free for
> anyone who wanted one. They would probably be better then Arduino for
> targeting.
> 

The arduino Due uses a Cortex M3, so arduino could mean ARM or AVR.
GDC can target AVR devices with minimal changes. The main problem
is that you really can't use druntime / typeinfo on such small devices
and that some features (e.g. simply using structs) require
TypeInfo. I have some outdated commits here to disable the TypeInfo
stuff:
https://github.com/D-Programming-microD/GDC/commits/microD

and some proof of concept code:
https://github.com/D-Programming-microD/avr-playground/blob/master/src/test.d


@Marco as I've programmed AVR devices before, but never used arduino:
when you say 'arduino environment', do you want to port D to the
arduino hardware / boards or do you also want to interface to arduino
software libraries? Running D programs on Arduino hardware shouldn't be
very difficult, but interfacing Arduino software libraries could be.
IIRC some of the libraries are in C++ and we only have limited C++
interop. And the parts that are in C probably use lots of macros and
macros can't be used from D.



> Here is one I'm wanting to get: 
> http://www.dx.com/p/cortex-m3-stm32f103c8t6-stm32-development-board-w-swd-socket-st-link-v2-programmer-emulator-395848#.VterNJx97IU
> 
> Fairly cheap and yes you can use STM32's with Arduino.
> 
> 
> You will need to basically rebuild druntime from scratch and use
> either ldc or gdc. Unless of course you want to add an ARM target to
> dmd which would be very appreciated (but incredibly a lot harder).

to summarize:

* Interfacing with Arduino software: medium to difficult
* Rewriting IO / register access from scratch in D: easy-medium: proof
  of concepts for low level access exist (here:
  
https://github.com/D-Programming-microD/avr-playground/blob/master/src/util/mmio.d)
  and Mike had a class based concept. The boring part is generating
  code for all registers. I've got some old code which parses Atmel AVR
  datasheets which could be used for this task.

* Then there's the question whether you base the D port on a C library
  (avr libc) which is easier and allows for C/C++ interop, or whether
  you write the startup files in D as well. Both approaches have been
  used before, but I'd go for the simpler C-library approach first.

* Once low-level access to the registers is available, defining a
  high-level interface is an interesting task. How do you implement
  board configuration (e.g. system clock speed). Could we implement
  some higher level GPIO abstraction / USART abstraction which could
  even allow future libraries for higher-level protocols (such as
  onewire) to work efficiently on different architectures without
  porting efforts? Do we implement interrupts in any way? And the all
  these abstractions need to be efficient in code size and computation
  overhead.

* GDC on ARM: Should already work without changes. You probably want
  druntime on ARM based devices anyway, so TypeInfo is less of a
  problem and there's some existing work from Timo in this area:
  https://bitbucket.org/timosi/minlibd
* GDC on AVR: Requires some minimal changes to the
  GDC compiler. I could take care of that. If we can't get some feature
  upstream early enough, we can always work with patched compilers.
  We need to work without druntime here, which requires some more
  compiler changes for TypeInfo stuff. We'll probably have to work with
  a patched compiler for that, as I don't think we can upstream these
  changes fast enough (should go to DMD first, then GDC).

TLDR: You don't really have to work on the compiler for this project,
but there's some low-level D code that needs to be written. Once
the basics are 

Re: GSOC Idea.

2016-03-02 Thread Rikki Cattermole via Digitalmars-d

On 03/03/16 9:21 AM, Marco wrote:

Hi, I am Marco, a CS student at UCL. This is both a presentation post
and also post where I ask some suggestions about an idea for a GSOC
project.

I am a first year student and I do now that I may be too inexperienced
for such difficult projects, but my willingness to learn is big. I am
very fascinated by the world of compilers. Due to this fact I would
really appreciate to be selected this summer and to work on a project
for the D Lang.

My main idea is to port the D Lang to the arduino environment. I do know
that in past there has been a proposed project for GSOC very similar to
this idea, but I can't find any actual implementation of this idea on
the Internet.

Would the task be too difficult? Or with constant hard work would be
doable?

Thank you in advance.

(In the case that I will not be selected in the GSOC programme, I will
probably continue to be an active member of the development of the D
Lang, due to my real interest for it. So this post is my first step in
this community)


Half a year ago Jens Bauer was offering Cortex-M's for free for anyone 
who wanted one. They would probably be better then Arduino for targeting.


Here is one I'm wanting to get: 
http://www.dx.com/p/cortex-m3-stm32f103c8t6-stm32-development-board-w-swd-socket-st-link-v2-programmer-emulator-395848#.VterNJx97IU


Fairly cheap and yes you can use STM32's with Arduino.


You will need to basically rebuild druntime from scratch and use either 
ldc or gdc. Unless of course you want to add an ARM target to dmd which 
would be very appreciated (but incredibly a lot harder).


GSOC Idea.

2016-03-02 Thread Marco via Digitalmars-d
Hi, I am Marco, a CS student at UCL. This is both a presentation 
post and also post where I ask some suggestions about an idea for 
a GSOC project.


I am a first year student and I do now that I may be too 
inexperienced for such difficult projects, but my willingness to 
learn is big. I am very fascinated by the world of compilers. Due 
to this fact I would really appreciate to be selected this summer 
and to work on a project for the D Lang.


My main idea is to port the D Lang to the arduino environment. I 
do know that in past there has been a proposed project for GSOC 
very similar to this idea, but I can't find any actual 
implementation of this idea on the Internet.


Would the task be too difficult? Or with constant hard work would 
be doable?


Thank you in advance.

(In the case that I will not be selected in the GSOC programme, I 
will probably continue to be an active member of the development 
of the D Lang, due to my real interest for it. So this post is my 
first step in this community)


Re: [GSOC idea] enhance regular expressions?

2011-04-10 Thread Dmitry Olshansky

On 02.04.2011 14:58, Peter Chadwick wrote:

Dmitry Olshanskydmitry.o...@gmail.com  writes:


On 31.03.2011 22:16,petevi...@yahoo.com.au  wrote:

Dmitry Olshanskydmitry.o...@gmail.com   writes:

--- somewhat informal draft ---


Hi Dmitry,

It's good to know that there is interest in improving regular
expressions in the D standard library. I've run into a number of
problems using std.regex myself, so I'm keen so see either fixes for
std.regex or an improved replacement.

Thanks for the interest.
Indeed there are problems, though the engine itself is quite capable
and optimized, I think it could be fixed.

Great.


I've been working on a regular expression library myself based around
Russ Cox article athttp://swtch.com/~rsc/regexp/regexp2.html  and the
std.regex range interface. It includes a backtracking and Thompson
engine. If you are interested it is on github:

https://github.com/PeteChadwick/pcregexeng

The ddoc documentation is here:
http://petechadwick.github.com/pcregexeng/


Wow, I'll definitely look into it.
Overall it seems solid even though incomplete. From glance it seems
that parsing a pattern is somewhat convulted, but I'll defer any
critics or decisions until I looked in closer detail.

I'm open to any criticisms or suggestions you may have.


Ok, I think I got my criticisms sorted out.
Sorry for taking so much time, I've got really involved week, attending 
to a couple events and, in fact, helping organize one.


First things first - I like your structured way of expressing regex VM  
instructions by introducing distinct types for each, but let's examine 
what good does it get you in the end.

Function printProgram gets us first impressions:
Inst* inst = cast(Inst*)program[pos];
final switch( *instType )
 {
...
case REInst.IChar:
 InstIChar* inst = cast(InstIChar*)program[pos]; //disadvantage

 writefln( IChar %s, inst.c ); //benefit
  pos += InstIChar.sizeof;//not used //well to actually 
benefit of introduced types, should it be (*inst).sizeof ? ...


Another use case before jumping to conclusions:
auto instChar = cast(InstChar*)inst;//disadvantage
if ( instChar.c != thisChar )//benefit
return 0;
state_pc += InstChar.sizeof;//not used

Which means that the whole typing isn't exploited much, but gets in your 
way far to often.
The idea to avoid all the mess with double casting is to introduce tag 
union:

struct IR{
uint opcode;
union{
struct {
dchar c;
}
struct {
ubyte[8] bitmap;
}
//etc..
}
}
Then the double cast is removed:
auto inst = cast(IR*)program[pos];
switch(inst.InstType){
...
case REInst.IChar:

 writefln( IChar %s, inst.c ); pos += InstIChar.sizeof;


}
The only problem that remains is that we've lost the actual size of 
struct which varies based on an opcode value, and  constructors...

Solving both at once :
struct WrapedIR(REInst InstType,T)
{
enum length = instLength(InstType);
@property auto bytes(){// see later
return (cast(byte*)data)[0..length];
}
T data;
   alias data this;
}
where instLength is the unsafe place :
size_t instLength(REInst type){
switch(type){
case REInst.Char: case REInst.IChar:
return sizeof(uint)+sizeof(dchar);
...
}
}
and having helper function (which may use @property) like :
auto makeInst(IRInst inst)() { return WrappedIR!(inst,IR)(); }

will get us pretty much the same thing (or better) everywhere you create 
instructions:

auto charInst = makeInst!(REInst.Char)();

Which gets us to how you are creating instructions actually:
  static string MakeREInst( string instType, string instName )
{
string result =
byte[~instType~.sizeof] ~instName~Buf;~
instType~* ~instName~ = 
cast(~instType~*)~instName~Buf[0];~

*~instName~ = ~instType~();;

return result;
}
which is this after mix-in:
byte[instType.sizeof] instNameBuf;
instType* instName = cast(InstType*)instNameBuf[0];
*instName = instType();

Even disregarding the suggested approach to IR representation, the 
better alternative I think would be the opposite - stack allocate 
struct, and treat it as array of bytes  when need be.
Isn't this one of prime cases for introducing types in the first place? 
Also, why mixins?
instType instName = instType(); //can't be any  worse then 
mixin(MakeREInst( instType, instName );


Then for getting bytes of any struct use this one-liner (analogous to 
'bytes' of the above IR tag union):

ubyte[] bytesOf(T)(T* t){
return (cast(byte*)t)[0..T.sizeof];
}
Essentially you need to treat instructions as bytes exactly once: when 
appending or inserting them into VM program.

program ~=  splitInstBuf  //append magically mixed-in buffer
...
becomes
program ~=  bytesOf(splitInst)
...
Doesn't look half bad, and compiler wouldn't allow missing '' if anything.
Look  feel clearly could be enhanced by changing pointer argument to 
ref argument in bytesOf, but I don't 

Re: [GSOC idea] enhance regular expressions?

2011-04-02 Thread Peter Chadwick
Dmitry Olshansky dmitry.o...@gmail.com writes:

 On 31.03.2011 22:16, petevi...@yahoo.com.au wrote:
 Dmitry Olshanskydmitry.o...@gmail.com  writes:
 --- somewhat informal draft ---

 Hi Dmitry,

 It's good to know that there is interest in improving regular
 expressions in the D standard library. I've run into a number of
 problems using std.regex myself, so I'm keen so see either fixes for
 std.regex or an improved replacement.
 Thanks for the interest.
 Indeed there are problems, though the engine itself is quite capable
 and optimized, I think it could be fixed.

Great.

 I've been working on a regular expression library myself based around
 Russ Cox article at http://swtch.com/~rsc/regexp/regexp2.html and the
 std.regex range interface. It includes a backtracking and Thompson
 engine. If you are interested it is on github:

 https://github.com/PeteChadwick/pcregexeng

 The ddoc documentation is here:
 http://petechadwick.github.com/pcregexeng/

 Wow, I'll definitely look into it.
 Overall it seems solid even though incomplete. From glance it seems
 that parsing a pattern is somewhat convulted, but I'll defer any
 critics or decisions until I looked in closer detail.

I'm open to any criticisms or suggestions you may have.

 I see you also using the test from std.regex, sadly it's somewhat
 broken (the one with test vectors).
 All it's checking proves that *there is a match*, not *what* was matched.

You're right. I'm now updating the the code to make use of the other
info in the test vectors for more reliable tests.

 Anyway, I'm gathering what I have in a way of fixes as my first pull
 request, wish me luck ;)

Good luck! I look forward to seeing your work.



Re: [GSOC idea] enhance regular expressions?

2011-03-31 Thread petevik38
Dmitry Olshansky dmitry.o...@gmail.com writes:
 --- somewhat informal draft ---


Hi Dmitry,

It's good to know that there is interest in improving regular
expressions in the D standard library. I've run into a number of
problems using std.regex myself, so I'm keen so see either fixes for
std.regex or an improved replacement.

I've been working on a regular expression library myself based around
Russ Cox article at http://swtch.com/~rsc/regexp/regexp2.html and the
std.regex range interface. It includes a backtracking and Thompson
engine. If you are interested it is on github:

https://github.com/PeteChadwick/pcregexeng

The ddoc documentation is here:
http://petechadwick.github.com/pcregexeng/

 Having regular expression (regex) engine in standard library is
 totally expected for any modern language.
 The solid and performant implementation of regex is one of the
 greatest points of pride (if not of sale).

Couldn't agree more.

 While concrete results and benchmarks got frequently biased, there are
 some general conclusions:
 regexes usually came in two flavors: Finite Automation (FA, be it
 deterministic  or nondeterministic) and  backtracking, each of these
 with their set of traits.
 FA are more stable O(N) in time on input, however they are usually
 more limited as implementing features such as backreferences becomes
 problematic, also not to mention the time for constructing DFA from
 pattern.

For an NFA, the complexity of the regular expression also affects the
execution time as there will probably me more states to process per
character. There is also more storage and copying required for
captures. I haven't looked closely at using a DFA, and I'm not sure how
captures would work with one.

 Backtracking allows easily implementing some interesting features like
 backreferencing, lookahead/lookbehind and such, but has a dreadful
 exponential time on input in worst case.
 Not willing to go deeper into the performance/feature/usage topic,
 fairly good picture of various regular expression engines can be found
 here:  http://lh3lh3.users.sourceforge.net/reb.shtml. I though have
 now idea why he claims RE2 is feature-rich.

 Now speaking of D, what we have now is essentially an optimized
 backtracking with some not implemented features it may very well have
 had. Also consider important issue that need proper resolution
 http://d.puremagic.com/issues/show_bug.cgi?id=941. This situation begs
 for a change.

 The proposal consists of three parts:

 1) Enhance and bugfix the existing engine in std.regex. Sometime ago I
 got accustomed with this engine internals and confident that I can get
 it done. That's a short-term solution at best and something I plan to
 do in spare time even if it's not considered a compelling GSOC
 proposal ;)

I would really like to see this.

 2)  Make another one FA-based from scratch, here things go somewhat
 speculative.
 I plan to start with NFA, leaving DFA on later. Probably NFA with
 cached DFA states as optimization tuning aka memory vs time. Someone
 brought  http://swtch.com/~rsc/regexp/regexp1.html, it could be a
 starting point, I see some optimization opportunities I'd like to
 try. Supporting unicode of all flavors is of no question. The API for
 this engine may stay the same or it even  may be integrated into
 existing one, making the type of engine a compile-time parameter with
 default being Regex.Backtracking. Suppose:
 auto dfa = regex!Regex.NFA((D|N)?FA,g);

I think the existing range based API is very good. I also like the idea
of the regular expression engine being a compile time parameter.

 3) When I saw Don's comments on fixing CTFE, I remembered something D
 can do _better_  than most others mainstream compiled languages - 
 metaprogramming. So having a StaticRegex compiled at compile-time
 (inevitable calambour) would give us an  edge over comparative
 implementations. Also I can imagine quite a lot of applications not
 requiring *any* regex from non-constant pattern. Major exceptions are
 editors with find/replace and certain scripting/parsing toolkits.
 It looks quite natural:
 ...
 enum sregNum= StaticRegex![-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+);
 ...
 assert(match(PI: 3.1415926,sregNum).hit == 3.1415926);

 The best speed gain most likely would be obtained by generating the
 whole DFA at compile time, I believe such possibility  was previously
 discussed on the NG.

This would be a real nice feature.

 This diversity in kinds of regexes may warrant making regex engine an
 interface ( for homogeneous storage and parameter passing) with
 concrete implementations being classes.

 Another thing to sort out is style: are we sticking with ECMA or
 switching to Perl / GNU? Shall we provide also different versions of
 syntax?


Re: [GSOC idea] enhance regular expressions?

2011-03-31 Thread Dmitry Olshansky

On 31.03.2011 22:16, petevi...@yahoo.com.au wrote:

Dmitry Olshanskydmitry.o...@gmail.com  writes:

--- somewhat informal draft ---


Hi Dmitry,

It's good to know that there is interest in improving regular
expressions in the D standard library. I've run into a number of
problems using std.regex myself, so I'm keen so see either fixes for
std.regex or an improved replacement.

Thanks for the interest.
Indeed there are problems, though the engine itself is quite capable and 
optimized, I think it could be fixed.

I've been working on a regular expression library myself based around
Russ Cox article at http://swtch.com/~rsc/regexp/regexp2.html and the
std.regex range interface. It includes a backtracking and Thompson
engine. If you are interested it is on github:

https://github.com/PeteChadwick/pcregexeng

The ddoc documentation is here:
http://petechadwick.github.com/pcregexeng/


Wow, I'll definitely look into it.
Overall it seems solid even though incomplete. From glance it seems that 
parsing a pattern is somewhat convulted, but I'll defer any critics or 
decisions until I looked in closer detail.
I see you also using the test from std.regex, sadly it's somewhat broken 
(the one with test vectors).

All it's checking proves that *there is a match*, not *what* was matched.

Anyway, I'm gathering what I have in a way of fixes as my first pull 
request, wish me luck ;)

Having regular expression (regex) engine in standard library is
totally expected for any modern language.
The solid and performant implementation of regex is one of the
greatest points of pride (if not of sale).

Couldn't agree more.


While concrete results and benchmarks got frequently biased, there are
some general conclusions:
regexes usually came in two flavors: Finite Automation (FA, be it
deterministic  or nondeterministic) and  backtracking, each of these
with their set of traits.
FA are more stable O(N) in time on input, however they are usually
more limited as implementing features such as backreferences becomes
problematic, also not to mention the time for constructing DFA from
pattern.

For an NFA, the complexity of the regular expression also affects the
execution time as there will probably me more states to process per
character. There is also more storage and copying required for
captures. I haven't looked closely at using a DFA, and I'm not sure how
captures would work with one.

Right, for NFA the execution time in worst case O(n*m) n - input, m - 
pattern length.
As for DFA, it seems the process of construction full DFA eagerly from 
NFA would grab a lot of ram  time O(2^^m) in worst case, which makes it 
useful only on certain amounts of input, hence the idea to compute it at 
compile time.



Backtracking allows easily implementing some interesting features like
backreferencing, lookahead/lookbehind and such, but has a dreadful
exponential time on input in worst case.
Not willing to go deeper into the performance/feature/usage topic,
fairly good picture of various regular expression engines can be found
here:  http://lh3lh3.users.sourceforge.net/reb.shtml. I though have
now idea why he claims RE2 is feature-rich.

Now speaking of D, what we have now is essentially an optimized
backtracking with some not implemented features it may very well have
had. Also consider important issue that need proper resolution
http://d.puremagic.com/issues/show_bug.cgi?id=941. This situation begs
for a change.

The proposal consists of three parts:

1) Enhance and bugfix the existing engine in std.regex. Sometime ago I
got accustomed with this engine internals and confident that I can get
it done. That's a short-term solution at best and something I plan to
do in spare time even if it's not considered a compelling GSOC
proposal ;)

I would really like to see this.


2)  Make another one FA-based from scratch, here things go somewhat
speculative.
I plan to start with NFA, leaving DFA on later. Probably NFA with
cached DFA states as optimization tuning aka memory vs time. Someone
brought  http://swtch.com/~rsc/regexp/regexp1.html, it could be a
starting point, I see some optimization opportunities I'd like to
try. Supporting unicode of all flavors is of no question. The API for
this engine may stay the same or it even  may be integrated into
existing one, making the type of engine a compile-time parameter with
default being Regex.Backtracking. Suppose:
auto dfa = regex!Regex.NFA((D|N)?FA,g);

I think the existing range based API is very good. I also like the idea
of the regular expression engine being a compile time parameter.


3) When I saw Don's comments on fixing CTFE, I remembered something D
can do _better_  than most others mainstream compiled languages -
metaprogramming. So having a StaticRegex compiled at compile-time
(inevitable calambour) would give us an  edge over comparative
implementations. Also I can imagine quite a lot of applications not
requiring *any* regex from non-constant pattern. Major exceptions are
editors with 

Re: [GSOC idea] enhance regular expressions?

2011-03-30 Thread Dmitry Olshansky

On 29.03.2011 17:11, spir wrote:

On 03/29/2011 02:16 PM, KennyTM~ wrote:

On Mar 29, 11 18:56, bearophile wrote:

Dmitry Olshansky:

Others (except (?Pname) and (?P=name) ) also considered common 
extensions
and I planed to add them plus regex comment (#...) where all of ... 
simply

have no effect on matching.


Beside the (#...) comments in Python you have also the verbose 
regex, that
allow to put whispace and free #... comments with no parentheses. I 
find this

one of the nicest features, because it allows you to format your regex.

Bye,
bearophile


You can also format regex with

r\d+ // matches the integer part
~ r(?:\.\d+)?; // matches the fractional part

:)


Nice trick, thank you!


Or even without '~' using implicit adjacent strings concatenation
(though I vaguely remember that the conclusion was that ~ operator 
string will also happen at compile time soon)

string pat = `\d+` // matches the integer part
`(?:\.\d+)?`; // matches the fractional part
assert(pat == `\d+(?:\.\d+)?`);

This I think somewhat diminishes the need for verbose regex, isn't it? 
It may be even documented somewhere in examples...
Overall for now it seems that having different syntax style (like perl 
and etc.) is not something people find useful/interesting.

So I'll be sticking with ECMA + extensions.

--
Dmitry Olshansky



Re: [GSOC idea] enhance regular expressions?

2011-03-29 Thread bearophile
Dmitry Olshansky:

 Others (except (?Pname) and (?P=name) ) also considered common extensions 
 and I planed to add them plus  regex comment (#...) where all of ... simply 
 have no effect on matching.

Beside the (#...) comments in Python you have also the verbose regex, that 
allow to put whispace and free #... comments with no parentheses. I find this 
one of the nicest features, because it allows you to format your regex.

Bye,
bearophile


Re: [GSOC idea] enhance regular expressions?

2011-03-29 Thread KennyTM~

On Mar 29, 11 18:56, bearophile wrote:

Dmitry Olshansky:


Others (except (?Pname) and (?P=name) ) also considered common extensions and 
I planed to add them plus  regex comment (#...) where all of ... simply have no 
effect on matching.


Beside the (#...) comments in Python you have also the verbose regex, that 
allow to put whispace and free #... comments with no parentheses. I find this 
one of the nicest features, because it allows you to format your regex.

Bye,
bearophile


You can also format regex with

 r\d+ // matches the integer part
  ~  r(?:\.\d+)?; // matches the fractional part

:)


Re: [GSOC idea] enhance regular expressions?

2011-03-29 Thread spir

On 03/29/2011 02:16 PM, KennyTM~ wrote:

On Mar 29, 11 18:56, bearophile wrote:

Dmitry Olshansky:


Others (except (?Pname) and (?P=name) ) also considered common extensions
and I planed to add them plus regex comment (#...) where all of ... simply
have no effect on matching.


Beside the (#...) comments in Python you have also the verbose regex, that
allow to put whispace and free #... comments with no parentheses. I find this
one of the nicest features, because it allows you to format your regex.

Bye,
bearophile


You can also format regex with

r\d+ // matches the integer part
~ r(?:\.\d+)?; // matches the fractional part

:)


Nice trick, thank you!

Denis
--
_
vita es estrany
spir.wikidot.com



Re: [GSOC idea] enhance regular expressions?

2011-03-29 Thread spir

On 03/29/2011 12:56 PM, bearophile wrote:

Dmitry Olshansky:


Others (except (?Pname) and (?P=name) ) also considered common extensions and 
I planed to add them plus  regex comment (#...) where all of ... simply have no 
effect on matching.


Beside the (#...) comments in Python you have also the verbose regex, that 
allow to put whispace and free #... comments with no parentheses. I find this 
one of the nicest features, because it allows you to format your regex.


Same for me. That the feature #1 of python regexes.

valuePattern = r
[a-zA-Z][a-zA-Z0-9]* |  # symbol or
[+-]?[0-9]+(\.[0-9]+)?  # number


Denis
--
_
vita es estrany
spir.wikidot.com



[GSOC idea] enhance regular expressions?

2011-03-28 Thread Dmitry Olshansky

Hello,
(this is a long post, you may skip this introduction)
I was sticking around D's NG for about a year and following it's 
advancements very closely. I was busy ruthlessly wasting my spare time 
for testing this or that of cool features,  converting some personal C++ 
projects. Maybe it's time to introduce myself: I'm student in the middle 
of getting my master's degree in Moscow Institute of Physics and 
Technology. My main interests for the last years were C++, 
metaprogramming, compiler theory and, lately, robotics and 
microcontrollers. I got attracted to D2 as a natural way to reinvest my 
C++ knowledge (honestly, C++0x is such a mess) and, in fact, looked 
forward to do some hackery on some real-world compiler (that would be 
DMD). As sort of warm-up to get better grasp on the language and 
real-world interpreters alike I did some work on DMDScript (porting it 
to D2 among others) it's available on 
http://dsource.org/projects/dmdscript-2. Note that I haven't touched it 
in couple of dmd releases, it may need cosmetic fixes.


To summarize, I'm no stranger to regular expressions and parsing and 
want to utilize this experience to help community, what follows is my 
first attempt at proposal, any comments and ideas are warmly welcome.


--- somewhat informal draft ---

Having regular expression (regex) engine in standard library is totally 
expected for any modern language.
The solid and performant implementation of regex is one of the greatest 
points of pride (if not of sale).
While concrete results and benchmarks got frequently biased, there are 
some general conclusions:
regexes usually came in two flavors: Finite Automation (FA, be it 
deterministic  or nondeterministic) and  backtracking, each of these 
with their set of traits.
FA are more stable O(N) in time on input, however they are usually more 
limited as implementing features such as backreferences becomes 
problematic, also not to mention the time for constructing DFA from pattern.
Backtracking allows easily implementing some interesting features like 
backreferencing, lookahead/lookbehind and such, but has a dreadful 
exponential time on input in worst case.
Not willing to go deeper into the performance/feature/usage topic,  
fairly good picture of various regular expression engines can be found 
here:  http://lh3lh3.users.sourceforge.net/reb.shtml. I though have now 
idea why he claims RE2 is feature-rich.


Now speaking of D, what we have now is essentially an optimized 
backtracking with some not implemented features it may very well have 
had. Also consider important issue that need proper resolution 
http://d.puremagic.com/issues/show_bug.cgi?id=941. This situation begs 
for a change.


The proposal consists of three parts:

1) Enhance and bugfix the existing engine in std.regex. Sometime ago I 
got accustomed with this engine internals and confident that I can get 
it done. That's a short-term solution at best and something I plan to do 
in spare time even if it's not considered a compelling GSOC proposal ;)


2)  Make another one FA-based from scratch, here things go somewhat 
speculative.
I plan to start with NFA, leaving DFA on later. Probably NFA with cached 
DFA states as optimization tuning aka memory vs time. Someone brought  
http://swtch.com/~rsc/regexp/regexp1.html, it could be a starting point, 
I see some optimization opportunities I'd like to try. Supporting 
unicode of all flavors is of no question. The API for this engine may 
stay the same or it even  may be integrated into existing one, making 
the type of engine a compile-time parameter with default being 
Regex.Backtracking. Suppose:

auto dfa = regex!Regex.NFA((D|N)?FA,g);

3) When I saw Don's comments on fixing CTFE, I remembered something D 
can do _better_  than most others mainstream compiled languages - 
metaprogramming. So having a StaticRegex compiled at compile-time 
(inevitable calambour) would give us an  edge over comparative 
implementations. Also I can imagine quite a lot of applications not 
requiring *any* regex from non-constant pattern. Major exceptions are 
editors with find/replace and certain scripting/parsing toolkits.

It looks quite natural:
...
enum sregNum= StaticRegex![-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+);
...
assert(match(PI: 3.1415926,sregNum).hit == 3.1415926);

The best speed gain most likely would be obtained by generating the 
whole DFA at compile time, I believe such possibility  was previously 
discussed on the NG.


This diversity in kinds of regexes may warrant making regex engine an 
interface ( for homogeneous storage and parameter passing) with concrete 
implementations being classes.


Another thing to sort out is style: are we sticking with ECMA or 
switching to Perl / GNU? Shall we provide also different versions of  
syntax?



--
Dmitry Olshansky



Re: [GSOC idea] enhance regular expressions?

2011-03-28 Thread bearophile
Dmitry Olshansky:

 http://dsource.org/projects/dmdscript-2. Note that I haven't touched it 
 in couple of dmd releases, it may need cosmetic fixes.

I will try it.


 http://swtch.com/~rsc/regexp/regexp1.html, it could be a starting point, 

Before fully embracing the contents of that page, look at its critics too.


 3) When I saw Don's comments on fixing CTFE, I remembered something D 
 can do _better_  than most others mainstream compiled languages - 
 metaprogramming. So having a StaticRegex compiled at compile-time 
 (inevitable calambour) would give us an  edge over comparative 
 implementations.

But it inflates the binary too, if overused, because each RE pattern becomes a 
compiled function (or more than one).


 Another thing to sort out is style: are we sticking with ECMA or 
 switching to Perl / GNU? Shall we provide also different versions of  
 syntax?

One RE syntax in a language is more than enough :-) But there are some features 
of the Python REs that I'd like to see bolted-on to the ECMA REs used by D. So 
I suggest a superset of ECMA.

Bye,
bearophile


Re: [GSOC idea] enhance regular expressions?

2011-03-28 Thread Dmitry Olshansky


On 29.03.2011 1:47, bearophile wrote:

Dmitry Olshansky:


http://dsource.org/projects/dmdscript-2. Note that I haven't touched it
in couple of dmd releases, it may need cosmetic fixes.

I will try it.


No problem, feel free to report any issues. There should be working 
bugtracker if I remember correctly.

http://swtch.com/~rsc/regexp/regexp1.html, it could be a starting point,

Before fully embracing the contents of that page, look at its critics too.



I've seen them, and they are expected, also I thought I clearly 
presented some of logical considerations.

3) When I saw Don's comments on fixing CTFE, I remembered something D
can do _better_  than most others mainstream compiled languages -
metaprogramming. So having a StaticRegex compiled at compile-time
(inevitable calambour) would give us an  edge over comparative
implementations.

But it inflates the binary too, if overused, because each RE pattern becomes a 
compiled function (or more than one).

Well, in case that result would be a data structure or bytecode created 
at the compile-time using CTFE. The function for matching or replacing 
using the resulting structure/program would be the same as run-time one. 
It's inflating binary like any static data. Then 
StaticRegex!whatever should be convenience wrapper hiding all this 
complexity. Anyway I partly agree, there are limitations and 
considerations when using such technique. And it would rely on CTFE that 
does not leak memory like hell.



Another thing to sort out is style: are we sticking with ECMA or
switching to Perl / GNU? Shall we provide also different versions of
syntax?

One RE syntax in a language is more than enough :-) But there are some features 
of the Python REs that I'd like to see bolted-on to the ECMA REs used by D.

BTW which ones? Now is the time to propose them.


So I suggest a superset of ECMA.


Your vote counted ;)


Bye,
bearophile



--
Dmitry Olshansky



Re: [GSOC idea] enhance regular expressions?

2011-03-28 Thread bearophile
Dmitry Olshansky:

 BTW which ones? Now is the time to propose them.

Verbose regular expressions, that allow to put space and comments, to format 
REs more like code and less like a cryptic puzzle language.

(?:...) A non-grouping version of regular parentheses. Matches whatever regular 
expression is inside the parentheses, but the substring matched by the group 
cannot be retrieved after performing a match or referenced later in the pattern.

(?Pname...) Similar to regular parentheses, but the substring matched by the 
group is accessible via the symbolic group name name. A symbolic group is also 
a numbered group.

(?P=name) Matches whatever text was matched by the earlier group named name.

(?=...) Lookahead assertion. Matches if ... matches next, but doesn't consume 
any of the string.

(?!...) Negative lookahead assertion. Matches if ... doesn't match next. The 
contained pattern must only match strings of some fixed length.

(?=...) Positive lookbehind assertion. Matches if the current position in the 
string is preceded by a match for ... that ends at the current position. The 
contained pattern must only match strings of some fixed length.

(?!...) Negative lookbehind assertion. Matches if the current position in the 
string is not preceded by a match for  The contained pattern must only 
match strings of some fixed length. Patterns which start with negative 
lookbehind assertions may match at the beginning of the string being searched.

Bye,
bearophile


Re: [GSOC idea] enhance regular expressions?

2011-03-28 Thread spir

On 03/28/2011 11:47 PM, bearophile wrote:

Dmitry Olshansky:


http://dsource.org/projects/dmdscript-2. Note that I haven't touched it
in couple of dmd releases, it may need cosmetic fixes.


I will try it.



http://swtch.com/~rsc/regexp/regexp1.html, it could be a starting point,


Before fully embracing the contents of that page, look at its critics too.



3) When I saw Don's comments on fixing CTFE, I remembered something D
can do _better_  than most others mainstream compiled languages -
metaprogramming. So having a StaticRegex compiled at compile-time
(inevitable calambour) would give us an  edge over comparative
implementations.


But it inflates the binary too, if overused, because each RE pattern becomes a 
compiled function (or more than one).



Another thing to sort out is style: are we sticking with ECMA or
switching to Perl / GNU? Shall we provide also different versions of
syntax?


One RE syntax in a language is more than enough :-) But there are some features 
of the Python REs that I'd like to see bolted-on to the ECMA REs used by D. So 
I suggest a superset of ECMA.


... especially the ability to insert free spacing to clarify / comment the 
string pattern (meaning literal whitespace is escaped like '.', etc) (dunno if 
D regex supports that already though).


Denis
--
_
vita es estrany
spir.wikidot.com



Re: [GSOC idea] enhance regular expressions?

2011-03-28 Thread spir

On 03/29/2011 12:40 AM, bearophile wrote:

Dmitry Olshansky:


BTW which ones? Now is the time to propose them.


Verbose regular expressions, that allow to put space and comments, to format 
REs more like code and less like a cryptic puzzle language.

(?:...) A non-grouping version of regular parentheses. Matches whatever regular 
expression is inside the parentheses, but the substring matched by the group 
cannot be retrieved after performing a match or referenced later in the pattern.

(?Pname...) Similar to regular parentheses, but the substring matched by the 
group is accessible via the symbolic group name name. A symbolic group is also a 
numbered group.

(?P=name) Matches whatever text was matched by the earlier group named name.

(?=...) Lookahead assertion. Matches if ... matches next, but doesn't consume 
any of the string.

(?!...) Negative lookahead assertion. Matches if ... doesn't match next. The 
contained pattern must only match strings of some fixed length.

(?=...) Positive lookbehind assertion. Matches if the current position in the 
string is preceded by a match for ... that ends at the current position. The 
contained pattern must only match strings of some fixed length.

(?!...) Negative lookbehind assertion. Matches if the current position in the 
string is not preceded by a match for  The contained pattern must only match 
strings of some fixed length. Patterns which start with negative lookbehind 
assertions may match at the beginning of the string being searched.


FWIW: (probably due to domain?) I have never used positive lookahead, and even 
less lookbehinds.
Also, what I like in py regexes is that they don't match anywhere as default 
(there is a find method).
What I find annoying is single-line-only as default (like D regexes, 
unfortunately).


Denis
--
_
vita es estrany
spir.wikidot.com



Re: [GSOC idea] enhance regular expressions?

2011-03-28 Thread KennyTM~

On Mar 29, 11 06:21, Dmitry Olshansky wrote:

Another thing to sort out is style: are we sticking with ECMA or
switching to Perl / GNU? Shall we provide also different versions of
syntax?

One RE syntax in a language is more than enough :-) But there are some
features of the Python REs that I'd like to see bolted-on to the ECMA
REs used by D.

BTW which ones? Now is the time to propose them.


- Unicode character properties i.e. \p{Lu} and \P{Lu}.
- Look-behinds i.e. (?=...) and (?!...)


These should be the most useful extra patterns that covers 99.9% 
(made-up on the spot) of all regexes.


Re: [GSOC idea] enhance regular expressions?

2011-03-28 Thread Dmitry Olshansky

On 29.03.2011 2:40, bearophile wrote:

Dmitry Olshansky:


BTW which ones? Now is the time to propose them.

Verbose regular expressions, that allow to put space and comments, to format 
REs more like code and less like a cryptic puzzle language.

(?:...) A non-grouping version of regular parentheses. Matches whatever regular 
expression is inside the parentheses, but the substring matched by the group 
cannot be retrieved after performing a match or referenced later in the pattern.

(?Pname...) Similar to regular parentheses, but the substring matched by the 
group is accessible via the symbolic group name name. A symbolic group is also a 
numbered group.

(?P=name) Matches whatever text was matched by the earlier group named name.

(?=...) Lookahead assertion. Matches if ... matches next, but doesn't consume 
any of the string.

(?!...) Negative lookahead assertion. Matches if ... doesn't match next. The 
contained pattern must only match strings of some fixed length.

(?=...) Positive lookbehind assertion. Matches if the current position in the 
string is preceded by a match for ... that ends at the current position. The 
contained pattern must only match strings of some fixed length.

(?!...) Negative lookbehind assertion. Matches if the current position in the 
string is not preceded by a match for  The contained pattern must only match 
strings of some fixed length. Patterns which start with negative lookbehind 
assertions may match at the beginning of the string being searched.

Yeah, that's something I vaguely called ... not implemented features it 
may very well have had.


More then that (?:...), (?!...), (?=...) area part of ECMA v3 spec. And I got 
a preliminary support for them in my patch 
herehttp://d.puremagic.com/issues/show_bug.cgi?id=5673.
Others (except (?Pname) and (?P=name) ) also considered common extensions and 
I planed to add them plus  regex comment (#...) where all of ... simply have no 
effect on matching.



Bye,
bearophile



--
Dmitry Olshansky