GSoC Regexp engine

2007-05-31 Thread Ian Young

Hi all,
I'm Ian, one of the two students working on improving the regexp
engine in Vim for this year's Google Summer of Code.  I haven't had a
whole lot to contribute as of yet, but now that work is underway, I'll
probably pop up here asking lots of questions some days.

Right now we're working on getting things set up and building a
testing suite, but I thought I would spark some discussion on a design
decision that will be coming up after we finish this phase, which is
whether to implement the new model ourselves, or use an alternative
engine, like TRE: http://laurikari.net/tre/. I'm tempted to
implement one ourselves, as it's an intellectually stimulating
prospect, but that doesn't mean I won't listen to reason if TRE or
another option is far better. I don't know much about the internals of
TRE, but according to previous posts to this list, it utilizes three
engines: a slow one for handling backreferences (presumably similar to
Vim's current engine), a fast one for most cases (what we are looking
to implement), and one for their 'fuzzy matching' feature.

I have a couple questions to start things off. First: I couldn't see
much need for 'fuzzy matching' in Vim, but some of you are probably
much better acquainted with regexp use cases than I am.  Would this be
a useful feature to have available?  Second: We might have to do some
gymnastics to work with multibyte characters, as discussed here: 
http://tech.groups.yahoo.com/group/vimdev/message/46408. I haven't
worked with multibyte characters before, so I'm not clear on the
subtleties.  Would this translation to wide characters before passing
to the engine cause much of a performance hit and/or be excessively
complicated to implement? On a side note, TRE's main page says it has
both wide character and multibyte character support. I couldn't find a
version history, so I'm not sure if this is a new feature that Nikolai
isn't aware of, or if we need something more.

I'm interested to hear what you all have to say. We don't need to make
this decision until middle of next week at the earliest, but I thought
I would get the discussion going now.

Ian


Re: GSoC Regexp engine

2007-05-31 Thread Brian Gupta

I have also heard good things about the PCRE (Perl Compatible Regex
Library). You may want to consider it as an option.

http://www.pcre.org/

-Brian

On 5/31/07, Ian Young [EMAIL PROTECTED] wrote:

Hi all,
I'm Ian, one of the two students working on improving the regexp
engine in Vim for this year's Google Summer of Code.  I haven't had a
whole lot to contribute as of yet, but now that work is underway, I'll
probably pop up here asking lots of questions some days.

Right now we're working on getting things set up and building a
testing suite, but I thought I would spark some discussion on a design
decision that will be coming up after we finish this phase, which is
whether to implement the new model ourselves, or use an alternative
engine, like TRE: http://laurikari.net/tre/. I'm tempted to
implement one ourselves, as it's an intellectually stimulating
prospect, but that doesn't mean I won't listen to reason if TRE or
another option is far better. I don't know much about the internals of
TRE, but according to previous posts to this list, it utilizes three
engines: a slow one for handling backreferences (presumably similar to
Vim's current engine), a fast one for most cases (what we are looking
to implement), and one for their 'fuzzy matching' feature.

I have a couple questions to start things off. First: I couldn't see
much need for 'fuzzy matching' in Vim, but some of you are probably
much better acquainted with regexp use cases than I am.  Would this be
a useful feature to have available?  Second: We might have to do some
gymnastics to work with multibyte characters, as discussed here: 
http://tech.groups.yahoo.com/group/vimdev/message/46408. I haven't
worked with multibyte characters before, so I'm not clear on the
subtleties.  Would this translation to wide characters before passing
to the engine cause much of a performance hit and/or be excessively
complicated to implement? On a side note, TRE's main page says it has
both wide character and multibyte character support. I couldn't find a
version history, so I'm not sure if this is a new feature that Nikolai
isn't aware of, or if we need something more.

I'm interested to hear what you all have to say. We don't need to make
this decision until middle of next week at the earliest, but I thought
I would get the discussion going now.

Ian



Re: GSoC Regexp engine

2007-05-31 Thread Nikolai Weibull

On 5/31/07, Brian Gupta [EMAIL PROTECTED] wrote:

I have also heard good things about the PCRE (Perl Compatible Regex
Library). You may want to consider it as an option.


PCRE is crap.

It is crap, because it uses the same, crappy, backtracking method that
Vim, and most other crappy regex (note: not regular expression)
libraries use, which is exactly the kind of crap that this GSoC
project is aiming to scrap.

 nikocrap


Re: GSoC Regexp engine

2007-05-31 Thread Nikolai Weibull

On 5/31/07, Ian Young [EMAIL PROTECTED] wrote:


I'm Ian, one of the two students working on improving the regexp
engine in Vim for this year's Google Summer of Code.  I haven't had a
whole lot to contribute as of yet, but now that work is underway, I'll
probably pop up here asking lots of questions some days.

Right now we're working on getting things set up and building a
testing suite, but I thought I would spark some discussion on a design
decision that will be coming up after we finish this phase, which is
whether to implement the new model ourselves, or use an alternative
engine, like TRE: http://laurikari.net/tre/. I'm tempted to
implement one ourselves, as it's an intellectually stimulating
prospect, but that doesn't mean I won't listen to reason if TRE or
another option is far better. I don't know much about the internals of
TRE, but according to previous posts to this list, it utilizes three
engines: a slow one for handling backreferences (presumably similar to
Vim's current engine), a fast one for most cases (what we are looking
to implement), and one for their 'fuzzy matching' feature.

I have a couple questions to start things off. First: I couldn't see
much need for 'fuzzy matching' in Vim, but some of you are probably
much better acquainted with regexp use cases than I am.  Would this be
a useful feature to have available?



Second: We might have to do some
gymnastics to work with multibyte characters, as discussed here: 
http://tech.groups.yahoo.com/group/vimdev/message/46408. I haven't
worked with multibyte characters before, so I'm not clear on the
subtleties.  Would this translation to wide characters before passing
to the engine cause much of a performance hit and/or be excessively
complicated to implement? On a side note, TRE's main page says it has
both wide character and multibyte character support. I couldn't find a
version history, so I'm not sure if this is a new feature that Nikolai
isn't aware of, or if we need something more.


It supports

* Byte matching, that is, raw bytes
* Wide characters, that is, whatever wchar_t is
* Multi-byte characters, thas is, whatever mbrtowc supports
* Streams that is, objects that feed TRE characters as it needs them

It would be pretty easy to set up a stream object that would feed TRE
characters.  It would only have to keep track of where it was in the
buffer and basically request more of the buffer as TRE needs it.

It should be noted that there are quite a few bugs in TRE that relate
to the interaction of quantifiers.  I have discussed this privately
with Ville, but neither of us has been able to resolve it.  It has
also been discussed here:

http://laurikari.net/pipermail/tre-general/2007-February/thread.html

where Chris Kuklewicz suggests a solution to the problem that seems to
work.  It is a somewhat costly solution, but it may be worth it in all
its simplicity.  Chris has written an implementation of TDFAs for
Haskell that is quite simple and manages to both outperform all other
regex libraries for Haskell and still pass all POSIX tests.  Here's
the announcement:

http://www.mail-archive.com/[EMAIL PROTECTED]/msg11442.html

This will, sadly, be of no use to us, but it does show that TDFAs are
a possibility, and that the problems TRE has with quantifiers can be
resolved.

Anyway, fuzzy matching, it seems like this is a feature that never
really caught on.  Agrep has long enjoyed the status of being one of
the few commands that remain to be implemented for the GNU project
(can't seem to find the list right now, so I can't provide a link).
This does, however, seem to indicate that no one has cared enough
about it to implement and distribute it with GNU.  It can be a quite
interesting thing to have, but it's perhaps not useful enough to care
about at this stage.

Also, you won't have time to implement this yourself.  Seriously.  It
takes a lot of work to write an efficient and
as-compatible-as-possible implementation implementation and a summer
isn't nearly enough time to complete said work.  I think that what's
most important here is to set up a test suite and the code required to
interface with a library, such as TRE.  That way one can always hook
in another library when it gets written.

Finally, good to hear from you. I think we all look forward to being
able to enjoy the fruits of your hard labor ;-).

 nikolai


Re: GSoC Regexp engine

2007-05-31 Thread Charles E Campbell Jr

Ian Young wrote:


I have a couple questions to start things off. First: I couldn't see
much need for 'fuzzy matching' in Vim, but some of you are probably
much better acquainted with regexp use cases than I am.  Would this be
a useful feature to have available?


As you likely know, fuzzy matching hasn't been available in Vim.  One place
it has been useful is in suggesting spelling corrections; I myself used 
agrep

in the engspchk.vim plugin to support fuzzy matching.

Bram already has a spelling error suggestion feature, so I have no idea 
if the

fuzzy regex would help with it or not.

What I think could be more useful would be boolean logic for regexp.  My 
LogiPat
plugin provides this capability, but undoubtedly it'd be better if 
somehow it could be
incorporated.  The resulting patterns from LogiPat seem to me to be 
somewhat opaque.


Regards,
Chip Campbell



Re: GSoC Regexp engine

2007-05-31 Thread Nikolai Weibull

On 5/31/07, Charles E Campbell Jr [EMAIL PROTECTED] wrote:


What I think could be more useful would be boolean logic for regexp.  My
LogiPat
plugin provides this capability, but undoubtedly it'd be better if
somehow it could be
incorporated.  The resulting patterns from LogiPat seem to me to be
somewhat opaque.


What would be even cooler would be to use regular relations, as that
would allow for far superior substitution possibilities to what
:substitute has to offer.

I've long considered writing a text editor around regular relations,
and was actually hoping to get a Ph.D. based on using regular
relations in interactive processes, but that sadly never happened.

 nikolai


Re: GSoC Regexp engine

2007-05-31 Thread Nikolai Weibull

On 5/31/07, Nikolai Weibull [EMAIL PROTECTED] wrote:


What would be even cooler would be to use regular relations, as that
would allow for far superior substitution possibilities to what
:substitute has to offer.


(Someone asked off-list what regular relations were.  If anyone else
is interested, here's what I responded with.)

Here are some papers on regular relations:

http://citeseer.comp.nus.edu.sg/karttunen95replace.html
http://citeseer.comp.nus.edu.sg/karttunen96regular.html

Also see

http://www.xrce.xerox.com/competencies/content-analysis/fst/home.en.html

nikolai

P.S.
Please don't top-post.
D.S.


Defining new visual-mode motions?

2007-05-31 Thread Joseph Barker
Hello, all.

I was recently helping someone out with a vim script (camelcasemotion.vim) 
which adds additional motion commands (they treat camel-cased words 
(WordsLikeThis) as separate words, rather than as a single word). This is 
easy enough to do in normal and operator-pending mode. It seems to be very 
complicated to do this in visual mode, though -- calling a function (or 
anything that lets you move the cursor) seems to force you to leave visual 
mode (i.e., doing `vmap ,w :C-Ucall MoveCursor()` will move the cursor to 
the right place, but you're no longer in visual mode).

My approach to this was to call the movement function, set a mark, select 
the previous visual block (with gv) and then jump to the mark that was 
previously set. The mapping that I created to deal with this is the 
following:

vmap silent ,w @=\33:\25call 
SIDCamelCaseMotion('w',1,'v')CRCRm`gvg``

This seems somewhat inelegant, and also clobbers a mark to be able to 
accomplish its magic. Is there an easier way to accomplish the same thing? 
It seems like there should be, but I was unable to figure one out.

Thanks for your help.

JKB


Re: confirm unsubscribe from vim-dev@vim.org

2007-05-31 Thread Spencer Collyer
On 1 Jun 2007 05:59:49 -, [EMAIL PROTECTED] wrote:
 Hi! This is the ezmlm program. I'm managing the
 vim-dev@vim.org mailing list.
 
 To confirm that you would like
 
[EMAIL PROTECTED]
 
 removed from the vim-dev mailing list, please send an empty reply 
 to this address:
 
[EMAIL PROTECTED]
 
 Usually, this happens when you just hit the reply button.
 If this does not work, simply copy the address and paste it into
 the To: field of a new message.
 
 I haven't checked whether your address is currently on the mailing
 list. To see what address you used to subscribe, look at the messages
 you are receiving from the mailing list. Each message has your
 address hidden inside its return path; for example, [EMAIL PROTECTED]
 receives messages with return path:
 vim-dev-return-number[EMAIL PROTECTED]
 
 Some mail programs are broken and cannot handle long addresses. If you
 cannot reply to this request, instead send a message to
 [EMAIL PROTECTED] and put the entire address listed above
 into the Subject: line.
 
 
 --- Administrative commands for the vim-dev list ---
 
 I can handle administrative requests automatically. Please
 do not send them to the list address! Instead, send
 your message to the correct command address:
 
 To subscribe to the list, send a message to:
[EMAIL PROTECTED]
 
 To remove your address from the list, send a message to:
[EMAIL PROTECTED]
 
 Send mail to the following for info and FAQ for this list:
[EMAIL PROTECTED]
[EMAIL PROTECTED]
 
 Similar addresses exist for the digest list:
[EMAIL PROTECTED]
[EMAIL PROTECTED]
 
 To get messages 123 through 145 (a maximum of 100 per request), mail:
[EMAIL PROTECTED]
 
 To get an index with subject and author for messages 123-456 , mail:
[EMAIL PROTECTED]
 
 They are always returned as sets of 100, max 2000 per request,
 so you'll actually get 100-499.
 
 To receive all messages with the same subject as message 12345,
 send an empty message to:
[EMAIL PROTECTED]
 
 The messages do not really need to be empty, but I will ignore
 their content. Only the ADDRESS you send to is important.
 
 You can start a subscription for an alternate address,
 for example [EMAIL PROTECTED], just add a hyphen and your
 address (with '=' instead of '@') after the command word:
 [EMAIL PROTECTED]
 
 To stop subscription for this address, mail:
 [EMAIL PROTECTED]
 
 In both cases, I'll send a confirmation message to that address. When
 you receive it, simply reply to it to complete your subscription.
 
 If despite following these instructions, you do not get the
 desired results, please contact my owner at
 [EMAIL PROTECTED] Please be patient, my owner is a
 lot slower than I am ;-)
 
 --- Enclosed is a copy of the request I received.
 
 Return-Path: [EMAIL PROTECTED]
 Received: (qmail 31011 invoked from network); 1 Jun 2007 05:59:49
 - Received: from unknown (HELO avmail.wilbury.sk) (81.89.56.24)
   by foobar.math.fu-berlin.de with SMTP; 1 Jun 2007 05:59:49 -
 Received: from avmail.ltc.sk (avmail.ltc.sk [81.89.56.18])
   by avmail.wilbury.sk (Postfix) with ESMTP id 1B31C131DC
   for [EMAIL PROTECTED]; Fri,  1 Jun 2007 06:53:56
 +0200 (CEST) X-Virus-Scanned-ltcavmail: amavisd-new at ltc.sk
 Received: from avmail.wilbury.sk ([81.89.56.24])
   by avmail.ltc.sk (avmail.ltc.sk [81.89.56.18]) (amavisd-new,
 port 12425) with ESMTP id KiKjn+6Xn0hZ for
 [EMAIL PROTECTED]; Fri,  1 Jun 2007 06:53:55 +0200 (CEST)
 Received: from hathor.lasermount.plus.com (lasermount.plus.com
 [212.159.61.82]) by avmail.wilbury.sk (Postfix) with ESMTP id
 93F42131CB for [EMAIL PROTECTED]; Fri,  1 Jun 2007
 06:53:19 +0200 (CEST) Received: from hathor.lasermount.plus.com
 (localhost [127.0.0.1]) by hathor.lasermount.plus.com (Postfix) with
 ESMTP id 451EEC2650 for [EMAIL PROTECTED]; Fri,  1 Jun
 2007 05:53:17 +0100 (BST) Date: Fri, 1 Jun 2007 05:53:16 +0100
 From: Spencer Collyer [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Message-ID: [EMAIL PROTECTED]
 Organization: Lasermount Limited
 X-Mailer: Claws Mail 2.7.2 (GTK+ 2.10.6; x86_64-unknown-linux-gnu)
 Mime-Version: 1.0
 Content-Type: text/plain; charset=US-ASCII
 Content-Transfer-Encoding: 7bit
 


-- 
 Eagles may soar, but weasels don't get sucked into jet engines 
6:13am up 93 days 12:56, 17 users, load average: 2.88, 2.72, 1.65
Registered Linux User #232457 | LFS ID 11703