date:20120601

Re: [PHP-DEV] php interpreter

2012-06-01 Thread Stas Malyshev

Hi!

 I've seen this statement before about the impact of caching the actual
 compilation (or mere tokenization?) to bytecode being very small
 compared to the impact of avoiding disk access. I am curious if there
 are any measurements breaking this down. Read-only access to code in
 files already buffered by the OS (not files read for the first time)
 ought to be very fast.

We did some measurements a long time ago at Zend, but I don't have the
numbers right now and anyway the engine changed so much since then they
are probably irrelevant anyway. However, the main gist is right - time
saved on compilation is not that much. One of the reasons to that is
that some of the data structures that are used by the engine are dynamic
(class tables, class variables, static variables, etc.) which means a
lot of data needs still to be handled to make script stored in SHM
runnable. Which greatly decreases savings from not compiling it. The
disk read however is still saved, and since unlike compilation it's a
system call and talks to potentially very slow (compared to memory)
device, the savings are significant. Even with OS cache, you still have
context switches and copying the data, etc. With some work I think it is
possible to make PHP script to run with zero system calls spent on
loading script files.
-- 
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] NEWS again (was: PHP 5.4.4RC2 Released)

2012-06-01 Thread Gustavo Lopes


On Thu, 31 May 2012 21:01:50 -0400, David Soria Parra wrote:


We would like to announce the second RC of the 5.4.4 version. This
is mainly a bugfix release. The release includes a fix for a weakness
crypts() DES implementation (CVE-2012-2143). Please test it and 
notify
us of any problems you may encounter.  The full list of the fixes is 
as

always in the NEWS file.



Sorry to bring this up again, but they aren't. 5.3 NEWS are not being 
merged.


Right now, NEWS is pretty useless. If I want to know whether some 
change is in one release, 5.4 NEWS won't tell me that.


For instance, 0f180a63 was committed to 5.3 in April 7 (a 
stream_get_line() fix). It is most definitely in 5.4.4RC2:


$ git merge-base 0f180a63e php-5.4.4RC2
0f180a63ebb2d65bbe49b68d2430639b20443e9a

However, there's no mention in NEWS.

The current policy of changing only the lowest branch NEWS obviously 
can only work if these changes are then merged to the most recent 
branches on release. If the RMs are unwilling to do such merging, we 
should change the policy to require updating the NEWS files in every 
stable branch to which the fix was merged.


--
Gustavo Lopes

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] BreakIterator

2012-06-01 Thread Gustavo Lopes


On Fri, 01 Jun 2012 09:40:19 +1000, David Muir wrote:
Coming from a pleb, my only concern is the name if the class is in 
the
global scope. A BreakIterator to me sounds like something related 
to

breaking out of a looping structure, and not something used for
iterating over various language structure boundaries.
If it's in a ICU namespace, then it's not a problem, as it's clearly
related to Unicode.



We currently don't use namespaces in any of the core extensions. All 
the other symbols in ext/intl are in the global namespace; to put 
BreakIterator in a new namespace would be inconsistent -- and to put the 
whole extension would be a huge BC break.


As to the name chosen to the class, it just mirrors the name used in 
ICU. In some cases, we prefixed the class name with Intl, in order to 
minimize the likelihood of symbols collisions or distinguish it from 
other similar functionality in PHP (something namespaces would be more 
appropriate for), but otherwise we prefer to keep the symbols names used 
in ICU in order to make it easy for people who already know the native 
API.


Additionally, I think your concerns are exaggerated. The symbol 
BreakIterator can only used in contexts where it's obvious it's a class 
name, as in BreakIterator::createWordInstance('en').


--
Gustavo Lopes

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] BreakIterator

2012-06-01 Thread Benjamin Eberlei

How about IntlBreakIterator? I agree with David that the naming is very
weird, it doesn't hint at something from Intl but another crazy spl
iterator :-)

On Fri, Jun 1, 2012 at 9:57 AM, Gustavo Lopes glo...@nebm.ist.utl.ptwrote:

 On Fri, 01 Jun 2012 09:40:19 +1000, David Muir wrote:

 Coming from a pleb, my only concern is the name if the class is in the
 global scope. A BreakIterator to me sounds like something related to
 breaking out of a looping structure, and not something used for
 iterating over various language structure boundaries.
 If it's in a ICU namespace, then it's not a problem, as it's clearly
 related to Unicode.


 We currently don't use namespaces in any of the core extensions. All the
 other symbols in ext/intl are in the global namespace; to put BreakIterator
 in a new namespace would be inconsistent -- and to put the whole extension
 would be a huge BC break.

 As to the name chosen to the class, it just mirrors the name used in ICU.
 In some cases, we prefixed the class name with Intl, in order to minimize
 the likelihood of symbols collisions or distinguish it from other similar
 functionality in PHP (something namespaces would be more appropriate for),
 but otherwise we prefer to keep the symbols names used in ICU in order to
 make it easy for people who already know the native API.

 Additionally, I think your concerns are exaggerated. The symbol
 BreakIterator can only used in contexts where it's obvious it's a class
 name, as in BreakIterator::**createWordInstance('en').

 --
 Gustavo Lopes


 --
 PHP Internals - PHP Runtime Development Mailing List
 To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] PHP 5.3.14RC2 Released

2012-06-01 Thread Johannes Schlüter

Hi!

We would like to announce the second RC of the 5.3.14 version. This
is mainly a bugfix release. The release includes a fix for a weakness
crypts() DES implementation (CVE-2012-2143). Please test it and notify
us of any problems you may encounter.  The full list of the fixes is as
always in the NEWS file.

You can download the packages from:

http://downloads.php.net/johannes/php-5.3.14RC2.tar.bz2
http://downloads.php.net/johannes/php-5.3.14RC2.tar.gz

The Windows team provides windows binaries for the release.
As always you find them at:

http://windows.php.net/qa/

Regards,
  Johannes


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] BreakIterator

2012-06-01 Thread Pierre Joye

hi,

On Fri, Jun 1, 2012 at 10:02 AM, Benjamin Eberlei kont...@beberlei.de wrote:
 How about IntlBreakIterator? I agree with David that the naming is very
 weird, it doesn't hint at something from Intl but another crazy spl
 iterator :-)

I agree too. BreakIterator is a very common name and I suspect
possible naming conflicts may happen.

Cheers,
-- 
Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] BreakIterator

2012-06-01 Thread Gustavo Lopes


On Fri, 1 Jun 2012 12:58:37 +0200, Pierre Joye wrote:

On Fri, Jun 1, 2012 at 10:02 AM, Benjamin Eberlei
kont...@beberlei.de wrote:
How about IntlBreakIterator? I agree with David that the naming is 
very

weird, it doesn't hint at something from Intl but another crazy spl
iterator :-)


Asides from date related classes -- which could be confused with stuff 
from ext/date or even ext/calendar --, no other classes have Intl in 
their name. Does SpoofChecker hint at something from intl? 
ResourceBundle? ICU is a rather large library, and while 
internationalization is a common theme, the APIs have diverse 
functionality and therefore diverse names. Plus, SPL does not have a 
monopoly on the *Iterator names.



I agree too. BreakIterator is a very common name and I suspect
possible naming conflicts may happen.



So would you have RuleBasedBreakIterator renamed 
IntlRuleBasedBreakIterator too?... I find it very hard to believe that 
BreakIterator is a very common name, but I'm open to evidence that 
points otherwise. This argument could maybe be made for 
'Transliterator', which was added in 5.4.


--
Gustavo Lopes

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] BreakIterator

2012-06-01 Thread Maciek Sokolewicz


On 01-06-2012 13:34, Gustavo Lopes wrote:

On Fri, 1 Jun 2012 12:58:37 +0200, Pierre Joye wrote:

On Fri, Jun 1, 2012 at 10:02 AM, Benjamin Eberlei
kont...@beberlei.de wrote:

How about IntlBreakIterator? I agree with David that the naming is very
weird, it doesn't hint at something from Intl but another crazy spl
iterator :-)


Asides from date related classes -- which could be confused with stuff
from ext/date or even ext/calendar --, no other classes have Intl in
their name. Does SpoofChecker hint at something from intl?
ResourceBundle? ICU is a rather large library, and while
internationalization is a common theme, the APIs have diverse
functionality and therefore diverse names. Plus, SPL does not have a
monopoly on the *Iterator names.


I agree too. BreakIterator is a very common name and I suspect
possible naming conflicts may happen.



So would you have RuleBasedBreakIterator renamed
IntlRuleBasedBreakIterator too?... I find it very hard to believe that
BreakIterator is a very common name, but I'm open to evidence that
points otherwise. This argument could maybe be made for
'Transliterator', which was added in 5.4.

In my personal opinion, all Intl classes should be prefixed with Intl. 
It's not so much that BreakIterator is a very common name, but rather a 
very ambiguous name that may point to many different things. Just by the 
fact that multiple people have already posted here that at first they 
thought BreakIterator had something to do with the break statement gives 
you a rather solid hint that the function of this class is not 
immediately clear. Prefixing it with Intl immediately makes it clear 
that it belongs to the Intl superfamily, and limits the potential 
misunderstandings a lot. I actually still don't understand why not all 
Intl classes are prefixed? Isn't that the usual procedure? eg. for 
MySQLi, and pretty much all other extensions?


- Tul

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] BreakIterator

2012-06-01 Thread Pierre Joye

hi,

On Fri, Jun 1, 2012 at 1:34 PM, Gustavo Lopes glo...@nebm.ist.utl.pt wrote:

 So would you have RuleBasedBreakIterator renamed IntlRuleBasedBreakIterator
 too?...

Ideally we would yes, while they are less common and less aimed to be
seen as part of another API.

I find it very hard to believe that BreakIterator is a very
 common name, but I'm open to evidence that points otherwise. This argument
 could maybe be made for 'Transliterator', which was added in 5.4.

Transliterator is not confusing as BreakIterator, sorry.

I would not care much if there was some longer not so confusing/common
names. But with that one, the risk to conflict with existing may be
too high to do not be discussed.

Cheers,
--
Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] BreakIterator

2012-06-01 Thread Nikita Popov

On Fri, Jun 1, 2012 at 9:57 AM, Gustavo Lopes glo...@nebm.ist.utl.pt wrote:
 We currently don't use namespaces in any of the core extensions.
Does anything prevent us from starting to do so?

 other symbols in ext/intl are in the global namespace; to put BreakIterator
 in a new namespace would be inconsistent -- and to put the whole extension
 would be a huge BC break.
It sure would be a bit inconcistent, but if you see it as All new
Intl classes will go
into the Intl namespace it makes perfect sense in my eyes. Also, at least in
theory, one could alias all intl classes to namespaced variants (though I'm not
sure that's really necessary.)

Nikita

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] BreakIterator

2012-06-01 Thread Gustavo Lopes


On Fri, 1 Jun 2012 15:37:30 +0200, Pierre Joye wrote:
On Fri, Jun 1, 2012 at 1:34 PM, Gustavo Lopes 
glo...@nebm.ist.utl.pt wrote:


So would you have RuleBasedBreakIterator renamed 
IntlRuleBasedBreakIterator

too?...


Ideally we would yes, while they are less common and less aimed to be
seen as part of another API.


I find it very hard to believe that BreakIterator is a very
common name, but I'm open to evidence that points otherwise. This 
argument

could maybe be made for 'Transliterator', which was added in 5.4.


Transliterator is not confusing as BreakIterator, sorry.


You removed the quoting that provided context, but I was responding to 
your claim that it was a very common name and that you suspected 
naming conflicts might happen.


But in fact Transliterator is much more confusing than 
BreakIterator. In fact, the name Transliterator is an ICU artifact 
of the past, that module is now called Text Transformation as it 
provides a generic text transformation API, not specifically for 
transliteration.


--
Gustavo Lopes

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] BreakIterator

2012-06-01 Thread Gustavo Lopes


On Fri, 1 Jun 2012 15:56:59 +0200, Nikita Popov wrote:
On Fri, Jun 1, 2012 at 9:57 AM, Gustavo Lopes 
glo...@nebm.ist.utl.pt wrote:

We currently don't use namespaces in any of the core extensions.

Does anything prevent us from starting to do so?

other symbols in ext/intl are in the global namespace; to put 
BreakIterator
in a new namespace would be inconsistent -- and to put the whole 
extension

would be a huge BC break.

It sure would be a bit inconcistent, but if you see it as All new
Intl classes will go into the Intl namespace it makes perfect sense 
in my eyes.


You say that it makes perfect sense, but you don't explain why.

Also, at least in theory, one could alias all intl classes to 
namespaced variants

(though I'm not sure that's really necessary.)


Yes, that would be the only sane way to do it, but I really don't see a 
benefit large enough to compensate having a different treatment for 
classes depending on some arbitrary line like when they were added. The 
only real benefit of namespaces is to avoid name collisions, but most 
new projects use namespaces and we can easily avoid name collisions in 
the PHP core.


Plus, remember ext/intl is maintained in PECL too, where it supports 
PHP 5.2.


Anyway, this is getting a bit off-topic.

--
Gustavo Lopes

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] BreakIterator

2012-06-01 Thread Gustavo Lopes


On Fri, 01 Jun 2012 15:35:13 +0200, Maciek Sokolewicz wrote:

In my personal opinion, all Intl classes should be prefixed with
Intl. It's not so much that BreakIterator is a very common name, but
rather a very ambiguous name that may point to many different things.
Just by the fact that multiple people have already posted here that 
at

first they thought BreakIterator had something to do with the break
statement gives you a rather solid hint that the function of this
class is not immediately clear. Prefixing it with Intl immediately
makes it clear that it belongs to the Intl superfamily, and limits 
the

potential misunderstandings a lot. I actually still don't understand
why not all Intl classes are prefixed? Isn't that the usual 
procedure?

eg. for MySQLi, and pretty much all other extensions?



We've had the convention of prefixing function names with some 
extension prefix, but this convention has not been as marked for class 
names -- perhaps because there were so not many of them and so there 
were less collision/confusion problems.


In any case, I'll rename the classes before merging.

--
Gustavo Lopes

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] domdocument loadhtml and encoding

2012-06-01 Thread Tjerk Meesters

Gentlemen,

Regarding this bug report: https://bugs.php.net/bug.php?id=49705

As more developers move away from using regular expressions to parse
HTML and start using DOMDocument, I've noticed that quite a few
stumble over encoding issues. They're not bugs, because it's
documented (I think) that if a document is loaded using
::loadHTMLFile() or if it contains a content-type meta tag which
specifies the character encoding it will work as expected.

So far I've suggested a hack that involves adding the meta-tag in
front of the string that contains the HTML. As horrible as it seems,
that does the job!

That said, I'm hoping to get enough internals support to add a
parameter to ::loadHTML() that set / overrides the default character
set when processing the document; when given, any meta tags
pertaining to character set encoding should be ignored (AFAIK that's
also the browser's behavior).

Btw, there's another patch that also introduces a new parameter to
::parseHTML() which has gone into 5.4 branch
(https://bugs.php.net/bug.php?id=54037), so it looks like this would
be the second (optional) parameter then.

Thoughts?

-- 
--
Tjerk

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] Was php bug (IMHO). Have fix. Re: Bad eval() leading to response code 500

2012-06-01 Thread Todd Ruth

eval() does indeed set the response code to 500 upon failure.
Is that a bug?  I'll file a report because I don't believe
leaving the response code at 500 is consistent with the
statement from the php.net page about eval():

If there is a parse error in the evaluated code, eval() 
returns FALSE and execution of the following code continues normally.

I don't think leaving the the response code at 500 is consistent
with continues normally.  I believe the fix is one line, adding
 EG(current_execute_data)-opline-extended_value != ZEND_EVAL
to the if clauses before setting the error header in main.c
at line 1132.

In case anyone finds my post from last night while doing a search 
of the archives, I'll explain more below and answer my question 
about debugging.

What I didn't realize at the time of my post last night is that 
browsers don't mind receiving a 500 as long as everything else 
looks good.  For example, the following web page:

?php
eval('0+');
print hello world\n;
?

looks fine in a browser so I assumed (oops!) that it was returning
code 200.  If you try doing a wget on that example, it complains
about the response code 500.  (My big ugly application uses AJAX
and the 500 caused my AJAX framework to reject the page.)

Other than the code 500, everything seems to proceed normally.
All of the other code is executed normally.  The content of the
web page is normal and is displayed well in a browser unless you
have something checking for unhappy response codes.

The answer to my question about watching the headers in the
debugger turned out to be pretty easy:
watch sapi_globals.sapi_headers
watch sapi_globals.sapi_headers.http_response_code
It would not have been so simple with ZTS on.  (In that case,
the TSRM macros come in to play.)


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] BreakIterator

2012-06-01 Thread Pierre Joye

HI,

On Fri, Jun 1, 2012 at 5:02 PM, Gustavo Lopes glo...@nebm.ist.utl.pt wrote:

 In any case, I'll rename the classes before merging.

You may have missed part of my replies. One key part was: to discuss
it before doing anything.

This is only one day discussion and I don't feel like we have a long
term decision about what to do in this area. Before going with this
one only, I would rather prefer to solve this problem once and for all
(other intl classes/cases).

Cheers,
-- 
Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] BreakIterator

2012-06-01 Thread Stas Malyshev

Hi!

 I've wrapped ICU's BreakIterator and RuleBasedBreakIterator. I stopped 
 short of adding a procedural interface. I think there's a larger 
 expectation of a having an OOP interface when working with iterators. 
 What do you think? If there's no procedural interface, I'll change the 
 instances of zend_parse_methods to zpp for performance.

Nice! I remember we had TextIterator in PHP 6, IIRC that was the reason
BreakIterator never found its way into intl.

 BreakIterator also exposes other native methods:
 getAvailableLocales(), getLocale() and factory methods to build
 several predefined types of BreakIterators: createWordInstance()
 for word boundaries, createCharacterInstance() for locale
 dependent notions of characters, createSentenceInstance() for
 sentences, createLineInstance() and createTitleInstance() -- for
 title casing breaks. These factories currently return

One thing I notice here is that with this API it is not possible to
programmatically choose what is the iteration unit - you'd have to do a
switch for that. Do you think it may be a good idea to have a generic
function that allows to choose the unit programmatically?

What is the notion of characters - is it grapheme characters? Is there
option to iterate over code points too - not sure if it's useful just
curious, as we used to have it in PHP 6 IIRC.

About getAvailableLocales() - what this actually does? Does it list all
avaliable locales in the system, ones that have BreakIterator rules, or
something else? If it's not related to BI, I'm not sure we need to have
it in BI. What is the intended usage of it? Maybe it should be part of
Locale class?

 Note that BreakIterator is an iterator only in the sense of the
 first 'Iterator' in 'IteratorIterator', i.e., it does not
 implement the Iterator interface. The reason is that there is
 no sensible implementation for Iterator::key(). Using it for

Doesn't it have a notion of current position? If so, key should be the
current position.

Will this BreakIterator be usable in foreach? I'm not sure I understand
it from this description - understanding this without any usage
examples, RFCs or code snippets for intended usage is really hard and I
think we should really start with doing that. I would expect this class
to work like this:

foreach(BreakIterator::createWordInstance(blah blah blah) as $i =
$word) {
   echo Word number $i is $word\n;
}

or at least like this:

foreach(BreakIterator::createWordInstance(blah blah blah) as $i =
$word) {
   echo Next word at position $i is: $word\n;
}

Is it the model? If not, I think we need to wrap the C API to make this
possible, because this is what people expect in PHP from the iterator.

 Finally, I added a convenience method to BreakIterator:
 getPartsIterator(). This provides an IntlIterator, backed
 by the BreakIterator PHP object (i.e. moving the pointer or
 changing the text in BreakIterator affects the iterator
 and also moving the iterator affects the backing BreakIterator),
 which allows traversing the text between each boundary.

How that text is being traversed - by code
points/characters/graphemes/bytes?

-- 
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] 5.4.3 type hint handling

2012-06-01 Thread Anatoliy Belsky

Hi,

I'm experiencing an issue adding type hints to the function prototypes.
The following definition gives the unknown typehint error when invoking
a function

ZEND_BEGIN_ARG_INFO_EX(arg_info_trader_adosc, 0, 0, 4)
ZEND_ARG_TYPE_INFO(0,  high, IS_ARRAY, 0)
ZEND_ARG_TYPE_INFO(0,  low, IS_ARRAY, 0)
ZEND_ARG_TYPE_INFO(0,  close, IS_ARRAY, 0)
ZEND_ARG_TYPE_INFO(0,  volume, IS_ARRAY, 0)
ZEND_ARG_TYPE_INFO(0,  fastPeriod, IS_LONG, 1)
ZEND_ARG_TYPE_INFO(0,  slowPeriod, IS_LONG, 1)
ZEND_END_ARG_INFO();

The reason I trip up on this is to generate the xml doc proto for the
extension. Therefore I'm using the extended ZEND_ARG_INFO version. Without
type hints there are no param types in the xml.

Quickly looking at the sources I realize that 5.4.3 has an explicit type
hint check which was previously ignored in 5.3

http://lxr.php.net/opengrok/xref/PHP_5_4/Zend/zend_execute.c#600

The reason of writing this is not to start a new discussion about scalar
types, for God's sake not :), but just to point at the collision with the
current core and doc generator. A simple way to fix this would be to
restore the old 5.3 behaviour just passing on scalar types. Or may be
there were a simple solution for this, despite 5.4.3 is already issued?

Cheers

Anatoliy


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] BreakIterator

2012-06-01 Thread Gustavo Lopes


On Fri, 01 Jun 2012 11:31:13 -0700, Stas Malyshev wrote:



BreakIterator also exposes other native methods:
getAvailableLocales(), getLocale() and factory methods to build
several predefined types of BreakIterators: createWordInstance()
for word boundaries, createCharacterInstance() for locale
dependent notions of characters, createSentenceInstance() for
sentences, createLineInstance() and createTitleInstance() -- for
title casing breaks. These factories currently return


One thing I notice here is that with this API it is not possible to
programmatically choose what is the iteration unit - you'd have to do 
a

switch for that. Do you think it may be a good idea to have a generic
function that allows to choose the unit programmatically?


You can create a RuleBasedBreakIterator with any rules you choose. The 
rules are basically a set of regex expressions; ICU has two matching 
modes -- by default it tries the longest match, but it can also chain 
together rules. There are rules to advance, to go back and to go to a 
safe position from an arbitrary position in the two directions. The ICU 
user guide to which I linked in the first e-mail has more details.


What is the notion of characters - is it grapheme characters? Is 
there

option to iterate over code points too - not sure if it's useful just
curious, as we used to have it in PHP 6 IIRC.


Yes, they are grapheme clusters. ICU has a special rule for Thai, but 
from I see in the tracker, it's obsolete with recent versions of Unicode 
(possibly the root rule is now generic enough).


To iterate over code points, you can build a very simple 
RuleBasedBreakIterator -- new RuleBasedBreakIterator('.;'). See this 
example here: https://gist.github.com/2843005




About getAvailableLocales() - what this actually does? Does it list 
all
avaliable locales in the system, ones that have BreakIterator rules, 
or
something else? If it's not related to BI, I'm not sure we need to 
have
it in BI. What is the intended usage of it? Maybe it should be part 
of

Locale class?


Right now, the ICU implementation just calls 
Locale::getAvailableLocales(), but its description is Gets all the 
available locales that has localized text boundary data. so I suppose 
it could return a different set in the future.



Note that BreakIterator is an iterator only in the sense of the
first 'Iterator' in 'IteratorIterator', i.e., it does not
implement the Iterator interface. The reason is that there is
no sensible implementation for Iterator::key(). Using it for


Doesn't it have a notion of current position? If so, key should be 
the

current position.

Will this BreakIterator be usable in foreach? I'm not sure I 
understand

it from this description - understanding this without any usage
examples, RFCs or code snippets for intended usage is really hard and 
I
think we should really start with doing that. I would expect this 
class

to work like this:

foreach(BreakIterator::createWordInstance(blah blah blah) as $i =
$word) {
   echo Word number $i is $word\n;
}

or at least like this:

foreach(BreakIterator::createWordInstance(blah blah blah) as $i =
$word) {
   echo Next word at position $i is: $word\n;
}

Is it the model? If not, I think we need to wrap the C API to make 
this
possible, because this is what people expect in PHP from the 
iterator.


My options here were: the BreakIterator mirrors the ICU homonym -- it 
iterates over breaks, i.e., boundaries in the text. Hence, the iterators 
returns the *positions* of the several boundaries. Therefore, this 
cannot be used also for the key.


Acknowledging that getting the text between the boundaries was going to 
be a common scenario, I added a method, getPartsIterator(), that yields 
the text between each boundary. Hence, there is one less element in this 
iterator than in the BreakIterator.


Neither of the iterators implement getKey(), so one traversing the keys 
will be 0, 1, 2... It would probably be a good a idea to change the 
parts iterator to give the left boundary as the key. That way on  could 
do:


$bi = BreakIterator::createWordInstance(NULL);
$bi-setText($foo);
foreach ($bi-getPartsIterator() as $k = $v) {
echo $v is at position $k\n;
}

instead of

$bi = BreakIterator::createWordInstance(NULL);
$bi-setText($foo);
$pos = $bi-first();
foreach ($bi-getPartsIterator() as $v) {
echo $v is at position $pos\n;
$pos = $bi-current();
}

Another possibility would be to have the break iterator itself behave 
as the parts iterator for iteration purposes. I don't think that is a 
good idea. Even though BreakIterator does not implement Iterator, people 
would expect next() and current() return the next and current iterator 
value, while they would be returning the iteration key.


By the way, you can look at the test cases in the tree on github for 
examples: 
https://github.com/cataphract/php-src/commit/d289c3977ed4ba8d9ba127e5af9f709b19b8e1ba


Thanks for the comments!

--
Gustavo Lopes

--
PHP Internals - PHP

Re: [PHP-DEV] 5.4.3 type hint handling

2012-06-01 Thread Felipe Pena

Hi,

2012/6/1 Anatoliy Belsky a...@php.net:
 Hi,

 I'm experiencing an issue adding type hints to the function prototypes.
 The following definition gives the unknown typehint error when invoking
 a function

 ZEND_BEGIN_ARG_INFO_EX(arg_info_trader_adosc, 0, 0, 4)
    ZEND_ARG_TYPE_INFO(0,  high, IS_ARRAY, 0)
    ZEND_ARG_TYPE_INFO(0,  low, IS_ARRAY, 0)
    ZEND_ARG_TYPE_INFO(0,  close, IS_ARRAY, 0)
    ZEND_ARG_TYPE_INFO(0,  volume, IS_ARRAY, 0)
    ZEND_ARG_TYPE_INFO(0,  fastPeriod, IS_LONG, 1)
    ZEND_ARG_TYPE_INFO(0,  slowPeriod, IS_LONG, 1)
 ZEND_END_ARG_INFO();



We do not use ZEND_ARG_TYPE_INFO() with scalar types that are not
covered with the type hint supports. (i.e. string, integer, double,
resource)

-- 
Regards,
Felipe Pena

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] NEWS again

2012-06-01 Thread Christopher Jones




On 06/01/2012 12:48 AM, Gustavo Lopes wrote:

 If the RMs are unwilling to do such merging, we should change the policy to 
require updating the
 NEWS files in every stable branch to which the fix was merged.

This makes sense to me.

Chris

--
christopher.jo...@oracle.com
http://twitter.com/#!/ghrd

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] php interpreter

[PHP-DEV] NEWS again (was: PHP 5.4.4RC2 Released)

Re: [PHP-DEV] BreakIterator

Re: [PHP-DEV] BreakIterator

[PHP-DEV] PHP 5.3.14RC2 Released

Re: [PHP-DEV] BreakIterator

Re: [PHP-DEV] BreakIterator

Re: [PHP-DEV] BreakIterator

Re: [PHP-DEV] BreakIterator

Re: [PHP-DEV] BreakIterator

Re: [PHP-DEV] BreakIterator

Re: [PHP-DEV] BreakIterator

Re: [PHP-DEV] BreakIterator

[PHP-DEV] domdocument loadhtml and encoding

[PHP-DEV] Was php bug (IMHO). Have fix. Re: Bad eval() leading to response code 500

Re: [PHP-DEV] BreakIterator

Re: [PHP-DEV] BreakIterator

[PHP-DEV] 5.4.3 type hint handling

Re: [PHP-DEV] BreakIterator

Re: [PHP-DEV] 5.4.3 type hint handling

Re: [PHP-DEV] NEWS again

21 matches

Site Navigation

Mail list logo

Footer information