[PHP] Using preg_match to find Japanese text

2006-08-05 Thread Dave M G

PHP list,

While I'm only just learning about regular expressions in another 
thread, I still seem to be finding exceptional situations which have me 
questioning the extent to which preg expressions can be implemented.


(The following contains UTF-8 encoded Japanese text. Apologies if it 
comes out as ASCII gibberish.)


What I have are sentences that look like this:
気温 【きおん】 (n) atmospheric temperature; (P); EP
について (exp) concerning; along; under; per; KD

I want to divide the first line into three variables, $word, $reading, 
and $meaning. And I want to divide the second line into two variables, 
$word and $meaning.


If I can figure out how to extract the first variable, $word, then I can 
figure out the rest. But that first step seems to be a doozy.


The way I see it, I could do it two ways. One is to take out all the 
pull out all the characters up to the first occurrence of a space, and 
assume that it's Japanese. Not that I'm sure how to write that 
expression, but maybe I could.


But it seems like it would be a lot more sophisticated if I could 
determine if a word was Japanese by testing it's Unicode value or some 
similar method. That way I would be less vulnerable to slight 
variabilities in positioning of words in the source material.


Looking at all the multibyte related functions in the PHP manual, it 
seems there are options for testing the type of encoding, but not for 
the type of language or character set.

http://jp2.php.net/manual/en/ref.mbstring.php
However, I could be wrong about this (and it would be nice if I was).

Searching the web, I came across this guy's script to test if characters 
were above the usual ASCII range in Unicode, and could therefore be 
assumed to be Japanese:

http://www.randomchaos.com/documents/?source=php_and_unicode

But this seems unwieldy, as I think, if I understand it correctly, I'd 
have to test each individual word. I could use it to test if there was 
any Japanese at all in a string, but I'm not confident I could use it to 
extract words.


So I'm a little stuck. If anyone has any advice to help get me started, 
it would be much appreciated.


Thank you for your time and help.

--
Dave M G

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Using preg_match to find Japanese text

2006-08-06 Thread MOKULEN_IMADICA
Dear all and Dave,

>I want to divide the first line into three variables, $word, $reading, 
>and $meaning. And I want to divide the second line into two variables, 
>$word and $meaning.

What you(Dave) want seems to have some resemblance to 
what I tackled in last month.I tried to sort out some japanese
characters which can input to my server but cannot be displayed
correctly in japanese-main-portable-phone-browsers through my poor
program.

>The following contains UTF-8 encoded Japanese text. 

If you specify EUC-JP in [mbstring] of "php.ini"
(this is usual way when using Japanese character in PHP),
 I propose first to change that into Shift-JIS to solve this problem.
Probably, you specify UTF-8 in charaset in your ".php"s.
I propose to change it into SHIFT-JIS.

As far as I executed, it is difficult to extract the specific japanese
word.

I think that you should add the processing which replaces a Japanese
word with the other language ( English ) once in your server.

In my case,I first made the list of the words which aren't displayed
correctly in main-japanese-portable-phone-browsers through my poor
program.Then, I checked a "non-displayed-word" on the list and programed
to sort out a "non-displayed-word" (,and replace a "non-displayed-word"
to a image).

What I wrote above is probably not the advice which you(Dave) demand.
I do not see linkpages in your e-mail yet.I am sorry of my poor advice.
Anway,I send this to you.

Thank you for being interested in Japanese.>Dave

Thank you for developing [mbstring] in php.ini. >developers.
.
Madoca



signature.asc
Description: 	このメッセー	ジにはデジタ	ル署名された	部分がありま	す


Re: [PHP] Using preg_match to find Japanese text

2006-08-07 Thread Richard Lynch
On Sat, August 5, 2006 9:06 pm, Dave M G wrote:
> While I'm only just learning about regular expressions in another
> thread, I still seem to be finding exceptional situations which have
> me
> questioning the extent to which preg expressions can be implemented.
>
> (The following contains UTF-8 encoded Japanese text. Apologies if it
> comes out as ASCII gibberish.)
>
> What I have are sentences that look like this:
> 気温 【きおん】 (n) atmospheric temperature; (P); EP
> について (exp) concerning; along; under; per; KD

Can you be sure that '(' will not appear in the Japanese part?

preg_match('/^(.*)(\\(.*$)/', $text, $parts);
echo "Japanese: $parts[1]\n";
echo "Definition: $parts[2]\n";

Then you could break apart the Japanese part based on whether there
are or aren't the delimiters for the "reading" -- they looked kinda
like parentheses before my ascii-centric email munged them.

You might even be able to combine it all into one big preg_match if
you worked at it.

-- 
Like Music?
http://l-i-e.com/artists.htm

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Using preg_match to find Japanese text

2006-08-08 Thread Dave M G

Richard, Madoka,

Thank you for your insights into searching for Japanese characters.

I've decided to stick with searching for words as determined by the 
placement of spaces within the source text.


Thank you for your time and advice.

--
Dave M G

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php