Re: [BUGS] ERROR: character 0xe3809c of encoding UTF8 has no equivalent in EUC_JP

2011-03-24 Thread Kasia Tuszynska
Hi, 
We have a customer in Japan who would be interested in this fix, in the future. 
Would you like me to enter it as an official Postgres bug?
Sincerely,
Kasia 

-Original Message-
From: Tatsuo Ishii [mailto:is...@postgresql.org] 
Sent: Tuesday, March 22, 2011 10:17 PM
To: itagaki.takah...@gmail.come 
Cc: Kasia Tuszynska; pgsql-bugs@postgresql.org
Subject: Re: [BUGS] ERROR: character 0xe3809c of encoding UTF8 has no 
equivalent in EUC_JP

 Agreed if the encoding is added as an user-defined encoding.
 I don't want to add built-in encodings only for Japanese language any more.

I do not agree here. Adding one more encoding/conversion is not big
deal.

Anyway these soltions would come to be real after one or two releases
at the earliest. The realistic solution available today is replacing
default conversion for EUC-JP and UTF-8.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] ERROR: character 0xe3809c of encoding UTF8 has no equivalent in EUC_JP

2011-03-24 Thread Itagaki Takahiro
On Fri, Mar 25, 2011 at 03:33, Kasia Tuszynska ktuszyn...@esri.com wrote:
 We have a customer in Japan who would be interested in this fix, in the 
 future. Would you like me to enter it as an official Postgres bug?

Not a bug at all -- there are at least 3 versions of EUCJP encodings, and
postgres just supports one of them. I think it won't be changed in the near
term. So, you would need to define a CONVERSION for your purpose as of now.

However, I think we could have an extension of conversion procedure set
for Japanese confused encodings out of the core.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] ERROR: character 0xe3809c of encoding UTF8 has no equivalent in EUC_JP

2011-03-24 Thread Tatsuo Ishii
 We have a customer in Japan who would be interested in this fix, in the 
 future. Would you like me to enter it as an official Postgres bug?
 Sincerely,

As I stated before, I don't regard this as a bug.

BTW I wonder why you don't use CREATE CONVERSION which can be used for
customer's problem today...
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] ERROR: character 0xe3809c of encoding UTF8 has no equivalent in EUC_JP

2011-03-22 Thread Tatsuo Ishii
 Hi,
 I was wondering if this was considered a bug, and if so what were the plans 
 to fix it: http://archives.postgresql.org/pgsql-bugs/2005-08/msg00211.php
 
 I searched the: pgsql-bug archive and found nothing
 I also searched the wiki to do list and found nothing
 But I could have missed it.

I don't consider it's a bug.

We maps WAVE DASH of EUC-JP (0xa1c1) to U+FF5E, not U+301C. U+FF5E
and U+301C look same, but there are different code point by some
reason I don't know. On the other hand EUC-JP has only one code point
for WAVE DASH. So if we want to do a round trip conversion between
EUC-JP and UTF-8, we have to choose either U+FF5E OR U+301C. We have
chosen U+FF5E. If we change the mapping, many existing applications
would break.

Same thing can be said to MINUS sign.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] ERROR: character 0xe3809c of encoding UTF8 has no equivalent in EUC_JP

2011-03-22 Thread Itagaki Takahiro
On Wed, Mar 23, 2011 at 08:05, Kasia Tuszynska ktuszyn...@esri.com wrote:
 I was wondering if this was considered a bug, and if so what were the plans
 to fix it: http://archives.postgresql.org/pgsql-bugs/2005-08/msg00211.php

The wave dash issue is not postgres-specific; some other converter just
replace it with '?'. Instead, postgres throws an error.
I guess there is no possibility to support ambiguous character mappings
in the default conversions, but you can define more relaxed conversion
procedures for your purpose.


BTW, we cannot use non-default conversion procedures from SQL commands,
right?  If it were allowed, we can use some relaxed conversions
on the initial loading, like this:

=# SET character_conversion TO utf8_to_eucjp_relaxed;
=# COPY tbl FROM '/file_with_wave_dashes.utf8.tsv';
=# RESET character_conversion;

Another idea is to allow to create new encoding names and define
the above conversion procs as the default:

=# CREATE ENCODING eucjp_relaxed;
=# CREATE DEFAULT CONVERSION xxx FOR utf8 TO eucjp_relaxed
 FROM utf8_to_eucjp_relaxed;

I think overhaul of conversion support is a TODO item.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] ERROR: character 0xe3809c of encoding UTF8 has no equivalent in EUC_JP

2011-03-22 Thread Itagaki Takahiro
On Wed, Mar 23, 2011 at 10:58, Tatsuo Ishii is...@postgresql.org wrote:
 So if we want to do a round trip conversion between
 EUC-JP and UTF-8, we have to choose either U+FF5E OR U+301C. We have
 chosen U+FF5E. If we change the mapping, many existing applications
 would break.

I heard a request a few times for an additional one-directional conversion
from U+301C to EUC-JP (0xa1c1). It should not break existing applications.
We already have non-round trip conversions for IBM and NEC extended
characters in SJIS. The policy seems not so strict for me.

Anyway, we might need to revisit the area in the near term for unicode
Emoji issue.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] ERROR: character 0xe3809c of encoding UTF8 has no equivalent in EUC_JP

2011-03-22 Thread Tatsuo Ishii
 So if we want to do a round trip conversion between
 EUC-JP and UTF-8, we have to choose either U+FF5E OR U+301C. We have
 chosen U+FF5E. If we change the mapping, many existing applications
 would break.
 
 I heard a request a few times for an additional one-directional conversion
 from U+301C to EUC-JP (0xa1c1). It should not break existing applications.
 We already have non-round trip conversions for IBM and NEC extended
 characters in SJIS. The policy seems not so strict for me.

Doesn't breaking round-trip conversion between EUC-JP and UTF-8 itself
break backward compatibility?

I think what we can do best here is, adding new encoding and default
conversion.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] ERROR: character 0xe3809c of encoding UTF8 has no equivalent in EUC_JP

2011-03-22 Thread Itagaki Takahiro
On Wed, Mar 23, 2011 at 13:02, Tatsuo Ishii is...@postgresql.org wrote:
 I think what we can do best here is, adding new encoding and default
 conversion.

Agreed if the encoding is added as an user-defined encoding.
I don't want to add built-in encodings only for Japanese language any more.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


Re: [BUGS] ERROR: character 0xe3809c of encoding UTF8 has no equivalent in EUC_JP

2011-03-22 Thread Tatsuo Ishii
 Agreed if the encoding is added as an user-defined encoding.
 I don't want to add built-in encodings only for Japanese language any more.

I do not agree here. Adding one more encoding/conversion is not big
deal.

Anyway these soltions would come to be real after one or two releases
at the earliest. The realistic solution available today is replacing
default conversion for EUC-JP and UTF-8.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs