Re: IranSystem to Unicode (UTF-8) converter
salam nemidoonam shoma in narmafzaro darin ya na , age darin lotf konid baram send konid I just wrote a PHP script to do just that a couple of days ago at work. It's relatively simple, using Roozbeh Pournader's conversion table. All you have to do is to read the input string byte by byte, and output the appropriate UTF-8 codes in reverse order. The only gotcha I faced was if there are latin characters (or numbers) in the middle of the text, they should not be reversed. This is caused by the way IranSystem encodes strings. Ehsan ___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
Dear Ehsan,You suggested a creative solution. Thank you.My application, consists of a database, and two user-interfaces.The first UI is used for data entry,where I parse a given XML file, extract and "Romanize" itsdata - based on a "Persian-Roman Conversion Map" -and then insert them into DB.Luckily, PHP provides a very fast function forsuch conversions, named strtr().Now I have a "Roman DB".The second UI is used for data retrieval (searching),where I "Romanize" the given search argument,and look for it trough the DB records. The results will bedecoded and converted to Persian, before sending to stdout. I've actually implemented this approach in a project. I have not yet published the code, but if you want, I can make it available under the GPL. Ehsan ___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
One solution would be to augment a DB capabilityat the application level. That is instead of the searchor select qualified by a SQL where clause, simply geteverything (select *) and then let the application filterwhat you want. Then when your given DB providesthat operation by itself, simplify your applicationand deligate that to DB (Query Engine). Another solution is make the db believe your text is English. This could be done by "romanizing" the text before inserting it to the db, and converting it back to Unicode after reading it from the db and before displaying it to the user. This can be done by choosing a Roman letter for each Persian letter, and reading Persian characters one by one and looking them up in a conversion table and writing the equivalent Roman characters to the output. However, this has the downside that IIRC MySQL's full-text search is case-insensitive, and if I'm right in that you'd have to choose Roman characters all from one case (upper or lower.) In addition to that, the data stored in the db might be difficult/impossible to use without such a conversion. It's you who should judge the tradeoffs before choosing to use this method or not. For some good romanizing scripts, check out http://home.byu.net/jmd56/download.html. Ehsan ___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: problem in myql data display
Sadeq Naqashzade wrote: Salaam, One of my frinds have same problem (but I have not) I'm using mysqli and he using mysql extention. Try mysqli this may help you. - Sadeq Thanks, but I wasn't the one who asked the question! I'm CCing the OP as well as the list. Ehsan ___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: problem in myql data display
mzz wrote: hi every one i have a problem in mysql data base is that when i reveiw my table cotained data in PhpMyAdmin in persian i can see and edit data correctly but when i use my script to query my tables using PHP it display my table data as a '?' (question marks) i am using mysql server 4.1; php4.xx and utf-8 encoding in my pages. OS:Win2000 server. Regards zarbizade. Can you dump the table into a file from the PHP script and then make sure the data in the file is correct (and in UTF-8 encoding)? Ehsan ___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
Number display in Firefox
Hi all, I just found something cool in Firefox which I had not come across before, and thought some of you guys might not know it as well. As far as I can tell this is related to Gecko, so it must affect all Mozilla based applications, though I have not tested it anywhere except Firefox 1.0. The default rendering behavior for numbers appearing inside Persian text in Mozilla is to show them as Latin digits (1 2 3 ...), though in IE it depends on the context (whether the direction of the containing text is rtl or ltr.) To make Firefox respect the direction of the text in this regard, you can add the following line to your user.js file: user_pref("bidi.numeral", 1); which sets the number rendering mode to "context." This enables ASCII digits entered inside Persian text to be rendered as Persian numbers (Û Û Û ...) Of course this does not affect the behavior of rendering numbers explicitly entered using Unicode character codes. FWIW, Ehsan ___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: A new Persian Unicode keyboard
The problem, as some of you might have guessed, is the direction switching. Given an application like MS Word, my keyboard correctly sends the characters, and Word gives them the right form. But sometimes some characters (mainly the “shared” chars), and often the blinking caret appear on the wrong side of the line. What can be done to make the shared characters (Like “!”) to appear on the correct side? The caret problem can be fixed with Word’s RTL command. But mixing English and Persian letters in the same line often leads to unpredictable outcomes. The rule of the thumb is, use RTL paragraphs when writing Persian text (which might contain English text within it) and use LTR when writing English text (which might contain Persian text within it.) Is there an algorithm governing these situations that I can use to modify the output to remedy this? There is an algorithm called Unicode BiDirectional Algorithm, the details of which is avaibale on Unicode.org. As you might have guessed, Word doesn't provide a correct implementation of this algorithm (nor do any other text editors that I know of to this date.) There's a library being developed called FriBidi, of which Behdad is the project maintainer, IIRC, which might help you, but not with Word probably. I guess Behdad would be able to make profound comments on this. -Ehsan Akhgari www.farda-tech.comList Owner: MSVC@BeginThread.com [Email: [EMAIL PROTECTED]][WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Persian in Windows Applications
I'm going to program and develop a windows application and I want to use Persian in user interface. I'm using Windows XP and uni-code in programming language. But is there any trick or rule to make application working fine in older windows? (98, ME) Or just using uni-code makes anything fine? Win9x does not support Unicode internally. M$ has developed the so-called MSLU[1] which provides Unicode compatibility at the Windows API level for Win9x. I have used it, and it indeed works, but be warned that these OSes do *not* support Unicode anyway, and all MSLU can do is implement API stubs for Unicode versions Win32 functions (such as, CreateFileW) which would allow you to build your app in Unicode mode in Visual C++. What I've ended up doing in the past is do all the UI as HTML, and embed a HTML rendering engine in my app. I've used the WebBrowser control (the same control used by IE). This requires you to distribute a customized[2] version of IE with your own app which has "Arabic" support built-in, and write some amount of _javascript_ code to enable the user to type Persian in your application even if they don't have a Persian keyboard installed (you can find several JS codes as starters on the web for this purpose.) You can also use Gecko, which is Mozilla's great HTML rendering engine as well. If you decide to use the WebBrowser control, check out http://www.beginthread.com/Article/Ehsan/WebBrowser%20Goodies/ for some articles about possible customizations of the control that you may be needing in your own applications. All of this, of course, applies to Visual C++. If you use some other programming tool, then you'll have to research on your own, though I think that few support MSLU. [1] You can download it from http://www.microsoft.com/msdownload/platformsdk/sdkupdate/psdkredist.htm. [2] You can deploy a customized IE install using the IE Administration Kit (IEAK.) -Ehsan Akhgari www.farda-tech.comList Owner: MSVC@BeginThread.com [Email: [EMAIL PROTECTED]][WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Frasi in MS Powerpoint
> Hi > > I would like to write farsi in microsoft powerpoint for presentation > purposes. Would it be possible at all? If yes, how this can be done? > What alternatives are available. > > I appreciate your help. It is possible. You simply should switch to a Persian keyboard and type your text. I seem to remember that some versions of MS Powerpoint did not support right-to-left text properly (I don't remember exactly what the problem was). A very good alternative to MS Powerpoint is the OpenOffice.org (www.openoffice.org) version 1.1.3. I have used it to create Persian presentations with no problems. - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: MSVC@BeginThread.com [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: openoffice & zwnj
> That's a famous bug that will happen in applications. KDE also had > that bug for quite a time until Behdad fixed it. The bug is because > the application or the rendering engine asks the font for a glyph for > the character, where it shouldn't. > The application or the rendering engine should not pass ZWNJ (and a > few other "invisible" Unicode characters) down. Great to know it's been fixed. Do you exactly know the fix is included since which version of the KDE? I've noticed that this bug seriously affects the usability of KDE for Persian computing. Thanks, - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: MSVC@BeginThread.com [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Parsnegar to Unicode conversion AND phonetic Farsi keyboardwithEnglish keyboard
Mr. Khazaee misdirected the email to me personally. I thought I'd send it to the whole list. > -Original Message- > From: khazaee [mailto:[EMAIL PROTECTED] > Sent: 2004/12/18 10:27 Þ.Ù > To: Ehsan Akhgari > Subject: RE: Parsnegar to Unicode conversion AND phonetic > Farsi keyboardwithEnglish keyboard > > > You want to define a user-defined keyboard for linux > operating system or not? > for linux operating system you can refer to persian keyboard > on farsilinux.org. > you can change the position of persian letter in your keyboard easily. > regards. > -- Original Message ------ > From: "Ehsan Akhgari" <[EMAIL PROTECTED]> > Date: Fri, 17 Dec 2004 22:50:03 +0330 > > > > > > >Also, I was wondering if anyone knows a way of defining a > user-defined > >keyboard to use with Farsi Unicode, similar to Parsnegar > which allows > >to define a phonetic Farsi keyboard with English keyboards, so that, > >when typing in Microsoft word in Farsi, I could use key "J" > for letter "jim", "A" > >for letter "alef", etc. > > > >You need your custom keyboard layout. M$ has a tool for that: > >Microsoft Keyboard Layout Creator. You can use it to create > your fully > >(well, nearly > >fully) customized keyboard layout for Windows. > > > >- > >Ehsan Akhgari > > > >www.farda-tech.com <http://www.farda-tech.com/> List Owner: > ><mailto:[EMAIL PROTECTED]> > >[EMAIL PROTECTED] > > > >[Email: [EMAIL PROTECTED] > >[WWW: http://www.beginthread.com/Ehsan ] > > > > > > > > > > > > > > > ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Parsnegar to Unicode conversion AND phonetic Farsi keyboard withEnglish keyboard
Also, I was wondering if anyone knows a way of defining a user-defined keyboard to use with Farsi Unicode, similar to Parsnegar which allows to define a phonetic Farsi keyboard with English keyboards, so that, when typing in Microsoft word in Farsi, I could use key “J” for letter “jim”, “A” for letter “alef”, etc. You need your custom keyboard layout. M$ has a tool for that: Microsoft Keyboard Layout Creator. You can use it to create your fully (well, nearly fully) customized keyboard layout for Windows. -Ehsan Akhgari www.farda-tech.comList Owner: [EMAIL PROTECTED] [Email: [EMAIL PROTECTED]][WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Miscellaneous web issues
> Roozbeh, it is a long time and I don't remember your answer to this > email. What happened to this new dll? AFAIK, it's not still put in the sourceforge. If you're interested, I can mail it to you off-list. ----- Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: farsiweb.info
> Humm, would you check http://farsitex.org/? I think it worked in IE > when I designed it. Done. It looks pretty well, only the non-link items in the left hand menu might not be much readable (or it might be my lack of perfect sight.) - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] Light without eyes illuminates nothing. ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: farsiweb.info
> Ah, that's a good sign, that none of us at FarsiWeb uses IE anymore! > BTW, IIRC, 8bit transparent PNG works in IE too. I'm not sure. What I can say for sure is the image won't render correctly in IE. Hmm, BTW, at a second look, IE fails to render the layout correctly as well! Of course that's not as bad as how the background image looks. - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: farsiweb.info
> Hi friends, > > The FarsiWeb Project's website <http://farsiweb.info/> is now > up-to-date with a new Wiki system. Congrats on the new site! I took a quick look, and I have a comment regarding the design. It seems to me that you're using a transparent PNG file as the background for the pages. IE doesn't support this feature of PNG files correctly, so the pages render half unreadable on IE. I suggest changing this, and the easiest way would be not to use a transparent PNG (no need for that, anyway - just let the background be white.) Fortunately real browsers (Firefox, and Mozilla) do render it pretty fine! Other than that, the layout seems very nice. Thanks for your efforts. - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Vi/Emacs editor with RTL support
> Not anything really useful. Vim has a rightleft mode (:set > rightleft), which is useful for ONLY RIGHT-TO-LEFT text. > > Emacs, it's worse: there's an emacs-unicode branch, an > emacs-bidi branch, and the emacs-head branch. They are > trying to merge the three of them for a few years now! Thanks for your reply, Behdad. So, is there any editor you would recommend that has good support for bidirectional (Persian and English) text, and preferrably supporting HTML (but an editor without HTML support will also be just fine)? The latest one I'm working with is Bluefish, but it has some minor problems, and I'm looking to see if there's something better available. TIA, - Ehsan Akhgari Learn Linux in Persian: http://www.persian-linux.org/ ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Vi/Emacs editor with RTL support
Hi all, Sorry if this question is too basic. Is anyone aware of a version of the vi editor (preferrably) or Emacs which have support for right-to-left languages, including Persian? If they already support this, should I do anything special to turn RTL support on in those applications? Thanks in advance, - Ehsan Akhgari Learn Linux in Persian: http://www.persian-linux.org/ ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Linux teaching website
> BTW Ehsan, I consider this off-topic. This is about Persian support in > software and computers, software written to handle Persian text, etc. > This is not a list to gather volunteers for a website that happens to > be about an operating system and in Persian. > > Not that I'm not personally interested, but only that it is off-topic. Oh, I'm sorry for posting off-topic to the list. I'll try not to do so again. :-) - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Persian translation of GNOME
> > I've got to give them both a test, and if I don't like them, I'll > > write my own tools. :-) > > That's what is considered reinventing a wheel ;-). You can just get > on and improve gtranslator. Sure, that's why I added the "if I don't like them" condition, which, apparently, is not the case! > I prefer you start right away too. ROOZBEH, hello, wake up... :-) > It's supposed to attract GNOME-lovers. The problem is that I can't > find any time to fire it up... Perhaps after FarsiWeb set up its wiki > system. I've heard about FarsiWeb's wiki for quite a while. What makes starting it up so difficult? Anything I can help with? - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Persian translation of GNOME
> There are a couple tools to help translation. KBabel is the one from > KDE project, and there's a gtranslator for more GNOMEi look. I've got to give them both a test, and if I don't like them, I'll write my own tools. :-) > I remember Roozbeh was preparing a guide for Persian GNOME > translators. There is also a list for that that Roozbeh will > subscribe you eventually. The translation process is definitely not > as easy as it is for a left-to-right language. > Also we are a bit picky about words to use, want to conform to the > Persian Academy translations and other sources... But help is > definitely welcome. I suppose I'll receive a list of such words with the approved translations, isn't it? I personally have a low opinion about most of those "translated" words that the Persian Academy has assigned (I'll *never* call computers "Raayaaneh"!) but some of them sound meaningful, and anyway I'm not here to enforce my personal preferences, but to help! > Roozbeh is a bit busier than before these days. If you didn't gety > ANY feedback on these, come to in September again and I will use my > privileges :-). Fine - although I'd prefer to start right away, since the occasions in which I have spare time are pretty scarce, and I'd like to use them well. > Since you are in Iran now, you may also want to join gnome-ir-list on > http://lists.gnome.org/ and help starting GNOME enthusiasm in Iran; > this great desktop has been left in cold there... I did. Hmm, the list doesn't seem to want to attract many people, does it? I had to type the URL by hand, and if it were not because of my personal experience with Mailman, I would have never found its subscription page! Maybe you'd like to make the list more visible... - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Linux teaching website
Hi all, Is there any interest for a Persian website dedicated to teach Linux from the ground up? I've been spending some time looking for Linux teaching websites on the net, and I've found a number of them. Most of them have only contained a handful of Linux related tips, and there are a few which attempt in actually teaching Linux, but they don't have a good teaching program for getting beginners started -- All they provide is a teaching guide for a certain application or aspect of the system. And there are several which are mostly dedicated to Linux discussions/news, which don't fall in this category. Now, what I have in mind is this. As a Persian user, one needs a Persian teaching resource which does not assume previous experience at all, and starts teaching Linux from the ground up; in a way that they can follow from Lesson 1 upward to start learning Linux. And the whole teaching material will be free, both as in freedom and as in free "maa-oshaeer". :-) Do you guys think this is a good idea? Do you have any idea about things to add, or exclude, maybe? I also need help if anyone is willing/able to give. I'm going to write up "Linux from command line" lesssons myself, which start from ls/cd commands up to more advanced command line tricks and shell programming methods, and then I might consider writiing about a graphical desktop, an application (or an app suite), or a specific task (like networking with Linux, for example.) But I think it would be very nice if several parallel topics can be started simultaneously. But I don't have enough time for that myself, so I need help. If anyone is able to write about such a topic from the ground up and on a lesson by lesson basis, I'd be grateful to have their help. Also, if anyone is able to write Linux tips & tricks, then that would be nice as well. Also, we can open up forums if some of you guys do the favor of answering questions there (since I won't have enough time...) In case anyone decides to join, I think I would use MovableType as the publishing system, so it would be easy for anyone to get started writing articles. Ideas/questions/comments/suggestions? Thanks in advance! - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Persian translation of GNOME
I'd like to help in translating the GNOME 2.8 po files. I noticed that Roozbeh is the leader of the Persian translation team. I'd like to know how I can contribute. Should I send patches to Roozbeh himself, or do something else? Also, are there any tools which can help in the translation (instead of manually editing the po files)? Thanks! ----- Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Persian UTF-8 MySql collation
> [Ehsan, you just replied to me. Answering on list.] My bad. Sorry, I meant to reply to the list. > Well, you may wish to read a couple documents. Read Unicode Collation > Algorithm for example. Just read the intro or something like that. > The point is that Persian Collation is only an small table feed to the > Unicode Collation Algorithm. > So yes, there is a free Persian collation implementation, Glibc + > fa_IR locale. Good point, thanks. I'll investigate it. > What you have seen is the binary encoded table. The source is in the > fa_IR locale source file. Thanks, I'll try Googling for it. > Guys, both of you, if you don't have Glib, You mean glibc, right? > and your system > does not provide what you need, you: > > * Either forget about Persian Collation, or > * Implement your own minimal collation, or That's what I have in mind, currently. > * Consider using something like Glibc or uClibc with Persian > locale as a library. Not sure how uClibc deals with Persian > locale. Thanks again, - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Persian UTF-8 MySql collation
> Right. I was thinking about adding UTF-8 Persian collation to MySql > 4.1.x > - our project will involve a fairly large amount of data, so we'd like > to have the option of sorting at the DB level. I've never tested MySQL 4.1.x. Have you tried it? How is the UTF-8 support? Have you tried Persian collation in MySQL 4.1.x to see how much better it's compared to 4.0.x? Unfortunately I won't be willing to look into 4.1.x at this time, since it's Beta, and we don't use Beta products on our productions servers, so doing so will do no good to my project. > ... which is why we're hoping to use MySql 4.1.x I'd give it a try if I were in your shoes. > Nope, no Persian collation file for MySql 4.1.x as far as I can see > (which is where we came in!) How does 4.1.x get Persian sorting? Like 4.0.x? - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Persian UTF-8 MySql collation
> That might work for Ehsan, but it sadly wouldn't save much effort for > us since PHP doesn't do Persian UTF-8 collation (that I've been able > to get working anyway), or provide access to strxfrm() > > :-( > > - which is why MySql seemed the least bad option. Hmmm, if you've compiled PHP with glibc, I suppose you could simply do the following (code not tested): And yes, PHP doesn't provide access to strxfrm, but I think it's trivial to write a PHP extension which provides that function. - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Persian UTF-8 MySql collation
> Ehsan - are you thinking about adding glibc collation to the > strings/ctype-MYSET.c file? Or something more fundemental? Well, to tell you the truth, I'm not really sure, since I've not checked the MySQL source tree yet. But yes, I'm going to see if glibc support can be incorporated into MySQL's charset handling mechanism. > I think you and the team I'm working with are trying to do > the same thing - it would be great if we could work together > and come up with a solution that anyone else can use too. I looked around a bit, and it seems like MySQL 4.1.x will be supporting UTF-8. MySQL 4.0.x doesn't have that support (the version I'm using on the production server is 4.0.18-standard.) Because of that, incorporating that support into MySQL might require a lot more work that I currently imagine. Unfortunately in that case, I'll have to leave MySQL as it is, and sort the data at the client site (less efficient, but requiring less development time), and since the application I'm working on doesn't store very big chunks of data in the db, I may decide to sacrifice performance for development time. > What's involved in creating a collation file? These two pages: > http://dev.mysql.com/doc/mysql/en/Adding_character_set.html > http://dev.mysql.com/doc/mysql/en/Character_arrays.html > http://dev.mysql.com/doc/mysql/en/String_collating.html > seem to say that's it's not too difficult, if you know what > you're doing? > (Which I dont. I'm just a humble PHP programmer) Well, that seems to be for single-byte code pages. The Persian character coding system used in glibc is UTF-8, and that will require patching MySQL source code. And like I said, because of MySQL's lack of UTF-8 support, it might require more work that I imagine. I think I can handle it from technical point of view (I'm good at C/C++) but I'm quite pressed in free time... > ... it seems it would be great to create a mySql Persian > collation file rather than changing the source, with all the > problems that would lead to of having to re-patch the code > everytime there's a new MySql release? Or is that inevitable? Well, if we decide to change the MySQL source code, we can submit our patches to MySQL team, and hopefully they will incorporate it into their new releases. Of course in that case we might have to look into adding that support to MySQL 4.1.x as well (if it already doesn't have.) So there's no need for re-patching. There's just a need for time! :-) In case I decide not to spend the time in the development of Persian collation support in MySQL, I'll be glad to help your team in case they need technical programming help. In that case, I'll let you know off-list (remind me if you don't get any note from me within a week, please.) - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Persian UTF-8 MySql collation
> It's not any easy to do what you are saying here, unless you > make sure you ALWAYS run your mysql under the same (fa_IR) > locale, and that the locale data does not change. Any Glibc > version >= 2.2 should be Ok. I think I'll give it a try anyway; but I'm wonderring how useful it is, considering the fact that MySQL 4.1.x (currently Beta) will be UTF-8 enabled... Anyway, thanks for your comments a lot. - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Persian UTF-8 MySql collation
> For proper sorting using Glibc, it's not enough that the > application use Glibc, but it should call the sorting > function of Glibc too! (which apparently MySql does not). Right. I'd like to spend some time trying to patch MySQL sources to use glibc collation functions before I give up and sort the data at the client side. Would you mind letting me know which version of glibc I should be using? Also, is there any resource/documentation/how-to available which can guide me in this job? Thanks! ----- Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Persian UTF-8 MySql collation
> You can do proper Persian sorting using either glibc > (available in all GNU/Linux distributions), or ICU (available > from http://oss.software.ibm.com/icu/). I have tested both MySQL 4.0.15 on WinXP and the default MySQL which comes with Fedora Core 1, and neither could handle Persian sorting correctly. They both seemed to start sorting from letter "FEH" to "YEH" and then picking up "CHEH", "ZHEH", "GEH" and "PEH", and then starting from "ALEF" to "GHEIN". It's possible that the Windows version has not been compiled with glibc, but the Linux version is most likely compiled with glibc, I think. Do I need to compile MySQL manually? If so, is any particular version of glibc required, or do I need to specify any particular compilation options? Thanks in advance, - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Persian-English Dictionary -- Was: Iranian Mac User group
> > I volunteer to implement a web interface for the dictionary, > Excellent! > You'll have to make it so that whether the user types in bi[ZWNJ]kaar, > bikaar, or bi kaar, the word will be found! Yes, that's right. This is relatively easy to implement. > > but I think we'll need other > > people's help as well, because I would guess the whole data > would be *huge*. > Will this require separate dedicated server(s)? > (I'm thinking about Behdad and the Persian Digital Library here...) Hmmm, not necessarily *dedicated*. As long as there's enough web space for some part of the data to reside on the server, and I have access to it to install an application which processes the queries locally, it doesn't really have to be dedicated, unless the server's already fully loaded by other tasks. I don't think we'll need dedicated servers for this job. The process of searching can be done fast enough. - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Miscellaneous web issues
> I would appreciate if you send me the exact process you used and the > DLL, so we can publish it on the FarsiWeb website on SourceForge. OK. I send the step-by-step process on the list, and will send you the relevant files off-list, so that you can put them on sourceforge. Here are the steps I took to accomplish the job: 1. After installing the Microsoft Keyboard Layout Creator (MSKLC) tool, I inspected its install directory, and figured that it's being shipped with a version of the MS C/C++ compiler (cl.exe) in the directory: C:\Program Files\Microsoft Keyboard Layout Creator\bin\i386. This assured me that the tool creates a C source file, and feeds that to the compiler to create the layout DLL. Now, I needed to know the location of the generated source file, and also the command prompt parameters passed to the compiler. 2. To get the command prompt options passed to the compiler, I wrote a simple application which appends its command line arguments to a log.txt file. This application is called shim.cpp, and is shipped in the src package inside the shim directory. It can simply be compiled to shim.exe using the command "cl shim.cpp". 3. Now, I moved all of the .exe files in the C:\Program Files\Microsoft Keyboard Layout Creator\bin\i386 directory, and copied shim.exe under all of the moved files' names. So, now I had a cl.exe, rc.exe, link.exe, etc. in that directory which were all actually the shim.exe program. This enabled me to figure the command prompt options passed to the compiler tools from the MSKLC tool so that I could immitate them manually. 4. I opened MSKLC, and selected File | Load Existing Keyboard menu item to load the "Persian experimental standard" keyboard (version 1.0.3.13) that I had already grabbed from sf.net repository. 5. I selected the Project | Build DLL and Setup Package menu item to build the DLL. The tool invoked my shim tool instead of all of the compiler's tools (see Step 3 above.) 6. I created the directory C:\Program Files\Microsoft Keyboard Layout Creator\hack, and created a build.bat file there, which would execute the compiler's tools with the command prompts passed by MSKLC to it. 7. I copied the keyboard layout source files generated by MSKLC from the temporary directory to the hack directory as well. 8. I edited Persian.c, to change the shift state code for the Space key from ' ' to 0x200C. The patched line is line 268 in the original file copied from the temp directory. 9. I edited Persian.rc to change the version number from 1.0.3.13 to 1.0.3.14 so that I could tell my modified Persian.dll version from the original FarsiWeb one. 10. I ran build.bat, and voila! The Persian.dll version 1.0.3.14 got built. Then I just had to replace it with the version 1.0.3.13 DLL from the original FarsiWeb package. The installer didn't need any change. Now, I just ran the installer to uninstall the old version, and install the new version, and I had my keyboard working with Shift+Space. I'm sending to Roozbeh two files: Persian-src-1_0_3_14.zip which contains the modified source files, and Persian-1_0_3_14.zip which contains the DLL plus the installer, which I guess he'd make available through the sourceforge. I'm open for questions/comments. Please don't hesitate if you have any. - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Misinformation!
> There's a difference in the case of C++ standard and web > standards: Writing non-standard C++ code only produces compile-time > problems, but if you happen to compile the code, it works correctly > (or supposed to do so). Well, that's not exactly so. Some non-conformant behavior tend to generate (maybe subtle) runtime behavior differences. But I see what your point here is. > But it's quite a different case in web. > 30-40 percent is low enough to get ignored, counting that the other > way you are sacrificing the other 60-70% for not being able to find > the document by searching in Google. And note that even with Win9x > and a recent IE, and updated fonts, there's no problem. I'd definitely do so if the Google search problem couldn't be solved. But I've been using a method I've mentioned in my other post to solve that problem as well. This was the best way of having the best of the two worlds that I could think of, but I'm wide open for suggestions/improvements to this idea. > About using HTML entities, no matter what the encoding of the page is, > HTML entities generate Unicode characters. They do on most browsers, but browsers are not required to do so. Consider a browser which can't handle UTF-8 (well, or at all). > It's quite common to see > people exporting Persian documents in MS Word, and get an HTML page > encoded in MS Arabic encoding, with Persian Yeh and Keh encoded in > HTML entities. Yes, and that will make their document even more difficult for search engines to index. And of course, I'd debate that using CP1256/ISO-8859-6 is not suitable for Persian documents, but that's another story perhaps. > PS. BTW, I just found that using Harakat (kasre, fathe, ...) also > prevent a hit in Google search :(. That's quite expected, but perhaps > I should reconsider my habbit of putting those tiny marks everywhere. That's another sad fact. I really think that Google must seriously consider implementing some such details on their indexing process. That's also one of the things that AriaSearch.com handles. --- Hmmm, now that we're here, how about gathering some volunteers who can work with Google to fix some of these problems? In the past, I've contacted Google on a number of occassions about small problems in their services, and they seemed quite willing to fix them. Maybe we would hopefully have a more Persian-friendly Google in the future this way. If you feel that this is a good idea, I'd be pleased to take part in that team. Comments? - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Misinformation!
> Here is a solution (in fact a hack) that if implemented correctly, can > resolve some of the issues till people and Google start using correct > software: > > With a little tweaking, the web servers can translate the correct > Unicode to the incorrect unicode desired so much by the Win9X users. > That is, the web severs looks at the browser request, and if it can > detect Win9X, translates all U+06CC's in the document to U+064A (and > all other required translations). The same technique could be used to > fool google into generating correct search results. That, is the web > server generates a Win9X friendly version of the document and appends > it to the original document. You can also allocate tags that the user > of the web server can disable or enable some of these features. This > may even make one gain some advatnage over other web hosting > companies. That solves half of the problem. On Win9x, the key d on the keyboard inserts an Arabic YEH, and on Win2K+, it inserts FARSI YEH. So, if you use this method, when a user types in a word containing yeh in the google's search box on Win9x, they wouldn't find your site. The best hack (or solution, as one might call it) I've found for this is feeding a version of page too Google which contains both forms of words (using YEH and FARSI YEH) so that the chances of google finding your page for a certain keyword gets maximized. Of course, certain measures must be taken to prevent bad results, for example, the proximity of the words must not get touched. Nevertheless, this will cause other problems, such as malformed keyword density, which cannot be solved reliably. The problem must be fixed in the search engine code, really, and such hacks have their own downsides. The search engine project I've been working on handles this (and the ARABIC KEHEH and FARSI KEH problem) among other problems for searching in Persian text. > Of course, the solution above is only a transient one, and it is up to > people to upgrade their Win9X machines to something that is > Unicode-compliant, also it is up to Google to program their systems > such that it can understand that both U+06CC and U+064A are the same > shape and hence should be regarded the same for searching unless user > requests otherwise. This is the same as case-insensitive search that > is usually implemented by mapping all upper and lower case characters > -- in documents and queries alike -- to uppercase. Yeah that's right. Of course great attention must be paid so that it doesn't break Arabic search results. - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] He who sees the abyss, but with eagle's eyes - he who with eagle's talons grasps the abyss: he has courage. -Thus Spoke Zarathustra, F. W. Nietzsche ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Misinformation!
> Unfortunately this kind of misinforming is quite popular in weblogs, > where people only care about being visible to more people. I confess that I'm one of those who use this technique on their web sites. I don't believe it's correct, and I don't think of it even as a semi-elegant solution. It's a solution which just works on the largest number of platforms. By inspecting the web server logs, I notice that still an average of 30-40 percent of the visitors are using Win9x. Hopefully one can start dropping support for Win9x users as their number is constantly decreasing, but right now if I choose the standards compliant route of using FARSI YEH everywhere, those Win9x-ers will not be able to browse my sites. I have a high respect and tendency to the standards. I'm mostly a C++ programmer, and I'm one of those "preachers" of the C++ Standard. However, today's C++ compilers are still not fully compliant to the C++ Standard, so whenever someone asks me for advice on how to accomplish a certain task on a non-conformant compiler, I show them the non-standards way, and also mention the standards way, so that they know what the *right* way is, and also what the way to do their job right now is. I see little difference in the web standards land as well. Of course this 'solution' (if it can be called so) poses other problems, such as the inability of correctly indexing of such words with both forms of YEH by search engine spiders such as Google's, which must be addressed separately. Also, if you choose to use the FARSI YEH form everywhere, then again such problems will occur (such as a Win9x-er can neither correctly see your pages nor fine them in Google; if they query for a word containing YEH.) > They even go on and use HTML entities (like ٚ) instead of UTF-8, > just because if the user's browser is set to something other than auto > and UTF-8, the page is still rendered correctly... This one is silly, and I don't see how this can solve any problem. The browsers are required to be able to correctly resolve such numerical entities only if the page's encoding is already UTF-8, and if it is so, why not use UTF-8 encoded characters in the first place? Also, some agents have difficulties interpreting such numerical forms. Furthermore, maintaining them is impossible (not hard), and even they can't be treated as text by most software packages (for example, they can't be searched for by many programs.) And the last, but not least, for a regular Persian document, they're likely to increase the document size by more than two times. They have their own usage, of course, but I don't see any sense in using them instead of UTF-8 characters for regular web pages. - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Persian-English Dictionary -- Was: Iranian Mac User group
[snip] > I'm sure this dictionary must have been funded by the Iranian > government and no profits expected. I'm shocked to see that less than > a dozen US universities have purchased it. I should think the author > and publisher would be very happy to see it put online and all the > efforts go to some use. Surely they will agree if their name is kept > with the data! As for the technical part, I no longer have any doubts > as to the abilities of the members of this group, especially after > hearing the keyboard hack job for the sake of the ZWNJ earlier today! :-) I did the keyboard job just because I thought it's a lot easier to use Shift+Space instead of Shift+B, and also because I was in the process of typing in a lot of Persian data. It took only about half an hour (not the time to download the MSKLC tool of course) and improved my typing speed considerably. About your proposal, I'm personally interested in doing the technical part of the job. I volunteer to implement a web interface for the dictionary, and I can also provide the hosting for the web interface. I can provide some amount of web space for the data as well, but I think we'll need other people's help as well, because I would guess the whole data would be *huge*. If the data has to reside on multiple web servers, I can code some sort of distributed query mechanism which transparently fetches the definitions for remote web servers and display them to the end user transparently. - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Miscellaneous web issues
> There is no C/C++ source file. The source is a data file that MSKLC > compiles into the DLL. If the data file contains ZWNJ on shift-space, > it fails to compile. Microsoft developers confirmed that this is a > bug. Well, I did a little bit investigation on this. I downloaded the MSKLC (MS Keyboard Layout Creator) tool, and took a look at it. This tool generates a C source code from the data you feed to it, and then compiles this C code in order to generate the keyboard layout DLL. The bug which expects Space to only insert a space character is at the MSKLC level. IOW, if the generated C source code is patched correctly, and then compiled with the same compiler switches that the MSKLC tool passes to the compiler, ZWNJ can be successfully assigned to Shift+Space combination. I did this, and installed the new DLL on my system, and it works beatifully. It's the same keyboard layout, only Shift+Space inserts a ZWNJ instead of a space. I thought I would submit it to sourceforge so that everyone can use the new tool. Roozbeh, let me know if it would be okay for me to send the files to you to get them into the sourceforge, or if I should do something else. --------- Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Miscellaneous web issues
> > Thanks for the links. Seems like a very handy keyboard. > BTW, why the > > Shift-Space combination does not work? > > Bug in Microsoft keyboard layout creation tool. Use "Shift-B" > temporarily. Thanks. I've not done any work in this arena, so what I propose here might make no sense. Sorry if that's so. But, the M$ page on the keyboard layout creation tool says the tool "simplifies" the process of creating a keyboard layout. Would there be any way to assign ZWNJ to Shift+Space by coding the keyboard layout tool manually? If you can send me the C/C++ source file off-list, I'll try to investigate it further. If not, I guess Shift+B is not that bad as well. The keyboard layout rocks, even without having Shift+Space in place. :-) - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Miscellaneous web issues
> What is notepad? A text editor? Text editors should not insert a UTF-8 > BOM either. The problem is that Microsoft sometimes invents > non-standard things and then pushes it so hard that Unicode adds it to > parts of the standard (or an FAQ). "Microsoft conventions for .txt > files" in the Unicode FAQ looks sarcastic to me. Well, maybe you're right, but I don't see how a text editor is supposed to know the encoding of a file without some kind of mark. See, HTTP transfers the character set using the Content-Type response header. In HTML, it's spedified with a tag. In XML, the default encoding is UTF-8, and if a document is encoded in another encoding, it must be specified in the PI. Plain text files have no means of identifying the character encoding, so a single text file can be interpreted as UTF-7, UTF-8, UTF-16, UTF-32, etc. if there's nothing to declare the exact character encoding used. The point here is that, protocols which do not allow BOM are those who provide other means of specifying the character encoding. A certain byte stream can have multiple interpretations depending on what content encoding you use to interpret it, and there must be some way to cut off this confusion. YMMV, - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Miscellaneous web issues
> You can re-live its creation here in the archives: > http://lists.sharif.edu/pipermail/persiancomputing/2003-June/0 00538.html [snip] Thanks for the links. Seems like a very handy keyboard. BTW, why the Shift-Space combination does not work? > Done! Beautiful! > I hope the Mozilla users appreciate all this trouble. > > Thanks again for all your help! You're welcome! :-) - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Miscellaneous web issues
> It appears taking a break is the best cure. Some progress: Yes. It certainly is. Good to hear the problem's solved. [snip] > Find/Replace [the invisible] ZWNJ in Notepad is no problem becuase I > have the Persian Experimental Keyboard and ZWNJ is right on Shift-b. > Although I can't actually SEE that I've typed ZWNJ in the Find box, it > really is there. So now in my .js array, I have a few Persian words > with \u200c right in the middle of the Persian script. Interesting. Sorry for my ignorance, but is that keyboard available publicly? > It doesn't seem like the browsers should be able to handle that but > now I see it's not a problem. Why not? The \u syntax allows you to represent Unicode characters in JavaScript. > Only thing I have to > remember is to re-open the Notepad file in a non-WYSIWYG editor and > delete that BOM creature. > > Mozilla is now able to "find" my words containing ZWNJ which was the > whole point of this exercise. > > One small problem still remains: in Mozilla, if you click on any Tajik > word, it shows you the Persian counterpart in the popup. > But Mozilla is not able to display the ZWNJ so that is ignored. > I'm not sure what to do to solve this. Well, on Mozilla1.2.1 that I tested it on, if you replaces ZWNJ in the description of the Tajik array indices with then it seems to work happily. Try giving it a test. - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Miscellaneous web issues
> First of all, thank you very much for all the patient and lengthy > explanations. Very nice of you to share so many tips! > (Thanks to the others too who answered on and off list!) Happy to help! [snip] > Now that 2 people have said to change ZWNJ to \u200c, I tried that but > it didn't work. I don't think I have the right tool. > > I couldn't do it in Notepad because as I said, it's WYSIWYG in Persian > script so if I do a global replacement and stick \u200c in the middle > of Persian script, that's obviously not going to work (and I also > tried it for good measure and it didn't work but there may be many > reasons it didn't work out using Notepad.) I don't know what you mean here. Why it doesn't work in Notepad? Note that on Windows XP, you can't type ZWNJ inside the Find/Replace dialog box - you need to copy/paste it from inside the Notepad text editor window. Another reason why not to use Notepad. > Then, since you recommended Frontpage, I tried that. Earlier, it had > not even occured to me to attempt to open a .js file in Frontpage > (version > 2000.) This time I fooled it by changing the extension from .js to > .html and so was able to open it in html view where all the unicode > was in numeric style. I changed all the to \u200c but now I > see that also has not worked. Well, I don't know what the problem is here... BTW, FrontPage 2003 can open the .js file (using File | Open, or drag and drop) and render the UTF-8 characters without converting them to numeric entities just fine. Don't try putting them in an HTML file. Don't know about FrontPage 2000, though. > I think I'm not going to use Notepad for making bidirectional arrays > from now on! That is insane to go to such great lengths! Yeah, it's definitely so. > Not sure what you have in mind here, but at this point, I"ll be glad > just to make it work with ZWNJ. In the JS code, try to replace the trailing ZWNJ-raa and ZWNJ-o with nothing using a regex. HTH, - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Miscellaneous web issues
> An important note: what Notepad does here is only "acceptable". It's > not even recommended. HTML 4 clearly doesn't allow a UTF-8 BOM appear > before the HTML tag. Notepad is supposed to be a text editor. A text > editor shouldn't insert markup by itself. BTW, ISIRI 6219 strongly > discourages the use of a BOM in UTF-8 files. The problem here is that web protocols (HTML for example) don't allow the BOM, and Notepad is not an HTML editor, so there's nothing to prevent it from adding the BOM. Check out: http://www.unicode.org/faq/utf_bom.html#28 - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] 'I generally take life as it comes my way', said Death. ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Miscellaneous web issues
e correspondence between > languages for the purposes of this project. I was wishing I had > Behdad's beloved U+202F, the Narrow No-Break Space for this operation! You can leave them as they are, and handle them in the JavaScript code (trim them off of the end of the Tajik words maybe.) > 6. I embedded the fonts again. Looks beautiful on WInXP/IE6 and > limited others. I presume it looks terrible on the rest. > Still thinking about what to do about that. Behnam, how's the Tajik > looking on your Mac? A big (IMO) problem with font embedding is that if users save the document on their HD (using IE of course) then the fonts will be gone. Not a professional image, if you ask me. That's why I try to stick with the std fonts, and use other formats when a custom font is absolutely necessary (PDF being my favorite). Not the best of solutions, of course, but works for me. Hope this helps, - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: IranL10nInfo
> Iranian guys, would you please do a short statistical survey? I've never come across Amordad. And I was born in (A)Mordad... Ehsan ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Iranian Calendar
> What we should look for, is clear and reasonable objection. > There hasn't been any such objection for "Iranian calendar". I think it's the most reasonable term when you look at it from a foreigner's point of view. They're not interested in what Jalali means, or the astronomical details of the calendar. I think "Iranian Calendar" best identifies the subject as the calendar officially used in Iran, and that would sound the most reasonable name it can get. Of course, that's all my opinion, hence my "personal preference"... :-) > My rewording of the FarsiWeb opinion is that the 2820-year Birashk > calendar is the best implementable arithmetic calendar. The law *is* > different and the practice *may be* different, but this is the best we > can find. The "showraa-ye aalie-e taghvim" (of the Islamic Republic of > Iran) holds the authority on the Iranian calendar, and they don't even > disclose the calendar of 1384 if you ask them to, let aside telling > the algorithm they use to anybody (which includes other governmental > bodies, like "saazmaan-e modiriat va barnaame-rizi-e keshvar" and > "showraa-ye aali-e anformaatik"). Yeah, many such supposed-to-be-known-to-all information here are acted upon like military secrects, unfortunately. Lovely. - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Persian PC-Kimmo 0.8 released
Thanks for your reply, Jon. > Thanks for asking. All the words are in > tab-separated text files, as in noun.lex, verb.lex, > etc. They get converted to a kimmo-usable file such > as fa-noun.lex, fa-verb.lex, etc. using the db2lex perl scripts in the > scripts directory. The verb and adjective files use a specific script > written for them; all others use the plain script. Also see the > orthography.txt file for the romanization scheme. It also has some > other goodies. > > I would love add any additions you might make to the lexicon in the > next release. I suppose I can use roman2unicode to convert the roman encoding into readable plain text (I'm not fast on reading the roman notation). That way, I can import the data into Excel, sort it alphabetically, and start adding new stuff... > As you can see, it needs a little more work on the morphophonemic > rules, but it should work fine for stemming purposes. Yes, it's pretty good at recognizing the stem of the word. > Hans Nelson is the man to talk to. He's working on a Kimmo output to > XML program. I don't know much about > it, but here's his email: [EMAIL PROTECTED] Thanks for your hint. I'll try to contact him. In case you're interested, I can send the final result of our discussion to you off-list. - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Persian PC-Kimmo 0.8 released
> For anyone who's interested, Persian PC-Kimmo version > 0.8 has just been released. It's available here: > > http://home.byu.net/jmd56/download/persian-pckimmo-0.8.tar.gz Thanks, Jon, for releasing this version. It looks a lot better than the previous one! > The biggest thing holding them back from being a 1.0 is a relatively > small lexicon (~1350 words). The morphology engine achieves about > two-thirds recognition on a corpus of about 3.5 million words. > And of course, it's GPL'ed. Hmmm, do you have a list of the words in the current lexicon? (I'm not familiar with PC-KIMMO specific commands, so I can't parse them on my own.) What should I do to help adding more words? > Any helpful feedback would be appreciated. I find the new tree-style recognition a lot helpful: n+mi+]+im NEG+DUR+come.PRES+1P 1: Top | Verb | VNEGPREFIXVNStem n+ __|___ NEG+ VPREFIX VStem mi+ | DUR+V1Stem |_ V2Stem VPSUFFIX | +im V3Stem +1P | V ] come.PRES Top: [ cat: Top ] 1 parse found n+mi+]+m NEG+DUR+come.PRES+1S 1: Top | Verb | VNEGPREFIXVNStem n+ __|___ NEG+ VPREFIX VStem mi+ | DUR+V1Stem |_ V2Stem VPSUFFIX | +m V3Stem +1S | V ] come.PRES Top: [ cat: Top ] 1 parse found I was wonderring if there's some way to retrieve the tree-structured data in a format which is easy to parse (the ASCII style is too difficult for a computer program to parse), something like an XML format maybe? - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Farsi Stemming Algorithm
> (I'll just reply to your other post here) I guess I didn't know about > a new pc-parse release. Where did you get the newest source code? > That's terrific news for me. Well, the release I downloaded is approximately one year old, but here's the URL I downloaded it from: ftp://ftp.sil.org/software/unix/pc-parse-src-20030321.tgz To build it, I just did a typical "./configure; make; make install;" - there was nothing more than that. What compiler version have you used to compile it? Let me know if you still have compilation problems. I might be able to help if I can reproduce them here. > I'm very interested in any work you'd work on, including a PHP > extension. Maybe SIL.org might be interested as well. Actually, what I'm working on is an English/Persian search engine which can be placed on any site with no need to download/install anything. It's nearly finished, I only have to translate the web UI into Persian, and also implement stemming for Persian in the engine. Originally I planned to implement a stemming algorithm myself, but I figured that I can't be considered an expert in Persian grammar/linguistics at all, so I prefer to use already working solutions, and your work seems to be the *best* choice. The PHP extension would be quite a thin wrapper, but anyway I'll definitely provide you with the source code when I'm finished. You'll be also welcome to a copy of the search engine's source code itself if you're interested. > Give me a week and I'll email them to the email address in your > signature, unless you tell me otherwise. Thanks a lot! I highly appreciate your great help. - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Farsi Stemming Algorithm
> One of the things that drives me nuts about the software is that it > claims to run on Solaris/Sparc, Win/x86, MacOS, or BSD, but apparently > no Linux (I have a Sparc box, so I'm lucky :-). The source code is > downloadable, but it currently doesn't seem to compile on Linux/x86. > It does have a callable C interface, as documented in the kimmolib.txt > in this file [2]. In fact, I'm working on an AI program that calls > PC-Kimmo to do morphology. Batch mode is used via the 'take' command, > and using a .tak file. Here's an update. I tried to build the whole pc-parse package on Linux (RedHat 9.0) using gcc 3.2.2, and it compiled without a single problem. I also tried running PC-Kimmo, and it was working smoothly. I noticed that in the README, they cliam to have tested the build process on the following platfroms: 1. Debian GNU/Linux 2.2 (kernel 2.2.17) / gcc 2.95.2, glibc 2.1.3-24 2. Red Hat Linux 7.3 (kernel 2.4.18) / gcc 2.96, glibc 2.2.5-34 3. Red Hat Linux 8.0 (kernel 2.4.18-14) / gcc 3.2-7, glibc 2.2.93-5 4. OpenBSD 3.1 / gcc 2.95.3 5. Mac OS X (10.2) / gcc 3.1 6. cygwin 1.3.10-1 (Windows XP Pro) / gcc 2.95.3-5 Maybe you're trying an older version? - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Farsi Stemming Algorithm
> It's a two-level morphology engine, so basically it resolves a surface > form to a lexical form, or lexical to surface form. > For example, if I give it a newspaper word like 'nmiAim' > (نميايم -- I am not coming), it > will resolve to 'n+mi+A+m', taking into account any morpheme boundary > changes (like the yeh here). More documentation is found here [1]. Thanks for the information. I tried nmiAim, but unfortunately didn't get any results. However, I noticed it recognizes some words, like xuAhd (khahad) as xuAh+d for example. It seems like a perfect tool for my job. Thanks for the nice job! > One of the things that drives me nuts about the software is that it > claims to run on Solaris/Sparc, Win/x86, MacOS, or BSD, but apparently > no Linux (I have a Sparc box, so I'm lucky :-). The source code is > downloadable, but it currently doesn't seem to compile on Linux/x86. > It does have a callable C interface, as documented in the kimmolib.txt > in this file [2]. In fact, I'm working on an AI program that calls > PC-Kimmo to do morphology. Batch mode is used via the 'take' command, > and using a .tak file. I downloaded the source, and took a look into it, and found this file: pc-parse-20030321/pckimmo/r.c, which seems to be exactly what I'm after - a C interface for the recognition engine. Too bad it doesn't compile on Linux though, because I'm planning to use this in a PHP extension which must run on both Linux/x86 and Win32. However, if the source doesn't need a full re-write, then I can fix it to compile on Linux as well (I'm a C/C++ programmer, more than anything!). Are you interested in the fixed sources? I could send them to you, or I can make it available online if there's enough interest. Also let me know if you'll be interested in the PHP extension as well. > Don't be too disappointed about version 0.5 of the Persian > implementation -- it was released 2 years ago > ;-) I've reworked almost every aspect of it since then, so hopefully > it will work better. > Have fun. Hmmm, would it be possible for me to have a copy of your latest work before you publish it? I'd be grateful if you can send them to me. Thanks! - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
RE: Farsi Stemming Algorithm
Thanks a lot, Jon, for your reply. > The only one that I'm aware of is found here [1]. But it > seems hard to get any other information about this stemmer. Yes, it definitely seems so. The only Farsi stemmer I've been aware of myself is http://www.isri.unlv.edu/publications/isripub/Taghva2003-02.pdf . I had contacted Dr. Taghva some time ago about his stemmer, but didn't hear back from him at all. > While the aim is a little different from a stemmer, a Perian > morphological engine is being developed. The one available > for download [2] is a couple versions behind current > development, but it still yeilds decent results. Version 0.5 > is public domain, and newer versions will be under the > General Public License. A new version will be released in a > couple of months. I downloaded this package, and looked into it. It seem to be useful for my job. However, this is the first time I'm hearing of PC-Kimmo, so I was kind of lost when trying to figure out the whole thing. I was wonderring if you can provide me with some additional info (or URLs; didn't find any myself) about this software, especially how can it be used on Linux in batch mode. Does PC-Kimmo come with any callable C interface? Thanks a lot! - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing
Farsi Stemming Algorithm
Hi all, Does anyone know of any free Farsi Stemming algorithm, like the Porter algorithm to English? Thanks a lot! - Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] ___ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing