Re: [Mono-devel-list] How to handle huge string collation resources?
Hi, Cant the same be said about the table that Atsushi is creating? Do we really worry about the 200k replicated between Mono and libmono.so. Not too much ;-). Well, it might make a difference if on the same box somebody is using mono and libmono.so -- but not a big issue. They are not only 200KB tables (that number is just for CJK support which is rarely required) but about 500KB as a whole. I've no idea what kind of data is in the table, and if it is endian dependent. If it's just a byte array, we'd be fine... Well, I need more clarification for them. - For standard collation support (numbers are in length): - byte[] : 153360. - int[] : 13552. - some extent of additional byte[] : maybe less than 5000 - for CJK support which is rarely used: - ushort[] : 2 * 3 + 3 - byte[] : 2 - for String.Normalize() which is only for .NET 2.0: - byte[] : 3 - int[] : 2700 - short[] : 17600 - maybe additional int[] : 4000 So, they are mostly byte for collation, except for CJK support which is not referenced unless we use CompareInfo methods from CJK cultures. So I think managed resources won't be bad here. For String Normalization related stuff, I still think it would be better to have them in mscorlib.dll since they are unused in the default profile. If we want to include the stuff in the C runtime, we pretty much have to check in the generated file to SVN. We can't run C# code before the runtime is compiled. OTOH, if we include the stuff as a managed resource to corlib, we could run a mono app at that point to generate the file. While this stuff is in active development, checking in a large file to svn is probably going to make mono-patches-list a bit annoying. I agree. So maybe the best plan would be a managed resource for now, and once the table is stable, moving it into C if that makes a substantial performance difference. Ok, then right now I need to hack on corlib Makefile to run make under Mono.Globalization.Unicode. Atsushi Eno ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-devel-list] How to handle huge string collation resources?
On Wed, 2005-06-22 at 04:26 +0900, Atsushi Eno wrote: 3. run make. It will automatically downloads some files from some sites. For now without this step the build b0rks. Of course, this will need to be changed ;-). Here is a serious problem. In step 3 it makes 1.2MB of a C# source file that results in 500KB increase of mscorlib.dll. If you are generating a file in C#, you are going to be managing memory badly. C# has no sense of a const array. When you say: static readonly int [] x = { ... } This array is actually allocated in the GC *at runtime*. Doing it in a header file would be an option. Not really ideal because it gets into our package three times (once for the statically linked mono binary, another for libmono.so, another for libmono.a). The best option is to have it as a resource in a dll. The runtime memory maps that. And for about 200KB of data, they are just for CJK cultures so they won't be used unless we use those cultures to handle culture-sensitive CJK collation. That is mostly waste of memory. Not if the data doesn't get paged in ;-). - CompareInfo or whatever holds those tables as static variables. - If the variable is null, then it tries to load the internally stored table via runtime icall_1. However at this stage it returns null, since nothing is stored. - Then, CompareInfo or whatever loads table-only assembly via reflection and loads table into memory, and then invokes an icall_2 that sets the table as runtime internal table. - Next time CompareInfo tries to fill the table, icall_1 will return the table. The memory system essentially does that via the mmap call, however it is hidden from view. -- Ben ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-devel-list] How to handle huge string collation resources?
Hey, Ben Maurer wrote: On Wed, 2005-06-22 at 04:26 +0900, Atsushi Eno wrote: 3. run make. It will automatically downloads some files from some sites. For now without this step the build b0rks. Of course, this will need to be changed ;-). duh ;-) It will be checked in when we decide how to handle it. Here is a serious problem. In step 3 it makes 1.2MB of a C# source file that results in 500KB increase of mscorlib.dll. If you are generating a file in C#, you are going to be managing memory badly. C# has no sense of a const array. When you say: static readonly int [] x = { ... } This array is actually allocated in the GC *at runtime*. Doing it in a header file would be an option. Not really ideal because it gets into our package three times (once for the statically linked mono binary, another for libmono.so, another for libmono.a). The best option is to have it as a resource in a dll. The runtime memory maps that. Oh, I didn't know that resources are mmapped. Yeah, then that sounds the best way. BTW doesn't that mean all that kind of culture resources had better become managed resources, unless they are required at runtime? We also have huge culture-info-table.h and char-conversions.h in metadata. And for about 200KB of data, they are just for CJK cultures so they won't be used unless we use those cultures to handle culture-sensitive CJK collation. That is mostly waste of memory. Not if the data doesn't get paged in ;-). The memory system essentially does that via the mmap call, however it is hidden from view. Well, they will be hidden from view, but don't they still eat memory when mscorlib.dll is loaded? Don't they still get paged? Atsushi Eno ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-devel-list] How to handle huge string collation resources?
On Thu, 2005-06-23 at 12:41 +0900, Atsushi Eno wrote: Ben Maurer wrote: Here is a serious problem. In step 3 it makes 1.2MB of a C# source file that results in 500KB increase of mscorlib.dll. If you are generating a file in C#, you are going to be managing memory badly. C# has no sense of a const array. When you say: static readonly int [] x = { ... } This array is actually allocated in the GC *at runtime*. Doing it in a header file would be an option. Not really ideal because it gets into our package three times (once for the statically linked mono binary, another for libmono.so, another for libmono.a). The best option is to have it as a resource in a dll. The runtime memory maps that. Oh, I didn't know that resources are mmapped. Yeah, then that sounds the best way. BTW doesn't that mean all that kind of culture resources had better become managed resources, unless they are required at runtime? We also have huge culture-info-table.h and char-conversions.h in metadata. They are in C, where they are a const array. One advantage of having them there is that we don't have to do conversions for different endian systems. And for about 200KB of data, they are just for CJK cultures so they won't be used unless we use those cultures to handle culture-sensitive CJK collation. That is mostly waste of memory. Not if the data doesn't get paged in ;-). The memory system essentially does that via the mmap call, however it is hidden from view. Well, they will be hidden from view, but don't they still eat memory when mscorlib.dll is loaded? Don't they still get paged? Memory in a mapped file that is never touched never gets read from the disk, nor is physical memory allocated for it. -- Ben ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-devel-list] How to handle huge string collation resources?
Hello, BTW doesn't that mean all that kind of culture resources had better become managed resources, unless they are required at runtime? We also have huge culture-info-table.h and char-conversions.h in metadata. They are in C, where they are a const array. One advantage of having them there is that we don't have to do conversions for different endian systems. Cant the same be said about the table that Atsushi is creating? Do we really worry about the 200k replicated between Mono and libmono.so. And the libmono.a is only shipped to those developing. The resource approach has also the downside that access to it would be through the Stream interface (even if mmapped) while getting a pointer to a C-statically defined array would allow the corlib code to access it without any wrapper code. Both approaches enjoy the mmap benefits. Miguel. ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-devel-list] How to handle huge string collation resources?
Hello, Just to follow up some more: since the tables generated contain various arrays of short and int sizes, I rather go down the path of embedding that into the C code, so we get the automatic endian adjustment rather than forcing the managed code to deal with it. We can add an internal call to get the pointers to the various tables and take it from there. ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-devel-list] How to handle huge string collation resources?
On Thu, 2005-06-23 at 01:05 -0400, Miguel de Icaza wrote: Hello, BTW doesn't that mean all that kind of culture resources had better become managed resources, unless they are required at runtime? We also have huge culture-info-table.h and char-conversions.h in metadata. They are in C, where they are a const array. One advantage of having them there is that we don't have to do conversions for different endian systems. Cant the same be said about the table that Atsushi is creating? Do we really worry about the 200k replicated between Mono and libmono.so. Not too much ;-). Well, it might make a difference if on the same box somebody is using mono and libmono.so -- but not a big issue. I've no idea what kind of data is in the table, and if it is endian dependent. If it's just a byte array, we'd be fine... The resource approach has also the downside that access to it would be through the Stream interface (even if mmapped) while getting a pointer to a C-statically defined array would allow the corlib code to access it without any wrapper code. Well, since we are in corlib, we can get the void* ;-). Anyways, I think the two approaches don't make much difference in terms of performance (except possibly the endian stuff). Where there is a difference is in terms of development model: If we want to include the stuff in the C runtime, we pretty much have to check in the generated file to SVN. We can't run C# code before the runtime is compiled. OTOH, if we include the stuff as a managed resource to corlib, we could run a mono app at that point to generate the file. While this stuff is in active development, checking in a large file to svn is probably going to make mono-patches-list a bit annoying. So maybe the best plan would be a managed resource for now, and once the table is stable, moving it into C if that makes a substantial performance difference. -- Ben ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
[Mono-devel-list] How to handle huge string collation resources?
Hello, Finally I got my managed collation engine working, though it is far from complete form I aim and it is mostly conceptual for now (it does not handle many things, performs so bad). For now it handles ASCII case sensitivity, large part of CompareOptions flags, large part of diacritical mark processing. Here is the steps to make it available: 1. apply attached patch against mcs/class/corlib. 2. go to mcs/class/corlib/Mono.Globalization.Unicode 3. run make. It will automatically downloads some files from some sites. For now without this step the build b0rks. 4. make corlib as usual. 5. set MONO_USE_MANAGED_COLLATION environment variable as yes. Here is a serious problem. In step 3 it makes 1.2MB of a C# source file that results in 500KB increase of mscorlib.dll. It could be made as C header i.e. runtime source, like existing culture-info-table.h. But it is still huge. And for about 200KB of data, they are just for CJK cultures so they won't be used unless we use those cultures to handle culture-sensitive CJK collation. That is mostly waste of memory. One possible solution idea is to create different assembly and loads the tables like: - CompareInfo or whatever holds those tables as static variables. - If the variable is null, then it tries to load the internally stored table via runtime icall_1. However at this stage it returns null, since nothing is stored. - Then, CompareInfo or whatever loads table-only assembly via reflection and loads table into memory, and then invokes an icall_2 that sets the table as runtime internal table. - Next time CompareInfo tries to fill the table, icall_1 will return the table. In fact the same discussion also applies to string Normalization tables (to support String.Normalize() introduced in .NET 2.0). Any good ideas for this problem? Thanks, Atsushi Eno Index: corlib.dll.sources === --- corlib.dll.sources (revision 46284) +++ corlib.dll.sources (working copy) @@ -8,6 +8,12 @@ Microsoft.Win32/Win32RegistryApi.cs Microsoft.Win32/Win32ResultCode.cs Microsoft.Win32.SafeHandles/SafeFileHandle.cs +Mono.Globalization.Unicode/CodePointIndexer.cs +Mono.Globalization.Unicode/MSCompatUnicodeTable.cs +Mono.Globalization.Unicode/MSCompatUnicodeTableUtil.cs +Mono.Globalization.Unicode/SimpleCollator.cs +Mono.Globalization.Unicode/SortKey.cs +Mono.Globalization.Unicode/SortKeyBuffer.cs Mono/Runtime.cs Mono.Math/BigInteger.cs Mono.Math.Prime/ConfidenceFactor.cs @@ -300,7 +306,6 @@ System.Globalization/NumberFormatInfo.cs System.Globalization/NumberStyles.cs System.Globalization/RegionInfo.cs -System.Globalization/SortKey.cs System.Globalization/StringInfo.cs System.Globalization/TaiwanCalendar.cs System.Globalization/TextElementEnumerator.cs Index: System.Globalization/CompareInfo.cs === --- System.Globalization/CompareInfo.cs (revision 46284) +++ System.Globalization/CompareInfo.cs (working copy) @@ -34,12 +34,17 @@ using System.Reflection; using System.Runtime.Serialization; using System.Runtime.CompilerServices; +using Mono.Globalization.Unicode; namespace System.Globalization { [Serializable] public class CompareInfo : IDeserializationCallback { + public static readonly bool UseManagedCollation = + Environment.GetEnvironmentVariable (MONO_USE_MANAGED_COLLATION) + == yes; + // Keep in synch with MonoCompareInfo in the runtime. private int culture; [NonSerialized] @@ -47,6 +52,8 @@ [NonSerialized] private IntPtr ICU_collator; private int win32LCID; // Unused, but MS.NET serializes this + + SimpleCollator collator; /* Hide the .ctor() */ CompareInfo() {} @@ -57,25 +64,50 @@ internal CompareInfo (CultureInfo ci) { this.culture = ci.LCID; - this.icu_name = ci.IcuName; - this.construct_compareinfo (icu_name); + if (UseManagedCollation) + collator = new SimpleCollator (ci); + else { + this.icu_name = ci.IcuName; + this.construct_compareinfo (icu_name); + } } [MethodImplAttribute (MethodImplOptions.InternalCall)] private extern void free_internal_collator (); - + ~CompareInfo () { - free_internal_collator (); + if