subject:"\[Mono\-devel\-list\] How to handle huge string collation resources\?"

Re: [Mono-devel-list] How to handle huge string collation resources?

2005-06-23 Thread Atsushi Eno

Hi,

 Cant the same be said about the table that Atsushi is creating?  Do we
 really worry about the 200k replicated between Mono and libmono.so.
 
 Not too much ;-). Well, it might make a difference if on the same box
 somebody is using mono and libmono.so -- but not a big issue.

They are not only 200KB tables (that number is just for CJK
support which is rarely required) but about 500KB as a whole.

 I've no idea what kind of data is in the table, and if it is endian
 dependent. If it's just a byte array, we'd be fine...

Well, I need more clarification for them.

- For standard collation support (numbers are in length):
  - byte[] : 153360.
  - int[] : 13552.
  - some extent of additional byte[] : maybe less than 5000
- for CJK support which is rarely used:
  - ushort[] : 2 * 3 + 3
  - byte[] : 2
- for String.Normalize() which is only for .NET 2.0:
  - byte[] : 3
  - int[] : 2700
  - short[] : 17600
  - maybe additional int[] : 4000

So, they are mostly byte for collation, except for CJK support
which is not referenced unless we use CompareInfo methods from
CJK cultures. So I think managed resources won't be bad here.

For String Normalization related stuff, I still think it would be
better to have them in mscorlib.dll since they are unused in the
default profile.

 If we want to include the stuff in the C runtime, we pretty much have to
 check in the generated file to SVN. We can't run C# code before the
 runtime is compiled. OTOH, if we include the stuff as a managed resource
 to corlib, we could run a mono app at that point to generate the file.
 
 While this stuff is in active development, checking in a large file to
 svn is probably going to make mono-patches-list a bit annoying.

I agree.

 So maybe the best plan would be a managed resource for now, and once the
 table is stable, moving it into C if that makes a substantial
 performance difference.

Ok, then right now I need to hack on corlib Makefile to run make
under Mono.Globalization.Unicode.

Atsushi Eno
___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list

Re: [Mono-devel-list] How to handle huge string collation resources?

2005-06-22 Thread Ben Maurer

On Wed, 2005-06-22 at 04:26 +0900, Atsushi Eno wrote:
   3. run make. It will automatically downloads some files
  from some sites. For now without this step the build
  b0rks.

Of course, this will need to be changed ;-).



 Here is a serious problem. In step 3 it makes 1.2MB of a C#
 source file that results in 500KB increase of mscorlib.dll.

If you are generating a file in C#, you are going to be managing memory
badly. C# has no sense of a const array. When you say:

static readonly int [] x = {
...
}

This array is actually allocated in the GC *at runtime*.

Doing it in a header file would be an option. Not really ideal because
it gets into our package three times (once for the statically linked
mono binary, another for libmono.so, another for libmono.a).

The best option is to have it as a resource in a dll. The runtime memory
maps that.

 And for about 200KB of data, they are just for CJK cultures
 so they won't be used unless we use those cultures to handle
 culture-sensitive CJK collation. That is mostly waste of memory.

Not if the data doesn't get paged in ;-).

   - CompareInfo or whatever holds those tables as static
 variables.
   - If the variable is null, then it tries to load the
 internally stored table via runtime icall_1. However
 at this stage it returns null, since nothing is stored.
   - Then, CompareInfo or whatever loads table-only assembly
 via reflection and loads table into memory, and
 then invokes an icall_2 that sets the table as runtime
 internal table.
   - Next time CompareInfo tries to fill the table, icall_1
 will return the table.

The memory system essentially does that via the mmap call, however it is
hidden from view.

-- Ben

___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list

Re: [Mono-devel-list] How to handle huge string collation resources?

2005-06-22 Thread Atsushi Eno

Hey,

Ben Maurer wrote:
 On Wed, 2005-06-22 at 04:26 +0900, Atsushi Eno wrote:
  3. run make. It will automatically downloads some files
 from some sites. For now without this step the build
 b0rks.
 
 Of course, this will need to be changed ;-).

duh ;-)  It will be checked in when we decide how to handle it.

 Here is a serious problem. In step 3 it makes 1.2MB of a C#
 source file that results in 500KB increase of mscorlib.dll.
 
 If you are generating a file in C#, you are going to be managing memory
 badly. C# has no sense of a const array. When you say:
 
 static readonly int [] x = {
   ...
 }
 
 This array is actually allocated in the GC *at runtime*.
 
 Doing it in a header file would be an option. Not really ideal because
 it gets into our package three times (once for the statically linked
 mono binary, another for libmono.so, another for libmono.a).
 
 The best option is to have it as a resource in a dll. The runtime memory
 maps that.

Oh, I didn't know that resources are mmapped. Yeah, then that
sounds the best way.

BTW doesn't that mean all that kind of culture resources had better
become managed resources, unless they are required at runtime?
We also have huge culture-info-table.h and char-conversions.h
in metadata.

 And for about 200KB of data, they are just for CJK cultures
 so they won't be used unless we use those cultures to handle
 culture-sensitive CJK collation. That is mostly waste of memory.
 
 Not if the data doesn't get paged in ;-).

 The memory system essentially does that via the mmap call, however it is
 hidden from view.

Well, they will be hidden from view, but don't they still eat
memory when mscorlib.dll is loaded? Don't they still get paged?

Atsushi Eno

___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list

Re: [Mono-devel-list] How to handle huge string collation resources?

2005-06-22 Thread Ben Maurer

On Thu, 2005-06-23 at 12:41 +0900, Atsushi Eno wrote:
 Ben Maurer wrote:
  Here is a serious problem. In step 3 it makes 1.2MB of a C#
  source file that results in 500KB increase of mscorlib.dll.
  
  If you are generating a file in C#, you are going to be managing memory
  badly. C# has no sense of a const array. When you say:
  
  static readonly int [] x = {
  ...
  }
  
  This array is actually allocated in the GC *at runtime*.
  
  Doing it in a header file would be an option. Not really ideal because
  it gets into our package three times (once for the statically linked
  mono binary, another for libmono.so, another for libmono.a).
  
  The best option is to have it as a resource in a dll. The runtime memory
  maps that.
 
 Oh, I didn't know that resources are mmapped. Yeah, then that
 sounds the best way.
 
 BTW doesn't that mean all that kind of culture resources had better
 become managed resources, unless they are required at runtime?
 We also have huge culture-info-table.h and char-conversions.h
 in metadata.

They are in C, where they are a const array. One advantage of having
them there is that we don't have to do conversions for different endian
systems.

 
  And for about 200KB of data, they are just for CJK cultures
  so they won't be used unless we use those cultures to handle
  culture-sensitive CJK collation. That is mostly waste of memory.
  
  Not if the data doesn't get paged in ;-).
 
  The memory system essentially does that via the mmap call, however it is
  hidden from view.
 
 Well, they will be hidden from view, but don't they still eat
 memory when mscorlib.dll is loaded? Don't they still get paged?

Memory in a mapped file that is never touched never gets read from the
disk, nor is physical memory allocated for it.

-- Ben

___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list

Re: [Mono-devel-list] How to handle huge string collation resources?

2005-06-22 Thread Miguel de Icaza

Hello,

  BTW doesn't that mean all that kind of culture resources had better
  become managed resources, unless they are required at runtime?
  We also have huge culture-info-table.h and char-conversions.h
  in metadata.
 
 They are in C, where they are a const array. One advantage of having
 them there is that we don't have to do conversions for different endian
 systems.

Cant the same be said about the table that Atsushi is creating?  Do we
really worry about the 200k replicated between Mono and libmono.so.

And the libmono.a is only shipped to those developing. 

The resource approach has also the downside that access to it would be
through the Stream interface (even if mmapped) while getting a pointer
to a C-statically defined array would allow the corlib code to access it
without any wrapper code.

Both approaches enjoy the mmap benefits.

Miguel.
___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list

Re: [Mono-devel-list] How to handle huge string collation resources?

2005-06-22 Thread Miguel de Icaza

Hello,

   Just to follow up some more: since the tables generated contain
various arrays of short and int sizes, I rather go down the path of
embedding that into the C code, so we get the automatic endian
adjustment rather than forcing the managed code to deal with it.

   We can add an internal call to get the pointers to the various tables
and take it from there.
___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list

Re: [Mono-devel-list] How to handle huge string collation resources?

2005-06-22 Thread Ben Maurer

On Thu, 2005-06-23 at 01:05 -0400, Miguel de Icaza wrote:
 Hello,
 
   BTW doesn't that mean all that kind of culture resources had better
   become managed resources, unless they are required at runtime?
   We also have huge culture-info-table.h and char-conversions.h
   in metadata.
  
  They are in C, where they are a const array. One advantage of having
  them there is that we don't have to do conversions for different endian
  systems.
 
 Cant the same be said about the table that Atsushi is creating?  Do we
 really worry about the 200k replicated between Mono and libmono.so.

Not too much ;-). Well, it might make a difference if on the same box
somebody is using mono and libmono.so -- but not a big issue.

I've no idea what kind of data is in the table, and if it is endian
dependent. If it's just a byte array, we'd be fine...

 The resource approach has also the downside that access to it would be
 through the Stream interface (even if mmapped) while getting a pointer
 to a C-statically defined array would allow the corlib code to access it
 without any wrapper code.

Well, since we are in corlib, we can get the void* ;-).


Anyways, I think the two approaches don't make much difference in terms
of performance (except possibly the endian stuff). Where there is a
difference is in terms of development model:

If we want to include the stuff in the C runtime, we pretty much have to
check in the generated file to SVN. We can't run C# code before the
runtime is compiled. OTOH, if we include the stuff as a managed resource
to corlib, we could run a mono app at that point to generate the file.

While this stuff is in active development, checking in a large file to
svn is probably going to make mono-patches-list a bit annoying.

So maybe the best plan would be a managed resource for now, and once the
table is stable, moving it into C if that makes a substantial
performance difference.

-- Ben

___
Mono-devel-list mailing list
Mono-devel-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-devel-list

[Mono-devel-list] How to handle huge string collation resources?

2005-06-21 Thread Atsushi Eno

Hello,

Finally I got my managed collation engine working, though it is far
from complete form I aim and it is mostly conceptual for now (it
does not handle many things, performs so bad). For now it handles
ASCII case sensitivity, large part of CompareOptions flags, large
part of diacritical mark processing.

Here is the steps to make it available:

1. apply attached patch against mcs/class/corlib.
2. go to mcs/class/corlib/Mono.Globalization.Unicode
3. run make. It will automatically downloads some files
   from some sites. For now without this step the build
   b0rks.
4. make corlib as usual.
5. set MONO_USE_MANAGED_COLLATION environment variable
   as yes.

Here is a serious problem. In step 3 it makes 1.2MB of a C#
source file that results in 500KB increase of mscorlib.dll.
It could be made as C header i.e. runtime source, like existing
culture-info-table.h. But it is still huge.
And for about 200KB of data, they are just for CJK cultures
so they won't be used unless we use those cultures to handle
culture-sensitive CJK collation. That is mostly waste of memory.

One possible solution idea is to create different assembly and
loads the tables like:

- CompareInfo or whatever holds those tables as static
  variables.
- If the variable is null, then it tries to load the
  internally stored table via runtime icall_1. However
  at this stage it returns null, since nothing is stored.
- Then, CompareInfo or whatever loads table-only assembly
  via reflection and loads table into memory, and
  then invokes an icall_2 that sets the table as runtime
  internal table.
- Next time CompareInfo tries to fill the table, icall_1
  will return the table.

In fact the same discussion also applies to string Normalization
tables (to support String.Normalize() introduced in .NET 2.0).

Any good ideas for this problem?

Thanks,
Atsushi Eno
Index: corlib.dll.sources
===
--- corlib.dll.sources  (revision 46284)
+++ corlib.dll.sources  (working copy)
@@ -8,6 +8,12 @@
 Microsoft.Win32/Win32RegistryApi.cs
 Microsoft.Win32/Win32ResultCode.cs
 Microsoft.Win32.SafeHandles/SafeFileHandle.cs
+Mono.Globalization.Unicode/CodePointIndexer.cs
+Mono.Globalization.Unicode/MSCompatUnicodeTable.cs
+Mono.Globalization.Unicode/MSCompatUnicodeTableUtil.cs
+Mono.Globalization.Unicode/SimpleCollator.cs
+Mono.Globalization.Unicode/SortKey.cs
+Mono.Globalization.Unicode/SortKeyBuffer.cs
 Mono/Runtime.cs
 Mono.Math/BigInteger.cs
 Mono.Math.Prime/ConfidenceFactor.cs
@@ -300,7 +306,6 @@
 System.Globalization/NumberFormatInfo.cs
 System.Globalization/NumberStyles.cs
 System.Globalization/RegionInfo.cs
-System.Globalization/SortKey.cs
 System.Globalization/StringInfo.cs
 System.Globalization/TaiwanCalendar.cs
 System.Globalization/TextElementEnumerator.cs
Index: System.Globalization/CompareInfo.cs
===
--- System.Globalization/CompareInfo.cs (revision 46284)
+++ System.Globalization/CompareInfo.cs (working copy)
@@ -34,12 +34,17 @@
 using System.Reflection;
 using System.Runtime.Serialization;
 using System.Runtime.CompilerServices;
+using Mono.Globalization.Unicode;
 
 namespace System.Globalization
 {
[Serializable]
public class CompareInfo : IDeserializationCallback
{
+   public static readonly bool UseManagedCollation =
+   Environment.GetEnvironmentVariable 
(MONO_USE_MANAGED_COLLATION)
+   == yes;
+
// Keep in synch with MonoCompareInfo in the runtime. 
private int culture;
[NonSerialized]
@@ -47,6 +52,8 @@
[NonSerialized]
private IntPtr ICU_collator;
private int win32LCID;  // Unused, but MS.NET serializes this
+
+   SimpleCollator collator;

/* Hide the .ctor() */
CompareInfo() {}
@@ -57,25 +64,50 @@
internal CompareInfo (CultureInfo ci)
{
this.culture = ci.LCID;
-   this.icu_name = ci.IcuName;
-   this.construct_compareinfo (icu_name);
+   if (UseManagedCollation) 
+   collator = new SimpleCollator (ci);
+   else {
+   this.icu_name = ci.IcuName;
+   this.construct_compareinfo (icu_name);
+   }
}

[MethodImplAttribute (MethodImplOptions.InternalCall)]
private extern void free_internal_collator ();
-   
+
~CompareInfo ()
{
-   free_internal_collator ();
+   if

Re: [Mono-devel-list] How to handle huge string collation resources?

Re: [Mono-devel-list] How to handle huge string collation resources?

Re: [Mono-devel-list] How to handle huge string collation resources?

Re: [Mono-devel-list] How to handle huge string collation resources?

Re: [Mono-devel-list] How to handle huge string collation resources?

Re: [Mono-devel-list] How to handle huge string collation resources?

Re: [Mono-devel-list] How to handle huge string collation resources?

[Mono-devel-list] How to handle huge string collation resources?

8 matches

Site Navigation

Mail list logo

Footer information