Re: GSoC Weekly Report
Hi, On 10/16/07, D Bera [EMAIL PROTECTED] wrote: A followup question, I didnot find any API documentation of Mono.Data.Sqlite :( #mono was also sleeping when I asked the question there. My understanding is that both M.D.SqliteClient and M.D.Sqlite follow the general ADO.Net API patterns and that the latter is more or less a drop-in replacement for the former. A few things may need to be tweaked, but in general just changing the using statements at the top of each source file should be all that's needed. If M.D.Sqlite does not have a way to return rows on demand, I am against the migration. In the worst case, we can ship with a modified copy of M.D.Sqlite but I am not sure what will that buy us. You've always been able to get rows on demand via ADO.Net, it's just a matter of the implementation underneath. The old one (not modified by us) would load all of them into memory. I'm not sure how the new one performs memory-wise. If the Mono guys don't have any idea, the right thing to do here would be to create a large test database (or use an existing TextCache or FAStore db) and do a SELECT * using the 3 implementations and walk the results, using heap-buddy and/or heap-shot to analyze their memory usage. In the same breath, what is the benefit of M.D.Sqlite over M.D.SqliteClient for beagle ? I figured out there are some ADO.Net advantages but other than that ... ? It's maintained for one, which our modified one essentially isn't. It has the backing of the Mono team. The code is much cleaner and easier to understand, largely because it doesn't have two separate codepaths (one for v2 and one for v3). I am sure the Mono guys have other good reasons too. :) Joe ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
A followup question, I didnot find any API documentation of Mono.Data.Sqlite :( #mono was also sleeping when I asked the question there. My understanding is that both M.D.SqliteClient and M.D.Sqlite follow the general ADO.Net API patterns and that the latter is more or less a drop-in replacement for the former. A few things may need to be tweaked, but in general just changing the using statements at the top of each source file should be all that's needed. I was more looking for some method for row-by-row retrieval, on demand. Real on-demand, where the implementation does not retrieve all the rows at once but returns one by one. You've always been able to get rows on demand via ADO.Net, it's just a matter of the implementation underneath. The old one (not modified by us) would load all of them into memory. I'm not sure how the new one performs memory-wise. If the Mono guys don't have any idea, the right I checked the source out of curiousity http://anonsvn.mono-project.com/viewcvs/trunk/mcs/class/Mono.Data.Sqlite/Mono.Data.Sqlite/ And the code for DataReader looks exactly the same (didnt do a diff, just visually) as the one in Mono.Data.SqliteClient. So even if we migrate (the migration would be easy), we still have to ship with a modified inhouse M.D.Sqlite and keep syncing in with upstream. *sigh* - dBera -- - Debajyoti Bera @ http://dtecht.blogspot.com beagle / KDE fan Mandriva / Inspiron-1100 user ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
Ignore my previous email ... I was looking at the wrong place :( This is the right place for the new M.D.Sqlite http://anonsvn.mono-project.com/viewcvs/trunk/mcs/class/Mono.Data.Sqlite/Mono.Data.Sqlite_2.0/SQLiteDataReader.cs - dBera -- - Debajyoti Bera @ http://dtecht.blogspot.com beagle / KDE fan Mandriva / Inspiron-1100 user ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Migrate to Mono.Data.Sqlite (Was: Re: GSoC Weekly Report)
Ignore my previous email ... I was looking at the wrong place :( This is the right place for the new M.D.Sqlite http://anonsvn.mono-project.com/viewcvs/trunk/mcs/class/Mono.Data.Sqlite/Mo no.Data.Sqlite_2.0/SQLiteDataReader.cs Migration from Mono.Data.SqliteClient to Mono.Data.Sqlite completed (rev 4061). -- - Debajyoti Bera @ http://dtecht.blogspot.com beagle / KDE fan Mandriva / Inspiron-1100 user ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
What to do with our local changes to Mono.Data.SqliteClient ? I always get confused with them. Dont even know what are those changes and why are they there :-/ (it has something to with threading and locking) ? The work done locally was mainly for memory usage reasons. IIRC, the upstream bindings pull all of the results into memory at once, whereas our locally modified ones do so only as needed. I don't think threading/locking was ever an issue -- you might be confusing it with the fact that we couldn't use early sqlite 3.x versions because of broken policy in the library to that effect. Probably you are right. I still had to verify ... beagle:/source=mind?query=sqlite+beagle+lock returned nothing :-D but google returned, http://lists.ximian.com/pipermail/mono-devel-list/2005-November/015977.html which mentions Lock ... yay! My faith in my memory is restored ;-) - dBera -- - Debajyoti Bera @ http://dtecht.blogspot.com beagle / KDE fan Mandriva / Inspiron-1100 user ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
Hi, On 10/16/07, Debajyoti Bera [EMAIL PROTECTED] wrote: What to do with our local changes to Mono.Data.SqliteClient ? I always get confused with them. Dont even know what are those changes and why are they there :-/ (it has something to with threading and locking) ? The work done locally was mainly for memory usage reasons. IIRC, the upstream bindings pull all of the results into memory at once, whereas our locally modified ones do so only as needed. I don't think threading/locking was ever an issue -- you might be confusing it with the fact that we couldn't use early sqlite 3.x versions because of broken policy in the library to that effect. Probably you are right. I still had to verify ... beagle:/source=mind?query=sqlite+beagle+lock returned nothing :-D but google returned, http://lists.ximian.com/pipermail/mono-devel-list/2005-November/015977.html which mentions Lock ... yay! My faith in my memory is restored ;-) Indeed you're right, but those changes did get merged upstream. So the memory usage I believe is the only outstanding reason. Joe ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
Indeed you're right, but those changes did get merged upstream. So the memory usage I believe is the only outstanding reason. Sweet. A followup question, I didnot find any API documentation of Mono.Data.Sqlite :( #mono was also sleeping when I asked the question there. If M.D.Sqlite does not have a way to return rows on demand, I am against the migration. In the worst case, we can ship with a modified copy of M.D.Sqlite but I am not sure what will that buy us. In the same breath, what is the benefit of M.D.Sqlite over M.D.SqliteClient for beagle ? I figured out there are some ADO.Net advantages but other than that ... ? - dBera -- - Debajyoti Bera @ http://dtecht.blogspot.com beagle / KDE fan Mandriva / Inspiron-1100 user ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
Hi, On 10/13/07, Debajyoti Bera [EMAIL PROTECTED] wrote: What to do with our local changes to Mono.Data.SqliteClient ? I always get confused with them. Dont even know what are those changes and why are they there :-/ (it has something to with threading and locking) ? The work done locally was mainly for memory usage reasons. IIRC, the upstream bindings pull all of the results into memory at once, whereas our locally modified ones do so only as needed. I don't think threading/locking was ever an issue -- you might be confusing it with the fact that we couldn't use early sqlite 3.x versions because of broken policy in the library to that effect. I'm not sure what the memory side effects of the newer upstream bindings are. Joe ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
Sorry, I was unclear. By removing sqlite2 I meant simply removing it as an option from configure.in and requiring only sqlite3, not removing the codepaths from the cut-and-pasted code. Then, at some point in the future, porting over to Mono's own Mono.Data.Sqlite. What to do with our local changes to Mono.Data.SqliteClient ? I always get confused with them. Dont even know what are those changes and why are they there :-/ (it has something to with threading and locking) ? - dBera PS: Mannn... I love these Liberation fonts... can't stop reading the same mail ten times :P -- - Debajyoti Bera @ http://dtecht.blogspot.com beagle / KDE fan Mandriva / Inspiron-1100 user ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
Hi, On 10/9/07, D Bera [EMAIL PROTECTED] wrote: At this point, I'm in favor of dropping support for sqlite2 entirely anyway. That will make a migration to the new Mono sqlite bindings smoother, and drop a nasty chunk of cut-and-paste-and-patch code in the tree. Me too, me too ... But I see no point in the double effort in once removing sqlite-2 support and then changing the code to use mono.data.sqlite. Any volunteers for the cleanup ? Sorry, I was unclear. By removing sqlite2 I meant simply removing it as an option from configure.in and requiring only sqlite3, not removing the codepaths from the cut-and-pasted code. Then, at some point in the future, porting over to Mono's own Mono.Data.Sqlite. Joe ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
Hi, On 10/8/07, Debajyoti Bera [EMAIL PROTECTED] wrote: One thing I forgot to test was support for sqlite-2. Could anyone with sqlite-2 sync svn trunk and see if things work as expected ? .beagle/ might need to be deleted and files/emails re-indexed. At this point, I'm in favor of dropping support for sqlite2 entirely anyway. That will make a migration to the new Mono sqlite bindings smoother, and drop a nasty chunk of cut-and-paste-and-patch code in the tree. Joe ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
Hi, First the context of this discussion: better storing of cached data (aka textcache). Very cool, and good to hear. If Arun could share a patch for his implementation, that would be awesome in terms of preventing wheel reinvention ;) If Arun is unable, or doesn't have the time to look into a hybrid solution, I wouldn't mind doing some investigative work, I think the biggest decision comes when its time to determine what our cutoff is, (size wise). While there is a little extra complication introduced by a hybrid system, I don't see it being a major issue to implement. My thought would just be to have a table in the TextCache.db which denotes if a uri is stored in db or on disk. The major concern is the cost of 2 sqlite queries per cache item. Just my thoughts on the subject. DBera: are you saying that you want to just work/look into the language stemming, or both the language stemming and the text cache? Depending on what you want to work on, I can help out with this, if its something we really want to see in 0.3.0. Lemme know. completely sure that such a loose typing system will greatly benefit us when working with TEXT/STRING types, however, the gzipped blobs might benefit from less disk usage thanks to being stored in a single file, in addition, I know that incremental i/o is a possibility with blobs in sqlite 3.4, which could potentially be utilized to optimize work like this. Anyways, please send a patch to the list if thats not too much to ask, or just give us an update as to how things are going. I and Arun had some discussion about this and we were trying to balance the performance and size issues. He already has the sqlite-idea implemented; however I would also like to see how a hybrid idea works i.e. store the huge number of extremely small files in sqlite and store the really large ones on the disk. Implementing this is tricky. I just checked in some changes implementing the above hybrid idea. Currently, any file less than 4K gzipped is an extremely small file (stored in db) and anything more is a really large one (stored on disk). The cutoff is hardcoded in TextCache.cs/BLOB_SIZE_LIMIT The number of files and the disk size of .beagle/TextCache reduces significantly. Performance and memory should not suffer noticably unless I did something stupid. One thing I forgot to test was support for sqlite-2. Could anyone with sqlite-2 sync svn trunk and see if things work as expected ? .beagle/ might need to be deleted and files/emails re-indexed. In the past, I emailed how this feature relates to language determination. It still does but that would require some more work (hint: somehow merge TextCacheWriteStream and PullingReader) and a significant bit of testing. I have no plans on working on it now. - dBera -- - Debajyoti Bera @ http://dtecht.blogspot.com beagle / KDE fan Mandriva / Inspiron-1100 user ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
Very cool, and good to hear. If Arun could share a patch for his implementation, that would be awesome in terms of preventing wheel reinvention ;) If Arun is unable, or doesn't have the time to look into a hybrid solution, I wouldn't mind doing some investigative work, I think the biggest decision comes when its time to determine what our cutoff is, (size wise). While there is a little extra complication introduced by a hybrid system, I don't see it being a major issue to implement. My thought would just be to have a table in the TextCache.db which denotes if a uri is stored in db or on disk. The major concern is the cost of 2 sqlite queries per cache item. Just my thoughts on the subject. DBera: are you saying that you want to just work/look into the language stemming, or both the language stemming and the text cache? Depending on what you want to work on, I can help out with this, if its something we really want to see in 0.3.0. Lemme know. Cheers, Kevin Kubasik On 10/2/07, Debajyoti Bera [EMAIL PROTECTED] wrote: completely sure that such a loose typing system will greatly benefit us when working with TEXT/STRING types, however, the gzipped blobs might benefit from less disk usage thanks to being stored in a single file, in addition, I know that incremental i/o is a possibility with blobs in sqlite 3.4, which could potentially be utilized to optimize work like this. Anyways, please send a patch to the list if thats not too much to ask, or just give us an update as to how things are going. I and Arun had some discussion about this and we were trying to balance the performance and size issues. He already has the sqlite-idea implemented; however I would also like to see how a hybrid idea works i.e. store the huge number of extremely small files in sqlite and store the really large ones on the disk. Implementing this is tricky (*). - dBera (*) One of my recent efforts has been to add language detection support (based on a patch in bugzilla). This will enable us to use the right stemmers and analyzers depending on the language. The hard part is stealing some initial text for language detection and doing it in a transparent way. Incidentally, one implementation of the hybird approach mentioned above and the language detection crosses path. I am waiting for some free time to get going after them. -- - Debajyoti Bera @ http://dtecht.blogspot.com beagle / KDE fan Mandriva / Inspiron-1100 user -- Cheers, Kevin Kubasik http://kubasik.net/blog ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
Just my thoughts on the subject. DBera: are you saying that you want to just work/look into the language stemming, or both the language stemming and the text cache? Depending on what you want to work on, I can help out with this, if its something we really want to see in 0.3.0. Lemme know. 1. I definitely don't have the time, lest it would have been done by now :) 2. I will locate Arun's patch and send it out; its a good implementation and can acts a reference. 3. The problem is less on the number of queries. It is more about sending the data to textcache (which can either store it gzipped in sqlite or gzipped on disk), and to the language determination class and to lucene without (repeat:without) storing all the data in a huge store/string in memory. I thought a cutoff size of disk_block_size would be a good starting point, it will reduce external fragmentation to a good degree since most textcache files are less than 1 block. So the decision to store on disk or in sqlite can only come after we have read, say 4KB of data. The language determination, I think, requires 1K of text. In our filter/lucene interface, lucene asks for data and then filters go and extract little more data from the file and send it back: this goes in loop till there is no more data to extract. There is no storing of data in the memory! So to do the whole thing correctly, as lucene asks for more data the filters return the data and transparently someone in the middle decides whether to store the data in sqlite or disk (and does so); furthermore, even before lucene asks for data, about 1K of data is extracted from the file, language detected and appropriate stemmer hooked and the data is kept around till lucene asks for it. The obvious approach is by extracting all the data in advance, storing it in memory, deciding where to store textcache, deciding the language and then comfortably feeding lucene from the stored data. Thats not desired. I hope you also see where the connection between language determination and text-cache comes in. Go for them if you or anyone wants to. Just let others know so there is no duplication in effort. N. Lets not target a release and cram features in :) Instead if you want to work on something, work on it. If it is done and release-ready by 0.3, it will be included. Otherwise there is always another release. There is little sense if including lots of half-complete, pooly implemented features just to make the release notes look yummy :-) Of course I am restating the obvious. (*) - dBera (*) When I sent out a to-come feature list in one of my earlier emails, I was more stressing the fact that testing is becoming very important and difficult with all these different features and less on the fact that Wow! Now we can do XXX too. Now I think I was misread. -- - Debajyoti Bera @ http://dtecht.blogspot.com beagle / KDE fan Mandriva / Inspiron-1100 user ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
On 02/10/2007, Kevin Kubasik [EMAIL PROTECTED] wrote: Very cool, and good to hear. If Arun could share a patch for his implementation, that would be awesome in terms of preventing wheel reinvention ;) If Arun is unable, or doesn't have the time to look into a hybrid solution, I wouldn't mind doing some investigative work, I've been completely swamped with work here in the first half of the semester, and I spent a little time getting the xesam-adaptor updated to the latest spec. Do let me know if you're taking this up, so there's no duplication of effort. The patch against r4013 is attached. I think the biggest decision comes when its time to determine what our cutoff is, (size wise). While there is a little extra complication introduced by a hybrid system, I don't see it being a major issue to implement. My thought would just be to have a table in the TextCache.db which denotes if a uri is stored in db or on disk. The major concern is the cost of 2 sqlite queries per cache item. Might it not be easier to have a boolean field denoting whether the field is an on-disk URI or the blob itself? Or better, if this is possible, to just examine the first few bytes to see if they are some ASCII text (or !(the Zip magic bytes)) Best, -- Arun Raghavan (http://nemesis.accosted.net) v2sw5Chw4+5ln4pr6$OFck2ma4+9u8w3+1!m?l7+9GSCKi056 e6+9i4b8/9HTAen4+5g4/8APa2Xs8r1/2p5-8 hackerkey.com Index: beagled/FileSystemQueryable/FileSystemQueryable.cs === --- beagled/FileSystemQueryable/FileSystemQueryable.cs (revision 4013) +++ beagled/FileSystemQueryable/FileSystemQueryable.cs (working copy) @@ -1810,17 +1810,12 @@ // is stored in a property. Uri uri = UriFu.EscapedStringToUri (hit [beagle:InternalUri]); - string path = TextCache.UserCache.LookupPathRaw (uri); + Stream text = TextCache.UserCache.LookupText(uri, hit.Uri.LocalPath); - if (path == null) + if (text == null) return null; - // If this is self-cached, use the remapped Uri - if (path == TextCache.SELF_CACHE_TAG) - return SnippetFu.GetSnippetFromFile (query_terms, hit.Uri.LocalPath, full_text); - - path = Path.Combine (TextCache.UserCache.TextCacheDir, path); - return SnippetFu.GetSnippetFromTextCache (query_terms, path, full_text); + return SnippetFu.GetSnippet(query_terms, new StreamReader(text), full_text); } override public void Start () Index: beagled/TextCache.cs === --- beagled/TextCache.cs(revision 4013) +++ beagled/TextCache.cs(working copy) @@ -37,6 +37,53 @@ namespace Beagle.Daemon { + // We only have this class because GZipOutputStream doesn't let us + // retrieve the baseStream + public class TextCacheStream : GZipOutputStream { + private Stream stream; + + public Stream BaseStream { + get { return stream; } + } + + public TextCacheStream() : this(new MemoryStream()) + { + } + + public TextCacheStream(Stream stream) : base(stream) + { + this.stream = stream; + this.IsStreamOwner = false; + } + } + + public class TextCacheWriter : StreamWriter { + private Uri uri; + private TextCache parent_cache; + private TextCacheStream tcStream; + + public TextCacheWriter(TextCache cache, Uri uri, TextCacheStream tcStream) : base(tcStream) + { + parent_cache = cache; + this.uri = uri; + this.tcStream = tcStream; + } + + override public void Close() + { + base.Close(); + + Stream stream = tcStream.BaseStream; + + byte[] text = new byte[stream.Length]; + stream.Seek(0, SeekOrigin.Begin); + stream.Read(text, 0, (int)stream.Length); + + parent_cache.Insert(uri, text); + tcStream.BaseStream.Close(); + } + } + // FIXME: This class isn't multithread safe! This class does not // ensure that different threads don't utilize a transaction started // in a certain thread at the same time. However, since all the @@ -50,7 +97,7 @@ static public bool Debug = false; - public const string SELF_CACHE_TAG = *self*; + private const string
Re: GSoC Weekly Report
On 02/10/2007, Kevin Kubasik [EMAIL PROTECTED] wrote: A quick followup, some reading here: http://www.sqlite.org/datatype3.html provides some insight into how exactly sqlite3 stores values, I'm not completely sure that such a loose typing system will greatly benefit us when working with TEXT/STRING types, however, the gzipped blobs might benefit from less disk usage thanks to being stored in a single file, in addition, I know that incremental i/o is a possibility with blobs in sqlite 3.4, which could potentially be utilized to optimize work like this. If the bindings wrap a Stream around this, this would be ideal. There doesn't seem to be much documentation on the new bindings. From what I can see in the mono-1.2.5.1 code, the new bindings (like the old bindings) just return the entire contents of the field. Maybe we should make a feature request? -- Arun Raghavan (http://nemesis.accosted.net) v2sw5Chw4+5ln4pr6$OFck2ma4+9u8w3+1!m?l7+9GSCKi056 e6+9i4b8/9HTAen4+5g4/8APa2Xs8r1/2p5-8 hackerkey.com ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
Updated patch attached -- some of the older code was not building. Cheers, Arun On 02/10/2007, Arun Raghavan [EMAIL PROTECTED] wrote: On 02/10/2007, Kevin Kubasik [EMAIL PROTECTED] wrote: Very cool, and good to hear. If Arun could share a patch for his implementation, that would be awesome in terms of preventing wheel reinvention ;) If Arun is unable, or doesn't have the time to look into a hybrid solution, I wouldn't mind doing some investigative work, I've been completely swamped with work here in the first half of the semester, and I spent a little time getting the xesam-adaptor updated to the latest spec. Do let me know if you're taking this up, so there's no duplication of effort. The patch against r4013 is attached. I think the biggest decision comes when its time to determine what our cutoff is, (size wise). While there is a little extra complication introduced by a hybrid system, I don't see it being a major issue to implement. My thought would just be to have a table in the TextCache.db which denotes if a uri is stored in db or on disk. The major concern is the cost of 2 sqlite queries per cache item. Might it not be easier to have a boolean field denoting whether the field is an on-disk URI or the blob itself? Or better, if this is possible, to just examine the first few bytes to see if they are some ASCII text (or !(the Zip magic bytes)) Best, -- Arun Raghavan (http://nemesis.accosted.net) v2sw5Chw4+5ln4pr6$OFck2ma4+9u8w3+1!m?l7+9GSCKi056 e6+9i4b8/9HTAen4+5g4/8APa2Xs8r1/2p5-8 hackerkey.com Index: beagled/FileSystemQueryable/FileSystemQueryable.cs === --- beagled/FileSystemQueryable/FileSystemQueryable.cs (revision 4016) +++ beagled/FileSystemQueryable/FileSystemQueryable.cs (working copy) @@ -1810,17 +1810,12 @@ // is stored in a property. Uri uri = UriFu.EscapedStringToUri (hit [beagle:InternalUri]); - string path = TextCache.UserCache.LookupPathRaw (uri); + Stream text = TextCache.UserCache.LookupText(uri, hit.Uri.LocalPath); - if (path == null) + if (text == null) return null; - // If this is self-cached, use the remapped Uri - if (path == TextCache.SELF_CACHE_TAG) - return SnippetFu.GetSnippetFromFile (query_terms, hit.Uri.LocalPath, full_text); - - path = Path.Combine (TextCache.UserCache.TextCacheDir, path); - return SnippetFu.GetSnippetFromTextCache (query_terms, path, full_text); + return SnippetFu.GetSnippet(query_terms, new StreamReader(text), full_text); } override public void Start () Index: beagled/TextCache.cs === --- beagled/TextCache.cs(revision 4016) +++ beagled/TextCache.cs(working copy) @@ -37,6 +37,53 @@ namespace Beagle.Daemon { + // We only have this class because GZipOutputStream doesn't let us + // retrieve the baseStream + public class TextCacheStream : GZipOutputStream { + private Stream stream; + + public Stream BaseStream { + get { return stream; } + } + + public TextCacheStream() : this(new MemoryStream()) + { + } + + public TextCacheStream(Stream stream) : base(stream) + { + this.stream = stream; + this.IsStreamOwner = false; + } + } + + public class TextCacheWriter : StreamWriter { + private Uri uri; + private TextCache parent_cache; + private TextCacheStream tcStream; + + public TextCacheWriter(TextCache cache, Uri uri, TextCacheStream tcStream) : base(tcStream) + { + parent_cache = cache; + this.uri = uri; + this.tcStream = tcStream; + } + + override public void Close() + { + base.Close(); + + Stream stream = tcStream.BaseStream; + + byte[] text = new byte[stream.Length]; + stream.Seek(0, SeekOrigin.Begin); + stream.Read(text, 0, (int)stream.Length); + + parent_cache.Insert(uri, text); + tcStream.BaseStream.Close(); + } + } + // FIXME: This class isn't multithread safe! This class does not // ensure that different threads don't utilize a transaction started // in a certain thread at the same time. However, since all the @@ -50,7 +97,7
Re: GSoC Weekly Report
On Tuesday 02 October 2007 19:13, you wrote: Thinking quickly, one way to do this would be to add an option to query to specify the language. That's a nice option, but the default should be to search all languages I think. People are used to just type a word without setting another option. Regards Daniel -- http://www.danielnaber.de ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
On 8/19/07, Arun Raghavan [EMAIL PROTECTED] wrote: Hello All, This week I've been working on the new TextCache implementation that I'd mentioned the last time (replacing the bunch of files with an Sqlite db). Making an Sqlite db with just the uri and raw text caused an almost 3x increase in the text cache size (3.6 MB (on-disk) vs. almost 15MB in my test case). This despite the fact that the size of the raw text was only 7.9 MB. I need to figure out why this happens. In the mean time, I also implemented another version of this which stores (uri, gzipped text) pairs in the Sqlite db instead of (uri, raw text). Surprisingly, this actually seems to work very well (the db for the test case mentioned shrunk down to 2.6 MB, which is just a little more than the actual size of the compressed data itself). My first impression on this is that Sqlite is probably building an index for the raw text data. where as the compressed data is simply treated as a binary 'glob'. I'm not 100% sure of the table definitions that your using, or exactly how much (in terms of Indexes) sqlite does automatically, but that seems like the most likely culprit. As we already have our own system for searching text ;) if you could find a way to force sqlite to not index the table's raw text column, you could probably get more sane numbers regarding the database size. However, its possible, its just how sqlite handles text content, and the gzipped text is the best way to go. The other thing to test is how this is handled in far larger situations. Is it possible that the first 1000 rows are very expensive, but when we scale to 5 rows, we see only a minute increase in size? Performance numbers on a search which returns 1205 results are below. I basically ran the measurements twice -- once after flushing the inode, dentry and page cache, and another time taking advantage of the disk caches. Current TextCache: no-disk-cache: ~1m with-disk-cache: ~9s New TextCache (raw and gzipped versions had similar numbers): no-disk-cache: ~42s with-disk-cache: ~10s Very cool/ interesting. One of the important cases to test here is multiple successive queries. Think like deskbar as a user types completion, how does such a system fair when it gets 15 or 20 queries back to back. Does the compression difference factor in then? One very important factor remains to be seen -- memory usage. I am working on figuring out what the impact of the new code on memory usage is. Numbers should be available soon. On the Xesam front, I will be updating the code tomorrow,day-after to reflect the latest changes to the spec. I know the Google SoC is over, and its completely ok if your too busy to complete these tests, but if would be awesome if you could provide a patch to the list so we can not only see exactly what you were doing, but so that someone else might finish up your work and/or get it merged in and ready for 0.3.0. -- Arun Raghavan -- Cheers, Kevin Kubasik http://kubasik.net/blog ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
A quick followup, some reading here: http://www.sqlite.org/datatype3.html provides some insight into how exactly sqlite3 stores values, I'm not completely sure that such a loose typing system will greatly benefit us when working with TEXT/STRING types, however, the gzipped blobs might benefit from less disk usage thanks to being stored in a single file, in addition, I know that incremental i/o is a possibility with blobs in sqlite 3.4, which could potentially be utilized to optimize work like this. Anyways, please send a patch to the list if thats not too much to ask, or just give us an update as to how things are going. Cheers, Kevin Kubasik On 10/1/07, Kevin Kubasik [EMAIL PROTECTED] wrote: On 8/19/07, Arun Raghavan [EMAIL PROTECTED] wrote: Hello All, This week I've been working on the new TextCache implementation that I'd mentioned the last time (replacing the bunch of files with an Sqlite db). Making an Sqlite db with just the uri and raw text caused an almost 3x increase in the text cache size (3.6 MB (on-disk) vs. almost 15MB in my test case). This despite the fact that the size of the raw text was only 7.9 MB. I need to figure out why this happens. In the mean time, I also implemented another version of this which stores (uri, gzipped text) pairs in the Sqlite db instead of (uri, raw text). Surprisingly, this actually seems to work very well (the db for the test case mentioned shrunk down to 2.6 MB, which is just a little more than the actual size of the compressed data itself). My first impression on this is that Sqlite is probably building an index for the raw text data. where as the compressed data is simply treated as a binary 'glob'. I'm not 100% sure of the table definitions that your using, or exactly how much (in terms of Indexes) sqlite does automatically, but that seems like the most likely culprit. As we already have our own system for searching text ;) if you could find a way to force sqlite to not index the table's raw text column, you could probably get more sane numbers regarding the database size. However, its possible, its just how sqlite handles text content, and the gzipped text is the best way to go. The other thing to test is how this is handled in far larger situations. Is it possible that the first 1000 rows are very expensive, but when we scale to 5 rows, we see only a minute increase in size? Performance numbers on a search which returns 1205 results are below. I basically ran the measurements twice -- once after flushing the inode, dentry and page cache, and another time taking advantage of the disk caches. Current TextCache: no-disk-cache: ~1m with-disk-cache: ~9s New TextCache (raw and gzipped versions had similar numbers): no-disk-cache: ~42s with-disk-cache: ~10s Very cool/ interesting. One of the important cases to test here is multiple successive queries. Think like deskbar as a user types completion, how does such a system fair when it gets 15 or 20 queries back to back. Does the compression difference factor in then? One very important factor remains to be seen -- memory usage. I am working on figuring out what the impact of the new code on memory usage is. Numbers should be available soon. On the Xesam front, I will be updating the code tomorrow,day-after to reflect the latest changes to the spec. I know the Google SoC is over, and its completely ok if your too busy to complete these tests, but if would be awesome if you could provide a patch to the list so we can not only see exactly what you were doing, but so that someone else might finish up your work and/or get it merged in and ready for 0.3.0. -- Arun Raghavan -- Cheers, Kevin Kubasik http://kubasik.net/blog -- Cheers, Kevin Kubasik http://kubasik.net/blog ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoC Weekly Report
Making an Sqlite db with just the uri and raw text caused an almost 3x increase in the text cache size (3.6 MB (on-disk) vs. almost 15MB in my test case). This despite the fact that the size of the raw text was only 7.9 MB. I need to figure out why this happens. In the mean time, I also implemented another version of this which stores (uri, gzipped text) pairs in the Sqlite db instead of (uri, raw text). Surprisingly, this actually seems to work very well (the db for the test case mentioned shrunk down to 2.6 MB, which is just a little more than the actual size of the compressed data itself). Current TextCache: no-disk-cache: ~1m with-disk-cache: ~9s New TextCache (raw and gzipped versions had similar numbers): no-disk-cache: ~42s with-disk-cache: ~10s The numbers look pretty good. Size on disk is the main focus here. The disk cache will come into heavy play on a machine constantly serving queries. So even if that suffers a little bit (but only a little bit), I think its still ok if we gain in other places. The speedup with no-disk-cache is an added bonus. Do the performance degrade when looking up small result sets ? In the current implementation, that will involve lesser disk seek whereas for the sqlite based approach, the I/O overhead will probably be similar. - dBera -- - Debajyoti Bera @ http://dtecht.blogspot.com beagle / KDE fan Mandriva / Inspiron-1100 user ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoc weekly report (Browser Extension Rewrite)
Tao, I was testing the extension, when I noticed this (browser.dump enabled): [beagle] [beaglPref.get beagle.bookmark.active] [Exception... Component returned failure code: 0x8000 (NS_ERROR_UNEXPECTED) [nsIPrefBranch.getBoolPref] nsresult: 0x8000 (NS_ERROR_UNEXPECTED) location: JS frame :: chrome://newbeagle/content/utils.js :: anonymous :: line 53 data: no] This was getting thrown on the terminal multiple times. Not quite sure what was triggering this. (I didnt set any option explicitly in the preferences) Also, in beagleoverlay.js:writeMetadata, uniddexed should be unindexed (typo). Could you store the URLs as text and not keyword - people should be able to query part of the url too :) ? -- - Debajyoti Bera @ http://dtecht.blogspot.com beagle / KDE fan Mandriva / Inspiron-1100 user ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoc weekly report (Browser Extension Rewrite)
2007/8/7, Joe Shaw [EMAIL PROTECTED]: Hi, On 8/6/07, Tao Fei [EMAIL PROTECTED] wrote: 2007/8/7, Joe Shaw [EMAIL PROTECTED]: I've been playing around with the new extension, and I'm seeing a little inconsistent behavior with it. I wonder if it's related to me having the old Beagle extension installed as well (although I disabled that one). Yes. That's the problem. I use the same preference name beagle.enabled with the old extension. Fixed now. Using beagle.autoindex.active now, the tooltip is also updated. beagle.enabled is wrong word, as it doesn't affect on-demand index. Cool, I'll give it a test later today. We should keep in mind a migration path for the old extension. Ideally the new one will just be a drop-in replacement, and if we could migrate the basic settings (ie, enabled/disabled and a whitelist/blacklist) that would be ideal. Oh. You can import the preference from old extension. ( just open the preference window, and you can see the button). May be I should imported them silently when the new extension is installed. We'll may also want to use the same UUID so that upgrades are done cleanly, if there's no method for obsoleting other extensions. The same UUID ? I guess we only need to modified the install.rdf to change the UUID. -- Tao Fei (陶飞) My Blog: blog.filia.cn My Summer Of Code Blog: filiasoc.blogspot.com ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoc weekly report (Browser Extension Rewrite)
Hey, I've been playing around with the new extension, and I'm seeing a little inconsistent behavior with it. I wonder if it's related to me having the old Beagle extension installed as well (although I disabled that one). Whenever I open a site, I get the little dog icon with an X over it, indicating that it's not indexing that page. The page is not from HTTPS, and when I open the preferences dialog, the Default Action has Index selected. If I click on the icon to toggle it, it gets indexed fine. But I'm not sure why it's not by default. After I toggle the icon, any subsequent page opens are indexed. Joe ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoc weekly report (Browser Extension Rewrite)
2007/8/7, Joe Shaw [EMAIL PROTECTED]: Hey, I've been playing around with the new extension, and I'm seeing a little inconsistent behavior with it. I wonder if it's related to me having the old Beagle extension installed as well (although I disabled that one). Yes. That's the problem. I use the same preference name beagle.enabled with the old extension. Fixed now. Using beagle.autoindex.active now, the tooltip is also updated. beagle.enabled is wrong word, as it doesn't affect on-demand index. Whenever I open a site, I get the little dog icon with an X over it, indicating that it's not indexing that page. The page is not from HTTPS, and when I open the preferences dialog, the Default Action has Index selected. If I click on the icon to toggle it, it gets indexed fine. But I'm not sure why it's not by default. After I toggle the icon, any subsequent page opens are indexed. Joe -- Tao Fei (陶飞) My Blog: blog.filia.cn My Summer Of Code Blog: filiasoc.blogspot.com ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoc Weekly Report (Browser Extension Rewrite)
Hi, On 7/14/07, Tao Fei [EMAIL PROTECTED] wrote: I've noticed that Epiphany can be written in C or in Python. The old extension is written in C. I'm wondering whether it is acceptable if I write the extension in python ? It's a possibility, although I'm not crazy about adding a Python dependency to Beagle (not libbeagle, which already has an optional Python dep for the bindings). It's probably not unreasonable to assume that anyone with Epiphany installed will also have Python, however. Joe ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
RE: GSoc Weekly Report (Browser Extension Rewrite)
Hey, a quick note on the subject, I made a haphazard attempt at this rewrite some time ago, and faced the same issue you have now. I think the deciding factor would be your personal experience with the languages. If you have never really worked with C, but have used python, I would think that a well designed and well written python plugin is much better than a haphazard 'My First C' program. The second concern/thought is that a lot of users will leave their browsers open for hours (if not days) at a time, I'm not 100% sure if this applies in the plugin context, but a GC system probably offers some safety net for memory use. Just a quick $0.02, Kevin Kubasik -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Joe Shaw Sent: Friday, July 20, 2007 10:49 AM To: Tao Fei Cc: dashboard-hackers@gnome.org Subject: Re: GSoc Weekly Report (Browser Extension Rewrite) Hi, On 7/14/07, Tao Fei [EMAIL PROTECTED] wrote: I've noticed that Epiphany can be written in C or in Python. The old extension is written in C. I'm wondering whether it is acceptable if I write the extension in python ? It's a possibility, although I'm not crazy about adding a Python dependency to Beagle (not libbeagle, which already has an optional Python dep for the bindings). It's probably not unreasonable to assume that anyone with Epiphany installed will also have Python, however. Joe ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers smime.p7s Description: S/MIME cryptographic signature ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoc Weekly Report (Browser Extension Rewrite)
Sorry for late. I have just got back to home. I got some network problem, and failed to get access to the network until today, There was a recent one opened against the old extension about internationalization. I think that's a pretty important task that this one should address. There is even a patch attached to that bug, although I haven't looked at it closely. Yes,I have notice that. (Debajyoti have cc-ed this bug to me) I'd like to say that the new extension will be translatable. I have put all the UI string in .dtd file and all the javascript string in .properties file (expect some debug information). And I will keep doing that. Thanks. -- Tao Fei (陶飞) My Blog: blog.filia.cn My Summer Of Code Blog: filiasoc.blogspot.com ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers
Re: GSoc Weekly Report (Browser Extension Rewrite)
Hi, On 7/6/07, Tao Fei [EMAIL PROTECTED] wrote: I did a little search in http://bugzilla.gnome.org/ , there are some bug reports for the extension. eg, Bug 317605: http://bugzilla.gnome.org/show_bug.cgi?id=317605 In fact , I use status bar label to indicate whether the page is indexed. And use the beagle icon to indicate wheather the beagle is enabled or disabled or in a error state. The icon is global. I think it partly fix the bug . Yeah, I think this is a good idea. I didn't like before that the icon was overloaded for two questions: is this page indexed? and is the extension enabled for this page? Separating those concepts is a good idea. What to do next: * to fix the bugs in bugzilla (or avoid producing them ) There was a recent one opened against the old extension about internationalization. I think that's a pretty important task that this one should address. There is even a patch attached to that bug, although I haven't looked at it closely. Thanks, Joe ___ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers