Re: Lots of data over a service

Greg Harris Tue, 06 Aug 2013 18:11:53 -0700

Hi Paul,


>> Is this something you will use yourself or for a client, or propose to
make available one way or another?

This is work that I did myself as a side project some years ago to cement
my Silverlight and C# knowledge.  I tried to find some commercial interest
in it, but it just was not there in 2009/2010 when I was looking.  I am
very open to suggestions.



Some of my original notes on the project are:



The web site where I first saw this style of graph on one of the TED talks (
http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html)
talks by Hans Rosling (http://en.wikipedia.org/wiki/Hans_Rosling) of the
Karolinska Institute, where he presented the work done at Gap Minder (
http://gapminder.org) I was very impressed and assumed that the graphics
system behind the graph was some extensive university project that would be
hard to reproduce.



When I saw the graph again some months later during a presentation by
Tristan Kurniawan  (then at SSW) on good user interface design, it occurred
to me that this could be done as a Silverlight project.  At the time Adam
Cogan said yeah sure Greg, you do that this weekend… While it was clear
that it would be a lot more than a weekend job, I started on the project as
my 'background project', which has took up about 18 Months of background
work to complete (say equivalent of three-four months of full time work).



While this work is strongly influenced by the GapMinder project all the
code in this version is my own, I draw every pixel on the screen!



The data sources I used is from GapMinder.org, specifically see:

Life expectancy at birth:
http://spreadsheets.google.com/pub?key=phAwcNAVuyj2tPLxKvvnNPA

GDP per capita:
http://spreadsheets.google.com/pub?key=phAwcNAVuyj1jiMAkmq1iMg

Population : http://spreadsheets.google.com/pub?key=phAwcNAVuyj0XOoBL_n5tAQ



The data needed extensive massaging to get the data into a more usable
format and to interpolate missing data between known values.  See the data
tabs on the left hand side of the graph for the raw data I ended up with.



Where data is missing for some years for a country, that data is estimated
by drawing a straight line between two known data points, this is then used
to derive data for the missing years in between.



The data displayed is not complete and may have errors and omissions, where
there was a problem with part of the data set, that was left out rather
than represent incorrect data.   There was a problem merging separate data
sets where countries showed different names, so a direct merge was not
possible, in this case if clear merge did not present itself, the data was
excluded.

Other errors may have been introduced into the data during preparation the
data for representation in this format (I welcome someone doing a through
data validation).



Once I had all of the data I worked on getting the graph drawn, the graph
is drawn with many lines, circles and rectangles drawn on a Silverlight
canvas.  With the sheer volume of data and updates needed, this was a bit
of a trial and error process to find processes that worked effectively at
an acceptable performance.


Regards

Greg Harris

On Wed, Aug 7, 2013 at 8:46 AM, Paul Evrat <p...@paulevrat.com> wrote:

> Greg,****
>
> ** **
>
> I saw the TED talk that you note was the inspiration for this. I thought
> at the time it was a brilliant way to present and understand data. Plus it
> and the presenter had the audience totally amused but it really made the
> data talk.****
>
> ** **
>
> Is this something you will use yourself or for a client, or propose to
> make available one way or another?****
>
> ** **
>
> Regards,****
>
> ** **
>
> ** **
>
> *From:* ozdotnet-boun...@ozdotnet.com [mailto:
> ozdotnet-boun...@ozdotnet.com] *On Behalf Of *Greg Harris
> *Sent:* Wednesday, 7 August 2013 1:30 AM
> *To:* ozDotNet
> *Subject:* Re: Lots of data over a service****
>
> ** **
>
> Hi Greg,****
>
> ** **
>
> What I did with my Motion Chart software (
> http://www.eshiftlog.com/Silverlight/MotionGraphTestPage.html) to get
> better download performance was:****
>
> • Move away from small WCF data transfers to transferring a single large
> encoded compressed text file****
>
> • Only transfer raw data (no JSON/XML structure, which adds a LOT OF FAT)*
> ***
>
> • Minor use of CSV format, otherwise fixed format****
>
> • Define my own number formats to reduce size (remove unneeded decimal
> places)****
>
> • Use zip file to transfer data****
>
> This has improved data load time by a factor of ~50-100 times (sorry no
> hard numbers).****
>
> My data ended up being 430KB for ~32K rows, just over 13 bytes/row.****
>
> ** **
>
> Example data:****
>
> C,007,Australia,Oceania,1820,2007****
>
> 3413340017010****
>
> 3413310017070****
>
> 3413290017280****
>
> 3413290017530****
>
> 3413320017950****
>
> 3413330018330****
>
> ** **
>
> As traditional CSV text, this would look like:****
>
> CountryID,Year,LifeExpect,Population,GDP,CountryName,RegionCode,RegionName
> ****
>
> 007,1820,34.1,0000334000,000701.0,Australia,4S,Oceania****
>
> 007,1821,34.1,0000331000,000707.0,Australia,4S,Oceania****
>
> 007,1822,34.1,0000329000,000728.0,Australia,4S,Oceania****
>
> 007,1823,34.1,0000329000,000753.0,Australia,4S,Oceania****
>
> 007,1824,34.1,0000332000,000795.0,Australia,4S,Oceania****
>
> 007,1825,34.1,0000333000,000833.0,Australia,4S,Oceania****
>
> ** **
>
> There are three row types in the file:****
>
> Lines beginning with "C" are CSV country header lines - Like:****
>
>   C,007,Australia,Oceania,1820,2007****
>
> The values being:****
>
>   - C: Header****
>
>   - 007: Country number****
>
>   - Australia: Country name****
>
>   - Oceania: Country region****
>
>   - 1820: First year there is data****
>
>   - 2007: Last year there is data****
>
> ** **
>
> Lines starting with 0-9 are data for one individual year for the above
> country****
>
>   - The year is assumed to increment for every detail line****
>
>   - These detail lines are always 13 digits wide, fixed width fields, no
> field separator, like:****
>
>            341 334001 7010 (spaces added for clarity, not in actual file)*
> ***
>
>   - Life expectancy (x10), example: 341 = 34.1 years****
>
>   - Population (last digit is exponent multiplier) 334001 = 334,000;
> 334002 = 3,340,000. ****
>
>     The last digit is effectively the number of zeros to add at the right
> hand side.****
>
>   - GDP (per person, last digit is exponent multiplier) 7010 = $7,010;
> 7011 = $70,100. ****
>
>      Again, the last digit is effectively the number of zeros to add at
> the right hand side.****
>
> ** **
>
> You need to be careful with this technique, how much data can you afford
> to “lose” due to data rounding.****
>
> ** **
>
> You were looking for “getting the data across with the least suffering and
> complexity”, my complexity was continual refining to more and more simple
> data structures, that were more and more looking like a data structure from
> a 1960’s COBOL program when storage was expensive and processing was slow.
> ****
>
> ** **
>
> In hindsight, I feel that I still sent more data than I needed to down the
> wire, I could have taken one digit off the age range, two digits off the
> population and one digit off the GDP, saving another 4 bytes per row. Also,
> could have used base 64 numbers, that would have given me another ~4 bytes
> per row.  But the performance was fine with this structure, so I did no
> more to cut it back.  ****
>
> ** **
>
> WARNING: This worked fine with my specific smallish well known data set,
> if I was putting this out into customer land, I would allow for a wider
> range of values.  For example, if we were to need to express the values in
> Indonesian Rupiahs rather than US Dollars, the amounts would go up by a
> factor of 10,000 and my values would no longer fit.  My values only work
> for large positive numbers, no room for a negative sign in front of the
> number or the exponent.  ****
>
> ** **
>
> So you need to design a file format that will work for your specific
> situation and data and keep an eye on it to make sure it stays working.***
> *
>
> ** **
>
> After having done all of this, I am tempted to see what the performance
> would be like with just simple raw CSV, if I was going to re-code this
> today, that is what I would start with.****
>
> ** **
>
> Regards****
>
> Greg #2 Harris****
>
> ** **
>
> ** **
>
> On Tue, Aug 6, 2013 at 6:00 PM, Greg Keogh <g...@mira.net> wrote:****
>
> Folks, I have to send several thousand database entities of different
> types to both a Silverlight 5 and WPF app for display in a grid. I can't
> "page" the data because it's all got to be loaded to allow a snappy
> response to filtering it. I'm fishing for ways of getting the data across
> with the least suffering and complexity ... don't forget that Silverlight
> is involved.****
>
>  ****
>
> Does a WCF service with http binding allow streaming? That would be the
> ideal technique if it comes out of the box and isn't too complex.****
>
>  ****
>
> I ran an experiment to convert ~6000 entities into XML and the size is a
> hefty 6MB (no surprise!), however Ionic.Zlib deflates it down to a 500KB
> buffer which transmits acceptably fast. I'm unhappy with my code to round
> trip the entities-to-XML as it's a bit messy and has special case logic to
> skip association properties.****
>
>  ****
>
> Then I thought of Json, which I haven't need to use before. Would the
> Jason libraries make round-tripping easier? Are the built-in Framework
> classes good enough, or would I need to use something like NewtonSoft? Can
> I control which properties are processed? Any general ideas would be
> welcome.****
>
>  ****
>
> Greg K****
>
>  ****
>
>  ****
>
>
>
> ****
>
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2013.0.3392 / Virus Database: 3209/6554 - Release Date: 08/05/13*
> ***
>

Re: Lots of data over a service

Reply via email to