RE: Lots of data over a service

Paul Evrat Tue, 06 Aug 2013 19:16:44 -0700

 

In this age of 'big data' you'd think there would be a big commercialisation
opportunity for visualising both small and large data sets in that way.
Standardise the input data formats so people can prepare their own data and
interpolate missing points and it would have to be huge for management and
presentation software particularly if not available commercially already.

From: ozdotnet-boun...@ozdotnet.com [mailto:ozdotnet-boun...@ozdotnet.com]
On Behalf Of Greg Harris
Sent: Wednesday, 7 August 2013 11:11 AM
To: ozDotNet
Subject: Re: Lots of data over a service

Hi Paul,

>> Is this something you will use yourself or for a client, or propose to
make available one way or another?

This is work that I did myself as a side project some years ago to cement my
Silverlight and C# knowledge.  I tried to find some commercial interest in
it, but it just was not there in 2009/2010 when I was looking.  I am very
open to suggestions.

Some of my original notes on the project are:

The web site where I first saw this style of graph on one of the TED talks
(http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen
.html) talks by Hans Rosling (http://en.wikipedia.org/wiki/Hans_Rosling) of
the Karolinska Institute, where he presented the work done at Gap Minder
(http://gapminder.org <http://gapminder.org/> ) I was very impressed and
assumed that the graphics system behind the graph was some extensive
university project that would be hard to reproduce.

When I saw the graph again some months later during a presentation by
Tristan Kurniawan  (then at SSW) on good user interface design, it occurred
to me that this could be done as a Silverlight project.  At the time Adam
Cogan said yeah sure Greg, you do that this weekend. While it was clear that
it would be a lot more than a weekend job, I started on the project as my
'background project', which has took up about 18 Months of background work
to complete (say equivalent of three-four months of full time work).

While this work is strongly influenced by the GapMinder project all the code
in this version is my own, I draw every pixel on the screen!

The data sources I used is from GapMinder.org, specifically see:

Life expectancy at birth:
http://spreadsheets.google.com/pub?key=phAwcNAVuyj2tPLxKvvnNPA 

GDP per capita:
http://spreadsheets.google.com/pub?key=phAwcNAVuyj1jiMAkmq1iMg

Population : http://spreadsheets.google.com/pub?key=phAwcNAVuyj0XOoBL_n5tAQ

The data needed extensive massaging to get the data into a more usable
format and to interpolate missing data between known values.  See the data
tabs on the left hand side of the graph for the raw data I ended up with.

Where data is missing for some years for a country, that data is estimated
by drawing a straight line between two known data points, this is then used
to derive data for the missing years in between.

The data displayed is not complete and may have errors and omissions, where
there was a problem with part of the data set, that was left out rather than
represent incorrect data.   There was a problem merging separate data sets
where countries showed different names, so a direct merge was not possible,
in this case if clear merge did not present itself, the data was excluded.

Other errors may have been introduced into the data during preparation the
data for representation in this format (I welcome someone doing a through
data validation).

Once I had all of the data I worked on getting the graph drawn, the graph is
drawn with many lines, circles and rectangles drawn on a Silverlight canvas.
With the sheer volume of data and updates needed, this was a bit of a trial
and error process to find processes that worked effectively at an acceptable
performance.

Regards

Greg Harris

On Wed, Aug 7, 2013 at 8:46 AM, Paul Evrat <p...@paulevrat.com> wrote:

Greg,

I saw the TED talk that you note was the inspiration for this. I thought at
the time it was a brilliant way to present and understand data. Plus it and
the presenter had the audience totally amused but it really made the data
talk.

Is this something you will use yourself or for a client, or propose to make
available one way or another?

Regards,

From: ozdotnet-boun...@ozdotnet.com [mailto:ozdotnet-boun...@ozdotnet.com]
On Behalf Of Greg Harris
Sent: Wednesday, 7 August 2013 1:30 AM
To: ozDotNet
Subject: Re: Lots of data over a service

Hi Greg,

What I did with my Motion Chart software
(http://www.eshiftlog.com/Silverlight/MotionGraphTestPage.html) to get
better download performance was:

. Move away from small WCF data transfers to transferring a single large
encoded compressed text file

. Only transfer raw data (no JSON/XML structure, which adds a LOT OF FAT)

. Minor use of CSV format, otherwise fixed format

. Define my own number formats to reduce size (remove unneeded decimal
places)

. Use zip file to transfer data

This has improved data load time by a factor of ~50-100 times (sorry no hard
numbers).

My data ended up being 430KB for ~32K rows, just over 13 bytes/row.

Example data:

C,007,Australia,Oceania,1820,2007

3413340017010

3413310017070

3413290017280

3413290017530

3413320017950

3413330018330

As traditional CSV text, this would look like:

CountryID,Year,LifeExpect,Population,GDP,CountryName,RegionCode,RegionName

007,1820,34.1,0000334000,000701.0,Australia,4S,Oceania

007,1821,34.1,0000331000,000707.0,Australia,4S,Oceania

007,1822,34.1,0000329000,000728.0,Australia,4S,Oceania

007,1823,34.1,0000329000,000753.0,Australia,4S,Oceania

007,1824,34.1,0000332000,000795.0,Australia,4S,Oceania

007,1825,34.1,0000333000,000833.0,Australia,4S,Oceania

There are three row types in the file:

Lines beginning with "C" are CSV country header lines - Like:

  C,007,Australia,Oceania,1820,2007

The values being:

  - C: Header

  - 007: Country number

  - Australia: Country name

  - Oceania: Country region

  - 1820: First year there is data

  - 2007: Last year there is data

Lines starting with 0-9 are data for one individual year for the above
country

  - The year is assumed to increment for every detail line

  - These detail lines are always 13 digits wide, fixed width fields, no
field separator, like:

           341 334001 7010 (spaces added for clarity, not in actual file)

  - Life expectancy (x10), example: 341 = 34.1 years

  - Population (last digit is exponent multiplier) 334001 = 334,000; 334002
= 3,340,000. 

    The last digit is effectively the number of zeros to add at the right
hand side.

  - GDP (per person, last digit is exponent multiplier) 7010 = $7,010; 7011
= $70,100. 

     Again, the last digit is effectively the number of zeros to add at the
right hand side.

You need to be careful with this technique, how much data can you afford to
"lose" due to data rounding.

You were looking for "getting the data across with the least suffering and
complexity", my complexity was continual refining to more and more simple
data structures, that were more and more looking like a data structure from
a 1960's COBOL program when storage was expensive and processing was slow.

In hindsight, I feel that I still sent more data than I needed to down the
wire, I could have taken one digit off the age range, two digits off the
population and one digit off the GDP, saving another 4 bytes per row. Also,
could have used base 64 numbers, that would have given me another ~4 bytes
per row.  But the performance was fine with this structure, so I did no more
to cut it back.  

WARNING: This worked fine with my specific smallish well known data set, if
I was putting this out into customer land, I would allow for a wider range
of values.  For example, if we were to need to express the values in
Indonesian Rupiahs rather than US Dollars, the amounts would go up by a
factor of 10,000 and my values would no longer fit.  My values only work for
large positive numbers, no room for a negative sign in front of the number
or the exponent.  

So you need to design a file format that will work for your specific
situation and data and keep an eye on it to make sure it stays working.

After having done all of this, I am tempted to see what the performance
would be like with just simple raw CSV, if I was going to re-code this
today, that is what I would start with.

Regards

Greg #2 Harris

On Tue, Aug 6, 2013 at 6:00 PM, Greg Keogh <g...@mira.net> wrote:

Folks, I have to send several thousand database entities of different types
to both a Silverlight 5 and WPF app for display in a grid. I can't "page"
the data because it's all got to be loaded to allow a snappy response to
filtering it. I'm fishing for ways of getting the data across with the least
suffering and complexity ... don't forget that Silverlight is involved.

Does a WCF service with http binding allow streaming? That would be the
ideal technique if it comes out of the box and isn't too complex.

I ran an experiment to convert ~6000 entities into XML and the size is a
hefty 6MB (no surprise!), however Ionic.Zlib deflates it down to a 500KB
buffer which transmits acceptably fast. I'm unhappy with my code to round
trip the entities-to-XML as it's a bit messy and has special case logic to
skip association properties.

Then I thought of Json, which I haven't need to use before. Would the Jason
libraries make round-tripping easier? Are the built-in Framework classes
good enough, or would I need to use something like NewtonSoft? Can I control
which properties are processed? Any general ideas would be welcome.

Greg K

No virus found in this message.
Checked by AVG - www.avg.com
Version: 2013.0.3392 / Virus Database: 3209/6554 - Release Date: 08/05/13

No virus found in this message.
Checked by AVG - www.avg.com
Version: 2013.0.3392 / Virus Database: 3209/6556 - Release Date: 08/06/13

RE: Lots of data over a service

Reply via email to