Re: Lots of data over a service

2013-08-06 Thread Greg Harris
Hi Greg,

What I did with my Motion Chart software (
http://www.eshiftlog.com/Silverlight/MotionGraphTestPage.html) to get
better download performance was:
• Move away from small WCF data transfers to transferring a single large
encoded compressed text file
• Only transfer raw data (no JSON/XML structure, which adds a LOT OF FAT)
• Minor use of CSV format, otherwise fixed format
• Define my own number formats to reduce size (remove unneeded decimal
places)
• Use zip file to transfer data
This has improved data load time by a factor of ~50-100 times (sorry no
hard numbers).
My data ended up being 430KB for ~32K rows, just over 13 bytes/row.

Example data:
C,007,Australia,Oceania,1820,2007
3413340017010
3413310017070
3413290017280
3413290017530
3413320017950
3413330018330

As traditional CSV text, this would look like:
CountryID,Year,LifeExpect,Population,GDP,CountryName,RegionCode,RegionName
007,1820,34.1,334000,000701.0,Australia,4S,Oceania
007,1821,34.1,331000,000707.0,Australia,4S,Oceania
007,1822,34.1,329000,000728.0,Australia,4S,Oceania
007,1823,34.1,329000,000753.0,Australia,4S,Oceania
007,1824,34.1,332000,000795.0,Australia,4S,Oceania
007,1825,34.1,333000,000833.0,Australia,4S,Oceania

There are three row types in the file:
Lines beginning with C are CSV country header lines - Like:
  C,007,Australia,Oceania,1820,2007
The values being:
  - C: Header
  - 007: Country number
  - Australia: Country name
  - Oceania: Country region
  - 1820: First year there is data
  - 2007: Last year there is data

Lines starting with 0-9 are data for one individual year for the above
country
  - The year is assumed to increment for every detail line
  - These detail lines are always 13 digits wide, fixed width fields, no
field separator, like:
   341 334001 7010 (spaces added for clarity, not in actual file)
  - Life expectancy (x10), example: 341 = 34.1 years
  - Population (last digit is exponent multiplier) 334001 = 334,000; 334002
= 3,340,000.
The last digit is effectively the number of zeros to add at the right
hand side.
  - GDP (per person, last digit is exponent multiplier) 7010 = $7,010; 7011
= $70,100.
 Again, the last digit is effectively the number of zeros to add at the
right hand side.

You need to be careful with this technique, how much data can you afford to
“lose” due to data rounding.

You were looking for “getting the data across with the least suffering and
complexity”, my complexity was continual refining to more and more simple
data structures, that were more and more looking like a data structure from
a 1960’s COBOL program when storage was expensive and processing was slow.

In hindsight, I feel that I still sent more data than I needed to down the
wire, I could have taken one digit off the age range, two digits off the
population and one digit off the GDP, saving another 4 bytes per row. Also,
could have used base 64 numbers, that would have given me another ~4 bytes
per row.  But the performance was fine with this structure, so I did no
more to cut it back.

WARNING: This worked fine with my specific smallish well known data set, if
I was putting this out into customer land, I would allow for a wider range
of values.  For example, if we were to need to express the values in
Indonesian Rupiahs rather than US Dollars, the amounts would go up by a
factor of 10,000 and my values would no longer fit.  My values only work
for large positive numbers, no room for a negative sign in front of the
number or the exponent.

So you need to design a file format that will work for your specific
situation and data and keep an eye on it to make sure it stays working.

After having done all of this, I am tempted to see what the performance
would be like with just simple raw CSV, if I was going to re-code this
today, that is what I would start with.

Regards
Greg #2 Harris


On Tue, Aug 6, 2013 at 6:00 PM, Greg Keogh g...@mira.net wrote:

 Folks, I have to send several thousand database entities of different
 types to both a Silverlight 5 and WPF app for display in a grid. I can't
 page the data because it's all got to be loaded to allow a snappy
 response to filtering it. I'm fishing for ways of getting the data across
 with the least suffering and complexity ... don't forget that Silverlight
 is involved.

 Does a WCF service with http binding allow streaming? That would be the
 ideal technique if it comes out of the box and isn't too complex.

 I ran an experiment to convert ~6000 entities into XML and the size is a
 hefty 6MB (no surprise!), however Ionic.Zlib deflates it down to a 500KB
 buffer which transmits acceptably fast. I'm unhappy with my code to round
 trip the entities-to-XML as it's a bit messy and has special case logic to
 skip association properties.

 Then I thought of Json, which I haven't need to use before. Would the
 Jason libraries make round-tripping easier? Are the built-in Framework
 classes good enough, or would I need to use something like 

RE: Lots of data over a service

2013-08-06 Thread Rob Andrew
Hi Greg,

 

We've used a technique called Chunking to move large quantities of binary
data around. We move ~50 Mb files (LAN only) this way and works pretty well.


 

http://msdn.microsoft.com/en-us/library/aa717050.aspx

 

Rob

 

From: ozdotnet-boun...@ozdotnet.com [mailto:ozdotnet-boun...@ozdotnet.com]
On Behalf Of Greg Keogh
Sent: Tuesday, 6 August 2013 6:01 PM
To: ozDotNet
Subject: Lots of data over a service

 

Folks, I have to send several thousand database entities of different types
to both a Silverlight 5 and WPF app for display in a grid. I can't page
the data because it's all got to be loaded to allow a snappy response to
filtering it. I'm fishing for ways of getting the data across with the least
suffering and complexity ... don't forget that Silverlight is involved.

 

Does a WCF service with http binding allow streaming? That would be the
ideal technique if it comes out of the box and isn't too complex.

 

I ran an experiment to convert ~6000 entities into XML and the size is a
hefty 6MB (no surprise!), however Ionic.Zlib deflates it down to a 500KB
buffer which transmits acceptably fast. I'm unhappy with my code to round
trip the entities-to-XML as it's a bit messy and has special case logic to
skip association properties.

 

Then I thought of Json, which I haven't need to use before. Would the Jason
libraries make round-tripping easier? Are the built-in Framework classes
good enough, or would I need to use something like NewtonSoft? Can I control
which properties are processed? Any general ideas would be welcome.

 

Greg K

 

 



Re: Lots of data over a service

2013-08-06 Thread Greg Keogh
Howdy Greg #2 (or 3?)

Haven't seen you since the Silverlight weekend in Docklands a few years ago.

Very interesting! You have implemented your own data compression, and we
used to do very similar things back in the late 70s and 80s when mainframe
disk space was precious. Compression algorithms and software were not
available or widely know then. In fact, Wikipedia says the LZ algorithms
were only published in 1977/78 (not long ago in coding years).

However, I have this uneasy feeling that all of your manual work is made
mostly redundant by what zipping does for transmission. Zip will
aggressively remove redundancy from your data, so well in fact that I
suspect it might reduce the benefits of your pre-processing to a hair's
width. Although your pre-processing will save space for the raw data if
that's a problem on the client side.

I was quite amazed that ~6000 of my entities as XML took 6.06MB, but
deflated down to 492KB which is 8% of the original size and quite suitable
for transmission as a single blob. I reckon your data as plain CSV would
also reduce incredibly well.

Given that I also deflate for transmission as a blob, I think my problem is
now reduced to a pure coding problem: What format is easiest to round-trip
my entities?

Importantly, I'm looking for a general purpose way of transforming (most of
the) the entity class properties. XML needs manual coding, Json I'm not
sure about. I can't use pure binary serialization because it's not
supported in Silverlight clients.

Greg K


RE: Lots of data over a service

2013-08-06 Thread Paul Evrat
Greg,

 

I saw the TED talk that you note was the inspiration for this. I thought at
the time it was a brilliant way to present and understand data. Plus it and
the presenter had the audience totally amused but it really made the data
talk.

 

Is this something you will use yourself or for a client, or propose to make
available one way or another?

 

Regards,

 

 

From: ozdotnet-boun...@ozdotnet.com [mailto:ozdotnet-boun...@ozdotnet.com]
On Behalf Of Greg Harris
Sent: Wednesday, 7 August 2013 1:30 AM
To: ozDotNet
Subject: Re: Lots of data over a service

 

Hi Greg,

 

What I did with my Motion Chart software
(http://www.eshiftlog.com/Silverlight/MotionGraphTestPage.html) to get
better download performance was:

. Move away from small WCF data transfers to transferring a single large
encoded compressed text file

. Only transfer raw data (no JSON/XML structure, which adds a LOT OF FAT)

. Minor use of CSV format, otherwise fixed format

. Define my own number formats to reduce size (remove unneeded decimal
places)

. Use zip file to transfer data

This has improved data load time by a factor of ~50-100 times (sorry no hard
numbers).

My data ended up being 430KB for ~32K rows, just over 13 bytes/row.

 

Example data:

C,007,Australia,Oceania,1820,2007

3413340017010

3413310017070

3413290017280

3413290017530

3413320017950

3413330018330

 

As traditional CSV text, this would look like:

CountryID,Year,LifeExpect,Population,GDP,CountryName,RegionCode,RegionName

007,1820,34.1,334000,000701.0,Australia,4S,Oceania

007,1821,34.1,331000,000707.0,Australia,4S,Oceania

007,1822,34.1,329000,000728.0,Australia,4S,Oceania

007,1823,34.1,329000,000753.0,Australia,4S,Oceania

007,1824,34.1,332000,000795.0,Australia,4S,Oceania

007,1825,34.1,333000,000833.0,Australia,4S,Oceania

 

There are three row types in the file:

Lines beginning with C are CSV country header lines - Like:

  C,007,Australia,Oceania,1820,2007

The values being:

  - C: Header

  - 007: Country number

  - Australia: Country name

  - Oceania: Country region

  - 1820: First year there is data

  - 2007: Last year there is data

 

Lines starting with 0-9 are data for one individual year for the above
country

  - The year is assumed to increment for every detail line

  - These detail lines are always 13 digits wide, fixed width fields, no
field separator, like:

   341 334001 7010 (spaces added for clarity, not in actual file)

  - Life expectancy (x10), example: 341 = 34.1 years

  - Population (last digit is exponent multiplier) 334001 = 334,000; 334002
= 3,340,000. 

The last digit is effectively the number of zeros to add at the right
hand side.

  - GDP (per person, last digit is exponent multiplier) 7010 = $7,010; 7011
= $70,100. 

 Again, the last digit is effectively the number of zeros to add at the
right hand side.

 

You need to be careful with this technique, how much data can you afford to
lose due to data rounding.

 

You were looking for getting the data across with the least suffering and
complexity, my complexity was continual refining to more and more simple
data structures, that were more and more looking like a data structure from
a 1960's COBOL program when storage was expensive and processing was slow.

 

In hindsight, I feel that I still sent more data than I needed to down the
wire, I could have taken one digit off the age range, two digits off the
population and one digit off the GDP, saving another 4 bytes per row. Also,
could have used base 64 numbers, that would have given me another ~4 bytes
per row.  But the performance was fine with this structure, so I did no more
to cut it back.  

 

WARNING: This worked fine with my specific smallish well known data set, if
I was putting this out into customer land, I would allow for a wider range
of values.  For example, if we were to need to express the values in
Indonesian Rupiahs rather than US Dollars, the amounts would go up by a
factor of 10,000 and my values would no longer fit.  My values only work for
large positive numbers, no room for a negative sign in front of the number
or the exponent.  

 

So you need to design a file format that will work for your specific
situation and data and keep an eye on it to make sure it stays working.

 

After having done all of this, I am tempted to see what the performance
would be like with just simple raw CSV, if I was going to re-code this
today, that is what I would start with.

 

Regards

Greg #2 Harris

 

 

On Tue, Aug 6, 2013 at 6:00 PM, Greg Keogh g...@mira.net wrote:

Folks, I have to send several thousand database entities of different types
to both a Silverlight 5 and WPF app for display in a grid. I can't page
the data because it's all got to be loaded to allow a snappy response to
filtering it. I'm fishing for ways of getting the data across with the least
suffering and complexity ... don't forget that Silverlight is involved.

 

Does a WCF service with http

Re: Lots of data over a service

2013-08-06 Thread Greg Harris
Hi Greg #N+1,



That Silverlight weekend in Docklands was a great event, thank you to the
guys that organised it!



I would not say that I implemented my own data compression, more I avoided
any extra fat in the data.  I agree that I would not have been getting much
extra mileage out of the pushing it as far as I did, to be truthful, I had
such bad performance before doing this, I just threw every realistic at
reducing the size of the data.



If I did it again, I would start with a CSV file with two record types
shown by the first character on the line.  This would avoid the cost of
XML/JSON format and the extra complexity of Greg format numbers.  So my
detail line would change from: 3413340017010 to D,34.1,334000,701 being
just 4 more characters in this example.



The 430KB for ~32K rows is for the uncompressed data, it compresses to
168KB, I am not going to do an experiment to show how much more space it
would take up without Greg format numbers and the extra commas, but my gut
tells me that you are right that zip would make the differences very small.



When I did the original work on the Optus soft copy bills, we were getting
~90% zip data compression, but there was a lot of white space in that that
would have compressed down to almost nothing.



Regards

Greg Harris

On Wed, Aug 7, 2013 at 8:22 AM, Greg Keogh g...@mira.net wrote:

 Howdy Greg #2 (or 3?)

 Haven't seen you since the Silverlight weekend in Docklands a few years
 ago.

 Very interesting! You have implemented your own data compression, and we
 used to do very similar things back in the late 70s and 80s when mainframe
 disk space was precious. Compression algorithms and software were not
 available or widely know then. In fact, Wikipedia says the LZ algorithms
 were only published in 1977/78 (not long ago in coding years).

 However, I have this uneasy feeling that all of your manual work is made
 mostly redundant by what zipping does for transmission. Zip will
 aggressively remove redundancy from your data, so well in fact that I
 suspect it might reduce the benefits of your pre-processing to a hair's
 width. Although your pre-processing will save space for the raw data if
 that's a problem on the client side.

 I was quite amazed that ~6000 of my entities as XML took 6.06MB, but
 deflated down to 492KB which is 8% of the original size and quite suitable
 for transmission as a single blob. I reckon your data as plain CSV would
 also reduce incredibly well.

 Given that I also deflate for transmission as a blob, I think my problem
 is now reduced to a pure coding problem: What format is easiest to
 round-trip my entities?

 Importantly, I'm looking for a general purpose way of transforming (most
 of the) the entity class properties. XML needs manual coding, Json I'm not
 sure about. I can't use pure binary serialization because it's not
 supported in Silverlight clients.

 Greg K



Re: Lots of data over a service

2013-08-06 Thread Greg Harris
Hi Paul,


 Is this something you will use yourself or for a client, or propose to
make available one way or another?

This is work that I did myself as a side project some years ago to cement
my Silverlight and C# knowledge.  I tried to find some commercial interest
in it, but it just was not there in 2009/2010 when I was looking.  I am
very open to suggestions.



Some of my original notes on the project are:



The web site where I first saw this style of graph on one of the TED talks (
http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html)
talks by Hans Rosling (http://en.wikipedia.org/wiki/Hans_Rosling) of the
Karolinska Institute, where he presented the work done at Gap Minder (
http://gapminder.org) I was very impressed and assumed that the graphics
system behind the graph was some extensive university project that would be
hard to reproduce.



When I saw the graph again some months later during a presentation by
Tristan Kurniawan  (then at SSW) on good user interface design, it occurred
to me that this could be done as a Silverlight project.  At the time Adam
Cogan said yeah sure Greg, you do that this weekend… While it was clear
that it would be a lot more than a weekend job, I started on the project as
my 'background project', which has took up about 18 Months of background
work to complete (say equivalent of three-four months of full time work).



While this work is strongly influenced by the GapMinder project all the
code in this version is my own, I draw every pixel on the screen!



The data sources I used is from GapMinder.org, specifically see:

Life expectancy at birth:
http://spreadsheets.google.com/pub?key=phAwcNAVuyj2tPLxKvvnNPA

GDP per capita:
http://spreadsheets.google.com/pub?key=phAwcNAVuyj1jiMAkmq1iMg

Population : http://spreadsheets.google.com/pub?key=phAwcNAVuyj0XOoBL_n5tAQ



The data needed extensive massaging to get the data into a more usable
format and to interpolate missing data between known values.  See the data
tabs on the left hand side of the graph for the raw data I ended up with.



Where data is missing for some years for a country, that data is estimated
by drawing a straight line between two known data points, this is then used
to derive data for the missing years in between.



The data displayed is not complete and may have errors and omissions, where
there was a problem with part of the data set, that was left out rather
than represent incorrect data.   There was a problem merging separate data
sets where countries showed different names, so a direct merge was not
possible, in this case if clear merge did not present itself, the data was
excluded.

Other errors may have been introduced into the data during preparation the
data for representation in this format (I welcome someone doing a through
data validation).



Once I had all of the data I worked on getting the graph drawn, the graph
is drawn with many lines, circles and rectangles drawn on a Silverlight
canvas.  With the sheer volume of data and updates needed, this was a bit
of a trial and error process to find processes that worked effectively at
an acceptable performance.


Regards

Greg Harris

On Wed, Aug 7, 2013 at 8:46 AM, Paul Evrat p...@paulevrat.com wrote:

 Greg,

 ** **

 I saw the TED talk that you note was the inspiration for this. I thought
 at the time it was a brilliant way to present and understand data. Plus it
 and the presenter had the audience totally amused but it really made the
 data talk.

 ** **

 Is this something you will use yourself or for a client, or propose to
 make available one way or another?

 ** **

 Regards,

 ** **

 ** **

 *From:* ozdotnet-boun...@ozdotnet.com [mailto:
 ozdotnet-boun...@ozdotnet.com] *On Behalf Of *Greg Harris
 *Sent:* Wednesday, 7 August 2013 1:30 AM
 *To:* ozDotNet
 *Subject:* Re: Lots of data over a service

 ** **

 Hi Greg,

 ** **

 What I did with my Motion Chart software (
 http://www.eshiftlog.com/Silverlight/MotionGraphTestPage.html) to get
 better download performance was:

 • Move away from small WCF data transfers to transferring a single large
 encoded compressed text file

 • Only transfer raw data (no JSON/XML structure, which adds a LOT OF FAT)*
 ***

 • Minor use of CSV format, otherwise fixed format

 • Define my own number formats to reduce size (remove unneeded decimal
 places)

 • Use zip file to transfer data

 This has improved data load time by a factor of ~50-100 times (sorry no
 hard numbers).

 My data ended up being 430KB for ~32K rows, just over 13 bytes/row.

 ** **

 Example data:

 C,007,Australia,Oceania,1820,2007

 3413340017010

 3413310017070

 3413290017280

 3413290017530

 3413320017950

 3413330018330

 ** **

 As traditional CSV text, this would look like:

 CountryID,Year,LifeExpect,Population,GDP,CountryName,RegionCode,RegionName
 

 007,1820,34.1,334000,000701.0,Australia,4S

Re: Lots of data over a service

2013-08-06 Thread Greg Keogh
Hi Rob, I actually implemented a similar chunking technique from scratch
last year between WCF over http and an SL4 app. I simulated a simple
torrent idea where numbered chunks are sent to the server who assembles
them in a receiving array of the correct size. The only tricky code was on
the SL sending side where a worker thread loops and waits for the WCF call
on the UI thread to complete, creating a kind of gated send loop. It's
often used to upload ~40MB files to the server side.

I didn't want to do that sort of thing this time, as I was just inside the
limits of sending it all as one deflated blob in a reasonable time. Much
bigger and I would have to do some chunking.

Greg K


RE: Lots of data over a service

2013-08-06 Thread Paul Evrat
 

In this age of 'big data' you'd think there would be a big commercialisation
opportunity for visualising both small and large data sets in that way.
Standardise the input data formats so people can prepare their own data and
interpolate missing points and it would have to be huge for management and
presentation software particularly if not available commercially already. 

 

 

From: ozdotnet-boun...@ozdotnet.com [mailto:ozdotnet-boun...@ozdotnet.com]
On Behalf Of Greg Harris
Sent: Wednesday, 7 August 2013 11:11 AM
To: ozDotNet
Subject: Re: Lots of data over a service

 

Hi Paul,

 

 Is this something you will use yourself or for a client, or propose to
make available one way or another?

This is work that I did myself as a side project some years ago to cement my
Silverlight and C# knowledge.  I tried to find some commercial interest in
it, but it just was not there in 2009/2010 when I was looking.  I am very
open to suggestions.

 

Some of my original notes on the project are:

 

The web site where I first saw this style of graph on one of the TED talks
(http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen
.html) talks by Hans Rosling (http://en.wikipedia.org/wiki/Hans_Rosling) of
the Karolinska Institute, where he presented the work done at Gap Minder
(http://gapminder.org http://gapminder.org/ ) I was very impressed and
assumed that the graphics system behind the graph was some extensive
university project that would be hard to reproduce.

 

When I saw the graph again some months later during a presentation by
Tristan Kurniawan  (then at SSW) on good user interface design, it occurred
to me that this could be done as a Silverlight project.  At the time Adam
Cogan said yeah sure Greg, you do that this weekend. While it was clear that
it would be a lot more than a weekend job, I started on the project as my
'background project', which has took up about 18 Months of background work
to complete (say equivalent of three-four months of full time work).

 

While this work is strongly influenced by the GapMinder project all the code
in this version is my own, I draw every pixel on the screen!

 

The data sources I used is from GapMinder.org, specifically see:

Life expectancy at birth:
http://spreadsheets.google.com/pub?key=phAwcNAVuyj2tPLxKvvnNPA 

GDP per capita:
http://spreadsheets.google.com/pub?key=phAwcNAVuyj1jiMAkmq1iMg

Population : http://spreadsheets.google.com/pub?key=phAwcNAVuyj0XOoBL_n5tAQ

 

The data needed extensive massaging to get the data into a more usable
format and to interpolate missing data between known values.  See the data
tabs on the left hand side of the graph for the raw data I ended up with.

 

Where data is missing for some years for a country, that data is estimated
by drawing a straight line between two known data points, this is then used
to derive data for the missing years in between.

 

The data displayed is not complete and may have errors and omissions, where
there was a problem with part of the data set, that was left out rather than
represent incorrect data.   There was a problem merging separate data sets
where countries showed different names, so a direct merge was not possible,
in this case if clear merge did not present itself, the data was excluded.

Other errors may have been introduced into the data during preparation the
data for representation in this format (I welcome someone doing a through
data validation).

 

Once I had all of the data I worked on getting the graph drawn, the graph is
drawn with many lines, circles and rectangles drawn on a Silverlight canvas.
With the sheer volume of data and updates needed, this was a bit of a trial
and error process to find processes that worked effectively at an acceptable
performance.

 

Regards

Greg Harris

 

On Wed, Aug 7, 2013 at 8:46 AM, Paul Evrat p...@paulevrat.com wrote:

Greg,

 

I saw the TED talk that you note was the inspiration for this. I thought at
the time it was a brilliant way to present and understand data. Plus it and
the presenter had the audience totally amused but it really made the data
talk.

 

Is this something you will use yourself or for a client, or propose to make
available one way or another?

 

Regards,

 

 

From: ozdotnet-boun...@ozdotnet.com [mailto:ozdotnet-boun...@ozdotnet.com]
On Behalf Of Greg Harris
Sent: Wednesday, 7 August 2013 1:30 AM
To: ozDotNet
Subject: Re: Lots of data over a service

 

Hi Greg,

 

What I did with my Motion Chart software
(http://www.eshiftlog.com/Silverlight/MotionGraphTestPage.html) to get
better download performance was:

. Move away from small WCF data transfers to transferring a single large
encoded compressed text file

. Only transfer raw data (no JSON/XML structure, which adds a LOT OF FAT)

. Minor use of CSV format, otherwise fixed format

. Define my own number formats to reduce size (remove unneeded decimal
places)

. Use zip file to transfer data

This has improved data load time by a factor