On Sun, Sep 18, 2011 at 2:15 PM, Grant Ingersoll <gsing...@apache.org>wrote:

> Cool, I've pushed my changes to ClusterDumper to Lucid's github account
> (lucidimagination) and am planning on pushing all of it to Mahout this week.
>  It is now possible to output CSV, Text (the current option) and GraphML.
>  Easy enough to extend to output JSON or whatever.  I would imagine it would
> be dead simple to hook this into the Google Chart stuff.
>

Cool.  There is a special set of code to facilitate using CSV data.  See the
github project at:

https://github.com/tdunning/visualization-data-source

Note particularly the class CsvDataSourceServlet in the examples sub-dir.

I am pretty sure that class will let you do sql-y queries against the
underlying CSV data. Thus, you should be able to layer a variant of this on
your CSV output and get the sort of down-sampling, projection and pivoting
that you typically want in a visualization layer.


For instance, I was able to output CSV and then load into Gephi.  Granted
> Gephi nearly choked on the 350K points, but it was a start.
>
> Along those lines, I would imagine, for viz. purposes, many people collapse
> points that are within some distance of each other to a single point, at
> least when "zoomed out", right?  I'll have to think about how to add that to
> the output format.
>

The way that I would recommend is to export canned data for different zoom
levels.

This could be different tables, or it could simply amount to putting extra
columns on the main data table to indicate on which zoom level the point
should be visible.  You might also add an additional indicator to show how
many raw data points this point represents.

If you have 5 dimensional data, the CSV table might have columns like this:

     x1, x2, x3, x4, x5, svd1, svd2, svd3, cluster, zoom-mask, #points

All your raw data would have original coordinates, svd or PCA coordinates
and a cluster number.  These would be visible at zoom level 0 (at least) and
thus the zoom-mask would have a 1 in the LSB position.  Each raw point would
also represent 1 raw point so #points = 1.

You could then add points for the centroids.  These would only be visible at
a higher zoom level.  You could also sub-cluster each of the clusters to get
additional groupings. Which zoom-levels these should be visible at should
probably be either based on local density or number of decompositions below
the centroids.  This would allow the UI to show the centroids and then allow
selective drill-down into each cluster.  Data could be presented in the
original space or in any projection of the SVD or PCA space.

You can also present a scatter map of the data that shows the cluster
centroids and then grey shadows of no more than a few hundred data points
per cluster.  This can be done by selectively sampling each cluster at
different rates so that the number of members of each cluster is something
like the log of the true number.  Using semi-transparent dots would still
give you the gist of the distribution without vats and vats of points.

The treemap is also reputed to be useful for clustered data of this sort.
 It can handle quite a bit of data, but I might try the logn/n sampling
trick there as well so that the huge dynamic range is probably handled.  As
you drill into the different clusters, you can pick the zoom or sampling
level according to how many points would be visible without sampling.


>
>
> On Sep 18, 2011, at 2:17 PM, Ted Dunning wrote:
>
> > You have to make one hack to make sure that the JS downloads from your
> local
> > server, but that is easy.
> >
> > On Sun, Sep 18, 2011 at 12:17 PM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> >
> >> Yes.  The old stuff from google used to require their servers and was
> very
> >> limited on size of data.
> >>
> >> This newer stuff is not.
> >>
> >>
> >> On Sun, Sep 18, 2011 at 4:46 AM, Grant Ingersoll <gsing...@apache.org
> >wrote:
> >>
> >>>
> >>> On Sep 17, 2011, at 9:22 PM, Ted Dunning wrote:
> >>>
> >>>> I strongly recommend Google's visualization API.
> >>>
> >>> Cool.  Here I thought it required using Goog's servers, but I guess
> not.
> >>> So you can run the server and hit it locally?
> >>>
> >>>>
> >>>> This is divided into two parts, the reporting half and the data source
> >>> half.
> >>>> The reporting half is pretty good and very easy to use from
> javascript.
> >>> It
> >>>> is the library that underlies pretty much all of Google's internal and
> >>>> external web visualizations.
> >>>>
> >>>> The data source half might actually be of more use for Mahout.  It
> >>> provides
> >>>> a simplified query language, query parsers standard provisions for
> >>> having
> >>>> data sources that handle only a subset of the possible query language,
> >>> and
> >>>> shims that help provide the remaining bits of query semantics.
> >>>>
> >>>> The great virtue of this layer is that it provides a very clean
> >>> abstraction
> >>>> layer that separates data and presentation.  That separate lets you be
> >>> very
> >>>> exploratory at the visualization layer while reconstructing the data
> >>> layer
> >>>> as desired for performance.
> >>>>
> >>>> Together these layers make it quite plausible to handle millions of
> data
> >>>> points by the very common strategy of handling lots of data at the
> data
> >>>> layer, but only transporting modest amounts of summary data to the
> >>>> presentation layer.
> >>>>
> >>>> The data layer is also general enough that you could almost certainly
> >>> use it
> >>>> with alternative visualization layers.  For instance, you can specify
> >>> that
> >>>> data be returned in CSV format which would make R usable for
> >>> visualization.
> >>>> Or JSON makes Googles visualization code easy to use.  JSON would also
> >>> make
> >>>> processing or processing/js quite usable.
> >>>>
> >>>> I have ported the java version of the data source stuff to use Maven
> in
> >>> a
> >>>> standardized build directory and have added a version of the mysql
> >>> support
> >>>> code to allow integration with standard web service frameworks.  That
> >>> can be
> >>>> found on github here:
> >>>>
> >>>> https://github.com/tdunning/visualization-data-source
> >>>
> >>>
> >>>
> >>>>
> >>>> The original Google site on the subject is here:
> >>>>
> >>>> http://code.google.com/apis/chart/
> >>>>
> >>>> http://code.google.com/apis/chart/interactive/docs/dev/dsl_about.html
> >>>>
> >>>>
> >>>>
> >>>> On Sat, Sep 17, 2011 at 1:23 PM, Grant Ingersoll <gsing...@apache.org
> >>>> wrote:
> >>>>
> >>>>> I'll be checking in an abstraction, people can implement writers as
> >>> they
> >>>>> see fit.
> >>>>>
> >>>>> FWIW, I'm mostly looking for something that can be used in a
> >>> vizualization
> >>>>> toolkit, such as Gephi (although all be impressed if any of them can
> >>> handle
> >>>>> 7M points)
> >>>>>
> >>>>> -Grant
> >>>>>
> >>>>> On Sep 16, 2011, at 7:14 PM, Ted Dunning wrote:
> >>>>>
> >>>>>> Indeed.
> >>>>>>
> >>>>>> I strongly prefer the other two for expressivity.
> >>>>>>
> >>>>>> On Fri, Sep 16, 2011 at 4:37 PM, Jake Mannix <jake.man...@gmail.com
> >
> >>>>> wrote:
> >>>>>>
> >>>>>>> On Fri, Sep 16, 2011 at 3:30 PM, Ted Dunning <
> ted.dunn...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> I think that Avro and protobufs are the current best options for
> >>> large
> >>>>>>> data
> >>>>>>>> assets like this.
> >>>>>>>>
> >>>>>>>
> >>>>>>> (or serialized Thrift)
> >>>>>>>
>
>

Reply via email to