Re: Graph Output formats

Grant Ingersoll Sun, 18 Sep 2011 13:15:46 -0700

Cool, I've pushed my changes to ClusterDumper to Lucid's github account 
(lucidimagination) and am planning on pushing all of it to Mahout this week.  
It is now possible to output CSV, Text (the current option) and GraphML.  Easy 
enough to extend to output JSON or whatever.  I would imagine it would be dead 
simple to hook this into the Google Chart stuff.


For instance, I was able to output CSV and then load into Gephi.  Granted Gephi 
nearly choked on the 350K points, but it was a start.  

Along those lines, I would imagine, for viz. purposes, many people collapse 
points that are within some distance of each other to a single point, at least 
when "zoomed out", right?  I'll have to think about how to add that to the 
output format.


On Sep 18, 2011, at 2:17 PM, Ted Dunning wrote:

> You have to make one hack to make sure that the JS downloads from your local
> server, but that is easy.
> 
> On Sun, Sep 18, 2011 at 12:17 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
>> Yes.  The old stuff from google used to require their servers and was very
>> limited on size of data.
>> 
>> This newer stuff is not.
>> 
>> 
>> On Sun, Sep 18, 2011 at 4:46 AM, Grant Ingersoll <gsing...@apache.org>wrote:
>> 
>>> 
>>> On Sep 17, 2011, at 9:22 PM, Ted Dunning wrote:
>>> 
>>>> I strongly recommend Google's visualization API.
>>> 
>>> Cool.  Here I thought it required using Goog's servers, but I guess not.
>>> So you can run the server and hit it locally?
>>> 
>>>> 
>>>> This is divided into two parts, the reporting half and the data source
>>> half.
>>>> The reporting half is pretty good and very easy to use from javascript.
>>> It
>>>> is the library that underlies pretty much all of Google's internal and
>>>> external web visualizations.
>>>> 
>>>> The data source half might actually be of more use for Mahout.  It
>>> provides
>>>> a simplified query language, query parsers standard provisions for
>>> having
>>>> data sources that handle only a subset of the possible query language,
>>> and
>>>> shims that help provide the remaining bits of query semantics.
>>>> 
>>>> The great virtue of this layer is that it provides a very clean
>>> abstraction
>>>> layer that separates data and presentation.  That separate lets you be
>>> very
>>>> exploratory at the visualization layer while reconstructing the data
>>> layer
>>>> as desired for performance.
>>>> 
>>>> Together these layers make it quite plausible to handle millions of data
>>>> points by the very common strategy of handling lots of data at the data
>>>> layer, but only transporting modest amounts of summary data to the
>>>> presentation layer.
>>>> 
>>>> The data layer is also general enough that you could almost certainly
>>> use it
>>>> with alternative visualization layers.  For instance, you can specify
>>> that
>>>> data be returned in CSV format which would make R usable for
>>> visualization.
>>>> Or JSON makes Googles visualization code easy to use.  JSON would also
>>> make
>>>> processing or processing/js quite usable.
>>>> 
>>>> I have ported the java version of the data source stuff to use Maven in
>>> a
>>>> standardized build directory and have added a version of the mysql
>>> support
>>>> code to allow integration with standard web service frameworks.  That
>>> can be
>>>> found on github here:
>>>> 
>>>> https://github.com/tdunning/visualization-data-source
>>> 
>>> 
>>> 
>>>> 
>>>> The original Google site on the subject is here:
>>>> 
>>>> http://code.google.com/apis/chart/
>>>> 
>>>> http://code.google.com/apis/chart/interactive/docs/dev/dsl_about.html
>>>> 
>>>> 
>>>> 
>>>> On Sat, Sep 17, 2011 at 1:23 PM, Grant Ingersoll <gsing...@apache.org
>>>> wrote:
>>>> 
>>>>> I'll be checking in an abstraction, people can implement writers as
>>> they
>>>>> see fit.
>>>>> 
>>>>> FWIW, I'm mostly looking for something that can be used in a
>>> vizualization
>>>>> toolkit, such as Gephi (although all be impressed if any of them can
>>> handle
>>>>> 7M points)
>>>>> 
>>>>> -Grant
>>>>> 
>>>>> On Sep 16, 2011, at 7:14 PM, Ted Dunning wrote:
>>>>> 
>>>>>> Indeed.
>>>>>> 
>>>>>> I strongly prefer the other two for expressivity.
>>>>>> 
>>>>>> On Fri, Sep 16, 2011 at 4:37 PM, Jake Mannix <jake.man...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>> On Fri, Sep 16, 2011 at 3:30 PM, Ted Dunning <ted.dunn...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> I think that Avro and protobufs are the current best options for
>>> large
>>>>>>> data
>>>>>>>> assets like this.
>>>>>>>> 
>>>>>>> 
>>>>>>> (or serialized Thrift)
>>>>>>>

Re: Graph Output formats

Reply via email to