[ 
https://issues.apache.org/jira/browse/MAHOUT-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14014298#comment-14014298
 ] 

Andrew Musselman edited comment on MAHOUT-1505 at 5/30/14 10:28 PM:
--------------------------------------------------------------------

Attaching a patch for this and will do a pull request.

Here's what the output of ClusterDumper looks like up to whitespace; could 
someone please try this out and let me know what comments you have?  It affects 
any use of AbstractCluster.formatVector(), which I checked always occurs in 
printlns and logging.

{code:javascript}
{ 
  "top_terms": [
    {"all":3.0149030685424805},
    {"english":3.0149030685424805},
    {"best":3.0149030685424805},
    {"spaniel":3.0149030685424805},
    {"springer":3.0149030685424805},
    {"dogs":1.9162907600402832}
  ],
  "cluster_id": 7,
  "cluster": {
    "r": [],
    "c": [
      {"all":3.015},
      {"best":3.015},
      {"dogs":1.916},
      {"english":3.015},
      {"spaniel":3.015},
      {"springer":3.015}
    ],
    "n": 1,
    "identifier": "C-7"
  },
  "points": [
    { 
      "point": [ 
        {"all":3.015},
        {"best":3.015},
        {"dogs":1.916},
        {"english":3.015},
        {"spaniel":3.015},
        {"springer":3.015}
      ],
      "vector_name": "P(14)",
      "weight": "1.0"
    }
  ]
}
{code}


was (Author: andrew.musselman):
Attaching a patch for this and will do a pull request.

Here's what the output of ClusterDumper looks like up to whitespace; could 
someone please try this out and let me know what comments you have?  It affects 
any use of AbstractCluster.formatVector(), which I checked always occurs in 
printlns and logging.

{code:javascript}
{ 
  "top_terms":
    [
      {"all":3.0149030685424805},
      {"english":3.0149030685424805},
      {"best":3.0149030685424805},
      {"spaniel":3.0149030685424805},
      {"springer":3.0149030685424805},
      {"dogs":1.9162907600402832}
    ],
  "cluster_id": 7,
  "cluster":
    {
      "r": [],
      "c":
        [
          {"all":3.015},
          {"best":3.015},
          {"dogs":1.916},
          {"english":3.015},
          {"spaniel":3.015},
          {"springer":3.015}
        ],
    "n": 1,
    "identifier": "C-7"
  },
  "points":
    [
      { 
        "point": 
          [ 
            {"all":3.015},
            {"best":3.015},
            {"dogs":1.916},
            {"english":3.015},
            {"spaniel":3.015},
            {"springer":3.015}
          ],
        "vector_name": "P(14)",
        "weight": "1.0"
      }
    ]
}
{code}

> structure of clusterdump's JSON output
> --------------------------------------
>
>                 Key: MAHOUT-1505
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1505
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.9
>            Reporter: Terry Blankers
>            Assignee: Andrew Musselman
>              Labels: json
>             Fix For: 1.0
>
>
> Hi all, I'm working on some automated analysis of the clusterdump output 
> using '-of = JSON'. While digging into the structure of the representation of 
> the data I've noticed something that seems a little odd to me.
> In order to access the data for a particular cluster, the 'cluster', 'n', 'c' 
> & 'r' values are all in one continuous string. For example:
> {noformat}
> {"cluster":"VL-10515{n=5924 c=[action:0.023, adherence:0.223, 
> administration:0.011 r=[action:0.446, adherence:1.501, 
> administration:0.306]}"}
> {noformat}
> This is also the case for the "point":
> {noformat}
> {"point":"013FFD34580BA31AECE5D75DE65478B3D691D138 = [body:6.904, 
> harm:10.101]","vector_name":"013FFD34580BA31AECE5D75DE65478B3D691D138","weight":"1.0"}
> {noformat}
> This leads me to believe that the only way I can get to the individual data 
> in these items is by string parsing. For JSON deserialization I would have 
> expected to see something along the lines of:
> {noformat}
> {
>     "cluster":"VL-10515",
>     "n":5924,
>     "c":
>     [
>         {"action":0.023},
>         {"adherence":0.223},
>         {"administration":0.011}
>     ],
>     "r":
>     [
>         {"action":0.446},
>         {"adherence":1.501},
>         {"administration":0.306}
>     ]
> }
> {noformat}
> and:
> {noformat}
> {
>     "point": {
>         "body": 6.904,
>         "harm": 10.101
>     },
>     "vector_name": "013FFD34580BA31AECE5D75DE65478B3D691D138",
>     "weight": 1.0
> } 
> {noformat}
> Andrew Musselman replied:
> {quote}
> Looks like a bug to me as well; I would have expected something similar to
> what you were expecting except maybe something like this which puts the "c"
> and "r" values in objects rather than arrays of single-element objects:
> {noformat}
> {
>     "cluster":"VL-10515",
>     "n":5924,
>     "c":
>     {
>         "action":0.023,
>         "adherence":0.223,
>         "administration":0.011
>     },
>     "r":
>     {
>        "action":0.446,
>        "adherence":1.501,
>        "administration":0.306
>     }
> }
> {noformat}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to