Hello,


I am new to this list. I tried to solve this problem for the last 48h but I am 
stuck. I hope someone here can hint me in the right direction.



I have problems using the Pig JsonLoader and wondering if I do something wrong 
or I encounter another problem.



The 1st half of this post is to show I know a at least something about what I 
am talking and that I did my homework. During research I found a lot about 
elephant-bird but there seems to be a conflict with cloudera. This way I am 
stuck as well. If you trust me already you can directly jump to the 2nd half of 
my post ,-).



The desired solution should work both, in Cloudera and on Amazon EMR.



To proof something works.

--------------------------



I have this data file:

```

$ cat a.json

{"DataASet":{"A1":1,"A2":4,"DataBSets":[{"B1":"1","B2":"1"},{"B1":"2","B2":"2"}]}}

$ ./jq '.' a.json

{

  "DataASet": {

    "A1": 1,

    "A2": 4,

    "DataBSets": [

      {

        "B1": "1",

        "B2": "1"

      },

      {

        "B1": "2",

        "B2": "2"

      }

    ]

  }

}

$

```



I am using this Pig Script to load it.



``` Pig

a = load 'a.json' using JsonLoader('

     DataASet: (

       A1:int,

       A2:int,

       DataBSets: {

        (

           B1:chararray,

           B2:chararray

         )

       }

     )

');

```



In grunt everything seems ok.



```

grunt> describe a;

a: {DataASet: (A1: int,A2: int,DataBSets: {(B1: chararray,B2: chararray)})}

grunt> dump a;

((1,4,{(1,1),(2,2)}))

grunt>

```



So far so good.



Real Problem

------------



In fact my real data (Gigabytes) looks a little bit different. The array is in 
fact an array of an object.



```

$ ./jq '.' b.json

{

  "DataASet": {

    "A1": 1,

    "A2": 4,

    "DataBSets": [

      {

        "DataBSet": {

          "B1": "1",

          "B2": "1"

        }

      },

      {

        "DataBSet": {

          "B1": "2",

          "B2": "2"

        }

      }

    ]

  }

}

$ cat b.json

{"DataASet":{"A1":1,"A2":4,"DataBSets":[{"DataBSet":{"B1":"1","B2":"1"}},{"DataBSet":{"B1":"2","B2":"2"}}]}}

$

```



I trying to load this json with the following schema:



``` Pig

b = load 'b.json' using JsonLoader('

     DataASet: (

       A1:int,

       A2:int,

       DataBSets: {

        DataBSet: (

           B1:chararray,

           B2:chararray

         )

       }

     )

');

```



Again it looks good so far in grunt.



```

grunt> describe b;

b: {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2: 
chararray)})} ```



I expect someting like this when dumping b:



```

((1,4,{((1,1)),((2,2))}))

```



But I get this:



```

grunt> dump b;

()

grunt>

```



Obviously I am doing something wrong. An empty set hints in the direction that 
the schema does not match on the input line.



Any hints? Thanks in advance.



Kind regards.

Ralf

Reply via email to