Hello,
I am new to this list. I tried to solve this problem for the last 48h but I am
stuck. I hope someone here can hint me in the right direction.
I have problems using the Pig JsonLoader and wondering if I do something wrong
or I encounter another problem.
The 1st half of this post is to show I know a at least something about what I
am talking and that I did my homework. During research I found a lot about
elephant-bird but there seems to be a conflict with cloudera. This way I am
stuck as well. If you trust me already you can directly jump to the 2nd half of
my post ,-).
The desired solution should work both, in Cloudera and on Amazon EMR.
To proof something works.
--------------------------
I have this data file:
```
$ cat a.json
{"DataASet":{"A1":1,"A2":4,"DataBSets":[{"B1":"1","B2":"1"},{"B1":"2","B2":"2"}]}}
$ ./jq '.' a.json
{
"DataASet": {
"A1": 1,
"A2": 4,
"DataBSets": [
{
"B1": "1",
"B2": "1"
},
{
"B1": "2",
"B2": "2"
}
]
}
}
$
```
I am using this Pig Script to load it.
``` Pig
a = load 'a.json' using JsonLoader('
DataASet: (
A1:int,
A2:int,
DataBSets: {
(
B1:chararray,
B2:chararray
)
}
)
');
```
In grunt everything seems ok.
```
grunt> describe a;
a: {DataASet: (A1: int,A2: int,DataBSets: {(B1: chararray,B2: chararray)})}
grunt> dump a;
((1,4,{(1,1),(2,2)}))
grunt>
```
So far so good.
Real Problem
------------
In fact my real data (Gigabytes) looks a little bit different. The array is in
fact an array of an object.
```
$ ./jq '.' b.json
{
"DataASet": {
"A1": 1,
"A2": 4,
"DataBSets": [
{
"DataBSet": {
"B1": "1",
"B2": "1"
}
},
{
"DataBSet": {
"B1": "2",
"B2": "2"
}
}
]
}
}
$ cat b.json
{"DataASet":{"A1":1,"A2":4,"DataBSets":[{"DataBSet":{"B1":"1","B2":"1"}},{"DataBSet":{"B1":"2","B2":"2"}}]}}
$
```
I trying to load this json with the following schema:
``` Pig
b = load 'b.json' using JsonLoader('
DataASet: (
A1:int,
A2:int,
DataBSets: {
DataBSet: (
B1:chararray,
B2:chararray
)
}
)
');
```
Again it looks good so far in grunt.
```
grunt> describe b;
b: {DataASet: (A1: int,A2: int,DataBSets: {DataBSet: (B1: chararray,B2:
chararray)})} ```
I expect someting like this when dumping b:
```
((1,4,{((1,1)),((2,2))}))
```
But I get this:
```
grunt> dump b;
()
grunt>
```
Obviously I am doing something wrong. An empty set hints in the direction that
the schema does not match on the input line.
Any hints? Thanks in advance.
Kind regards.
Ralf