Re: Hive select shows null after successful data load

Stephen Sprague Wed, 19 Jun 2013 12:10:02 -0700

try_parsed_json is not trivial imho :)

start with the very, very basic, for example,  { "jobs" : "foo" }.  Get
that to work first. :)   When that works add a level of nesting and see
what happens.  Keep building on it until you either break it (and then you
know that last thing you added broke it and can concentrate on that) or
you'll have worked out all the bugs and your final example will work.
Nothing fancy here except old school trial and error.


An alternative I keep bringing up when native semantics don't go one's way
is the transform() function.  use python, perl,  ruby or whatever to parse
the json and go nuts with the rich features of said language.  just write
your output to stdout as a delimited serialization of what you want to
store and that's it.  That would be another way to get your scalars, arrays
and structs to work.

Don't give up yet though on the JsonSerde! :)  Its probably something very
easy that we just can't see.


On Wed, Jun 19, 2013 at 10:00 AM, Sunita Arvind <sunitarv...@gmail.com>wrote:

> Thanks for looking into it Ramki.
> Yes I had tried these options. Here is what I get (renamed the table to
> have a meaningful name):
>
> hive> select jobs.values[1].id from linkedinjobsearch;
> ......mapreduce task details....
> OK
> NULL
> Time taken: 9.586 seconds
>
>
> hive> select jobs.values[0].position.title from linkedinjobsearch;
>  Total MapReduce jobs = 1
> Launching Job 1 out of 1
>
> OK
> NULL
> Time taken: 9.617 seconds
>
>
> I am trying to connect btrace to the process to be able to trace the code
> but cant get it to respond. Here is what I tried:
>
> [sunita@node01 ~]$ hive --debug, recursive=y, port=7000,mainSuspend=y,
> childSuspend=y
> ERROR: Cannot load this JVM TI agent twice, check your java command line
> for duplicate jdwp options.
> Error occurred during initialization of VM
> agent library failed to init: jdwp
>
> Tried changing the port also. Any idea regarding the debuggers that can be
> used. I also tried explain query and that does not show any issues either.
>
> regards
> Sunita
>
>
>
>
>
>
>
> On Wed, Jun 19, 2013 at 12:11 PM, Ramki Palle <ramki.pa...@gmail.com>wrote:
>
>> Can you run some other queries from job1 table and see if any query
>> returns some data?
>>
>> I am guessing your query "select jobs.values.position.title from jobs1;"
>> may have some issue. May be it should be as
>>
>> select jobs.values[0].position.title from jobs1;
>>
>>
>> Regards,
>> Ramki.
>>
>>
>> On Wed, Jun 19, 2013 at 8:24 AM, Sunita Arvind <sunitarv...@gmail.com>wrote:
>>
>>> Thanks Stephen,
>>>
>>> That's just what I tried with the try_parsed table. It is exactly same
>>> data with lesser nesting in the structure and lesser number of entries.
>>> Do you mean to say that highly nested jsons can lead to issues? What are
>>> typical solution to such issues? Write UDFs in hive or parse the JSON into
>>> a delimited file?
>>> I have heard of custom serdes also. Not sure if UDFs and custom serdes
>>> are one and the same.
>>>
>>> regards
>>> Sunita
>>>
>>>
>>> On Wed, Jun 19, 2013 at 10:38 AM, Stephen Sprague <sprag...@gmail.com>wrote:
>>>
>>>> I think you might have to start small here instead of going for the
>>>> home run on the first swing.  when all else fails start with a trivial json
>>>> object and then build up from there and see what additional step breaks
>>>> it.   that way you know if the trivial example fails is something
>>>> fundamental and not the complexity of your json object that's throwing
>>>> things off.
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 4:34 AM, Sunita Arvind 
>>>> <sunitarv...@gmail.com>wrote:
>>>>
>>>>> Thanks for sharing your experience Richa.
>>>>> I do have timestamps but in the format of year : INT, day : INT, month
>>>>> : INT.
>>>>> As per your suggestion, I changed them all to string, but still get
>>>>> null as the output.
>>>>>
>>>>> regards
>>>>> Sunita
>>>>>
>>>>>
>>>>> On Wed, Jun 19, 2013 at 2:17 AM, Richa Sharma <
>>>>> mailtorichasha...@gmail.com> wrote:
>>>>>
>>>>>> Do you have any timestamp fields in the table that might contain null
>>>>>> value ?
>>>>>>
>>>>>> I faced a similar situation sometime back - changing the data type to
>>>>>> string made it work.
>>>>>>
>>>>>> But I was working on delimited text files.
>>>>>> Not sure if it applies to JSON .. but its still worth giving a try !!
>>>>>>
>>>>>> Richa
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 19, 2013 at 7:28 AM, Sunita Arvind <sunitarv...@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Having the a column name same as the table name, is a problem due to
>>>>>>> which I was not able to reference jobs.values.id from jobs.
>>>>>>> Changing the table name to jobs1 resolved the semantic error.
>>>>>>> However, the query still returns null
>>>>>>>
>>>>>>> hive> select jobs.values.position.title from jobs1;
>>>>>>> Total MapReduce jobs = 1
>>>>>>> Launching Job 1 out of 1
>>>>>>> Number of reduce tasks is set to 0 since there's no reduce operator
>>>>>>> Starting Job = job_201306080116_0036, Tracking URL =
>>>>>>> http://node01.expressanalytics.net:50030/jobdetails.jsp?jobid=job_201306080116_0036
>>>>>>> Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill
>>>>>>> job_201306080116_0036
>>>>>>> Hadoop job information for Stage-1: number of mappers: 1; number of
>>>>>>> reducers: 0
>>>>>>> 2013-06-18 18:55:52,381 Stage-1 map = 0%,  reduce = 0%
>>>>>>> 2013-06-18 18:55:56,394 Stage-1 map = 100%,  reduce = 0%, Cumulative
>>>>>>> CPU 0.88 sec
>>>>>>> 2013-06-18 18:55:57,400 Stage-1 map = 100%,  reduce = 0%, Cumulative
>>>>>>> CPU 0.88 sec
>>>>>>> 2013-06-18 18:55:58,407 Stage-1 map = 100%,  reduce = 100%,
>>>>>>> Cumulative CPU 0.88 sec
>>>>>>> MapReduce Total cumulative CPU time: 880 msec
>>>>>>> Ended Job = job_201306080116_0036
>>>>>>> MapReduce Jobs Launched:
>>>>>>> Job 0: Map: 1   Cumulative CPU: 0.88 sec   HDFS Read: 35374 HDFS
>>>>>>> Write: 3 SUCCESS
>>>>>>> Total MapReduce CPU Time Spent: 880 msec
>>>>>>> OK
>>>>>>> null
>>>>>>> Time taken: 9.591 seconds
>>>>>>>
>>>>>>>
>>>>>>> regards
>>>>>>> Sunita
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 18, 2013 at 9:35 PM, Sunita Arvind <
>>>>>>> sunitarv...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Ok.
>>>>>>>> The data files are quite small. Around 35 KB and 1 KB each.
>>>>>>>>
>>>>>>>> [sunita@node01 tables]$  hadoop fs -ls /user/sunita/tables/jobs
>>>>>>>> Found 1 items
>>>>>>>> -rw-r--r--   3 sunita hdfs      35172 2013-06-18 18:31
>>>>>>>> /user/sunita/tables/jobs/jobs_noSite_parsed.json
>>>>>>>>
>>>>>>>>
>>>>>>>> [sunita@node01 tables]$ hadoop fs -text
>>>>>>>> /user/sunita/tables/jobs/jobs_noSite_parsed.json |more
>>>>>>>> {"jobs": {"_total": 1812, "_count": 20, "_start": 0, "values":
>>>>>>>> [{"company": {"i
>>>>>>>> d": 21836, "name": "CyberCoders"}, "postingDate": {"year": 2013,
>>>>>>>> "day": 10, "mo
>>>>>>>> nth": 6}, "descriptionSnippet": "Software Engineer-Hadoop, HDFS,
>>>>>>>> HBase, Pig- Ve
>>>>>>>> rtica Analytics Senior Hadoop Engineer - Skills Required - Hadoop,
>>>>>>>> HDFS, HBase,
>>>>>>>>  Pig, SQL, Industrial Software Development, System Integration,
>>>>>>>> Java, high perf
>>>>>>>> ormance, multi-threading, VerticaWe are a well known consumer
>>>>>>>> product developme
>>>>>>>> nt company and we are looking to add a Hadoop Engineer to our
>>>>>>>> Engineering team.
>>>>>>>>   You will be working with the latest ", "expirationDate": {"year":
>>>>>>>> 2013, "day"
>>>>>>>>
>>>>>>>> Its a single line, so used 'more' rather than 'head'. But
>>>>>>>> effectively, the file exists and has the data.
>>>>>>>>
>>>>>>>> regards
>>>>>>>> Sunita
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jun 18, 2013 at 8:38 PM, Stephen Sprague <
>>>>>>>> sprag...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> As Nitin alluded to its best to confirm the data is definitely in
>>>>>>>>> hdfs using hdfs semantics rather than hive for the first step.
>>>>>>>>>
>>>>>>>>> 1. how big is it?   hadoop fs -ls <your hdfs dir>
>>>>>>>>> 2. cat a bit of it and see if anything is there.   hadoop fs -text
>>>>>>>>> <your hdfs dir>/<filename> | head -10
>>>>>>>>>
>>>>>>>>> do you see any data from step #2?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jun 18, 2013 at 3:58 PM, Sunita Arvind <
>>>>>>>>> sunitarv...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I ran some complex queries. Something to the extent of
>>>>>>>>>>                     select jobs from jobs;
>>>>>>>>>>  which triggers map reduce jobs but does not show errors and
>>>>>>>>>> produces the same output "null". If I try referencing the struct 
>>>>>>>>>> elements,
>>>>>>>>>> I get error which seems to be the root cause.
>>>>>>>>>>
>>>>>>>>>> Attached are the select statement outputs with the corresponding
>>>>>>>>>> hive logs.
>>>>>>>>>>
>>>>>>>>>> I have also attached my usage details of another table -
>>>>>>>>>> try_parsed which has a subset of the same data which seems to work 
>>>>>>>>>> fine.
>>>>>>>>>> Also attached is the input file for this table - try_parsed.json
>>>>>>>>>> Thanks for your help
>>>>>>>>>>
>>>>>>>>>> Sunita
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 18, 2013 at 4:35 PM, Nitin Pawar <
>>>>>>>>>> nitinpawar...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> can you run a little more complex query
>>>>>>>>>>>
>>>>>>>>>>> select uniq across columns or do some maths. so we know when it
>>>>>>>>>>> fires up a mapreduce
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 19, 2013 at 1:59 AM, Sunita Arvind <
>>>>>>>>>>> sunitarv...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for responding Nitin. Yes I am sure that serde is
>>>>>>>>>>>> working fine and json file is being picked based on all the errors 
>>>>>>>>>>>> that
>>>>>>>>>>>> showed up till this stage. What sort of error are you suspecting. 
>>>>>>>>>>>> File not
>>>>>>>>>>>> present or serde not parsing it ?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tuesday, June 18, 2013, Nitin Pawar wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> select * from table is as good as hdfs -cat
>>>>>>>>>>>>>
>>>>>>>>>>>>> are you sure there is any data in the table?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jun 18, 2013 at 11:54 PM, Sunita Arvind <
>>>>>>>>>>>>> sunitarv...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am able to parse the input JSON file and load it into hive.
>>>>>>>>>>>>>> I do not see any errors with create table, so I am assuming 
>>>>>>>>>>>>>> that. But when
>>>>>>>>>>>>>> I try to read the data, I get null
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> hive> select * from jobs;
>>>>>>>>>>>>>> OK
>>>>>>>>>>>>>> null
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have validated the JSON with JSONLint and Notepad++ JSON
>>>>>>>>>>>>>> plugin and it is a valid JSON. Here is my create table
>>>>>>>>>>>>>> statement and attached is the json input file.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> create external table jobs (
>>>>>>>>>>>>>> jobs STRUCT<
>>>>>>>>>>>>>> values : ARRAY<STRUCT<
>>>>>>>>>>>>>> company : STRUCT<
>>>>>>>>>>>>>> id : STRING,
>>>>>>>>>>>>>> name : STRING>,
>>>>>>>>>>>>>> postingDate : STRUCT<
>>>>>>>>>>>>>> year : INT,
>>>>>>>>>>>>>> day : INT,
>>>>>>>>>>>>>> month : INT>,
>>>>>>>>>>>>>> descriptionSnippet : STRING,
>>>>>>>>>>>>>> expirationDate : STRUCT<
>>>>>>>>>>>>>> year : INT,
>>>>>>>>>>>>>> day : INT,
>>>>>>>>>>>>>> month : INT>,
>>>>>>>>>>>>>> position : STRUCT<
>>>>>>>>>>>>>> title : STRING,
>>>>>>>>>>>>>> jobFunctions : ARRAY<STRUCT<
>>>>>>>>>>>>>> code : STRING,
>>>>>>>>>>>>>> name : STRING>>,
>>>>>>>>>>>>>> industries : ARRAY<STRUCT<
>>>>>>>>>>>>>> code : STRING,
>>>>>>>>>>>>>> id : STRING,
>>>>>>>>>>>>>> name : STRING>>,
>>>>>>>>>>>>>> jobType : STRUCT<
>>>>>>>>>>>>>> code : STRING,
>>>>>>>>>>>>>> name : STRING>,
>>>>>>>>>>>>>> experienceLevel : STRUCT<
>>>>>>>>>>>>>> code : STRING,
>>>>>>>>>>>>>> name : STRING>>,
>>>>>>>>>>>>>> id : STRING,
>>>>>>>>>>>>>> customerJobCode : STRING,
>>>>>>>>>>>>>> skillsAndExperience : STRING,
>>>>>>>>>>>>>> salary : STRING,
>>>>>>>>>>>>>> jobPoster : STRUCT<
>>>>>>>>>>>>>> id : STRING,
>>>>>>>>>>>>>> firstName : STRING,
>>>>>>>>>>>>>> lastName : STRING,
>>>>>>>>>>>>>> headline : STRING>,
>>>>>>>>>>>>>> referralBonus : STRING,
>>>>>>>>>>>>>> locationDescription : STRING>>>
>>>>>>>>>>>>>>  )
>>>>>>>>>>>>>> ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
>>>>>>>>>>>>>> LOCATION '/user/sunita/tables/jobs';
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The table creation works fine, but when I attempt to query, I
>>>>>>>>>>>>>> get null as the result.
>>>>>>>>>>>>>> I tried adding Input/Output formats, Serde Properties,
>>>>>>>>>>>>>> nothing seems to impact.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am of the opinion that the libraries cannot handle this
>>>>>>>>>>>>>> level of nesting and I probably will have to write a custom 
>>>>>>>>>>>>>> serde or a
>>>>>>>>>>>>>> parser myself. Just wanted to seek guidance before I get into 
>>>>>>>>>>>>>> that.
>>>>>>>>>>>>>> Appreciate your help and guidance.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> regards
>>>>>>>>>>>>>> Sunita
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Nitin Pawar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hive select shows null after successful data load

Reply via email to