Re: Python + Hive + Load Data

Suhail Doshi Tue, 07 Apr 2009 12:53:07 -0700

Seems like you have to do something like this?

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
                    page_url STRING, referrer_url STRING,
                    ip STRING COMMENT 'IP Address of the User',
                    country STRING COMMENT 'country of origination')
    COMMENT 'This is the staging page view table'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '54' LINES TERMINATED BY '12'
    STORED AS TEXTFILE
    LOCATION '/user/data/stagging/page_view';


    hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view

    FROM page_view_stg pvs
    INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
    SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url,
null, null, pvs.ip
    WHERE pvs.country = 'US';



On Tue, Apr 7, 2009 at 12:37 PM, Suhail Doshi <[email protected]>wrote:

> So evidentially LOAD DATA actually just copies a file to hdfs. What is the
> solution if you have thousands of files and attempt a hive query because my
> understanding is that this will be dead slow later.
>
> Suhail
>
>
> On Sun, Apr 5, 2009 at 10:52 AM, Suhail Doshi <[email protected]>wrote:
>
>> Ragu,
>>
>> I managed to get it working, seems there was just inconsistencies I guess
>> with metastore_db I was using in the client and the python one.
>>
>> I should just always use python from now on to make changes to
>> metastore_db, instead of copying it around and using the hive client.
>>
>> Suhail
>>
>>
>> On Sun, Apr 5, 2009 at 10:44 AM, Suhail Doshi 
>> <[email protected]>wrote:
>>
>>> Oh nevermind, of course python is using the metastore_db that the hive
>>> service is using.
>>>
>>> Suhail
>>>
>>>
>>> On Sun, Apr 5, 2009 at 10:42 AM, Suhail Doshi 
>>> <[email protected]>wrote:
>>>
>>>> This is kind of odd, it's like it's not using the same metastore_db:
>>>>
>>>> li57-125 ~/test: ls
>>>> derby.log  hive_test.py  hive_test.pyc    metastore_db  page_view.log.2
>>>>
>>>> li57-125 ~/test: hive
>>>> Hive history
>>>> file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt
>>>> hive> select count(1) from page_views;
>>>> Total MapReduce jobs = 1
>>>> Number of reduce tasks determined at compile time: 1
>>>> In order to change the average load for a reducer (in bytes):
>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>> In order to limit the maximum number of reducers:
>>>>   set hive.exec.reducers.max=<number>
>>>> In order to set a constant number of reducers:
>>>>   set mapred.reduce.tasks=<number>
>>>> Job need not be submitted: no output: Success
>>>> OK
>>>> Time taken: 4.909 seconds
>>>>
>>>> li57-125 ~/test: python hive_test.py
>>>> Connecting to HiveServer....
>>>> Opening transport...
>>>> select count(1) from page_views
>>>> Number of rows:  ['20297']
>>>>
>>>>
>>>>
>>>> On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi <[email protected]
>>>> > wrote:
>>>>
>>>>> No logs are generated when I run the python file in /tmp/hadoop/
>>>>>
>>>>> Suhail
>>>>>
>>>>>
>>>>> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <[email protected]>wrote:
>>>>>
>>>>>> Is there no entry in the server logs about the error?
>>>>>>
>>>>>>
>>>>>> On 4/4/09 10:24 PM, "Suhail Doshi" <[email protected]> wrote:
>>>>>>
>>>>>> > I am running the hive server and hadoop on the same server as the
>>>>>> file. I am
>>>>>> > also running the python script and hive server under the same user
>>>>>> and the
>>>>>> > file is located in a directory this user owns.
>>>>>> >
>>>>>> > I am not sure why it's not loading it still.
>>>>>> >
>>>>>> > Suhail
>>>>>> >
>>>>>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy <[email protected]>
>>>>>> wrote:
>>>>>> >> Is the file accessible to the HiveServer? We currently don't ship
>>>>>> the file
>>>>>> >> from the client machine to the server machine.
>>>>>> >>
>>>>>> >>
>>>>>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <[email protected]> wrote:
>>>>>> >>
>>>>>> >>>> I seem to be having problems with LOAD DATA with a file on my
>>>>>> local system
>>>>>> >>>> trying get it into hive:
>>>>>> >>>>
>>>>>> >>>> li57-125 ~/test: python hive_test.py
>>>>>> >>>> Connecting to HiveServer....
>>>>>> >>>> Opening transport...
>>>>>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' INTO
>>>>>> TABLE
>>>>>> >>>> page_views
>>>>>> >>>> Traceback (most recent call last):
>>>>>> >>>>   File "hive_test.py", line 36, in <module>
>>>>>> >>>>     c.client.execute(query)
>>>>>> >>>>   File
>>>>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>>>>> >>> line
>>>>>> >>>> 42, in execute
>>>>>> >>>>     self.recv_execute()
>>>>>> >>>>   File
>>>>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>>>>> >>> line
>>>>>> >>>> 63, in recv_execute
>>>>>> >>>>     raise result.ex
>>>>>> >>>> hive_service.ttypes.HiveServerException: {}
>>>>>> >>>>
>>>>>> >>>> The same query works fine through the hive client but doesn't
>>>>>> seem to work
>>>>>> >>>> through the python file. Executing a query through the python
>>>>>> client works
>>>>>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was a
>>>>>> better
>>>>>> >>> message
>>>>>> >>>> to describe why the exception is occurring.
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> http://mixpanel.com
>>>>> Blog: http://blog.mixpanel.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> http://mixpanel.com
>>>> Blog: http://blog.mixpanel.com
>>>>
>>>
>>>
>>>
>>> --
>>> http://mixpanel.com
>>> Blog: http://blog.mixpanel.com
>>>
>>
>>
>>
>> --
>> http://mixpanel.com
>> Blog: http://blog.mixpanel.com
>>
>
>
>
> --
> http://mixpanel.com
> Blog: http://blog.mixpanel.com
>



-- 
http://mixpanel.com
Blog: http://blog.mixpanel.com

Re: Python + Hive + Load Data

Reply via email to