Re: Python + Hive + Load Data

Suhail Doshi Tue, 07 Apr 2009 12:38:28 -0700

So evidentially LOAD DATA actually just copies a file to hdfs. What is the
solution if you have thousands of files and attempt a hive query because my
understanding is that this will be dead slow later.


Suhail

On Sun, Apr 5, 2009 at 10:52 AM, Suhail Doshi <digitalwarf...@gmail.com>wrote:

> Ragu,
>
> I managed to get it working, seems there was just inconsistencies I guess
> with metastore_db I was using in the client and the python one.
>
> I should just always use python from now on to make changes to
> metastore_db, instead of copying it around and using the hive client.
>
> Suhail
>
>
> On Sun, Apr 5, 2009 at 10:44 AM, Suhail Doshi <digitalwarf...@gmail.com>wrote:
>
>> Oh nevermind, of course python is using the metastore_db that the hive
>> service is using.
>>
>> Suhail
>>
>>
>> On Sun, Apr 5, 2009 at 10:42 AM, Suhail Doshi 
>> <digitalwarf...@gmail.com>wrote:
>>
>>> This is kind of odd, it's like it's not using the same metastore_db:
>>>
>>> li57-125 ~/test: ls
>>> derby.log  hive_test.py  hive_test.pyc    metastore_db  page_view.log.2
>>>
>>> li57-125 ~/test: hive
>>> Hive history
>>> file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt
>>> hive> select count(1) from page_views;
>>> Total MapReduce jobs = 1
>>> Number of reduce tasks determined at compile time: 1
>>> In order to change the average load for a reducer (in bytes):
>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>> In order to limit the maximum number of reducers:
>>>   set hive.exec.reducers.max=<number>
>>> In order to set a constant number of reducers:
>>>   set mapred.reduce.tasks=<number>
>>> Job need not be submitted: no output: Success
>>> OK
>>> Time taken: 4.909 seconds
>>>
>>> li57-125 ~/test: python hive_test.py
>>> Connecting to HiveServer....
>>> Opening transport...
>>> select count(1) from page_views
>>> Number of rows:  ['20297']
>>>
>>>
>>>
>>> On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi 
>>> <digitalwarf...@gmail.com>wrote:
>>>
>>>> No logs are generated when I run the python file in /tmp/hadoop/
>>>>
>>>> Suhail
>>>>
>>>>
>>>> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <rmur...@facebook.com>wrote:
>>>>
>>>>> Is there no entry in the server logs about the error?
>>>>>
>>>>>
>>>>> On 4/4/09 10:24 PM, "Suhail Doshi" <digitalwarf...@gmail.com> wrote:
>>>>>
>>>>> > I am running the hive server and hadoop on the same server as the
>>>>> file. I am
>>>>> > also running the python script and hive server under the same user
>>>>> and the
>>>>> > file is located in a directory this user owns.
>>>>> >
>>>>> > I am not sure why it's not loading it still.
>>>>> >
>>>>> > Suhail
>>>>> >
>>>>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy <rmur...@facebook.com>
>>>>> wrote:
>>>>> >> Is the file accessible to the HiveServer? We currently don't ship
>>>>> the file
>>>>> >> from the client machine to the server machine.
>>>>> >>
>>>>> >>
>>>>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <suh...@mixpanel.com> wrote:
>>>>> >>
>>>>> >>>> I seem to be having problems with LOAD DATA with a file on my
>>>>> local system
>>>>> >>>> trying get it into hive:
>>>>> >>>>
>>>>> >>>> li57-125 ~/test: python hive_test.py
>>>>> >>>> Connecting to HiveServer....
>>>>> >>>> Opening transport...
>>>>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' INTO
>>>>> TABLE
>>>>> >>>> page_views
>>>>> >>>> Traceback (most recent call last):
>>>>> >>>>   File "hive_test.py", line 36, in <module>
>>>>> >>>>     c.client.execute(query)
>>>>> >>>>   File
>>>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>>>> >>> line
>>>>> >>>> 42, in execute
>>>>> >>>>     self.recv_execute()
>>>>> >>>>   File
>>>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>>>> >>> line
>>>>> >>>> 63, in recv_execute
>>>>> >>>>     raise result.ex
>>>>> >>>> hive_service.ttypes.HiveServerException: {}
>>>>> >>>>
>>>>> >>>> The same query works fine through the hive client but doesn't seem
>>>>> to work
>>>>> >>>> through the python file. Executing a query through the python
>>>>> client works
>>>>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was a
>>>>> better
>>>>> >>> message
>>>>> >>>> to describe why the exception is occurring.
>>>>> >>
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> http://mixpanel.com
>>>> Blog: http://blog.mixpanel.com
>>>>
>>>
>>>
>>>
>>> --
>>> http://mixpanel.com
>>> Blog: http://blog.mixpanel.com
>>>
>>
>>
>>
>> --
>> http://mixpanel.com
>> Blog: http://blog.mixpanel.com
>>
>
>
>
> --
> http://mixpanel.com
> Blog: http://blog.mixpanel.com
>



-- 
http://mixpanel.com
Blog: http://blog.mixpanel.com

Re: Python + Hive + Load Data

Reply via email to