Re: Python + Hive + Load Data

Edward Capriolo Tue, 07 Apr 2009 13:07:08 -0700

>>So evidentially LOAD DATA actually just copies a file to hdfs. What is the 
>>solution if you have thousands of files and attempt a hive query because my 
>>understanding is that this will be dead slow later.


Loading thousands of files is very slow. I have an application reads
9000 small text files from a web server. I was planning attempting to
write them with BufferedWriter - FSDataOutputStream my load time was
over 9 hours. I can't say if DFS copy is much faster. I took a look at
what Nutch is doing to handle the situation. Nutch allocates single
threaded MapRunners on several nodes and emits NutchDataum. Kinda a
crazy way to load in 9000 files :(

Merging the smaller files locally might help as well.

On Tue, Apr 7, 2009 at 3:52 PM, Suhail Doshi <[email protected]> wrote:
> Seems like you have to do something like this?
>
> CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
>                     page_url STRING, referrer_url STRING,
>                     ip STRING COMMENT 'IP Address of the User',
>
>                     country STRING COMMENT 'country of origination')
>     COMMENT 'This is the staging page view table'
>     ROW FORMAT DELIMITED FIELDS TERMINATED BY '54' LINES TERMINATED BY '12'
>
>     STORED AS TEXTFILE
>     LOCATION '/user/data/stagging/page_view';
>
>     hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view
>
>     FROM page_view_stg pvs
>     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08',
> country='US')
>
>     SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null,
> null, pvs.ip
>     WHERE pvs.country = 'US';
>
> On Tue, Apr 7, 2009 at 12:37 PM, Suhail Doshi <[email protected]>
> wrote:
>>
>> So evidentially LOAD DATA actually just copies a file to hdfs. What is the
>> solution if you have thousands of files and attempt a hive query because my
>> understanding is that this will be dead slow later.
>>
>> Suhail
>>
>> On Sun, Apr 5, 2009 at 10:52 AM, Suhail Doshi <[email protected]>
>> wrote:
>>>
>>> Ragu,
>>>
>>> I managed to get it working, seems there was just inconsistencies I guess
>>> with metastore_db I was using in the client and the python one.
>>>
>>> I should just always use python from now on to make changes to
>>> metastore_db, instead of copying it around and using the hive client.
>>>
>>> Suhail
>>>
>>> On Sun, Apr 5, 2009 at 10:44 AM, Suhail Doshi <[email protected]>
>>> wrote:
>>>>
>>>> Oh nevermind, of course python is using the metastore_db that the hive
>>>> service is using.
>>>>
>>>> Suhail
>>>>
>>>> On Sun, Apr 5, 2009 at 10:42 AM, Suhail Doshi <[email protected]>
>>>> wrote:
>>>>>
>>>>> This is kind of odd, it's like it's not using the same metastore_db:
>>>>>
>>>>> li57-125 ~/test: ls
>>>>> derby.log  hive_test.py  hive_test.pyc    metastore_db  page_view.log.2
>>>>>
>>>>> li57-125 ~/test: hive
>>>>> Hive history
>>>>> file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt
>>>>> hive> select count(1) from page_views;
>>>>> Total MapReduce jobs = 1
>>>>> Number of reduce tasks determined at compile time: 1
>>>>> In order to change the average load for a reducer (in bytes):
>>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>>> In order to limit the maximum number of reducers:
>>>>>   set hive.exec.reducers.max=<number>
>>>>> In order to set a constant number of reducers:
>>>>>   set mapred.reduce.tasks=<number>
>>>>> Job need not be submitted: no output: Success
>>>>> OK
>>>>> Time taken: 4.909 seconds
>>>>>
>>>>> li57-125 ~/test: python hive_test.py
>>>>> Connecting to HiveServer....
>>>>> Opening transport...
>>>>> select count(1) from page_views
>>>>> Number of rows:  ['20297']
>>>>>
>>>>>
>>>>> On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> No logs are generated when I run the python file in /tmp/hadoop/
>>>>>>
>>>>>> Suhail
>>>>>>
>>>>>> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <[email protected]>
>>>>>> wrote:
>>>>>>>
>>>>>>> Is there no entry in the server logs about the error?
>>>>>>>
>>>>>>>
>>>>>>> On 4/4/09 10:24 PM, "Suhail Doshi" <[email protected]> wrote:
>>>>>>>
>>>>>>> > I am running the hive server and hadoop on the same server as the
>>>>>>> > file. I am
>>>>>>> > also running the python script and hive server under the same user
>>>>>>> > and the
>>>>>>> > file is located in a directory this user owns.
>>>>>>> >
>>>>>>> > I am not sure why it's not loading it still.
>>>>>>> >
>>>>>>> > Suhail
>>>>>>> >
>>>>>>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy
>>>>>>> > <[email protected]> wrote:
>>>>>>> >> Is the file accessible to the HiveServer? We currently don't ship
>>>>>>> >> the file
>>>>>>> >> from the client machine to the server machine.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <[email protected]> wrote:
>>>>>>> >>
>>>>>>> >>>> I seem to be having problems with LOAD DATA with a file on my
>>>>>>> >>>> local system
>>>>>>> >>>> trying get it into hive:
>>>>>>> >>>>
>>>>>>> >>>> li57-125 ~/test: python hive_test.py
>>>>>>> >>>> Connecting to HiveServer....
>>>>>>> >>>> Opening transport...
>>>>>>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' INTO
>>>>>>> >>>> TABLE
>>>>>>> >>>> page_views
>>>>>>> >>>> Traceback (most recent call last):
>>>>>>> >>>>   File "hive_test.py", line 36, in <module>
>>>>>>> >>>>     c.client.execute(query)
>>>>>>> >>>>   File
>>>>>>> >>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>>>>>> >>> line
>>>>>>> >>>> 42, in execute
>>>>>>> >>>>     self.recv_execute()
>>>>>>> >>>>   File
>>>>>>> >>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>>>>>>> >>> line
>>>>>>> >>>> 63, in recv_execute
>>>>>>> >>>>     raise result.ex
>>>>>>> >>>> hive_service.ttypes.HiveServerException: {}
>>>>>>> >>>>
>>>>>>> >>>> The same query works fine through the hive client but doesn't
>>>>>>> >>>> seem to work
>>>>>>> >>>> through the python file. Executing a query through the python
>>>>>>> >>>> client works
>>>>>>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was a
>>>>>>> >>>> better
>>>>>>> >>> message
>>>>>>> >>>> to describe why the exception is occurring.
>>>>>>> >>
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> http://mixpanel.com
>>>>>> Blog: http://blog.mixpanel.com
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> http://mixpanel.com
>>>>> Blog: http://blog.mixpanel.com
>>>>
>>>>
>>>>
>>>> --
>>>> http://mixpanel.com
>>>> Blog: http://blog.mixpanel.com
>>>
>>>
>>>
>>> --
>>> http://mixpanel.com
>>> Blog: http://blog.mixpanel.com
>>
>>
>>
>> --
>> http://mixpanel.com
>> Blog: http://blog.mixpanel.com
>
>
>
> --
> http://mixpanel.com
> Blog: http://blog.mixpanel.com
>

Re: Python + Hive + Load Data

Reply via email to