Re: Python + Hive + Load Data

Suhail Doshi Wed, 15 Apr 2009 00:41:23 -0700

Little bit of digging into problem #1, if you don't specify the partitions
it won't work.


Digging into question #1, seems you can't get past this but the work around
is by using partitions and just insert overwrite into an partition that is
dynamically created based on a timestamp to append data to the table.

Question #2, probably impossible atm :-(.

Suhail

On Tue, Apr 14, 2009 at 1:08 PM, Suhail Doshi <[email protected]>wrote:

> Just a few issues:
>
> Problem #1:
> Ended Job = job_200904142000_0003
> Loading data to table page_views_prod
> 10688 Rows loaded to page_views_prod
> OK
> Time taken: 8.736 seconds
> hive> select * from page_views_prod;
> OK
> Time taken: 0.126 seconds
>
> Seems a bit odd that there are no rows even though it said 10k were loaded.
>
> Question #1:
> For: FROM page_views_stage pvs INSERT OVERWRITE TABLE
> page_views_prod                   SELECT pvs.project_code, pvs.page,
> pvs.referrer, pvs.ip, pvs.created WHERE pvs.created > 1235948360
>
> Can you not, not use OVERWRITE? I want to load data in the staging table
> into the production table but I want it to append the data not erase
> existing data.
>
> Question #2:
> For: FROM page_views_stage pvs INSERT TABLE
> page_views_prod                  PARTITION(p_code=pvs.project_code,
> dt=to_date(from_unixtime(pvs.created)))
>
> Is there anyway to dynamically do the partitions if you reference a table?
>
> Thank you in advance,
> Suhail
>
>
>
> On Tue, Apr 7, 2009 at 1:06 PM, Edward Capriolo <[email protected]>wrote:
>
>> >>So evidentially LOAD DATA actually just copies a file to hdfs. What is
>> the solution if you have thousands of files and attempt a hive query because
>> my understanding is that this will be dead slow later.
>>
>> Loading thousands of files is very slow. I have an application reads
>> 9000 small text files from a web server. I was planning attempting to
>> write them with BufferedWriter - FSDataOutputStream my load time was
>> over 9 hours. I can't say if DFS copy is much faster. I took a look at
>> what Nutch is doing to handle the situation. Nutch allocates single
>> threaded MapRunners on several nodes and emits NutchDataum. Kinda a
>> crazy way to load in 9000 files :(
>>
>> Merging the smaller files locally might help as well.
>>
>> On Tue, Apr 7, 2009 at 3:52 PM, Suhail Doshi <[email protected]>
>> wrote:
>> > Seems like you have to do something like this?
>> >
>> > CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
>> >                     page_url STRING, referrer_url STRING,
>> >                     ip STRING COMMENT 'IP Address of the User',
>> >
>> >                     country STRING COMMENT 'country of origination')
>> >     COMMENT 'This is the staging page view table'
>> >     ROW FORMAT DELIMITED FIELDS TERMINATED BY '54' LINES TERMINATED BY
>> '12'
>> >
>> >     STORED AS TEXTFILE
>> >     LOCATION '/user/data/stagging/page_view';
>> >
>> >     hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view
>> >
>> >     FROM page_view_stg pvs
>> >     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08',
>> > country='US')
>> >
>> >     SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url,
>> null,
>> > null, pvs.ip
>> >     WHERE pvs.country = 'US';
>> >
>> > On Tue, Apr 7, 2009 at 12:37 PM, Suhail Doshi <[email protected]
>> >
>> > wrote:
>> >>
>> >> So evidentially LOAD DATA actually just copies a file to hdfs. What is
>> the
>> >> solution if you have thousands of files and attempt a hive query
>> because my
>> >> understanding is that this will be dead slow later.
>> >>
>> >> Suhail
>> >>
>> >> On Sun, Apr 5, 2009 at 10:52 AM, Suhail Doshi <
>> [email protected]>
>> >> wrote:
>> >>>
>> >>> Ragu,
>> >>>
>> >>> I managed to get it working, seems there was just inconsistencies I
>> guess
>> >>> with metastore_db I was using in the client and the python one.
>> >>>
>> >>> I should just always use python from now on to make changes to
>> >>> metastore_db, instead of copying it around and using the hive client.
>> >>>
>> >>> Suhail
>> >>>
>> >>> On Sun, Apr 5, 2009 at 10:44 AM, Suhail Doshi <
>> [email protected]>
>> >>> wrote:
>> >>>>
>> >>>> Oh nevermind, of course python is using the metastore_db that the
>> hive
>> >>>> service is using.
>> >>>>
>> >>>> Suhail
>> >>>>
>> >>>> On Sun, Apr 5, 2009 at 10:42 AM, Suhail Doshi <
>> [email protected]>
>> >>>> wrote:
>> >>>>>
>> >>>>> This is kind of odd, it's like it's not using the same metastore_db:
>> >>>>>
>> >>>>> li57-125 ~/test: ls
>> >>>>> derby.log  hive_test.py  hive_test.pyc    metastore_db
>> page_view.log.2
>> >>>>>
>> >>>>> li57-125 ~/test: hive
>> >>>>> Hive history
>> >>>>> file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt
>> >>>>> hive> select count(1) from page_views;
>> >>>>> Total MapReduce jobs = 1
>> >>>>> Number of reduce tasks determined at compile time: 1
>> >>>>> In order to change the average load for a reducer (in bytes):
>> >>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>> >>>>> In order to limit the maximum number of reducers:
>> >>>>>   set hive.exec.reducers.max=<number>
>> >>>>> In order to set a constant number of reducers:
>> >>>>>   set mapred.reduce.tasks=<number>
>> >>>>> Job need not be submitted: no output: Success
>> >>>>> OK
>> >>>>> Time taken: 4.909 seconds
>> >>>>>
>> >>>>> li57-125 ~/test: python hive_test.py
>> >>>>> Connecting to HiveServer....
>> >>>>> Opening transport...
>> >>>>> select count(1) from page_views
>> >>>>> Number of rows:  ['20297']
>> >>>>>
>> >>>>>
>> >>>>> On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi
>> >>>>> <[email protected]> wrote:
>> >>>>>>
>> >>>>>> No logs are generated when I run the python file in /tmp/hadoop/
>> >>>>>>
>> >>>>>> Suhail
>> >>>>>>
>> >>>>>> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <
>> [email protected]>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>> Is there no entry in the server logs about the error?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On 4/4/09 10:24 PM, "Suhail Doshi" <[email protected]>
>> wrote:
>> >>>>>>>
>> >>>>>>> > I am running the hive server and hadoop on the same server as
>> the
>> >>>>>>> > file. I am
>> >>>>>>> > also running the python script and hive server under the same
>> user
>> >>>>>>> > and the
>> >>>>>>> > file is located in a directory this user owns.
>> >>>>>>> >
>> >>>>>>> > I am not sure why it's not loading it still.
>> >>>>>>> >
>> >>>>>>> > Suhail
>> >>>>>>> >
>> >>>>>>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy
>> >>>>>>> > <[email protected]> wrote:
>> >>>>>>> >> Is the file accessible to the HiveServer? We currently don't
>> ship
>> >>>>>>> >> the file
>> >>>>>>> >> from the client machine to the server machine.
>> >>>>>>> >>
>> >>>>>>> >>
>> >>>>>>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <[email protected]>
>> wrote:
>> >>>>>>> >>
>> >>>>>>> >>>> I seem to be having problems with LOAD DATA with a file on my
>> >>>>>>> >>>> local system
>> >>>>>>> >>>> trying get it into hive:
>> >>>>>>> >>>>
>> >>>>>>> >>>> li57-125 ~/test: python hive_test.py
>> >>>>>>> >>>> Connecting to HiveServer....
>> >>>>>>> >>>> Opening transport...
>> >>>>>>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2'
>> INTO
>> >>>>>>> >>>> TABLE
>> >>>>>>> >>>> page_views
>> >>>>>>> >>>> Traceback (most recent call last):
>> >>>>>>> >>>>   File "hive_test.py", line 36, in <module>
>> >>>>>>> >>>>     c.client.execute(query)
>> >>>>>>> >>>>   File
>> >>>>>>> >>>>
>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>> >>>>>>> >>> line
>> >>>>>>> >>>> 42, in execute
>> >>>>>>> >>>>     self.recv_execute()
>> >>>>>>> >>>>   File
>> >>>>>>> >>>>
>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
>> >>>>>>> >>> line
>> >>>>>>> >>>> 63, in recv_execute
>> >>>>>>> >>>>     raise result.ex
>> >>>>>>> >>>> hive_service.ttypes.HiveServerException: {}
>> >>>>>>> >>>>
>> >>>>>>> >>>> The same query works fine through the hive client but doesn't
>> >>>>>>> >>>> seem to work
>> >>>>>>> >>>> through the python file. Executing a query through the python
>> >>>>>>> >>>> client works
>> >>>>>>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was
>> a
>> >>>>>>> >>>> better
>> >>>>>>> >>> message
>> >>>>>>> >>>> to describe why the exception is occurring.
>> >>>>>>> >>
>> >>>>>>> >
>> >>>>>>> >
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> http://mixpanel.com
>> >>>>>> Blog: http://blog.mixpanel.com
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> http://mixpanel.com
>> >>>>> Blog: http://blog.mixpanel.com
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> http://mixpanel.com
>> >>>> Blog: http://blog.mixpanel.com
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> http://mixpanel.com
>> >>> Blog: http://blog.mixpanel.com
>> >>
>> >>
>> >>
>> >> --
>> >> http://mixpanel.com
>> >> Blog: http://blog.mixpanel.com
>> >
>> >
>> >
>> > --
>> > http://mixpanel.com
>> > Blog: http://blog.mixpanel.com
>> >
>>
>
>
>
> --
> http://mixpanel.com
> Blog: http://blog.mixpanel.com
>



-- 
http://mixpanel.com
Blog: http://blog.mixpanel.com

Re: Python + Hive + Load Data

Reply via email to