Re: Python + Hive + Load Data

Suhail Doshi Tue, 14 Apr 2009 13:17:49 -0700

Just a few issues:

Problem #1:
Ended Job = job_200904142000_0003
Loading data to table page_views_prod
10688 Rows loaded to page_views_prod
OK
Time taken: 8.736 seconds
hive> select * from page_views_prod;
OK
Time taken: 0.126 seconds


Seems a bit odd that there are no rows even though it said 10k were loaded.

Question #1:
For: FROM page_views_stage pvs INSERT OVERWRITE TABLE
page_views_prod                   SELECT pvs.project_code, pvs.page,
pvs.referrer, pvs.ip, pvs.created WHERE pvs.created > 1235948360

Can you not, not use OVERWRITE? I want to load data in the staging table
into the production table but I want it to append the data not erase
existing data.

Question #2:
For: FROM page_views_stage pvs INSERT TABLE page_views_prod
PARTITION(p_code=pvs.project_code, dt=to_date(from_unixtime(pvs.created)))

Is there anyway to dynamically do the partitions if you reference a table?

Thank you in advance,
Suhail


On Tue, Apr 7, 2009 at 1:06 PM, Edward Capriolo <[email protected]>wrote:

> >>So evidentially LOAD DATA actually just copies a file to hdfs. What is
> the solution if you have thousands of files and attempt a hive query because
> my understanding is that this will be dead slow later.
>
> Loading thousands of files is very slow. I have an application reads
> 9000 small text files from a web server. I was planning attempting to
> write them with BufferedWriter - FSDataOutputStream my load time was
> over 9 hours. I can't say if DFS copy is much faster. I took a look at
> what Nutch is doing to handle the situation. Nutch allocates single
> threaded MapRunners on several nodes and emits NutchDataum. Kinda a
> crazy way to load in 9000 files :(
>
> Merging the smaller files locally might help as well.
>
> On Tue, Apr 7, 2009 at 3:52 PM, Suhail Doshi <[email protected]>
> wrote:
> > Seems like you have to do something like this?
> >
> > CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
> >                     page_url STRING, referrer_url STRING,
> >                     ip STRING COMMENT 'IP Address of the User',
> >
> >                     country STRING COMMENT 'country of origination')
> >     COMMENT 'This is the staging page view table'
> >     ROW FORMAT DELIMITED FIELDS TERMINATED BY '54' LINES TERMINATED BY
> '12'
> >
> >     STORED AS TEXTFILE
> >     LOCATION '/user/data/stagging/page_view';
> >
> >     hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view
> >
> >     FROM page_view_stg pvs
> >     INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08',
> > country='US')
> >
> >     SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url,
> null,
> > null, pvs.ip
> >     WHERE pvs.country = 'US';
> >
> > On Tue, Apr 7, 2009 at 12:37 PM, Suhail Doshi <[email protected]>
> > wrote:
> >>
> >> So evidentially LOAD DATA actually just copies a file to hdfs. What is
> the
> >> solution if you have thousands of files and attempt a hive query because
> my
> >> understanding is that this will be dead slow later.
> >>
> >> Suhail
> >>
> >> On Sun, Apr 5, 2009 at 10:52 AM, Suhail Doshi <[email protected]
> >
> >> wrote:
> >>>
> >>> Ragu,
> >>>
> >>> I managed to get it working, seems there was just inconsistencies I
> guess
> >>> with metastore_db I was using in the client and the python one.
> >>>
> >>> I should just always use python from now on to make changes to
> >>> metastore_db, instead of copying it around and using the hive client.
> >>>
> >>> Suhail
> >>>
> >>> On Sun, Apr 5, 2009 at 10:44 AM, Suhail Doshi <
> [email protected]>
> >>> wrote:
> >>>>
> >>>> Oh nevermind, of course python is using the metastore_db that the hive
> >>>> service is using.
> >>>>
> >>>> Suhail
> >>>>
> >>>> On Sun, Apr 5, 2009 at 10:42 AM, Suhail Doshi <
> [email protected]>
> >>>> wrote:
> >>>>>
> >>>>> This is kind of odd, it's like it's not using the same metastore_db:
> >>>>>
> >>>>> li57-125 ~/test: ls
> >>>>> derby.log  hive_test.py  hive_test.pyc    metastore_db
> page_view.log.2
> >>>>>
> >>>>> li57-125 ~/test: hive
> >>>>> Hive history
> >>>>> file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt
> >>>>> hive> select count(1) from page_views;
> >>>>> Total MapReduce jobs = 1
> >>>>> Number of reduce tasks determined at compile time: 1
> >>>>> In order to change the average load for a reducer (in bytes):
> >>>>>   set hive.exec.reducers.bytes.per.reducer=<number>
> >>>>> In order to limit the maximum number of reducers:
> >>>>>   set hive.exec.reducers.max=<number>
> >>>>> In order to set a constant number of reducers:
> >>>>>   set mapred.reduce.tasks=<number>
> >>>>> Job need not be submitted: no output: Success
> >>>>> OK
> >>>>> Time taken: 4.909 seconds
> >>>>>
> >>>>> li57-125 ~/test: python hive_test.py
> >>>>> Connecting to HiveServer....
> >>>>> Opening transport...
> >>>>> select count(1) from page_views
> >>>>> Number of rows:  ['20297']
> >>>>>
> >>>>>
> >>>>> On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> No logs are generated when I run the python file in /tmp/hadoop/
> >>>>>>
> >>>>>> Suhail
> >>>>>>
> >>>>>> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <[email protected]
> >
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Is there no entry in the server logs about the error?
> >>>>>>>
> >>>>>>>
> >>>>>>> On 4/4/09 10:24 PM, "Suhail Doshi" <[email protected]>
> wrote:
> >>>>>>>
> >>>>>>> > I am running the hive server and hadoop on the same server as the
> >>>>>>> > file. I am
> >>>>>>> > also running the python script and hive server under the same
> user
> >>>>>>> > and the
> >>>>>>> > file is located in a directory this user owns.
> >>>>>>> >
> >>>>>>> > I am not sure why it's not loading it still.
> >>>>>>> >
> >>>>>>> > Suhail
> >>>>>>> >
> >>>>>>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy
> >>>>>>> > <[email protected]> wrote:
> >>>>>>> >> Is the file accessible to the HiveServer? We currently don't
> ship
> >>>>>>> >> the file
> >>>>>>> >> from the client machine to the server machine.
> >>>>>>> >>
> >>>>>>> >>
> >>>>>>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <[email protected]> wrote:
> >>>>>>> >>
> >>>>>>> >>>> I seem to be having problems with LOAD DATA with a file on my
> >>>>>>> >>>> local system
> >>>>>>> >>>> trying get it into hive:
> >>>>>>> >>>>
> >>>>>>> >>>> li57-125 ~/test: python hive_test.py
> >>>>>>> >>>> Connecting to HiveServer....
> >>>>>>> >>>> Opening transport...
> >>>>>>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2'
> INTO
> >>>>>>> >>>> TABLE
> >>>>>>> >>>> page_views
> >>>>>>> >>>> Traceback (most recent call last):
> >>>>>>> >>>>   File "hive_test.py", line 36, in <module>
> >>>>>>> >>>>     c.client.execute(query)
> >>>>>>> >>>>   File
> >>>>>>> >>>>
> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
> >>>>>>> >>> line
> >>>>>>> >>>> 42, in execute
> >>>>>>> >>>>     self.recv_execute()
> >>>>>>> >>>>   File
> >>>>>>> >>>>
> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py",
> >>>>>>> >>> line
> >>>>>>> >>>> 63, in recv_execute
> >>>>>>> >>>>     raise result.ex
> >>>>>>> >>>> hive_service.ttypes.HiveServerException: {}
> >>>>>>> >>>>
> >>>>>>> >>>> The same query works fine through the hive client but doesn't
> >>>>>>> >>>> seem to work
> >>>>>>> >>>> through the python file. Executing a query through the python
> >>>>>>> >>>> client works
> >>>>>>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was
> a
> >>>>>>> >>>> better
> >>>>>>> >>> message
> >>>>>>> >>>> to describe why the exception is occurring.
> >>>>>>> >>
> >>>>>>> >
> >>>>>>> >
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> http://mixpanel.com
> >>>>>> Blog: http://blog.mixpanel.com
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> http://mixpanel.com
> >>>>> Blog: http://blog.mixpanel.com
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://mixpanel.com
> >>>> Blog: http://blog.mixpanel.com
> >>>
> >>>
> >>>
> >>> --
> >>> http://mixpanel.com
> >>> Blog: http://blog.mixpanel.com
> >>
> >>
> >>
> >> --
> >> http://mixpanel.com
> >> Blog: http://blog.mixpanel.com
> >
> >
> >
> > --
> > http://mixpanel.com
> > Blog: http://blog.mixpanel.com
> >
>



-- 
http://mixpanel.com
Blog: http://blog.mixpanel.com

Re: Python + Hive + Load Data

Reply via email to