Just a few issues: Problem #1: Ended Job = job_200904142000_0003 Loading data to table page_views_prod 10688 Rows loaded to page_views_prod OK Time taken: 8.736 seconds hive> select * from page_views_prod; OK Time taken: 0.126 seconds
Seems a bit odd that there are no rows even though it said 10k were loaded. Question #1: For: FROM page_views_stage pvs INSERT OVERWRITE TABLE page_views_prod SELECT pvs.project_code, pvs.page, pvs.referrer, pvs.ip, pvs.created WHERE pvs.created > 1235948360 Can you not, not use OVERWRITE? I want to load data in the staging table into the production table but I want it to append the data not erase existing data. Question #2: For: FROM page_views_stage pvs INSERT TABLE page_views_prod PARTITION(p_code=pvs.project_code, dt=to_date(from_unixtime(pvs.created))) Is there anyway to dynamically do the partitions if you reference a table? Thank you in advance, Suhail On Tue, Apr 7, 2009 at 1:06 PM, Edward Capriolo <[email protected]>wrote: > >>So evidentially LOAD DATA actually just copies a file to hdfs. What is > the solution if you have thousands of files and attempt a hive query because > my understanding is that this will be dead slow later. > > Loading thousands of files is very slow. I have an application reads > 9000 small text files from a web server. I was planning attempting to > write them with BufferedWriter - FSDataOutputStream my load time was > over 9 hours. I can't say if DFS copy is much faster. I took a look at > what Nutch is doing to handle the situation. Nutch allocates single > threaded MapRunners on several nodes and emits NutchDataum. Kinda a > crazy way to load in 9000 files :( > > Merging the smaller files locally might help as well. > > On Tue, Apr 7, 2009 at 3:52 PM, Suhail Doshi <[email protected]> > wrote: > > Seems like you have to do something like this? > > > > CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, > > page_url STRING, referrer_url STRING, > > ip STRING COMMENT 'IP Address of the User', > > > > country STRING COMMENT 'country of origination') > > COMMENT 'This is the staging page view table' > > ROW FORMAT DELIMITED FIELDS TERMINATED BY '54' LINES TERMINATED BY > '12' > > > > STORED AS TEXTFILE > > LOCATION '/user/data/stagging/page_view'; > > > > hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view > > > > FROM page_view_stg pvs > > INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', > > country='US') > > > > SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, > null, > > null, pvs.ip > > WHERE pvs.country = 'US'; > > > > On Tue, Apr 7, 2009 at 12:37 PM, Suhail Doshi <[email protected]> > > wrote: > >> > >> So evidentially LOAD DATA actually just copies a file to hdfs. What is > the > >> solution if you have thousands of files and attempt a hive query because > my > >> understanding is that this will be dead slow later. > >> > >> Suhail > >> > >> On Sun, Apr 5, 2009 at 10:52 AM, Suhail Doshi <[email protected] > > > >> wrote: > >>> > >>> Ragu, > >>> > >>> I managed to get it working, seems there was just inconsistencies I > guess > >>> with metastore_db I was using in the client and the python one. > >>> > >>> I should just always use python from now on to make changes to > >>> metastore_db, instead of copying it around and using the hive client. > >>> > >>> Suhail > >>> > >>> On Sun, Apr 5, 2009 at 10:44 AM, Suhail Doshi < > [email protected]> > >>> wrote: > >>>> > >>>> Oh nevermind, of course python is using the metastore_db that the hive > >>>> service is using. > >>>> > >>>> Suhail > >>>> > >>>> On Sun, Apr 5, 2009 at 10:42 AM, Suhail Doshi < > [email protected]> > >>>> wrote: > >>>>> > >>>>> This is kind of odd, it's like it's not using the same metastore_db: > >>>>> > >>>>> li57-125 ~/test: ls > >>>>> derby.log hive_test.py hive_test.pyc metastore_db > page_view.log.2 > >>>>> > >>>>> li57-125 ~/test: hive > >>>>> Hive history > >>>>> file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt > >>>>> hive> select count(1) from page_views; > >>>>> Total MapReduce jobs = 1 > >>>>> Number of reduce tasks determined at compile time: 1 > >>>>> In order to change the average load for a reducer (in bytes): > >>>>> set hive.exec.reducers.bytes.per.reducer=<number> > >>>>> In order to limit the maximum number of reducers: > >>>>> set hive.exec.reducers.max=<number> > >>>>> In order to set a constant number of reducers: > >>>>> set mapred.reduce.tasks=<number> > >>>>> Job need not be submitted: no output: Success > >>>>> OK > >>>>> Time taken: 4.909 seconds > >>>>> > >>>>> li57-125 ~/test: python hive_test.py > >>>>> Connecting to HiveServer.... > >>>>> Opening transport... > >>>>> select count(1) from page_views > >>>>> Number of rows: ['20297'] > >>>>> > >>>>> > >>>>> On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi > >>>>> <[email protected]> wrote: > >>>>>> > >>>>>> No logs are generated when I run the python file in /tmp/hadoop/ > >>>>>> > >>>>>> Suhail > >>>>>> > >>>>>> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <[email protected] > > > >>>>>> wrote: > >>>>>>> > >>>>>>> Is there no entry in the server logs about the error? > >>>>>>> > >>>>>>> > >>>>>>> On 4/4/09 10:24 PM, "Suhail Doshi" <[email protected]> > wrote: > >>>>>>> > >>>>>>> > I am running the hive server and hadoop on the same server as the > >>>>>>> > file. I am > >>>>>>> > also running the python script and hive server under the same > user > >>>>>>> > and the > >>>>>>> > file is located in a directory this user owns. > >>>>>>> > > >>>>>>> > I am not sure why it's not loading it still. > >>>>>>> > > >>>>>>> > Suhail > >>>>>>> > > >>>>>>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy > >>>>>>> > <[email protected]> wrote: > >>>>>>> >> Is the file accessible to the HiveServer? We currently don't > ship > >>>>>>> >> the file > >>>>>>> >> from the client machine to the server machine. > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <[email protected]> wrote: > >>>>>>> >> > >>>>>>> >>>> I seem to be having problems with LOAD DATA with a file on my > >>>>>>> >>>> local system > >>>>>>> >>>> trying get it into hive: > >>>>>>> >>>> > >>>>>>> >>>> li57-125 ~/test: python hive_test.py > >>>>>>> >>>> Connecting to HiveServer.... > >>>>>>> >>>> Opening transport... > >>>>>>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' > INTO > >>>>>>> >>>> TABLE > >>>>>>> >>>> page_views > >>>>>>> >>>> Traceback (most recent call last): > >>>>>>> >>>> File "hive_test.py", line 36, in <module> > >>>>>>> >>>> c.client.execute(query) > >>>>>>> >>>> File > >>>>>>> >>>> > "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py", > >>>>>>> >>> line > >>>>>>> >>>> 42, in execute > >>>>>>> >>>> self.recv_execute() > >>>>>>> >>>> File > >>>>>>> >>>> > "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py", > >>>>>>> >>> line > >>>>>>> >>>> 63, in recv_execute > >>>>>>> >>>> raise result.ex > >>>>>>> >>>> hive_service.ttypes.HiveServerException: {} > >>>>>>> >>>> > >>>>>>> >>>> The same query works fine through the hive client but doesn't > >>>>>>> >>>> seem to work > >>>>>>> >>>> through the python file. Executing a query through the python > >>>>>>> >>>> client works > >>>>>>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was > a > >>>>>>> >>>> better > >>>>>>> >>> message > >>>>>>> >>>> to describe why the exception is occurring. > >>>>>>> >> > >>>>>>> > > >>>>>>> > > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> http://mixpanel.com > >>>>>> Blog: http://blog.mixpanel.com > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> http://mixpanel.com > >>>>> Blog: http://blog.mixpanel.com > >>>> > >>>> > >>>> > >>>> -- > >>>> http://mixpanel.com > >>>> Blog: http://blog.mixpanel.com > >>> > >>> > >>> > >>> -- > >>> http://mixpanel.com > >>> Blog: http://blog.mixpanel.com > >> > >> > >> > >> -- > >> http://mixpanel.com > >> Blog: http://blog.mixpanel.com > > > > > > > > -- > > http://mixpanel.com > > Blog: http://blog.mixpanel.com > > > -- http://mixpanel.com Blog: http://blog.mixpanel.com
