Little bit of digging into problem #1, if you don't specify the partitions it won't work.
Digging into question #1, seems you can't get past this but the work around is by using partitions and just insert overwrite into an partition that is dynamically created based on a timestamp to append data to the table. Question #2, probably impossible atm :-(. Suhail On Tue, Apr 14, 2009 at 1:08 PM, Suhail Doshi <[email protected]>wrote: > Just a few issues: > > Problem #1: > Ended Job = job_200904142000_0003 > Loading data to table page_views_prod > 10688 Rows loaded to page_views_prod > OK > Time taken: 8.736 seconds > hive> select * from page_views_prod; > OK > Time taken: 0.126 seconds > > Seems a bit odd that there are no rows even though it said 10k were loaded. > > Question #1: > For: FROM page_views_stage pvs INSERT OVERWRITE TABLE > page_views_prod SELECT pvs.project_code, pvs.page, > pvs.referrer, pvs.ip, pvs.created WHERE pvs.created > 1235948360 > > Can you not, not use OVERWRITE? I want to load data in the staging table > into the production table but I want it to append the data not erase > existing data. > > Question #2: > For: FROM page_views_stage pvs INSERT TABLE > page_views_prod PARTITION(p_code=pvs.project_code, > dt=to_date(from_unixtime(pvs.created))) > > Is there anyway to dynamically do the partitions if you reference a table? > > Thank you in advance, > Suhail > > > > On Tue, Apr 7, 2009 at 1:06 PM, Edward Capriolo <[email protected]>wrote: > >> >>So evidentially LOAD DATA actually just copies a file to hdfs. What is >> the solution if you have thousands of files and attempt a hive query because >> my understanding is that this will be dead slow later. >> >> Loading thousands of files is very slow. I have an application reads >> 9000 small text files from a web server. I was planning attempting to >> write them with BufferedWriter - FSDataOutputStream my load time was >> over 9 hours. I can't say if DFS copy is much faster. I took a look at >> what Nutch is doing to handle the situation. Nutch allocates single >> threaded MapRunners on several nodes and emits NutchDataum. Kinda a >> crazy way to load in 9000 files :( >> >> Merging the smaller files locally might help as well. >> >> On Tue, Apr 7, 2009 at 3:52 PM, Suhail Doshi <[email protected]> >> wrote: >> > Seems like you have to do something like this? >> > >> > CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, >> > page_url STRING, referrer_url STRING, >> > ip STRING COMMENT 'IP Address of the User', >> > >> > country STRING COMMENT 'country of origination') >> > COMMENT 'This is the staging page view table' >> > ROW FORMAT DELIMITED FIELDS TERMINATED BY '54' LINES TERMINATED BY >> '12' >> > >> > STORED AS TEXTFILE >> > LOCATION '/user/data/stagging/page_view'; >> > >> > hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view >> > >> > FROM page_view_stg pvs >> > INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', >> > country='US') >> > >> > SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, >> null, >> > null, pvs.ip >> > WHERE pvs.country = 'US'; >> > >> > On Tue, Apr 7, 2009 at 12:37 PM, Suhail Doshi <[email protected] >> > >> > wrote: >> >> >> >> So evidentially LOAD DATA actually just copies a file to hdfs. What is >> the >> >> solution if you have thousands of files and attempt a hive query >> because my >> >> understanding is that this will be dead slow later. >> >> >> >> Suhail >> >> >> >> On Sun, Apr 5, 2009 at 10:52 AM, Suhail Doshi < >> [email protected]> >> >> wrote: >> >>> >> >>> Ragu, >> >>> >> >>> I managed to get it working, seems there was just inconsistencies I >> guess >> >>> with metastore_db I was using in the client and the python one. >> >>> >> >>> I should just always use python from now on to make changes to >> >>> metastore_db, instead of copying it around and using the hive client. >> >>> >> >>> Suhail >> >>> >> >>> On Sun, Apr 5, 2009 at 10:44 AM, Suhail Doshi < >> [email protected]> >> >>> wrote: >> >>>> >> >>>> Oh nevermind, of course python is using the metastore_db that the >> hive >> >>>> service is using. >> >>>> >> >>>> Suhail >> >>>> >> >>>> On Sun, Apr 5, 2009 at 10:42 AM, Suhail Doshi < >> [email protected]> >> >>>> wrote: >> >>>>> >> >>>>> This is kind of odd, it's like it's not using the same metastore_db: >> >>>>> >> >>>>> li57-125 ~/test: ls >> >>>>> derby.log hive_test.py hive_test.pyc metastore_db >> page_view.log.2 >> >>>>> >> >>>>> li57-125 ~/test: hive >> >>>>> Hive history >> >>>>> file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt >> >>>>> hive> select count(1) from page_views; >> >>>>> Total MapReduce jobs = 1 >> >>>>> Number of reduce tasks determined at compile time: 1 >> >>>>> In order to change the average load for a reducer (in bytes): >> >>>>> set hive.exec.reducers.bytes.per.reducer=<number> >> >>>>> In order to limit the maximum number of reducers: >> >>>>> set hive.exec.reducers.max=<number> >> >>>>> In order to set a constant number of reducers: >> >>>>> set mapred.reduce.tasks=<number> >> >>>>> Job need not be submitted: no output: Success >> >>>>> OK >> >>>>> Time taken: 4.909 seconds >> >>>>> >> >>>>> li57-125 ~/test: python hive_test.py >> >>>>> Connecting to HiveServer.... >> >>>>> Opening transport... >> >>>>> select count(1) from page_views >> >>>>> Number of rows: ['20297'] >> >>>>> >> >>>>> >> >>>>> On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi >> >>>>> <[email protected]> wrote: >> >>>>>> >> >>>>>> No logs are generated when I run the python file in /tmp/hadoop/ >> >>>>>> >> >>>>>> Suhail >> >>>>>> >> >>>>>> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy < >> [email protected]> >> >>>>>> wrote: >> >>>>>>> >> >>>>>>> Is there no entry in the server logs about the error? >> >>>>>>> >> >>>>>>> >> >>>>>>> On 4/4/09 10:24 PM, "Suhail Doshi" <[email protected]> >> wrote: >> >>>>>>> >> >>>>>>> > I am running the hive server and hadoop on the same server as >> the >> >>>>>>> > file. I am >> >>>>>>> > also running the python script and hive server under the same >> user >> >>>>>>> > and the >> >>>>>>> > file is located in a directory this user owns. >> >>>>>>> > >> >>>>>>> > I am not sure why it's not loading it still. >> >>>>>>> > >> >>>>>>> > Suhail >> >>>>>>> > >> >>>>>>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy >> >>>>>>> > <[email protected]> wrote: >> >>>>>>> >> Is the file accessible to the HiveServer? We currently don't >> ship >> >>>>>>> >> the file >> >>>>>>> >> from the client machine to the server machine. >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <[email protected]> >> wrote: >> >>>>>>> >> >> >>>>>>> >>>> I seem to be having problems with LOAD DATA with a file on my >> >>>>>>> >>>> local system >> >>>>>>> >>>> trying get it into hive: >> >>>>>>> >>>> >> >>>>>>> >>>> li57-125 ~/test: python hive_test.py >> >>>>>>> >>>> Connecting to HiveServer.... >> >>>>>>> >>>> Opening transport... >> >>>>>>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' >> INTO >> >>>>>>> >>>> TABLE >> >>>>>>> >>>> page_views >> >>>>>>> >>>> Traceback (most recent call last): >> >>>>>>> >>>> File "hive_test.py", line 36, in <module> >> >>>>>>> >>>> c.client.execute(query) >> >>>>>>> >>>> File >> >>>>>>> >>>> >> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py", >> >>>>>>> >>> line >> >>>>>>> >>>> 42, in execute >> >>>>>>> >>>> self.recv_execute() >> >>>>>>> >>>> File >> >>>>>>> >>>> >> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py", >> >>>>>>> >>> line >> >>>>>>> >>>> 63, in recv_execute >> >>>>>>> >>>> raise result.ex >> >>>>>>> >>>> hive_service.ttypes.HiveServerException: {} >> >>>>>>> >>>> >> >>>>>>> >>>> The same query works fine through the hive client but doesn't >> >>>>>>> >>>> seem to work >> >>>>>>> >>>> through the python file. Executing a query through the python >> >>>>>>> >>>> client works >> >>>>>>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was >> a >> >>>>>>> >>>> better >> >>>>>>> >>> message >> >>>>>>> >>>> to describe why the exception is occurring. >> >>>>>>> >> >> >>>>>>> > >> >>>>>>> > >> >>>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -- >> >>>>>> http://mixpanel.com >> >>>>>> Blog: http://blog.mixpanel.com >> >>>>> >> >>>>> >> >>>>> >> >>>>> -- >> >>>>> http://mixpanel.com >> >>>>> Blog: http://blog.mixpanel.com >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> http://mixpanel.com >> >>>> Blog: http://blog.mixpanel.com >> >>> >> >>> >> >>> >> >>> -- >> >>> http://mixpanel.com >> >>> Blog: http://blog.mixpanel.com >> >> >> >> >> >> >> >> -- >> >> http://mixpanel.com >> >> Blog: http://blog.mixpanel.com >> > >> > >> > >> > -- >> > http://mixpanel.com >> > Blog: http://blog.mixpanel.com >> > >> > > > > -- > http://mixpanel.com > Blog: http://blog.mixpanel.com > -- http://mixpanel.com Blog: http://blog.mixpanel.com
