>>So evidentially LOAD DATA actually just copies a file to hdfs. What is the >>solution if you have thousands of files and attempt a hive query because my >>understanding is that this will be dead slow later.
Loading thousands of files is very slow. I have an application reads 9000 small text files from a web server. I was planning attempting to write them with BufferedWriter - FSDataOutputStream my load time was over 9 hours. I can't say if DFS copy is much faster. I took a look at what Nutch is doing to handle the situation. Nutch allocates single threaded MapRunners on several nodes and emits NutchDataum. Kinda a crazy way to load in 9000 files :( Merging the smaller files locally might help as well. On Tue, Apr 7, 2009 at 3:52 PM, Suhail Doshi <[email protected]> wrote: > Seems like you have to do something like this? > > CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, > page_url STRING, referrer_url STRING, > ip STRING COMMENT 'IP Address of the User', > > country STRING COMMENT 'country of origination') > COMMENT 'This is the staging page view table' > ROW FORMAT DELIMITED FIELDS TERMINATED BY '54' LINES TERMINATED BY '12' > > STORED AS TEXTFILE > LOCATION '/user/data/stagging/page_view'; > > hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view > > FROM page_view_stg pvs > INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', > country='US') > > SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, > null, pvs.ip > WHERE pvs.country = 'US'; > > On Tue, Apr 7, 2009 at 12:37 PM, Suhail Doshi <[email protected]> > wrote: >> >> So evidentially LOAD DATA actually just copies a file to hdfs. What is the >> solution if you have thousands of files and attempt a hive query because my >> understanding is that this will be dead slow later. >> >> Suhail >> >> On Sun, Apr 5, 2009 at 10:52 AM, Suhail Doshi <[email protected]> >> wrote: >>> >>> Ragu, >>> >>> I managed to get it working, seems there was just inconsistencies I guess >>> with metastore_db I was using in the client and the python one. >>> >>> I should just always use python from now on to make changes to >>> metastore_db, instead of copying it around and using the hive client. >>> >>> Suhail >>> >>> On Sun, Apr 5, 2009 at 10:44 AM, Suhail Doshi <[email protected]> >>> wrote: >>>> >>>> Oh nevermind, of course python is using the metastore_db that the hive >>>> service is using. >>>> >>>> Suhail >>>> >>>> On Sun, Apr 5, 2009 at 10:42 AM, Suhail Doshi <[email protected]> >>>> wrote: >>>>> >>>>> This is kind of odd, it's like it's not using the same metastore_db: >>>>> >>>>> li57-125 ~/test: ls >>>>> derby.log hive_test.py hive_test.pyc metastore_db page_view.log.2 >>>>> >>>>> li57-125 ~/test: hive >>>>> Hive history >>>>> file=/tmp/hadoop/hive_job_log_hadoop_200904051740_1405686854.txt >>>>> hive> select count(1) from page_views; >>>>> Total MapReduce jobs = 1 >>>>> Number of reduce tasks determined at compile time: 1 >>>>> In order to change the average load for a reducer (in bytes): >>>>> set hive.exec.reducers.bytes.per.reducer=<number> >>>>> In order to limit the maximum number of reducers: >>>>> set hive.exec.reducers.max=<number> >>>>> In order to set a constant number of reducers: >>>>> set mapred.reduce.tasks=<number> >>>>> Job need not be submitted: no output: Success >>>>> OK >>>>> Time taken: 4.909 seconds >>>>> >>>>> li57-125 ~/test: python hive_test.py >>>>> Connecting to HiveServer.... >>>>> Opening transport... >>>>> select count(1) from page_views >>>>> Number of rows: ['20297'] >>>>> >>>>> >>>>> On Sat, Apr 4, 2009 at 11:02 PM, Suhail Doshi >>>>> <[email protected]> wrote: >>>>>> >>>>>> No logs are generated when I run the python file in /tmp/hadoop/ >>>>>> >>>>>> Suhail >>>>>> >>>>>> On Sat, Apr 4, 2009 at 10:38 PM, Raghu Murthy <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>> Is there no entry in the server logs about the error? >>>>>>> >>>>>>> >>>>>>> On 4/4/09 10:24 PM, "Suhail Doshi" <[email protected]> wrote: >>>>>>> >>>>>>> > I am running the hive server and hadoop on the same server as the >>>>>>> > file. I am >>>>>>> > also running the python script and hive server under the same user >>>>>>> > and the >>>>>>> > file is located in a directory this user owns. >>>>>>> > >>>>>>> > I am not sure why it's not loading it still. >>>>>>> > >>>>>>> > Suhail >>>>>>> > >>>>>>> > On Sat, Apr 4, 2009 at 10:14 PM, Raghu Murthy >>>>>>> > <[email protected]> wrote: >>>>>>> >> Is the file accessible to the HiveServer? We currently don't ship >>>>>>> >> the file >>>>>>> >> from the client machine to the server machine. >>>>>>> >> >>>>>>> >> >>>>>>> >> On 4/3/09 10:26 PM, "Suhail Doshi" <[email protected]> wrote: >>>>>>> >> >>>>>>> >>>> I seem to be having problems with LOAD DATA with a file on my >>>>>>> >>>> local system >>>>>>> >>>> trying get it into hive: >>>>>>> >>>> >>>>>>> >>>> li57-125 ~/test: python hive_test.py >>>>>>> >>>> Connecting to HiveServer.... >>>>>>> >>>> Opening transport... >>>>>>> >>>> LOAD DATA LOCAL INPATH '/home/hadoop/test/page_view.log.2' INTO >>>>>>> >>>> TABLE >>>>>>> >>>> page_views >>>>>>> >>>> Traceback (most recent call last): >>>>>>> >>>> File "hive_test.py", line 36, in <module> >>>>>>> >>>> c.client.execute(query) >>>>>>> >>>> File >>>>>>> >>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py", >>>>>>> >>> line >>>>>>> >>>> 42, in execute >>>>>>> >>>> self.recv_execute() >>>>>>> >>>> File >>>>>>> >>>> "/home/hadoop/hive/build/dist/lib/py/hive_service/ThriftHive.py", >>>>>>> >>> line >>>>>>> >>>> 63, in recv_execute >>>>>>> >>>> raise result.ex >>>>>>> >>>> hive_service.ttypes.HiveServerException: {} >>>>>>> >>>> >>>>>>> >>>> The same query works fine through the hive client but doesn't >>>>>>> >>>> seem to work >>>>>>> >>>> through the python file. Executing a query through the python >>>>>>> >>>> client works >>>>>>> >>>> fine if it's not a LOAD DATA. Unfortunately, I wish there was a >>>>>>> >>>> better >>>>>>> >>> message >>>>>>> >>>> to describe why the exception is occurring. >>>>>>> >> >>>>>>> > >>>>>>> > >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> http://mixpanel.com >>>>>> Blog: http://blog.mixpanel.com >>>>> >>>>> >>>>> >>>>> -- >>>>> http://mixpanel.com >>>>> Blog: http://blog.mixpanel.com >>>> >>>> >>>> >>>> -- >>>> http://mixpanel.com >>>> Blog: http://blog.mixpanel.com >>> >>> >>> >>> -- >>> http://mixpanel.com >>> Blog: http://blog.mixpanel.com >> >> >> >> -- >> http://mixpanel.com >> Blog: http://blog.mixpanel.com > > > > -- > http://mixpanel.com > Blog: http://blog.mixpanel.com >
