Re: questions about statistics in 0.7

2011-05-26 Thread Ning Zhang
On May 26, 2011, at 1:28 PM, Guy Bayes wrote: Crap sorry hit send too early questions 1: Job overhead of generating statistics on the fly with set hive.stats.autogather=true;? Overhead is minimum. The only accountable overhead is to insert a row into a RDBMS/HBase at the end of a task. At the

Re: An issue with Hive on hadoop cluster

2011-05-23 Thread Ning Zhang
AFAIK, the fs.default.name should be set by both the client and server side .xml files, and they should be consistent (the URI scheme, the hostname and port number). The server side config (also called fs.default.name) should be read by the namenod

Re: OOM after upgrade to last weeks 0.7

2011-05-17 Thread Ning Zhang
There is a new patch for optimizing partition pruning, including CPU and memory. I think it is not in 0.7 yet. Can you try trunk and see how much memory you need? BTW 72k partitions is indeed quite a large number. When I did the experiments with the new patch, you'll need about 300MB for 20k p

Re: Inconsistent results from INSERT OVERWRITE TABLE

2011-05-11 Thread Ning Zhang
Hive queries are compiled to different types tasks (MapReduce, MoveTask, etc), so a successful MR task as indicated in the JT doesn't mean the whole query succeeded. So you need to examine the status of the hive query to see if it succeeded or not. You can also check the hive's log file under /

Re: What does 'TempStatsStore' do ?

2011-05-10 Thread Ning Zhang
TempStatsStore is a derby database for stats gathering (intermediate stats). You can turn off stats gathering by set hive.stats.autogather=false. On May 10, 2011, at 1:23 PM, Christopher, Pat wrote: I don’t know what TempStatsStore is, but derby.log is an artifact of using the default metastore

Re: can I use hive dynamic partition while loading data into tables?

2011-04-15 Thread Ning Zhang
ase of EXTERNAL tables I guess. Jasper 2011/4/15 Ning Zhang mailto:nzh...@fb.com>> The INSERT OVERWRITE command will not overwrite the whole table. If you specify a partition in that table, it will only overwrite that partition. If you specify dynamic partitions, it will only create/ov

Re: can I use hive dynamic partition while loading data into tables?

2011-04-15 Thread Ning Zhang
to page_view more than 1 time ? Let us say, we import the data hourly, and with current dynamic partition implementation , the existing country partition will be overwritten! Is there any other way to avoid this without telling me to import the data once per day? 2011/4/15 Ning Zhang mai

Re: can I use hive dynamic partition while loading data into tables?

2011-04-15 Thread Ning Zhang
_view PARTITION(dt='2008-06-08', country) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country but insert must overwrite the table whole table partition. Can I insert without the overwrite key word? 2011/4/15 Ning Zhang mailto:nzh...@fb

Re: can I use hive dynamic partition while loading data into tables?

2011-04-15 Thread Ning Zhang
The LOAD DATA command only copy the files to the destination directory. It doesn't read the records of the input file, so it cannot do partitioning based on record values. On Apr 14, 2011, at 10:52 PM, Erix Yao wrote: hi,all The dynamic partition function is amazing ,but only works in inser

Re: Stats Gathering Problems

2011-03-05 Thread Ning Zhang
is exception. Anja On 03/04/2011 01:48 AM, Ning Zhang wrote: Hive CLI interprete ';' as the end of a command. You should put this property in hive-site.xml: hive.stats.dbconnectionstring jdbc:mysql://localhost/mstore The JDBC conneciton URL. For example, jdbc:mysql:localho

Re: Stats Gathering Problems

2011-03-03 Thread Ning Zhang
Hive CLI interprete ';' as the end of a command. You should put this property in hive-site.xml: hive.stats.dbconnectionstring jdbc:mysql://localhost/mstore The JDBC conneciton URL. For example, jdbc:mysql:localhost/stats_db?createDatabaseIfNotExist=true&user=stat_u;password=pass On Mar

Re: Thrift Java Client - TTransportException (SocketException: Connection reset)

2011-02-25 Thread Ning Zhang
I tried on the latest trunk (through CLI connecting to Hive Server) and there is no disconnection after 10 mins for a long query. @Ayush, is this Java client using JDBC connection? If so the client may have set a timeout for JDBC queries. I'm suspecting the disconnection is from the Java client

Re: Dynamic partition set to null

2011-02-13 Thread Ning Zhang
Khaled, which version of Hive are you running? I tried a similar query in trunk (0.7.0-SNAPSHOT) and it worked. The error does't mean the data is wrong (ds=null), it means the compiled query plan doesn't indicate it is a dynamic partition (which is very unlikely for this simple query) or the M

Re: Query Optimization in Hive

2011-01-31 Thread Ning Zhang
Hi Anja, As you noticed Hive only have limited supports for cost-baesd optimization. One of the reasons is that Hive used to have very small number of optional execution plans to choose from. One exception is mapjoin vs common joins. Liying Tang had some work on his last intern to convert commo

Re: HIVE ODBC test fails at testing with isql

2011-01-06 Thread Ning Zhang
;libpthread.so.0 => /lib/libpthread.so.0 (0x0054e000) >libc.so.6 => /lib/libc.so.6 (0x0011) >/lib/ld-linux.so.2 (0x003b9000) > [r...@vmlinux3 ~]# > > Is there any missing library here? > > Thanks and regards > > Vaibhav Negi > >

Re: HIVE ODBC test fails at testing with isql

2011-01-05 Thread Ning Zhang
It looks like isql cannot find the dynamically linked libraries. Can you ldd isql and see if all dynamically linked libraries are correct? On Jan 5, 2011, at 5:29 AM, vaibhav negi wrote: > Hi Carl, > > Downloaded the patched version of unixODBC from the given link and > installed successfully

RE: Safe to upgrade to Hive 0.7.0?

2010-12-08 Thread Ning Zhang
These are datanucleus paramesters found in hive-default.xml/hive-site.xml. If you change them to the following the metastore schema won't be changed automatically (may be good for preventing accidents). datanucleus.autoCreateSchema false datanucleus.fixedDatastore true __

Re: Failure when using "insert overwrite" after upgrading to Hive 0.6.0

2010-12-07 Thread Ning Zhang
Ryan, I wonder why setting 'hive.merge.mapfiles=false' could solve the issue. The issue seems to be a metastore related (drop table could not find default.test_table). This is probably due to the database support newly introduced in 0.6 (see JIRA HIVE-675). On Dec 7, 2010, at 10:21 AM, Ryan LeC

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-23 Thread Ning Zhang
ing should be > sensitive to the output table format. > P.S: I have hive.merge.maponly=true and am using LZO compression > > On Fri, Nov 19, 2010 at 5:20 PM, Ning Zhang wrote: >> It makes sense. CombineHiveInputFormat does not work with compressed text >> fil

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-19 Thread Ning Zhang
>> expr: ds >>>>> type: string >>>>> outputColumnNames: _col0, _col1, _col2, _col3, >>>>> _col4, _col5, _col6, _col7, _col8, _col9, _col10 >>>>>File Output O

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-18 Thread Ning Zhang
a relatively fresh Hive from trunk (built maybe a month ago). > > --Leo > > On Thu, Nov 18, 2010 at 1:12 PM, Ning Zhang wrote: >> The settings looks good. The parameter hive.merge.size.smallfiles.avgsize is >> used to determine at run time if a merge should be triggere

Re: Hive produces very small files despite hive.merge...=true settings

2010-11-18 Thread Ning Zhang
The settings looks good. The parameter hive.merge.size.smallfiles.avgsize is used to determine at run time if a merge should be triggered: if the average size of the files in the partition is SMALLER than the parameter and there are more than 1 file, the merge should be scheduled. Can you try to

Re: [VOTE] Bylaws for Apache Hive Project

2010-10-22 Thread Ning Zhang
+1 On Oct 22, 2010, at 2:51 PM, Ashish Thusoo wrote: Hi Folks, I propose that we adopt the following bylaws for the Apache Hive Project https://cwiki.apache.org/HIVE/bylaws.html These are basically a cut-and-paste job of the Apache Pig bylaws that were recently proposed by Alan Gates. We wil

Re: Merging small files with dynamic partitions

2010-10-15 Thread Ning Zhang
The output file shows it only have 2 jobs (the mapreduce job and the move task). This indicates that the plan does not have merge enabled. Merge should consists of a ConditionalTask and 2 sub tasks (a MR task and a move task). Can you send the plan of the query? One thing I noticed is that you

Re: Multiple insert statement and levels of aggregation

2010-10-15 Thread Ning Zhang
In the multi-insert statement, you cannot put another FROM clause. What you can do is to put both UDTF in the FROM clause: FROM foo lateral view someUDTF(foo.a) as t1_a lateral view anotherUDTF(foo.a) as T2_a INSERT ... SELECT a,b,c,count(1), t1_a .. SELECT a,b,c,count(1), t2_a .. On Oct 15, 2

Re: Help with last 30 day unique user query

2010-10-15 Thread Ning Zhang
wrote: Thanks, Ning! Finding the date which is 30 days before/later was easy enough but my problem is beyond that. I need to find unique users based on these last 30 days for a range of days. Does that make sense? On Fri, Oct 15, 2010 at 12:10 AM, Ning Zhang mailto:nzh...@facebook.com>> wrote

Re: Help with last 30 day unique user query

2010-10-15 Thread Ning Zhang
There are some UDFs that convert a string to epoch time and back to a string. e.g., select from_unixtime(unix_timestamp('2010-10-10', '-MM-dd') + 60*60*24*30, '-MM-dd') from src limit 1; will given you the date which is 30 days later than 2010-10-10. On Oct 14, 2010, at 11:36 PM, Vij