Re: data question

2011-01-31 Thread Jonathan Natkins
Hi Cam, I couldn't find a function that achieved precisely what you were looking for, but there is a function that gets pretty close to what you want. select id, collect_set(date_hour), collect_set(count), sum(count) from test group by id; The problem with using collect_set is that it removes du

Re: Query Optimization in Hive

2011-01-31 Thread bharath vissapragada
Thanks for replying namit.. It is motivating to receive a mail from the authors of Hive :). I filed the jira based on the discussion.. https://issues.apache.org/jira/browse/HIVE-1938 I will try to update my idea asap. Thanks Bharath,V 4th year Undergrad,IIIT Hyderabad. w: http://research.iiit.a

Re: Query Optimization in Hive

2011-01-31 Thread Namit Jain
Bharath, This would be great. Why donĀ¹t you write up something about how you are planning to proceed ? File a new jira and load some design notes/spec. there. We can definitely sync up. from there. This feature would be very useful to the community - We, at facebook, Would definitely like to us

Re: tons of bugs and problem found

2011-01-31 Thread Aaron Kimball
In MapReduce, filenames that begin with an underscore are "hidden" files and are not enumerated by FileInputFormat (Hive, I believe, processes tables with TextInputFormat and SequenceFileInputFormat, both descendants of this class). Using "_foo" as a hidden/ignored filename is conventional in the

Re: Query Optimization in Hive

2011-01-31 Thread bharath vissapragada
Hi Ning,Anja, I am doing my Masters thesis on this topic . I have implemented all SQL features like joins , selects etc on top of Hadoop (before knowing about Hive) and we have derived some basic cost-models for join re-ordering which seem to be working fine on some basic scales of TPCH datasets .

Re: Please read if you plan to use Hive 0.7.0 on Hadoop 0.20.0

2011-01-31 Thread Ajo Fod
I am new to hive and hadoop and I got the packaged version from Cloudera. So, personally, I'd be happy if the new package is mutually consistent. -Ajo On Mon, Jan 31, 2011 at 5:14 PM, Carl Steinbach wrote: > Hi, > > I'm trying to get an idea of how many people plan on running Hive > 0.7.0 on to

Re: Query Optimization in Hive

2011-01-31 Thread Ning Zhang
Hi Anja, As you noticed Hive only have limited supports for cost-baesd optimization. One of the reasons is that Hive used to have very small number of optional execution plans to choose from. One exception is mapjoin vs common joins. Liying Tang had some work on his last intern to convert commo

Please read if you plan to use Hive 0.7.0 on Hadoop 0.20.0

2011-01-31 Thread Carl Steinbach
Hi, I'm trying to get an idea of how many people plan on running Hive 0.7.0 on top of Hadoop 0.20.0 (as opposed to 0.20.1 or 0.20.2), and are in a position where they can't upgrade to one of more recent releases of the 0.20.x branch. I'm asking because there is a ticket open (HIVE-1817) that block

Re: Query Optimization in Hive

2011-01-31 Thread Ajo Fod
I think there is a developer mailing list ... that is probably the best place for this question. Also, I think there is a cost-based query optimizer in the works somewhere. -Ajo On Mon, Jan 31, 2011 at 2:04 PM, Anja Gruenheid wrote: > Hi! > > I'm a graduate student from Georgia Tech and I'm wor

Query Optimization in Hive

2011-01-31 Thread Anja Gruenheid
Hi! I'm a graduate student from Georgia Tech and I'm working with Hive for a research project. I am interested in query optimization and the Hive MetaStore in that context. Working through the documentation and code, I noticed that the implementation right now is using a rule-based optimizati

data question

2011-01-31 Thread Cam Bazz
Hello, After doing some aggregate counting, I now have data in a table like this: id countdate_hour (this is a partition name) 1 3 2011310115 1 1 2011310116 2 1 2011310117 2 1 2011310118 and I need to turn this into: 1 [2011310115,2011310

Re: always insert overwrite, so how do we collect data?

2011-01-31 Thread Cam Bazz
thank you very much, exactly what i needed. On Mon, Jan 31, 2011 at 9:57 PM, Adam O'Donnell wrote: > I would create separate partitions, one for each day worth of data, > and then drop the partitions that are no longer needed. > > On Mon, Jan 31, 2011 at 11:56 AM, Cam Bazz wrote: >> Hello, >> >>

Re: always insert overwrite, so how do we collect data?

2011-01-31 Thread Adam O'Donnell
I would create separate partitions, one for each day worth of data, and then drop the partitions that are no longer needed. On Mon, Jan 31, 2011 at 11:56 AM, Cam Bazz wrote: > Hello, > > I understand there is no way to delete data stored in a table. like a > `delete from table_name` in the hive l

always insert overwrite, so how do we collect data?

2011-01-31 Thread Cam Bazz
Hello, I understand there is no way to delete data stored in a table. like a `delete from table_name` in the hive language. All this is fine, because all the query results are inserted into another table, and the previous data in it is overwritten. When we need to store data collectively in to a

Re: problem with hadoop or hive

2011-01-31 Thread Cam Bazz
Hello, I have fixed the problem. I had to use not the last version of hadoop but the previous release. hadoop-0.20.2 - with the hive-0.6.0-bin - thank you very much. I already started importing my web logs and making basic statistics on them. -C.B. On Mon, Jan 31, 2011 at 7:32 PM, Jean-Daniel

Re: tons of bugs and problem found

2011-01-31 Thread yongqiang he
You can first try to set io.skip.checksum.errors to true, which will ignore bad checksum. >>In facebook, we also had a requirement to ignore corrupt/bad data - but it >>has not been committed yet. Yongqiang, what is the jira number ? there seems no jira for this issue. thanks yongqiang 2011/1/31

Re: small files with hive and hadoop

2011-01-31 Thread Ajo Fod
I've noticed that it takes a while for each map job to be set up in hive ... and the way I set up the job I noticed that there were as many maps as files/buckets. I read a recommendation somewhere to design jobs such that they take at least a minute. Cheers, -Ajo. On Mon, Jan 31, 2011 at 8:08 AM

Re: problem with hadoop or hive

2011-01-31 Thread Jean-Daniel Cryans
It seems the hadoop version you're running isn't the same as the one that hive is using. Check the lib/ folder and if it's not the same, replace the hadoop jars with the ones from the version you're running. J-D On Mon, Jan 31, 2011 at 6:29 AM, Cam Bazz wrote: > Hello, > > I have written a probl

Re: tons of bugs and problem found

2011-01-31 Thread Namit Jain
On 1/31/11 7:46 AM, "Laurent Laborde" wrote: >On Fri, Jan 28, 2011 at 8:05 AM, Laurent Laborde >wrote: >> On Fri, Jan 28, 2011 at 1:12 AM, Namit Jain wrote: >>> Hi Laurent, >>> >>> 1. Are you saying that _top.sql did not exist in the home directory. >>> Or that, _top.sql existed, but hive was

Re: small files with hive and hadoop

2011-01-31 Thread Edward Capriolo
On Mon, Jan 31, 2011 at 11:08 AM, wrote: > Hello, > > I like to do a reporting with Hive on something like tracking data. > The raw data which is about 2 gigs or more a day I want to query with hive. > This works already for me, no problem. > Also I want to cascade down the reporting data to som

small files with hive and hadoop

2011-01-31 Thread hive1
Hello, I like to do a reporting with Hive on something like tracking data. The raw data which is about 2 gigs or more a day I want to query with hive. This works already for me, no problem. Also I want to cascade down the reporting data to something like client, date, something in Hive like part

Re: tons of bugs and problem found

2011-01-31 Thread Laurent Laborde
On Fri, Jan 28, 2011 at 8:05 AM, Laurent Laborde wrote: > On Fri, Jan 28, 2011 at 1:12 AM, Namit Jain wrote: >> Hi Laurent, >> >> 1. Are you saying that _top.sql did not exist in the home directory. >> Or that, _top.sql existed, but hive was not able to read it after loading > > It exist, it's lo

problem with hadoop or hive

2011-01-31 Thread Cam Bazz
Hello, I have written a problem in my previous email. I now tried: select item_view_raw.* from item_view_raw WHERE log_level = 'INFO'; and I get the same error. select * from item_view_raw works just fine, but when i do a WHERE clause to any column I get the same exception: Total MapReduce jobs

trouble loading from raw data table to data table

2011-01-31 Thread Cam Bazz
Hello, I just started hive today. Following instructions I did set it up, and made it work to play with my web server log files. I created two tables: CREATE TABLE item_view(view_time BIGINT, ip_number STRING, session_id STRING, session_cookie STRING, referrer_url STRING, eser_sid INT, sale_stat