RE: [hive-devel] Implicit type conversion in Hive2

2008-11-16 Thread Joydeep Sen Sarma
cebook.com/mailman/listinfo/hive-devel>] On Behalf Of Joydeep Sen Sarma Sent: Tuesday, October 14, 2008 12:17 AM To: Zheng Shao; hive Subject: Re: [hive-devel] A question about implicit type conversions Dunno. So I guess the number type hierarchy is pretty clear. >From whatever

RE: Trouble Loading Into External Table

2008-11-25 Thread Joydeep Sen Sarma
Can you please send the output of 'describe extended activity_test'. This will help us understand what's happening with all the create table parameters. Also - as a sanity check - can you please check hadoop dfs -cat /data/sample/* (to make sure data got loaded/moved into that dir) -Origi

RE: External tables and existing directory structure

2008-11-28 Thread Joydeep Sen Sarma
Hi Johann, Create external table with the 'location' clause set to ur data would be the way to go. However - Hive has it's own directory naming scheme for partitions ('='). So just pointing to a directory with subdirectories would not work. So right now case one would have to move or copy the

RE: External tables and existing directory structure

2008-11-28 Thread Joydeep Sen Sarma
dered? Josh On Nov 28, 2008, at 3:00 PM, Joydeep Sen Sarma wrote: > Hi Johann, > > Create external table with the 'location' clause set to ur data > would be the way to go. However - Hive has it's own directory naming > scheme for partitions ('='). So

RE: Compression

2008-12-02 Thread Joydeep Sen Sarma
Yes - from the jiras - bz2 is splitable in hadoop-0.19. Hive doesn't have to do anything to support this (although we haven't tested it). please mark ur tables as 'stored as textfile' (not sure if that's the default). As long as the file as bz2 extension and hadoop has the codec that matches th

RE: Re:RE: [hive-users] "LOAD DATA" From hdfs can'r work under hadoop 0.19

2008-12-02 Thread Joydeep Sen Sarma
Hi Paradisehi The issue is that the default file system uri obtained from hadoop config variable fs.default.name (from hadoop-default/site.xml) does not match the uri that u are loading from. As Zheng mentioned �C can u please use the hdfs://namenode:x/test/shixing/log �C where ‘namenode:x

RE: Compression

2008-12-02 Thread Joydeep Sen Sarma
rking. Josh On Dec 2, 2008, at 10:30 AM, Joydeep Sen Sarma wrote: Yes - from the jiras - bz2 is splitable in hadoop-0.19. Hive doesn't have to do anything to support this (although we haven't tested it). please mark ur tables as 'stored as textfile' (not sure if that

RE: RE: did hive support the udf now?

2008-12-02 Thread Joydeep Sen Sarma
This is done already Use: add file This is same as �Cfile argument in hadoop streaming. U can refer to this file by it’s last component in ‘USING’ clause. list file will show list of current added files delete file will delete from current session From: Zh

RE: Apache Access Log Table example in Hive user guide

2008-12-04 Thread Joydeep Sen Sarma
Please use hive from http://svn.apache.org/repos/asf/hadoop/hive/trunk/ This should work with hadoop-0.19. Will update UserGuide with this info .. From: Bill Au [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 1:59 PM To: hive-user@hadoop.apache.org S

RE: Serde and Record I/O

2008-12-08 Thread Joydeep Sen Sarma
Hi Johan - so keys and value class types are RecordIO classes? This may need some dev work. A few things: - traditionally our serde's have ignored the keys altogether (the row is embedded in the value). What are the semantics for ur case? - the jute code was written for an older version of the se

RE: Serde and Record I/O

2008-12-08 Thread Joydeep Sen Sarma
/jira/browse/HIVE-126 Thanks in advance! /Johan Joydeep Sen Sarma wrote: > Hi Johan - so keys and value class types are RecordIO classes? > > This may need some dev work. A few things: > - traditionally our serde's have ignored the keys altogether (the row is > embedded in

RE: Hadoop JobStatus

2008-12-08 Thread Joydeep Sen Sarma
The jobid is printed out for non-silent session execution mode. Since there's no structured interface - I had tried to have structured data emitted as key=value in the output stream. The relevant output emitted here is from: console.printInfo("Starting Job = " + rj.getJobID() + ", Trackin

RE: Metadata in Multiuser DB

2008-12-09 Thread Joydeep Sen Sarma
We use mysql as metadb server. Prasad can give a more detailed response when he's back - but here are the relevant entries from our hive-default.xml: javax.jdo.option.ConnectionURL jdbc:mysql://xxx.yyy.facebook.com/hms_during_upgrade?createDatabaseIfNotExist=true javax.jdo.option.Conn

RE: Problem with queries

2008-12-14 Thread Joydeep Sen Sarma
Hive should work with 0.18 However - it needs to be specifically compiled with 0.18 to work. Please do 'ant -Dhadoop.version=0.18.0 package' from the source tree root to get jar files that work with 18. From: Martin Matula [mailto:matu...@gmail.com] Sent: Sunday

RE: OLAP with Hive

2008-12-14 Thread Joydeep Sen Sarma
We have done some preliminary work with indexing - but that's not the focus right now and no code is available in the open source trunk for this purpose. I think it's fair to say that hive is not optimized for online processing right now. (and we are quite some ways off from columnar storage).

RE: OLAP with Hive

2008-12-14 Thread Joydeep Sen Sarma
ee hive go toward hbase or katta. What is the long term vision for hive? Josh On Dec 14, 2008, at 1:06 PM, Joydeep Sen Sarma wrote: We have done some preliminary work with indexing - but that's not the focus right now and no code is available in the open source trunk for this purpose. I

RE: OLAP with Hive

2008-12-14 Thread Joydeep Sen Sarma
erent fields of the same rows, but it's not very clear what's the best way to do that. Zheng On Sun, Dec 14, 2008 at 3:51 PM, Josh Ferguson mailto:j...@besquared.net>> wrote: What would columnar organization look like and what are the benefits and drawbacks to this? Josh O

RE: Hadoop JobStatus

2008-12-15 Thread Joydeep Sen Sarma
be to implement this using a message queue (publish/subscribe system). We could leverage ActiveMQ or something similar, but that would be a bit more heavyweight but potentially people can develop or advanced monitoring applications around it. Ashish ____________

RE: Number of Mappers

2009-01-11 Thread Joydeep Sen Sarma
We should be able to control this (specify exact mapper count) once hadoop-4565 and hive-74 are resolved (these are being worked on actively). From: Zheng Shao [mailto:zsh...@gmail.com] Sent: Sunday, January 11, 2009 9:16 PM To: hive-user@hadoop.apache.org Subject

RE: Can hive load a table from a SequenceFile?

2009-01-12 Thread Joydeep Sen Sarma
If u have a file of this type already - loading it into Hive is trivial. - create table xxx () ... stored as sequencefile - load data infile yyy into table xxx assuming yyy is already in hdfs. See the wiki for additional create table documentation: http://wiki.apache.org/hadoop

RE: Can hive load a table from a SequenceFile?

2009-01-12 Thread Joydeep Sen Sarma
Please give a full uri - like hdfs://xxx.yyy.zzz:9000/user/... Where xxx.yyy.zzz is the same namenode/hdfs instance where u are planning to store the hive tables. From: Jeremy Chow [mailto:coderp...@gmail.com] Sent: Monday, January 12, 2009 6:17 PM To: hive-user@

RE: Can hive load a table from a SequenceFile?

2009-01-14 Thread Joydeep Sen Sarma
Hey Jeremy - Looks like this was more trouble than it should have been. Can u help us by filing a couple of Jiras on expected behavior: 1. should 'location ..' clause in create table force people to specify uri? Or should it use fs.default.name from hadoop configuration and tell user that i

RE: Error loading data from HDFS into Hive

2009-01-20 Thread Joydeep Sen Sarma
Can u do a describe extended on the ip_locations table? it will have a location string. It's possible that the location spec in it does not have full uri (perhaps the table was created before the warehouse.dir was filled in?) some of these issues were fixed in a jira fixed by Prasad a couple of

RE: Problem loading data from local file

2009-01-23 Thread Joydeep Sen Sarma
There was a small change to the Load command a couple of days back (to fix a different problem) and it's triggering this. Can you apply the attached patch and check that it works. There's no extra logging here - so looking at the code was the only option .. From

RE: equijoin with multiple columns?

2009-01-23 Thread Joydeep Sen Sarma
Moral of the story - don't google around too much before writing code. -Original Message- From: Raghu Murthy [mailto:ra...@facebook.com] Sent: Friday, January 23, 2009 4:01 PM To: hive-user@hadoop.apache.org Subject: Re: equijoin with multiple columns? We could add trim to hive load, but

RE: Hive w/o hadoop installation

2009-01-25 Thread Joydeep Sen Sarma
I would say package everything up in hadoop/lib to be sure. (Even the jetty stuff is now required by the hive web server I think) From: Prasad Chakka [mailto:pra...@facebook.com] Sent: Sunday, January 25, 2009 10:08 AM To: hive-user@hadoop.apache.org Subject: Re:

RE: Job Speed

2009-01-27 Thread Joydeep Sen Sarma
Hi Josh, Copying large number small map outputs can take a while. Can't say why the tasktracker is not running more than one mapper. We are working on this. hadoop-4565 tracks a jira to create splits that cross files while preserving locality. Hive-74 will use 4565 on hive side to control numb

RE: Yet another join issue

2009-02-16 Thread Joydeep Sen Sarma
Searching my computer, I find Namit quoting: "ansi sql semantics are that the filter is executed after the join." So there u go .. In the same mail he suggested putting the filter condition for the table inside the ON clause for execution before the join. So I guess u might want to try: SELECT

RE: Is there a way to hint Hive the reduce key will be evenly distributed?

2009-02-18 Thread Joydeep Sen Sarma
Only for count(1) though. For others it still does 2mr. See hive-223 - it does what Qing is asking for. Still not committed - so can try out patch. 1mr with the option mentioned below. Will also do 1mr with hive.groupby.skewindata=false for non map-side aggregate as well. __

RE: Error input handling in Hive

2009-02-20 Thread Joydeep Sen Sarma
There are certain class of errors (out of memory types) that cannot be handled within Hive. For such cases - doing it in Hadoop would make sense. The other case is handling errors in user scripts. This is especially tricky - and we would need to borrow/use hadoop techniques for retry during the

RE: How to simplify our development flow under the means of using Hive?

2009-02-22 Thread Joydeep Sen Sarma
Hi Min, One possibility is to have ur data sets stored in Hive - but for ur map-reduce programs - use the Hive Java api's (to find input files for a table, to extract rows from a table - etc.). That way at least the metadata about all data is standardized in Hive. If you want to go down this ro

RE: Error input handling in Hive

2009-02-23 Thread Joydeep Sen Sarma
Unfortunately - #1 is not current Hive behavior. We are in a weird in-between state where the deserializer exceptions are ignored but execution exceptions are not. (there's a counter that keeps track of deserializer errors). There's a related question of whether we should verify the schema of t

RE: How to simplify our development flow under the means of using Hive?

2009-02-23 Thread Joydeep Sen Sarma
. What is your solution then? BTW, is it Hive only run as a thrift service in Facebook? On Mon, Feb 23, 2009 at 12:23 PM, Joydeep Sen Sarma mailto:jssa...@facebook.com>> wrote: Hi Min, One possibility is to have ur data sets stored in Hive - but for ur map-reduce programs - use the

RE: how to store UDFs in Hive system?

2009-02-23 Thread Joydeep Sen Sarma
We already pick up all jars from auxlib/ (both for client side and execution). Also modifiable via -auxpath switch From: Zheng Shao [mailto:zsh...@gmail.com] Sent: Monday, February 23, 2009 8:29 PM To: hive-user@hadoop.apache.org Subject: Re: how to store UDFs in

RE: how to store UDFs in Hive system?

2009-02-23 Thread Joydeep Sen Sarma
lback.) From: Joydeep Sen Sarma [mailto:jssa...@facebook.com] Sent: Monday, February 23, 2009 10:16 PM To: hive-user@hadoop.apache.org Subject: RE: how to store UDFs in Hive system? We already pick up all jars from auxlib/ (both for client side and execution). Also modifiable via -au

RE: How to simplify our development flow under the means of using Hive?

2009-02-23 Thread Joydeep Sen Sarma
hive-user@hadoop.apache.org Subject: Re: How to simplify our development flow under the means of using Hive? Hi Joydeep, What drive your batch-processing jobs to work? Data? or a crontab script? or your shell script? On Tue, Feb 24, 2009 at 9:56 AM, Joydeep Sen Sarma mailto:jssa...@faceboo

RE: how to store UDFs in Hive system?

2009-02-23 Thread Joydeep Sen Sarma
r users to manager UDFs. If Hive takes over all UDF registration, then it might be a pain for users to upgrade the jars containing UDFs. Zheng On Mon, Feb 23, 2009 at 10:40 PM, Joydeep Sen Sarma mailto:jssa...@facebook.com>> wrote: My bad. Obviously this doesn't work (need to call

RE: How to simplify our development flow under the means of using Hive?

2009-02-24 Thread Joydeep Sen Sarma
We can write a small example program to get files for a table/partition. To open a table using deserializer and get rows from it etc. This would help people write java map-reduce on hive tables. From: Zheng Shao [mailto:zsh...@gmail.com] Sent: Tuesday, February 2

RE: How can I use DistributedCache in Hive programs?

2009-02-27 Thread Joydeep Sen Sarma
add file adds the files to the distributed cache. it's the same as the -files option in hadoop streaming (and hadoop in general). so u can use this option. From: Min Zhou [coderp...@gmail.com] Sent: Thursday, February 26, 2009 5:53 PM To: hive-user@hadoop.apache.

RE: Combine() optimization

2009-02-27 Thread Joydeep Sen Sarma
Yeah - we definitely want to convert it to a MFU type flush algorithm. If someone wants to take a crack at it before we can get to it - that would be awesome From: Namit Jain [mailto:nj...@facebook.com] Sent: Friday, February 27, 2009 1:59 PM To: hive-user@hadoop

RE: Malformed Rows

2009-03-01 Thread Joydeep Sen Sarma
There's also a jira open to ignore (upto threshold) exceptions from the execution engine. That would be easy to implement and help fix this particular scenario as well. From: Zheng Shao [mailto:zsh...@gmail.com] Sent: Sunday, March 01, 2009 1:35 AM To: hive-user@

RE: Combine() optimization

2009-03-03 Thread Joydeep Sen Sarma
there already a JIRA for this improvement? On 2/27/09 2:22 PM, "Joydeep Sen Sarma" wrote: Yeah - we definitely want to convert it to a MFU type flush algorithm. If someone wants to take a crack at it before we can get to it - that would be awesome __

RE: Querying JSON/Thrift data?

2009-03-06 Thread Joydeep Sen Sarma
can you describe a bit more on the format of the input file? is it a set of serialized thrift records of the same class type? the current ThriftDeserializer expects serialized records to be embedded inside a BytesWritable (we make sure of this during the loading process) - but probably not the

RE: Querying JSON/Thrift data?

2009-03-06 Thread Joydeep Sen Sarma
and loading it into Hive.. I just can't figure out how to tell hive that the input data is a bunch of serialized thrift records (all of the records are the "struct" type) in a TFileTransport. Hopefully this makes sense... -Steve ____________ From:

RE: Querying JSON/Thrift data?

2009-03-07 Thread Joydeep Sen Sarma
does Hive throw an error? I saw the JSON function but I think that the delimited maps/lists is a better solution because we don't need nested maps/lists. Thanks again! Steve Corona ________ From: Joydeep Sen Sarma [jssa...@facebook.com] Sent: Saturday, Ma

RE: thread cofinement session state

2009-03-09 Thread Joydeep Sen Sarma
(also been reading up on this code a bit just now) That's weird. It seems to be using TThreadPoolServer and that seems to just service all requests from a single connection in one thread. (and uses the same processor I assume that seems to initialize the session state in the interface construct

RE: thread cofinement session state

2009-03-09 Thread Joydeep Sen Sarma
ed to use a new thread for each connection. From: Joydeep Sen Sarma Reply-To: Date: Mon, 9 Mar 2009 20:16:22 -0700 To: Subject: RE: thread cofinement session state (also been reading up on this code a bit just now) That's weird. It seems to be using TThread

RE: thread cofinement session state

2009-03-09 Thread Joydeep Sen Sarma
t work here? From: Joydeep Sen Sarma Reply-To: Date: Mon, 9 Mar 2009 20:44:02 -0700 To: Subject: RE: thread cofinement session state Min is right. this seems a little screwed up. The Thrift Interface handler is constructed just once for the lifetime of the HiveServer. The sessio

RE: thread cofinement session state

2009-03-09 Thread Joydeep Sen Sarma
uming he is using the same code as MetaStore server. AFAIK, TThreadPoolServer is supposed to use a new thread for each connection. ________ From: Joydeep Sen Sarma http://jssa...@facebook.com>> Reply-To: http://hive-user@hadoop.apache.org>> Date: Mon, 9 Mar

RE: thread cofinement session state

2009-03-09 Thread Joydeep Sen Sarma
bject: Re: thread cofinement session state The server was keeping stay at the start point. On Tue, Mar 10, 2009 at 1:36 PM, Joydeep Sen Sarma mailto:jssa...@facebook.com>> wrote: Attaching a small patch. Can u try and see if this works? (it compiles and passes the hiveserver test) It doe

RE: thread cofinement session state

2009-03-09 Thread Joydeep Sen Sarma
high. I guess it will cause a StackOverflowError when connection reaching a certain amount. On Tue, Mar 10, 2009 at 2:16 PM, Min Zhou mailto:coderp...@gmail.com>> wrote: No connection right now, the server can not start well. On Tue, Mar 10, 2009 at 2:07 PM, Joydeep Sen Sarma

RE: thread cofinement session state

2009-03-11 Thread Joydeep Sen Sarma
ht now, the server can not start well. On Tue, Mar 10, 2009 at 2:07 PM, Joydeep Sen Sarma mailto:jssa...@facebook.com>> wrote: Hey - not able to understand - does this mean it didn't work. Can u explain in more detail what u did (how many connect

RE: Keeping Data compressed

2009-03-18 Thread Joydeep Sen Sarma
Hey - not sure if anyone responded. Sequencefiles are the way to go if u want parallelism on the files as well (since gz compressed files cannot be split). One simple way to do this is to start with text files, build (potentially an external) table on them - and load them into another table th

RE: Keeping Data compressed

2009-03-19 Thread Joydeep Sen Sarma
with the hive setting (hive.exec.compress.output=true)? > > Beside that I wonder how Hive deals with the key/value records in a > sequence file. > > Bob > > Joydeep Sen Sarma schrieb: > > Hey - not sure if anyone responded. > > >

RE: Keeping Data compressed

2009-03-19 Thread Joydeep Sen Sarma
eping Data compressed Joydeep Sen Sarma schrieb: > Can't reproduce this. can u run explain on the insert query and post the > results? > I'll do this but meanwhile I figured out that it doesnt need sequence files to get compression. I just stay with textfiles: 1. hadoop p

RE: getting different row counts on each import

2009-03-20 Thread Joydeep Sen Sarma
Yeah - that's really really surprising. The row count is reported using hadoop counters - we haven't seen any discrepancies so far (we use hadoop-17) - but that's one possibility. But the count(1) is the more important one to resolve - that should definitely be correct. Are the count results no

RE: Keeping Data compressed

2009-03-20 Thread Joydeep Sen Sarma
ble: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.mapred.SequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: t15 Joydeep Sen Sarma schrieb: > Can't reprod

RE: getting different row counts on each import

2009-03-20 Thread Joydeep Sen Sarma
: Friday, March 20, 2009 9:50 AM To: hive-user@hadoop.apache.org Subject: Re: getting different row counts on each import Joydeep Sen Sarma schrieb: > Yeah - that's really really surprising. > > The row count is reported using hadoop counters - we haven't seen any > discrepancies

RE: getting different row counts on each import

2009-03-20 Thread Joydeep Sen Sarma
PM To: hive-user@hadoop.apache.org Subject: Re: getting different row counts on each import Could this be related to Hadoop counters for compressed data being wonky? On Fri, Mar 20, 2009 at 10:03 AM, Joydeep Sen Sarma mailto:jssa...@facebook.com>> wrote: Ok - is this correct summary?:

Re: SerDe with a binary formatted file.

2009-04-14 Thread Joydeep Sen Sarma
Hey - take a look at the patch for hive-333. In general this kind of file cannot be split by hadoop (since record boundaries are unknown). I would suggest converting these files into sequencefiles with binary records stuffed inside bytewritables. Hive-333 has an example program that does this fo

RE: Execution Error: ExecDriver

2009-05-13 Thread Joydeep Sen Sarma
This should work. What version of hive are u running? (it almost seems that the add functionality is not implemented - which it has been forever. Hope you aren't using hive from the contrib. section of hadoop-19) From: Manhee Jo [mailto:j...@nttdocomo.com] Sent:

RE: Execution Error: ExecDriver

2009-05-13 Thread Joydeep Sen Sarma
t seems that the file name is not quoted. We need to use either single or double quotation mark (' or ") to quote the whole path. Zheng On Wed, May 13, 2009 at 8:40 PM, Joydeep Sen Sarma mailto:jssa...@facebook.com>> wrote: This should work. What version of hive are u run

Hive using EC2/S3

2009-05-19 Thread Joydeep Sen Sarma
Hi folks, I have put up a short tutorial on running SQL queries on EC2 against files in S3 using Hive and Hadoop. Please find it here: http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely Some example data and queries (from TPCH benchmark) are also made available in S3. Cc'ing core-us

RE: Query execution error on cast w/ lazyserde w/ join ...

2009-06-22 Thread Joydeep Sen Sarma
Yeah - we will get the 0.20 patch committed before 0.4 From: Zheng Shao [mailto:zsh...@gmail.com] Sent: Sunday, June 14, 2009 7:20 PM To: hive-user@hadoop.apache.org Subject: Re: Query execution error on cast w/ lazyserde w/ join ... There is currently no way to g

RE: Can't start hive after setting HIVE_AUX_JARS_PATH ...

2009-07-08 Thread Joydeep Sen Sarma
Sorry - this is also needed as part of hive-487: In hadoop-20 - the -libjars has to come after the jar file/class Please try applying this patch to bin/ext/cli.sh --- cli.sh (revision 789726) +++ cli.sh (working copy) @@ -10,7 +10,7 @@ exit 3; fi - exec $HADOOP jar $AUX_JARS_CMD_LIN

RE: NullPointerException on commit

2009-08-13 Thread Joydeep Sen Sarma
hey - not sure there was a reply. there's likely to be a fuller stack trace in the hive log file .. (whose path should be mentioned somewhere in the config files). that info would help debugging this further. From: Neal Richter [nrich...@gmail.com] Sent: S

RE: How HIVE manages a join

2010-08-12 Thread Joydeep Sen Sarma
i hate this message: 'THIS PAGE WAS MOVED TO HIVE XDOCS ! DO NOT EDIT!Join Syntax' why must edits to the wiki be banned if there are xdocs? hadoop has both. there will always be things that are not captured in xdocs. it's pretty sad to discourage free form edits by people who want to contribute

RE: what is difference hive local model and standalone model.

2010-08-13 Thread Joydeep Sen Sarma
Lei - not sure I understand the question. I tried to document the relationship between hive, MR and local-mode at http://wiki.apache.org/hadoop/Hive/GettingStarted#Hive.2C_Map-Reduce_and_Local-Mode recently. perhaps you have already read it. Regarding whether local mode can be run on windows or