Re: how to let hive support lzo
Hi, Along with the mapred.compress* properties try to set hive.exec.compress.output to true. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: ch huang justlo...@gmail.com Date: Mon, 22 Jul 2013 13:41:01 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Re: how to let hive support lzo # hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://CH22:9000/alex/my.txt lzo 13/07/22 13:27:58 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available 13/07/22 13:27:59 INFO util.ChecksumType: Checksum using org.apache.hadoop.util.PureJavaCrc32 13/07/22 13:27:59 INFO util.ChecksumType: Checksum can use org.apache.hadoop.util.PureJavaCrc32C 13/07/22 13:27:59 ERROR metrics.SchemaMetrics: Inconsistent configuration. Previous configuration for using table name in metrics: true, new configuration: false 13/07/22 13:27:59 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library 13/07/22 13:27:59 INFO lzo.LzoCodec: Successfully loaded initialized native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8] 13/07/22 13:27:59 INFO compress.CodecPool: Got brand-new compressor [.lzo_deflate] 13/07/22 13:28:00 INFO compress.CodecPool: Got brand-new decompressor [.lzo_deflate] SUCCESS # hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.LzoIndexer /alex 13/07/22 09:39:04 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library 13/07/22 09:39:04 INFO lzo.LzoCodec: Successfully loaded initialized native-lzo library [hadoop-lzo rev 6bb1b7f8b9044d8df9b4d2b6641db7658aab3cf8] 13/07/22 09:39:04 INFO lzo.LzoIndexer: LZO Indexing directory /alex... 13/07/22 09:39:04 INFO lzo.LzoIndexer: LZO Indexing directory hdfs://CH22:9000/alex/alex_t... 13/07/22 09:39:04 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file hdfs://CH22:9000/alex/sqoop-1.99.2-bin-hadoop200.tar.gz.lzo, size 0.02 GB... 13/07/22 09:39:05 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available 13/07/22 09:39:06 INFO lzo.LzoIndexer: Completed LZO Indexing in 1.16 seconds (13.99 MB/s). Index size is 0.52 KB. 13/07/22 09:39:06 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file hdfs://CH22:9000/alex/test1.lzo, size 0.00 GB... 13/07/22 09:39:06 INFO lzo.LzoIndexer: Completed LZO Indexing in 0.08 seconds (0.00 MB/s). Index size is 0.01 KB. On Mon, Jul 22, 2013 at 1:37 PM, ch huang justlo...@gmail.com wrote: hi ,all: i already install and testing lzo in hadoop and hbase,all success,but when i try it on hive ,it failed ,how can i do let hive can recognize lzo? hive set mapred.map.output.compression.codec; mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec hive set mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec hive select count(*) from test; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=number In order to limit the maximum number of reducers: set hive.exec.reducers.max=number In order to set a constant number of reducers: set mapred.reduce.tasks=number Starting Job = job_1374463239553_0003, Tracking URL = http://CH22:8088/proxy/application_1374463239553_0003/http://ch22:8088/proxy/application_1374463239553_0003/ Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1374463239553_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2013-07-22 13:33:27,243 Stage-1 map = 0%, reduce = 0% 2013-07-22 13:33:45,403 Stage-1 map = 100%, reduce = 0% Ended Job = job_1374463239553_0003 with errors Error during job, obtaining debugging information... Job Tracking URL: http://CH22:8088/proxy/application_1374463239553_0003/http://ch22:8088/proxy/application_1374463239553_0003/ Examining task ID: task_1374463239553_0003_m_00 (and more) from job job_1374463239553_0003 Task with the most failures(4): - Task ID: task_1374463239553_0003_m_00 URL: http://CH22:8088/taskdetails.jsp?jobid=job_1374463239553_0003tipid=task_1374463239553_0003_m_00http://ch22:8088/taskdetails.jsp?jobid=job_1374463239553_0003tipid=task_1374463239553_0003_m_00 - Diagnostic Messages for this Task: Error: java.lang.RuntimeException: native-lzo library not available at com.hadoop.compression.lzo.LzoCodec.getCompressorType(LzoCodec.java:155) at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:104) at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:118) at org.apache.hadoop.mapred.IFile$Writer.init(IFile.java:115) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1580) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1457) at
Re: Hive CLI
Hi Rahul, The same shortcuts ctrl+A and ctrl+E works in hive shell for me( hive 0.9) Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: rahul kavale kavale.ra...@gmail.com Date: Tue, 9 Jul 2013 11:00:49 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Hive CLI Hey there, I have been using HIVE(0.7) for a while now using CLI and bash scripts. But its a pain to move cursor in the CLI i.e. once you enter a very long query then you cant go to start of the query (like you do using Ctrl+A/Ctrl+E in terminal). Does anyone know how to do it? Thanks Regards, Rahul
Re: Strange error in hive
Hii Jerome Can you send the error log of the MapReduce task that failed? That should have some pointers which can help you troubleshoot the issue. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Jérôme Verdier verdier.jerom...@gmail.com Date: Mon, 8 Jul 2013 11:25:34 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Strange error in hive Hi everybody, I faced a strange error in hive this morning. The error message is this one : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask after a quick search on Google, it appears that this is a Hive bug : https://issues.apache.org/jira/browse/HIVE-4650 Is there a way to pass through this error ? Thanks. NB : my hive script is in the attachment. -- *Jérôme VERDIER* 06.72.19.17.31 verdier.jerom...@gmail.com
Re: integration issure about hive and hbase
Hi Can you try including the zookeeper quorum and port in your hive configuration as shown below hive --auxpath .../hbase-handler.jar, .../hbase.jar, ...zookeeper.jar, .../guava.jar -hiveconf hbase.zookeeper.quorum=zk server names separated by comma -hiveconf hbase.zookeeper.property.clientPort=your custom port Substitute the above command with actual values. Also ensure that the zk, hbase jars specified above are those used in your hbase cluster. To avoid any version mismatches. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: ch huang justlo...@gmail.com Date: Mon, 8 Jul 2013 16:40:59 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Re: integration issure about hive and hbase i replace the zookeeper jar ,the error is different hive CREATE TABLE hbase_table_1(key int, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (hbase.columns.mapping = :key,cf1:val) TBLPROPERTIES (hbase.table.name = xyz); FAILED: Error in metadata: MetaException(message:org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately. This could be a sign that the server has too many connections (30 is the default). Consider inspecting your ZK server logs for that error and then make sure you are reusing HBaseConfiguration as often as you can. See HTable's javadoc for more information. at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:160) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:1265) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:526) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.init(HConnectionManager.java:516) at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:173) at org.apache.hadoop.hbase.client.HBaseAdmin.init(HBaseAdmin.java:93) at org.apache.hadoop.hive.hbase.HBaseStorageHandler.getHBaseAdmin(HBaseStorageHandler.java:74) at org.apache.hadoop.hive.hbase.HBaseStorageHandler.preCreateTable(HBaseStorageHandler.java:158) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:344) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:470) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:3176) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:213) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:131) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1063) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:900) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:748) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:209) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:286) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:516) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:197) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:815) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:843) at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:930) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:138) ... 24 more ) FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask On Mon, Jul 8, 2013 at 2:52 PM, Cheng Su scarcer...@gmail.com wrote: Did you hbase cluster start up? The error message is more like that something wrong with the classpath. So maybe you'd better also check that. On Mon, Jul 8, 2013 at 1:54 PM, ch huang justlo...@gmail.com wrote: i get error when try create table on hbase use hive, anyone can help? hive CREATE TABLE hive_hbasetable_demo(key int,value string) STORED BY 'ora.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (hbase.columns.mapping = :key,cf1:val) TBLPROPERTIES (hbase.table.name =
Re: Need help in Hive
Hi Maheedhar As I understand, you are having a column with data of type MM:SS in your input data set. AFAIK this format is not in the standard java.sql.Timestamp format also it doesn't even have any date part . Hence you may not be able to use Timestamp data type here. You can define it as a string and then develop your custom UDFs for any further processing. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Matouk IFTISSEN matouk.iftis...@ysance.com Date: Mon, 8 Jul 2013 09:47:11 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Re: Need help in Hive Hello, Try this function in hive query: 1- transform your data (type integer ) in timestamp (linux), then do this: 2- from_unixtimeyour_date_timestamp), 'mm:ss') AS time Hope this, you will give help. 2013/7/8 Maheedhar Reddy maheedhar...@gmail.com Hi All, I have Hive 0.8.0 version installed in my single node Apache Hadoop cluster. I have a time column which is in format *MM:SS* (Minutes:seconds). I tried the date functions to get the value in MM:SS format. But its not working out. Below is my column for your reference. *Active Time* *12:01* 0:20 2:18 in the first record 12:01, 12 is the number of minutes and 01 is the seconds. so when the time i'm creating a table in Hive, i have to give a data type for this column Active Time, I have tried with various date type columns but none of them worked out for me. Please guide me. What function should I use, to get the time in *MM:SS* format? You only live once, but if you do it right, once is enough. Cheers!! Maheedhar Reddy K V http://about.me/maheedhar.kv/#
Re: When to use bucketed tables with/instead of partitioned tables
Hi Stephen In addition to join optimization, bucketing helps much in sampling as well. It helps you to choose the sample space, (ie n buckets of m). Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Stephen Boesch java...@gmail.com Date: Sun, 16 Jun 2013 11:20:49 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: When to use bucketed tables with/instead of partitioned tables I am accustomed to using partitioned tables to obtain separate directories for data files in each partition. When looking at the documentation for bucketed tables it seems they are typically used in conjunction with distribute by/sort by and an appropriate partitioning key - and thus provide ability to do map side joins. An explanation of when to use bucketed tables by themselves (in lieu of partitioned tables) as well as in conjunction with partitoined tables would be appreciated. thanks! stephenb
Re: How to delete Specific date data using hive QL?
Adding my two cents If you are having an unpartitioned data/table and would like to partition it on some specific columns in source table, Use dynamic partition insert. That would get the source data in separate partitions on a partitioned target table. http://kickstarthadoop.blogspot.com/2011/06/how-to-speed-up-your-hive-queries-in.html Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Hamza Asad hamza.asa...@gmail.com Date: Tue, 4 Jun 2013 12:52:49 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Re: How to delete Specific date data using hive QL? Thank u s much nitin for your help.. :) On Tue, Jun 4, 2013 at 12:18 PM, Nitin Pawar nitinpawar...@gmail.comwrote: 1- Does partitioning improve performance? --Only if you make use of partitions in your queries (mostly in where clause to limit data to your query for a specific value of partitioned column) 2- Do i have to create partition table new or i can create partition on existing table by renaming that date column and add partition column event_date (the actual column name) ? you can not create partitions on already existing data unless the data is in partitioned directories on hdfs. I would recommend create a new table with partitioned columns. load data from old table into partitioned table dump old table 3- can i import data directly into partition table using sqoop command? you can import data directly into a partition. for exported data, you don't have to worry. it remains as it is On Tue, Jun 4, 2013 at 12:41 PM, Hamza Asad hamza.asa...@gmail.comwrote: No i don't want to change my queries. I want that my queries work on same table and partition does not change its schema. and from schema i means schema on mysql (exported data). Few more things 1- Does partitioning improve performance? 2- Do i have to create partition table new or i can create partition on existing table by renaming that date column and add partition column event_date (the actual column name) ? 3- can i import data directly into partition table using sqoop command? On Tue, Jun 4, 2013 at 11:40 AM, Nitin Pawar nitinpawar...@gmail.comwrote: partitioning of data in hive is more for the reasons on how you layout data in a well defined manner so that when you access your data , you request only for specific data by specifying the partition columns in where clause. to answer your question, do you have to change your queries? out of the box the queries should work as it is unless and until you are changing the table schema by removing/adding new columns. does the format change when you export data? if your select statement is not changing it will not change will table schema change? do you mean schema on hive or mysql ? On Tue, Jun 4, 2013 at 11:37 AM, Hamza Asad hamza.asa...@gmail.comwrote: thats far more better :) .. Please tell me few more things. Do i have to change my query if i create table with partition on date? rest of the columns would be same as it is? Also if i export that partitioned table to mysql, does schema of that table would same as it was before partition? On Tue, Jun 4, 2013 at 12:09 AM, Stephen Sprague sprag...@gmail.comwrote: there is no delete semantic. you either partition on the data you want to drop and use drop partition (or drop table for the whole shebang) or you can do as Nitin suggests by selecting the inverse of the data you want to delete and store it back into the table itself. Not ideal but maybe it could work for your situation. Now here's another idea. This was just _recently_ discussed on this group as coincidence would have it. if you were to have scanned just a little of the groups messages you would have seen that and could then have added to the discussion! :) On Mon, Jun 3, 2013 at 2:19 AM, Hamza Asad hamza.asa...@gmail.comwrote: Thanx for your response nitin. Anybody else have any better solution? On Mon, Jun 3, 2013 at 1:27 PM, Nitin Pawar nitinpawar...@gmail.comwrote: hive does not give you a record level deletion as of now. so unless you have partitioned, other option is you overwrite the table with data which you want please wait for others to suggest you more options. this one is just mine and can be costly too On Mon, Jun 3, 2013 at 12:36 PM, Hamza Asad hamza.asa...@gmail.comwrote: no, its not partitioned by date. On Mon, Jun 3, 2013 at 11:19 AM, Nitin Pawar nitinpawar...@gmail.com wrote: how is the data laid out? is it partitioned data by the date? On Mon, Jun 3, 2013 at 11:20 AM, Hamza Asad hamza.asa...@gmail.com wrote: Dear all, How can i remove data of specific dates from HDFS using hive query language? -- *Muhammad Hamza Asad* -- Nitin Pawar -- *Muhammad Hamza Asad* -- Nitin Pawar -- *Muhammad Hamza Asad* -- *Muhammad Hamza Asad* -- Nitin Pawar -- *Muhammad Hamza Asad* -- Nitin
Re: how does hive find where is MR job tracker
Hive gets the JobTracker from the mapred-site.xml specified within your $HADOOP_HOME/conf. Is your $HADOOP_HOME/conf/mapred-site.xml on the node that runs hive have the correct value for jobtracker? If not changing that to the right one might resolve your issue. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Frank Luo j...@merkleinc.com Date: Tue, 28 May 2013 16:49:01 To: user@hive.apache.orguser@hive.apache.org Reply-To: user@hive.apache.org Subject: how does hive find where is MR job tracker I have a cloudera cluster, version 4.2.0. In the hive configuration, I have MapReduce Service set to mapreduce1, which is my MR service. However, without setting mapred.job.tracker, whenever I run hive command, it always sends the job to a wrong job tracker. Here is the error: java.net.ConnectException: Call From hqhd01ed01.pclc0.merkle.local/10.129.2.52 to hqhd01ed01.pclc0.merkle.local:8021 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) And the Cloudera Manager doesn't allow me to manually set mapred.job.tracker. So my question is how to make Hive point to the right job tracker without setting mapred.job.tracker every time. PS. Not sure it matters, but I did move the job tracker from machine A to machine B. Thx!
Re: Hive on Oracle
Hi Raj Which jar depends on what version of oracle you are using? The jar version corresponding to each oracle release would be there in oracle documentations online. JDBC Jars should be available from the oracle website for free download. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Raj Hadoop hadoop...@yahoo.com Date: Fri, 17 May 2013 20:43:46 To: bejoy...@yahoo.combejoy...@yahoo.com; user@hive.apache.orguser@hive.apache.org; Useru...@hadoop.apache.org Reply-To: user@hive.apache.org Subject: Re: Hive on Oracle Thanks for the reply. Can you specify whether which jar file need need to be used ? where can i get the jar file? does oracle provide one for free? let me know please. Thanks, Raj From: bejoy...@yahoo.com bejoy...@yahoo.com To: user@hive.apache.org; Raj Hadoop hadoop...@yahoo.com; User u...@hadoop.apache.org Sent: Friday, May 17, 2013 11:42 PM Subject: Re: Hive on Oracle Hi The procedure is same as setting up mysql metastore. You need to use the jdbc driver/jar corresponding to the oracle version/release you are intending to use. Regards Bejoy KS Sent from remote device, Please excuse typos From: Raj Hadoop hadoop...@yahoo.com Date: Fri, 17 May 2013 17:10:07 -0700 (PDT) To: Hiveuser@hive.apache.org; Useru...@hadoop.apache.org ReplyTo: user@hive.apache.org Subject: Hive on Oracle Hi, I am planning to install Hive and want to set up Meta store on Oracle. What is the procedure? Which driver (JDBC) do I need to use it? Thanks, Raj
Re: Getting Slow Query Performance!
Hi Since you are on a pseudo distributed/ single node environment the hadoop mapreduce parallelism is limited. You might be having just a few map slots and map tasks might be in queue waiting for others to complete. In a larger cluster your job should be faster. Certain SQL queries that ulilize indexing would be faster in sql server than in hive. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Gobinda Paul gobi...@live.com Date: Tue, 12 Mar 2013 15:09:31 To: user@hive.apache.orguser@hive.apache.org Reply-To: user@hive.apache.org Subject: Getting Slow Query Performance! i use sqoop to import 30GB data ( two table employee(aprox 21 GB) and salary(aprox 9GB ) into hadoop(Single Node) via hive. i run a sample query like SELECT EMPLOYEE.ID,EMPLOYEE.NAME,EMPLOYEE.DEPT,SALARY.AMOUNT FROM EMPLOYEE JOIN SALARY WHERE EMPLOYEE.ID=SALARY.EMPLOYEE_ID AND SALARY.AMOUNT90; In Hive it's take 15 Min(aprox.) where as mySQL take 4.5 min( aprox ) to execute that query . CPU: Pentium(R) Dual-Core CPU E5700 @ 3.00GHzRAM: 2GBHDD: 500GB Here IS My hive-site.xml conf. ?xml version=1.0??xml-stylesheet type=text/xsl href=configuration.xsl? configuration propertynamejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true/value /property propertynamejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/value /property property namejavax.jdo.option.ConnectionUserName/namevalueroot/value /property propertynamejavax.jdo.option.ConnectionPassword/name value123456/value /property property namehive.hwi.listen.host/name value0.0.0.0/value descriptionThis is the host address the Hive Web Interface will listen on/description /property propertynamehive.hwi.listen.port/name value/valuedescriptionThis is the port the Hive Web Interface will listen on/description /property property namehive.hwi.war.file/namevalue/lib/hive-hwi-0.9.0.war/value descriptionThis is the WAR file with the jsp content for Hive Web Interface/description /property property namemapred.reduce.tasks/namevalue-1/value descriptionThe default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is local. Hadoop set this to 1 by default, whereas hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers. /description /property property namehive.exec.reducers.bytes.per.reducer/name value10/value descriptionsize per reducer.The default is 1G, i.e if the input size is 10G, it will use 10 reducers./description /property propertynamehive.exec.reducers.max/namevalue999/value descriptionmax number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is negative, hive will use this one as the max number of reducers when automatically determine number of reducers. /description /property propertynamehive.exec.scratchdir/name value/tmp/hive-${user.name}/valuedescriptionScratch space for Hive jobs/description /property property namehive.metastore.local/name valuetrue/value /property /configuration Any IDEA ??
Re: Getting Slow Query Performance!
Hi Since you are on a pseudo distributed/ single node environment the hadoop mapreduce parallelism is limited. You might be having just a few map slots and map tasks might be in queue waiting for others to complete. In a larger cluster your job should be faster. As a side note, Certain SQL queries that ulilize indexing would be faster in sql server than in hive. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Gobinda Paul gobi...@live.com Date: Tue, 12 Mar 2013 15:09:31 To: user@hive.apache.orguser@hive.apache.org Reply-To: user@hive.apache.org Subject: Getting Slow Query Performance! i use sqoop to import 30GB data ( two table employee(aprox 21 GB) and salary(aprox 9GB ) into hadoop(Single Node) via hive. i run a sample query like SELECT EMPLOYEE.ID,EMPLOYEE.NAME,EMPLOYEE.DEPT,SALARY.AMOUNT FROM EMPLOYEE JOIN SALARY WHERE EMPLOYEE.ID=SALARY.EMPLOYEE_ID AND SALARY.AMOUNT90; In Hive it's take 15 Min(aprox.) where as mySQL take 4.5 min( aprox ) to execute that query . CPU: Pentium(R) Dual-Core CPU E5700 @ 3.00GHzRAM: 2GBHDD: 500GB Here IS My hive-site.xml conf. ?xml version=1.0??xml-stylesheet type=text/xsl href=configuration.xsl? configuration propertynamejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true/value /property propertynamejavax.jdo.option.ConnectionDriverName/name valuecom.mysql.jdbc.Driver/value /property property namejavax.jdo.option.ConnectionUserName/namevalueroot/value /property propertynamejavax.jdo.option.ConnectionPassword/name value123456/value /property property namehive.hwi.listen.host/name value0.0.0.0/value descriptionThis is the host address the Hive Web Interface will listen on/description /property propertynamehive.hwi.listen.port/name value/valuedescriptionThis is the port the Hive Web Interface will listen on/description /property property namehive.hwi.war.file/namevalue/lib/hive-hwi-0.9.0.war/value descriptionThis is the WAR file with the jsp content for Hive Web Interface/description /property property namemapred.reduce.tasks/namevalue-1/value descriptionThe default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is local. Hadoop set this to 1 by default, whereas hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers. /description /property property namehive.exec.reducers.bytes.per.reducer/name value10/value descriptionsize per reducer.The default is 1G, i.e if the input size is 10G, it will use 10 reducers./description /property propertynamehive.exec.reducers.max/namevalue999/value descriptionmax number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is negative, hive will use this one as the max number of reducers when automatically determine number of reducers. /description /property propertynamehive.exec.scratchdir/name value/tmp/hive-${user.name}/valuedescriptionScratch space for Hive jobs/description /property property namehive.metastore.local/name valuetrue/value /property /configuration Any IDEA ??
Re: hive issue with sub-directories
Hi Suresh AFAIK as of now a partition cannot contain sub directories, it can contain only files. You may have to move the sub dirs out of the parent dir 'a' and create separate partitions for those. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Suresh Krishnappa suresh.krishna...@gmail.com Date: Mon, 11 Mar 2013 10:58:05 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Re: hive issue with sub-directories Hi Mark, I am using external table in HIVE. This is how I am adding the partition alter table mytable add partition (pt=1) location '/test/a/'; I am able to run HIVE queries only if '/test/a/b' folder is deleted. How can I retain this folder structure and still issue queries? Thanks Suresh On Sun, Mar 10, 2013 at 12:48 AM, Mark Grover grover.markgro...@gmail.comwrote: Suresh, By default, the partition column name has to be appear in HDFS directory structure. e.g. /user/hive/warehouse/table name/partition col name=partition col value/data1.txt /user/hive/warehouse/table name/partition col name=partition col value/data2.txt On Thu, Mar 7, 2013 at 7:20 AM, Suresh Krishnappa suresh.krishna...@gmail.com wrote: Hi All, I have the following directory structure in hdfs /test/a/ /test/a/1.avro /test/a/2.avro /test/a/b/ /test/a/b/3.avro I created an external HIVE table using Avro Serde and added /test/a as a partition to this table. I am not able to run a select query. Always getting the error 'not a file' on '/test/a/b' Is this by design, a bug or am I missing some configuration? I am using HIVE 0.10 Thanks Suresh
Re: java.lang.NoClassDefFoundError: com/jayway/jsonpath/PathUtil
Hi Sai Local mode is just for trials, for any pre prod/production environment you need MR jobs. Hive under the hood stores data in HDFS (mostly) and definitely we use hadoop/hive for larger data volumes. So MR should be in there to process them. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Ramki Palle ramki.pa...@gmail.com Date: Sun, 10 Mar 2013 06:58:57 To: user@hive.apache.org; Sai Saisaigr...@yahoo.in Reply-To: user@hive.apache.org Subject: Re: java.lang.NoClassDefFoundError: com/jayway/jsonpath/PathUtil Well, you get the results faster. Please check this: https://cwiki.apache.org/Hive/gettingstarted.html#GettingStarted-Runtimeconfiguration Under section Hive, Map-Reduce and Local-Mode, it says This can be very useful to run queries over small data sets - in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. -Ramki. On Sun, Mar 10, 2013 at 5:26 AM, Sai Sai saigr...@yahoo.in wrote: Ramki/John Many Thanks, that really helped. I have run the add jars in the new session and it appears to be running. However i was wondering about by passing MR, why would we do it and what is the use of it. Will appreciate any input. Thanks Sai -- *From:* Ramki Palle ramki.pa...@gmail.com *To:* user@hive.apache.org; Sai Sai saigr...@yahoo.in *Sent:* Sunday, 10 March 2013 4:22 AM *Subject:* Re: java.lang.NoClassDefFoundError: com/jayway/jsonpath/PathUtil When you execute the following query, hive select * from twitter limit 5; Hive runs it in local mode and not use MapReduce. For the query, hive select tweet_id from twitter limit 5; I think you need to add JSON jars to overcome this error. You might have added these in a previous session. If you want these jars available for all sessions, insert the add jar statements to your $HOME/.hiverc file. To bypass MapReduce set hive.exec.mode.local.auto = true; to suggest Hive to use local mode to execute the query. If it still uses MR, try set hive.fetch.task.conversion = more;. -Ramki. On Sun, Mar 10, 2013 at 12:19 AM, Sai Sai saigr...@yahoo.in wrote: Just wondering if anyone has any suggestions: This executes successfully: hive select * from twitter limit 5; This does not work: hive select tweet_id from twitter limit 5; // I have given the exception info below: Here is the output of this: hive select * from twitter limit 5; OK tweet_idcreated_attextuser_iduser_screen_nameuser_lang 122106088022745088Fri Oct 07 00:28:54 + 2011wkwkw -_- ayo saja mba RT @yullyunet: Sepupuuu, kita lanjalan yok.. Kita karokoe-an.. Ajak mas galih jg kalo dia mau.. @Dindnf: doremifas124735434Dindnfen 122106088018558976Fri Oct 07 00:28:54 + 2011@egg486 특별히 준비했습니다!252828803CocaCola_Koreako 122106088026939392Fri Oct 07 00:28:54 + 2011My offer of free gobbies for all if @amityaffliction play Blair snitch project still stands.168590073SarahYoungBlooden 122106088035328001Fri Oct 07 00:28:54 + 2011the girl nxt to me in the lib got her headphones in dancing and singing loud af like she the only one here haha267296295MONEYyDREAMS_en 122106088005971968Fri Oct 07 00:28:54 + 2011@KUnYoong_B2UTY Bị lsao đấy269182160b2st_b2utyhpen Time taken: 0.154 seconds This does not work: hive select tweet_id from twitter limit 5; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201303050432_0094, Tracking URL = http://ubuntu:50030/jobdetails.jsp?jobid=job_201303050432_0094 Kill Command = /home/satish/work/hadoop-1.0.4/libexec/../bin/hadoop job -kill job_201303050432_0094 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2013-03-10 00:14:44,509 Stage-1 map = 0%, reduce = 0% 2013-03-10 00:15:14,613 Stage-1 map = 100%, reduce = 100% Ended Job = job_201303050432_0094 with errors Error during job, obtaining debugging information... Job Tracking URL: http://ubuntu:50030/jobdetails.jsp?jobid=job_201303050432_0094 Examining task ID: task_201303050432_0094_m_02 (and more) from job job_201303050432_0094 Task with the most failures(4): - Task ID: task_201303050432_0094_m_00 URL: http://ubuntu:50030/taskdetails.jsp?jobid=job_201303050432_0094tipid=task_201303050432_0094_m_00 - Diagnostic Messages for this Task: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) at
Re: Accessing sub column in hive
Hi Sai You can do it as Select address.country from employees; Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Bennie Schut bsc...@ebuddy.com Date: Fri, 8 Mar 2013 09:09:49 To: user@hive.apache.orguser@hive.apache.org; 'Sai Sai'saigr...@yahoo.in Reply-To: user@hive.apache.org Subject: RE: Accessing sub column in hive Perhaps worth posting the error. Some might know what the error means. Also a bit unrelated to hive but please do yourself a favor and don't use float to store monetary values like salary. You will get rounding issues at some point in time when you do arithmetic on them. Considering you are using hadoop you probably have a lot of data so adding it all up will get you there really really fast. http://stackoverflow.com/questions/3730019/why-not-use-double-or-float-to-represent-currency From: Sai Sai [mailto:saigr...@yahoo.in] Sent: Thursday, March 07, 2013 12:54 PM To: user@hive.apache.org Subject: Re: Accessing sub column in hive I have a table created like this successfully: CREATE TABLE IF NOT EXISTS employees (name STRING,salary FLOAT,subordinates ARRAYSTRING,deductions MAPSTRING,FLOAT,address STRUCTstreet:STRING, city:STRING, state:STRING, zip:INT, country:STRING) I would like to access/display country column from my address struct. I have tried this: select address[country] from employees; I get an error. Please help. Thanks Sai
Re: Finding maximum across a row
Hi Sachin You could get the detailed ateps from hive wiki itself https://cwiki.apache.org/Hive/hiveplugins.html Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Sachin Sudarshana sachin.sudarsh...@gmail.com Date: Fri, 1 Mar 2013 22:37:54 To: user@hive.apache.org; bejoy...@yahoo.com Reply-To: user@hive.apache.org Subject: Re: Finding maximum across a row Hi Bejoy, I am new to UDF in Hive. Could you send me any link/tutorials on where i can be able to learn about writing the UDF? Thanks! On Fri, Mar 1, 2013 at 10:22 PM, bejoy...@yahoo.com wrote: ** Hi Sachin AFAIK There isn't one at the moment. But you can easily achieve this using a custom UDF. Regards Bejoy KS Sent from remote device, Please excuse typos -- *From: * Sachin Sudarshana sachin.sudarsh...@gmail.com *Date: *Fri, 1 Mar 2013 22:16:37 +0530 *To: *user@hive.apache.org *ReplyTo: * user@hive.apache.org *Subject: *Finding maximum across a row Hi, Is there any function/method to find the maximum across a row in hive? Suppose i have a table like this: ColA ColB ColC 2 5 7 3 2 1 I want the function to return 7 1 Its urgently required. Any help would be greatly appreciated! -- Thanks and Regards, Sachin Sudarshana -- Thanks and Regards, Sachin Sudarshana
Re: Hive queries
Hi Cyril I believe you are using the derby meta store and then it should be an issue with the hive configs. Derby is trying to create a metastore at your current dir from where you are starting hive. The tables exported by sqoop would be inside HIVE_HOME and hence you are not able to see the tables from getting on to hive CLI from other locations. To have a universal metastore db configure a specific dir in javax.jdo.option.ConnectionURL in hive-site.xml . In your conn url configure the db name as databaseName=/home/hive/metastore_db Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Cyril Bogus cyrilbo...@gmail.com Date: Mon, 25 Feb 2013 10:34:29 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Re: Hive queries I do not get any errors. It is only when I run hive and try to query the tables I imported. Let's say I want to only get numeric tuples for a given table. I cannot find the table (show tables; is empty) unless I go in the hive home folder and run hive again. I would expect the state of hive to be the same everywhere I call it. But so far it is not the case. On Mon, Feb 25, 2013 at 10:22 AM, Nitin Pawar nitinpawar...@gmail.comwrote: any errors you see ? On Mon, Feb 25, 2013 at 8:48 PM, Cyril Bogus cyrilbo...@gmail.com wrote: Hi everyone, My setup is Hadoop 1.0.4, Hive 0.9.0, Sqoop 1.4.2-hadoop 1.0.0 Mahout 0.7 I have imported tables from a remote database directly into Hive using Sqoop. Somehow when I try to run Sqoop from Hadoop, the content Hive is giving me trouble in bookkeeping of where the imported tables are located. I have a Single Node setup. Thank you for any answer and you can ask question if I was not specific enough about my issue. Cyril -- Nitin Pawar
Re: Security for Hive
Hi Austin AFAIK at the moment you can control permissions gracefully only on a data level not on the metadata level. ie you can play with the hdfs permissions . Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Austin Chungath austi...@gmail.com Date: Fri, 22 Feb 2013 23:11:51 To: bejoy...@yahoo.combejoy...@yahoo.com; user@hive.apache.orguser@hive.apache.org Reply-To: user@hive.apache.org Subject: RE: Security for Hive So that means any user can revoke or give permissions to any user for any table in the metastore? Sent from my Phone, please ignore typos -- From: bejoy...@yahoo.com Sent: 22-02-2013 11:30 PM To: user@hive.apache.org Subject: Re: Security for Hive Hi Sachin Currently there is no such admin user concept in hive. Regards Bejoy KS Sent from remote device, Please excuse typos -- *From: * Sachin Sudarshana sachin.sudarsh...@gmail.com *Date: *Fri, 22 Feb 2013 16:40:49 +0530 *To: *user@hive.apache.org *ReplyTo: * user@hive.apache.org *Subject: *Re: Security for Hive Hi, I have read about roles, user privileges, group privileges etc. But these roles can be created by any user for any database/table. I would like to know if there is a specific 'administrator' for hive who can log on with his credentials and is the only one entitled to create roles, grant privileges etc. Thank you. On Fri, Feb 22, 2013 at 4:19 PM, Jagat Singh jagatsi...@gmail.com wrote: You might want to read this https://cwiki.apache.org/Hive/languagemanual-auth.html On Fri, Feb 22, 2013 at 9:44 PM, Sachin Sudarshana sachin.sudarsh...@gmail.com wrote: Hi, I have just started learning about hive. I have configured Hive to use mysql as the metastore instead of derby. If I wish to use GRANT and REVOKE commands, i can use it with any user. A user can issue GRANT or REVOKE commands to any other users' table since both the users' tables are present in the same warehouse. Isn't there a concept of superuser/admin in hive who alone has the authority to issue these commands ? Any answer is greatly appreciated! -- Thanks and Regards, Sachin Sudarshana -- Thanks and Regards, Sachin Sudarshana
Re: Security for Hive
Hi Sachin Currently there is no such admin user concept in hive. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Sachin Sudarshana sachin.sudarsh...@gmail.com Date: Fri, 22 Feb 2013 16:40:49 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Re: Security for Hive Hi, I have read about roles, user privileges, group privileges etc. But these roles can be created by any user for any database/table. I would like to know if there is a specific 'administrator' for hive who can log on with his credentials and is the only one entitled to create roles, grant privileges etc. Thank you. On Fri, Feb 22, 2013 at 4:19 PM, Jagat Singh jagatsi...@gmail.com wrote: You might want to read this https://cwiki.apache.org/Hive/languagemanual-auth.html On Fri, Feb 22, 2013 at 9:44 PM, Sachin Sudarshana sachin.sudarsh...@gmail.com wrote: Hi, I have just started learning about hive. I have configured Hive to use mysql as the metastore instead of derby. If I wish to use GRANT and REVOKE commands, i can use it with any user. A user can issue GRANT or REVOKE commands to any other users' table since both the users' tables are present in the same warehouse. Isn't there a concept of superuser/admin in hive who alone has the authority to issue these commands ? Any answer is greatly appreciated! -- Thanks and Regards, Sachin Sudarshana -- Thanks and Regards, Sachin Sudarshana
Re: Running Hive on multi node
Hi Hive uses the hadoop installation specified in HADOOP_HOME. If your hadoop home is configured for fully distributed operation it'll utilize the cluster itself. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Hamza Asad hamza.asa...@gmail.com Date: Thu, 21 Feb 2013 14:26:40 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Running Hive on multi node Does hive automatically runs on multi node as i configured hadoop on multi node OR i have to explicitly do its configuration?? -- *Muhammad Hamza Asad*
Re: Adding comment to a table for columns
Hi Gupta Try out DESCRIBE EXTENDED FORMATTED table-name I vaguely recall a operation like this. Please check hive wiki for the exact syntax. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Chunky Gupta chunky.gu...@vizury.com Date: Thu, 21 Feb 2013 17:15:37 To: user@hive.apache.org; bejoy...@yahoo.com; snehalata_bhas...@syntelinc.com Reply-To: user@hive.apache.org Subject: Re: Adding comment to a table for columns Hi Bejoy, Bhaskar I tried using FORMATTED, but it will not give me comments which I have put while creating table. Its output is like :- col_namedata_type comment cstring from deserializer timestring from deserializer Thanks, Chunky. On Thu, Feb 21, 2013 at 4:50 PM, bejoy...@yahoo.com wrote: ** Hi Gupta You can the describe output in a formatted way using DESCRIBE FORMATTED table name; Regards Bejoy KS Sent from remote device, Please excuse typos -- *From: * Chunky Gupta chunky.gu...@vizury.com *Date: *Thu, 21 Feb 2013 16:46:30 +0530 *To: *user@hive.apache.org *ReplyTo: * user@hive.apache.org *Subject: *Adding comment to a table for columns Hi, I am using this syntax to add comments for all columns :- CREATE EXTERNAL TABLE test ( c STRING COMMENT 'Common class', time STRING COMMENT 'Common time', url STRING COMMENT 'Site URL' ) PARTITIONED BY (dt STRING ) LOCATION 's3://BucketName/' Output of Describe Extended table is like :- (Output is just an example copied from internet) hive DESCRIBE EXTENDED table_name; Detailed Table Information Table(tableName:table_name, dbName:benchmarking, owner:root, createTime:1309480053, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:session_key, type:string, comment:null), FieldSchema(name:remote_address, type:string, comment:null), FieldSchema(name:canister_lssn, type:string, comment:null), FieldSchema(name:canister_session_id, type:bigint, comment:null), FieldSchema(name:tltsid, type:string, comment:null), FieldSchema(name:tltuid, type:string, comment:null), FieldSchema(name:tltvid, type:string, comment:null), FieldSchema(name:canister_server, type:string, comment:null), FieldSchema(name:session_timestamp, type:string, comment:null), FieldSchema(name:session_duration, type:string, comment:null), FieldSchema(name:hit_count, type:bigint, comment:null), FieldSchema(name:http_user_agent, type:string, comment:null), FieldSchema(name:extractid, type:bigint, comment:null), FieldSchema(name:site_link, type:string, comment:null), FieldSchema(name:dt, type:string, comment:null), FieldSchema(name:hour, type:int, comment:null)], location:hdfs://hadoop2/user/hive/warehouse/benchmarking.db/table_name, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe) Is there any way of getting this detailed comments and column name in readable format, just like the output of Describe table_name ?. Thanks, Chunky.
Re: bucketing on a column with millions of unique IDs
Hi Li The major consideration you should give is regarding the size of bucket. One bucket corresponds to a file in hdfs and you should ensure that every bucket is atleast a block size or in the worst case atleast majority of the buckets should be. So based on the data size you should derive on this rather than the number of rows/records. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Echo Li echo...@gmail.com Date: Wed, 20 Feb 2013 16:19:43 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: bucketing on a column with millions of unique IDs Hi guys, I plan to bucket a table by userid as I'm going to do intense calculation using group by userid. there are about 110 million rows, with 7 million unique userid, so my question is what is a good number of buckets for this scenario, and how to determine number of buckets? Any input is apprecaited :) Echo
Re: Map join optimization issue
Hi In later versions of hive you actually don't need a map joint hint in your query. Just the following would suffice the purpose Set hive.auto.convert.join=true Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Mayuresh Kunjir mayuresh.kun...@gmail.com Date: Fri, 15 Feb 2013 10:37:52 To: useruser@hive.apache.org Reply-To: user@hive.apache.org Subject: Re: Map join optimization issue Thanks Aniket. I actually had not specified the map-join hint though. Sorry for providing the wrong information earlier. I had only set hive.auto.convert.join=true before firing my join query. ~Mayuresh On Thu, Feb 14, 2013 at 10:44 PM, Aniket Mokashi aniket...@gmail.comwrote: I think hive.mapjoin.smalltable.filesize parameter will be disregarded in that case. On Thu, Feb 14, 2013 at 7:25 AM, Mayuresh Kunjir mayuresh.kun...@gmail.com wrote: Yes, the hint was specified. On Feb 14, 2013 3:11 AM, Aniket Mokashi aniket...@gmail.com wrote: have you specified map-join hint in your query? On Thu, Feb 7, 2013 at 11:39 AM, Mayuresh Kunjir mayuresh.kun...@gmail.com wrote: Hello all, I am trying to join two tables, the smaller being of size 4GB. When I set hive.mapjoin.smalltable.filesize parameter above 500MB, Hive tries to perform a local task to read the smaller file. This of-course fails since the file size is greater and the backup common join is then run. What I do not understand is why did Hive attempt a map join when small file size was greater than the smalltable.filesize parameter. ~Mayuresh -- ...:::Aniket:::... Quetzalco@tl -- ...:::Aniket:::... Quetzalco@tl
Re: CREATE EXTERNAL TABLE Fails on Some Directories
Hi Joseph There are differences in the following ls commands cloudera@localhost data]$ hdfs dfs -ls /715 This would list out all the contents in /715 in hdfs, if it is a dir Found 1 items -rw-r--r-- 1 cloudera supergroup 7853975 2013-02-14 17:03 /715 The output clearly defines it is file as d is missing as the first char [cloudera@localhost data]$ hdfs dfs -ls 715 This lists the dir 715 in your user's hdfs home dir. If your user is cloudera then usually your home dir might be /userdata/cloudera/ so in effect the dir listed is /userdata/cloudera/715 Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Joseph D Antoni jdant...@yahoo.com Date: Fri, 15 Feb 2013 08:55:50 To: user@hive.apache.orguser@hive.apache.org Reply-To: user@hive.apache.org Subject: Re: CREATE EXTERNAL TABLE Fails on Some Directories Not sure--I just truncated the file list from the ls--that was the first file (just obfuscated the name) The command I used to create the directories was: hdfs dfs -mkdir 715 then hdfs dfs -put myfile.csv 715 [cloudera@localhost data]$ hdfs dfs -ls /715 Found 1 items -rw-r--r-- 1 cloudera supergroup 7853975 2013-02-14 17:03 /715 [cloudera@localhost data]$ hdfs dfs -ls 715 Found 13 items -rw-r--r-- 1 cloudera cloudera 7853975 2013-02-15 00:41 715/40-file.csv Thanks From: Dean Wampler dean.wamp...@thinkbiganalytics.com To: user@hive.apache.org; Joseph D Antoni jdant...@yahoo.com Sent: Friday, February 15, 2013 11:50 AM Subject: Re: CREATE EXTERNAL TABLE Fails on Some Directories Something's odd about this output; why is there no / in front of 715? I always get the full path when I run a -ls command. I would expect either: /715/file.csv or /user/me/715/file.csv Or is that what you meant by (didn't leave rest of ls results)? dean On Fri, Feb 15, 2013 at 10:45 AM, Joseph D Antoni jdant...@yahoo.com wrote: [cloudera@localhost data]$ hdfs dfs -ls 715 Found 13 items -rw-r--r-- 1 cloudera cloudera 7853975 2013-02-15 00:41 715/file.csv (didn't leave rest of ls results) Thanks on the directory--wasn't clear on that.. Joey From: Dean Wampler dean.wamp...@thinkbiganalytics.com To: user@hive.apache.org; Joseph D Antoni jdant...@yahoo.com Sent: Friday, February 15, 2013 11:37 AM Subject: Re: CREATE EXTERNAL TABLE Fails on Some Directories You confirmed that 715 is an actual directory? It didn't become a file by accident? By the way, you don't need to include the file name in the LOCATION. It will read all the files in the directory. dean On Fri, Feb 15, 2013 at 10:29 AM, Joseph D Antoni jdant...@yahoo.com wrote: I'm trying to create a series of external tables for a time series of data (using the prebuilt Cloudera VM). The directory structure in HDFS is as such: /711 /712 /713 /714 /715 /716 /717 Each directory contains the same set of files, from a different day. They were all put into HDFS using the following script: for i in *;do hdfs dfs -put $i in $dir;done They all show up with the same ownership/perms in HDFS. Going into Hive to build the tables, I built a set of scripts to do the loads--then did a sed (changing 711 to 712,713, etc) to a file for each day. All of my loads work, EXCEPT for 715 and 716. Script is as follows: create external table 715_table_name (col1 string, col2 string) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile location '/715/file.csv'; This is failing with: Error in Metadata MetaException(message:Got except: org.apache.hadoop.fs.FileAlreadyExistsException Parent Path is not a directory: /715 715... Like I mentioned it works for all of the other directories, except 715 and 716. Thoughts on troubleshooting path? Thanks Joey D'Antoni -- Dean Wampler, Ph.D. thinkbiganalytics.com +1-312-339-1330 -- Dean Wampler, Ph.D. thinkbiganalytics.com +1-312-339-1330
Re: LOAD HDFS into Hive
Hi Venkataraman You can just create an external table and give it location as the hdfs dir where the data resides. No need to perform an explicit LOAD operation here. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: venkatramanan venkatraman...@smartek21.com Date: Fri, 25 Jan 2013 18:30:29 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: LOAD HDFS into Hive Hi, I need to load the hdfs data into the Hive table. For example, Am having the twitter data and its updated daily using the streaming API. These twitter responses are stored into the HDFS Path named like ('TwitterData'). After that i try to load the data into the Hive. using the 'LOAD DATA stmt'. My problem is that hdfs data is lost once i load the data. is there any way to load the data without the hdfs data lose. To Create the Table using the below stmt; CREATE EXTERNAL TABLE Tweets (FromUserId String, Text string, FromUserIdString String, FromUser String, Geo String, Id BIGINT, IsoLangCode string, ToUserId INT, ToUserIdString string, CreatedAt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'; To LOAD the data using the below stmt; LOAD DATA INPATH '/twitter_sample' INTO TABLE tweets; thanks in advance Thanks, Venkat
Re: An explanation of LEFT OUTER JOIN and NULL values
Hi David An explain extended would give you the exact pointer. From my understanding, this is how it could work. You have two tables then two different map reduce job would be processing those. Based on the join keys, combination of corresponding columns would be chosen as key from mapper1 and mapper2. So if the combination of columns having the same value those records from two set of mappers would go into the same reducer. On the reducer if there is a corresponding value for a key from table 1 to table 2/mapper 2 that value would be populated. If no val for mapper 2 then those columns from table 2 are made null. If there is a key-value just from table 2/mapper 2 and no corresponding value from mapper 1. That value is just discarded. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: David Morel dmore...@gmail.com Date: Thu, 24 Jan 2013 18:03:40 To: user@hive.apache.orguser@hive.apache.org Reply-To: user@hive.apache.org Subject: An explanation of LEFT OUTER JOIN and NULL values Hi! After hitting the curse of the last reducer many times on LEFT OUTER JOIN queries, and trying to think about it, I came to the conclusion there's something I am missing regarding how keys are handled in mapred jobs. The problem shows when I have table A containing billions of rows with distinctive keys, that I need to join to table B that has a much lower number of rows. I need to keep all the A rows, populated with NULL values from the B side, so that's what a LEFT OUTER is for. Now, when transforming that into a mapred job, my -naive- understanding would be that for every key on the A table, a missing key on the B table would be generated with a NULL value. If that were the case, I fail to understand why all NULL valued B keys would end up on the same reducer, since the key defines which reducer is used, not the value. So, obviously, this is not how it works. So my question is: how is this construct handled? Thanks a lot! D.Morel
Re: An explanation of LEFT OUTER JOIN and NULL values
Hi David, The default partitioner used in map reduce is the hash partitioner. So based on your keys they are send to a particular reducer. May be in your current data set, the keys that have no values in table are all falling in the same hash bucket and hence being processed by the same reducer. If you are noticing a skew on a particular reducer, sometimes a simple work around like increasing the no of reducers explicitly might help you get pass the hurdle. Also please ensure you have enabled skew join optimization. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: David Morel dmore...@gmail.com Date: Thu, 24 Jan 2013 18:39:56 To: user@hive.apache.org; bejoy...@yahoo.com Reply-To: user@hive.apache.org Subject: Re: An explanation of LEFT OUTER JOIN and NULL values On 24 Jan 2013, at 18:16, bejoy...@yahoo.com wrote: Hi David An explain extended would give you the exact pointer. From my understanding, this is how it could work. You have two tables then two different map reduce job would be processing those. Based on the join keys, combination of corresponding columns would be chosen as key from mapper1 and mapper2. So if the combination of columns having the same value those records from two set of mappers would go into the same reducer. On the reducer if there is a corresponding value for a key from table 1 to table 2/mapper 2 that value would be populated. If no val for mapper 2 then those columns from table 2 are made null. If there is a key-value just from table 2/mapper 2 and no corresponding value from mapper 1. That value is just discarded. Hi Bejoy, Thanks! So schematically, something like this, right? mapper1 (bigger table): K1-A, V1A K2-A, V2A K3-A, V3A mapper2 (joined, smaller table): K1-B, V1B reducer1: K1-A, V1A K1-B, V1B returns: K1, V1A, V1B etc reducer2: K2-A, V2A *no* K2-B, V so: K2-B, NULL is created, same for next row. K3-A, V3A returns: K2, V2A, NULL etc K3, V3A, NULL etc I still don't understand why my reducer2 (and only this one, which apparently gets all the keys for which we don't have a row on table B) would become overloaded. Am I completely misunderstanding the whole thing? David Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: David Morel dmore...@gmail.com Date: Thu, 24 Jan 2013 18:03:40 To: user@hive.apache.orguser@hive.apache.org Reply-To: user@hive.apache.org Subject: An explanation of LEFT OUTER JOIN and NULL values Hi! After hitting the curse of the last reducer many times on LEFT OUTER JOIN queries, and trying to think about it, I came to the conclusion there's something I am missing regarding how keys are handled in mapred jobs. The problem shows when I have table A containing billions of rows with distinctive keys, that I need to join to table B that has a much lower number of rows. I need to keep all the A rows, populated with NULL values from the B side, so that's what a LEFT OUTER is for. Now, when transforming that into a mapred job, my -naive- understanding would be that for every key on the A table, a missing key on the B table would be generated with a NULL value. If that were the case, I fail to understand why all NULL valued B keys would end up on the same reducer, since the key defines which reducer is used, not the value. So, obviously, this is not how it works. So my question is: how is this construct handled? Thanks a lot! D.Morel
Re: Mapping HBase table in Hive
Hi Ibrahim. SQOOP is used to import data from rdbms to hbase in your case. Please get the schema from hbase for your corresponding table and post it here. We can point out how your mapping could be. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Ibrahim Yakti iya...@souq.com Date: Sun, 13 Jan 2013 11:22:51 To: useruser@hive.apache.org Reply-To: user@hive.apache.org Subject: Re: Mapping HBase table in Hive Thanks Bejoy, what do you mean by: If you need to map a full CF to a hive column, the data type of the hive column should be a Map. suppose I used sqoop to move data from mysql to hbase and used id as a column family, all the other columns will be QF then, right? The integration document is not clear, I think it needs more clarification or maybe I am still missing something. -- Ibrahim On Tue, Jan 8, 2013 at 9:35 PM, bejoy...@yahoo.com wrote: data type of
Re: Map Reduce Local Task
Hi Santhosh As long as the smaller table size is in the range of a few MBs. It is a good candidate for map join. If the smaller table size is still more then you can take a look at bucketed map joins. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Santosh Achhra santoshach...@gmail.com Date: Wed, 9 Jan 2013 00:11:37 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Re: Map Reduce Local Task Thank you Dean, One of our table is very small, it has only 16,000 rows and other big table has 45 million plus records. Wont doing a loacl task help in this case ? Good wishes,always ! Santosh On Tue, Jan 8, 2013 at 11:59 PM, Dean Wampler dean.wamp...@thinkbiganalytics.com wrote: more aggressive about trying to convert a join to a local task, where it bypasses the job tracker. When you're experimenting with queries on a small data set, it can make things much faster, but won't be useful for large data sets where you need the cluster.
Re: Mapping HBase table in Hive
Hi Ibrahim The hive hbase integration totally depends on the hbase table schema and not the schema of the source table in mysql. You need to provide the column family qualifier mapping in there. Get the hbase table's schema from hbase shell. suppose you have the schema as Id CF1.qualifier1 CF1.qualifier2 CF1.qualifier3 You need to match each of these ColumnFamily:Qualifier to corresponding columns in hive. So in hbase.columns.mapping you need to provide these CF:QL in order. If you need to map a full CF to a hive column, the data type of the hive column should be a Map. You can get detailed hbase to hive integration document from hive wiki . Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Ibrahim Yakti iya...@souq.com Date: Tue, 8 Jan 2013 15:45:32 To: useruser@hive.apache.org Reply-To: user@hive.apache.org Subject: Mapping HBase table in Hive Hello, suppose I have the following table (orders) in MySQL: *** 1. row *** Field: id Type: int(10) unsigned Null: NO Key: PRI Default: NULL Extra: auto_increment *** 2. row *** Field: value Type: int(10) unsigned Null: NO Key: Default: NULL Extra: *** 3. row *** Field: date_lastchange Type: timestamp Null: NO Key: Default: CURRENT_TIMESTAMP Extra: on update CURRENT_TIMESTAMP *** 4. row *** Field: date_inserted Type: timestamp Null: NO Key: Default: -00-00 00:00:00 I imported it into HBase with column family id I want to create an external table in Hive to query the HBase table, I am not able to get the mapping parameters (*hbase.columns.mapping*), it is confusing, if anybody can explain it to me please. I used the following query: CREATE EXTERNAL TABLE hbase_orders(id bigint, value bigint, date_lastchange string, date_inserted string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (hbase.columns.mapping = ? ? ? ? ? ?) TBLPROPERTIES (hbase.table.name = orders); Is there any way to build the Hive tables automatically or I should go with the same process with each table? Thanks in advanced. -- Ibrahim
Re: View with map join fails
Looks like there is a bug with mapjoin + view. Please check hive jira to see if there an issue open against this else file a new jira. From my understanding, When you enable map join, hive parser would create back up jobs. These back up jobs are executed only if map join fails. In normal cases when map join succeeds these jobs are filtered out and not executed. '1116112419, job is filtered out (removed at runtime).' Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Santosh Achhra santoshach...@gmail.com Date: Tue, 8 Jan 2013 17:11:18 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: View with map join fails Hello, I have created a view as shown below. *CREATE VIEW V1 AS* *select /*+ MAPJOIN(t1) ,MAPJOIN(t2) */ t1.f1, t1.f2, t1.f3, t1.f4, t2.f1, t2.f2, t2.f3 from TABLE1 t1 join TABLE t2 on ( t1.f2= t2.f2 and t1.f3 = t2.f3 and t1.f4 = t2.f4 ) group by t1.f1, t1.f2, t1.f3, t1.f4, t2.f1, t2.f2, t2.f3* View get created successfully however when I execute below mentioned SQL or any SQL on the view get NULLPOINTER exception error *hive select count (*) from V1;* *FAILED: NullPointerException null* *hive* Is there anything wrong with the view creation ? Next I created view without MAPJOIN hints *CREATE VIEW V1 AS* *select t1.f1, t1.f2, t1.f3, t1.f4, t2.f1, t2.f2, t2.f3 from TABLE1 t1 join TABLE t2 on ( t1.f2= t2.f2 and t1.f3 = t2.f3 and t1.f4 = t2.f4 ) group by t1.f1, t1.f2, t1.f3, t1.f4, t2.f1, t2.f2, t2.f3* Before executing select SQL I excute *set hive.auto.convert.join=true; * I am getting beloow mentioned warnings java.lang.InstantiationException: org.apache.hadoop.hive.ql.parse.ASTNodeOrigin Continuing ... java.lang.RuntimeException: failed to evaluate: unbound=Class.new(); Continuing ... And I see from log that total 5 mapreduce jobs are started however when don't set auto.convert.join to true, I see only 3 mapreduce jobs getting invoked. *Total MapReduce jobs = 5* *Ended Job = 1116112419, job is filtered out (removed at runtime).* *Ended Job = -33256989, job is filtered out (removed at runtime).* *WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.* Good wishes,always ! Santosh
Re: External table with partitions
Hi Oded If you have created the directories manually that would come visible to the hive table only if the partitions/ sub dirs are added to the meta data using 'ALTER TABLE ... ADD PARTITION' . Partitions are not retrieved implicitly into hive tabe even if you have a proper sub dir structure. Similarly if you don't need a particular partition on your table permanently you can always delete them using the alter table command. If you are intending to use a particular partition alone in your query no need to alter the partitions. Just append a where clause to the query that has scope only on the required partitions. Hope this helps. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Oded Poncz o...@ubimo.com Date: Sun, 6 Jan 2013 16:07:26 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: External table with partitions Is it possible to instruct hive to get only specific files from a partitioned external table? For example I have the following directory structure data/dd=2012-12-31/a1.txt data/dd=2012-12-31/a2.txt data/dd=2012-12-31/a3.txt data/dd=2012-12-31/a4.txt data/dd=2012-12-31/b1.txt data/dd=2012-12-31/b2.txt data/dd=2012-12-31/b2.txt Is it possible to add 2012-12-31 as a partition and tell hive to load only the a* files to the table? Thanks,
Re: External table with partitions
Sorry, I din understand your query on first look through. Like Jagat said, you may need to go with a temp table for this. Do a hadoop fs -cp ../../a.* destn dir Create a external table with location as 'destn dir'. CREATE EXERNAL TABLE tmp tble name LIKE src table name LOCATION '' ; NB: I just gave the syntax from memory. please check the syntax in hive user guide. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: bejoy...@yahoo.com Date: Sun, 6 Jan 2013 14:39:45 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Re: External table with partitions Hi Oded If you have created the directories manually that would come visible to the hive table only if the partitions/ sub dirs are added to the meta data using 'ALTER TABLE ... ADD PARTITION' . Partitions are not retrieved implicitly into hive tabe even if you have a proper sub dir structure. Similarly if you don't need a particular partition on your table permanently you can always delete them using the alter table command. If you are intending to use a particular partition alone in your query no need to alter the partitions. Just append a where clause to the query that has scope only on the required partitions. Hope this helps. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Oded Poncz o...@ubimo.com Date: Sun, 6 Jan 2013 16:07:26 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: External table with partitions Is it possible to instruct hive to get only specific files from a partitioned external table? For example I have the following directory structure data/dd=2012-12-31/a1.txt data/dd=2012-12-31/a2.txt data/dd=2012-12-31/a3.txt data/dd=2012-12-31/a4.txt data/dd=2012-12-31/b1.txt data/dd=2012-12-31/b2.txt data/dd=2012-12-31/b2.txt Is it possible to add 2012-12-31 as a partition and tell hive to load only the a* files to the table? Thanks,
Re: Map side join
Hi Souvik Is your input files compressed using some non splittable compression codec? Do you have enough free slots while this job is running? Make sure that the job is not running locally. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Souvik Banerjee souvikbaner...@gmail.com Date: Wed, 12 Dec 2012 14:27:27 To: user@hive.apache.org; bejoy...@yahoo.com Reply-To: user@hive.apache.org Subject: Re: Map side join Hi Bejoy, Yes I ran the pi example. It was fine. Regarding the HIVE Job what I found is that it took 4 hrs for the first map job to get completed. Those map tasks were doing their job and only reported status after completion. It is indeed taking too long time to finish. Nothing I could find relevant in the logs. Thanks and regards, Souvik. On Wed, Dec 12, 2012 at 8:04 AM, bejoy...@yahoo.com wrote: ** Hi Souvik Apart from hive jobs is the normal mapreduce jobs like the wordcount running fine on your cluster? If it is working, for the hive jobs are you seeing anything skeptical in task, Tasktracker or jobtracker logs? Regards Bejoy KS Sent from remote device, Please excuse typos -- *From: * Souvik Banerjee souvikbaner...@gmail.com *Date: *Tue, 11 Dec 2012 17:12:20 -0600 *To: *user@hive.apache.org; bejoy...@yahoo.com *ReplyTo: * user@hive.apache.org *Subject: *Re: Map side join Hello Everybody, Need help in for on HIVE join. As we were talking about the Map side join I tried that. I set the flag set hive.auto.convert.join=true; I saw Hive converts the same to map join while launching the job. But the problem is that none of the map job progresses in my case. I made the dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be done very quickly. No luck with any change of settings. Failing to progress with the default setting changes these settings. set hive.mapred.local.mem=1024; // Initially it was 216 I guess set hive.join.cache.size=10; // Initialliu it was 25000 Also on Hadoop side I made this changes mapred.child.java.opts -Xmx1073741824 But I don't see any progress. After more than 40 minutes of run I am at 0% map completion state. Can you please throw some light on this? Thanks a lot once again. Regards, Souvik. On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee souvikbaner...@gmail.comwrote: Hi Bejoy, That's wonderful. Thanks for your reply. What I was wondering if HIVE can do map side join with more than one condition on JOIN clause. I'll simply try it out and post the result. Thanks once again. Regards, Souvik. On Fri, Dec 7, 2012 at 2:10 PM, bejoy...@yahoo.com wrote: ** Hi Souvik In earlier versions of hive you had to give the map join hint. But in later versions just set hive.auto.convert.join = true; Hive automatically selects the smaller table. It is better to give the smaller table as the first one in join. You can use a map join if you are joining a small table with a large one, in terms of data size. By small, better to have the smaller table size in range of MBs. Regards Bejoy KS Sent from remote device, Please excuse typos -- *From: *Souvik Banerjee souvikbaner...@gmail.com *Date: *Fri, 7 Dec 2012 13:58:25 -0600 *To: *user@hive.apache.org *ReplyTo: *user@hive.apache.org *Subject: *Map side join Hello everybody, I have got a question. I didn't came across any post which says somethign about this. I have got two tables. Lets say A and B. I want to join A B in HIVE. I am currently using HIVE 0.9 version. The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 = B.id2) AND (A.id3 = B.id3) Can I ask HIVE to use map side join in this scenario? Should I give a hint to HIVE by saying /*+mapjoin(B)*/ Get back to me if you want any more information in this regard. Thanks and regards, Souvik.
Re: Map side join
Hi Souvik To have the new hdfs block size in effect on the already existing files, you need to re copy them into hdfs. To play with the number of mappers you can set lesser value like 64mb for min and max split size. Mapred.min.split.size and mapred.max.split.size Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Souvik Banerjee souvikbaner...@gmail.com Date: Thu, 13 Dec 2012 12:00:16 To: user@hive.apache.org; bejoy...@yahoo.com Subject: Re: Map side join Hi Bejoy, The input files are non-compressed text file. There are enough free slots in the cluster. Can you please let me know can I increase the no of mappers? I tried reducing the HDFS block size to 32 MB from 128 MB. I was expecting to get more mappers. But still it's launching same no of mappers like it was doing while the HDFS block size was 128 MB. I have enough map slots available, but not being able to utilize those. Thanks and regards, Souvik. On Thu, Dec 13, 2012 at 11:12 AM, bejoy...@yahoo.com wrote: ** Hi Souvik Is your input files compressed using some non splittable compression codec? Do you have enough free slots while this job is running? Make sure that the job is not running locally. Regards Bejoy KS Sent from remote device, Please excuse typos -- *From: * Souvik Banerjee souvikbaner...@gmail.com *Date: *Wed, 12 Dec 2012 14:27:27 -0600 *To: *user@hive.apache.org; bejoy...@yahoo.com *ReplyTo: * user@hive.apache.org *Subject: *Re: Map side join Hi Bejoy, Yes I ran the pi example. It was fine. Regarding the HIVE Job what I found is that it took 4 hrs for the first map job to get completed. Those map tasks were doing their job and only reported status after completion. It is indeed taking too long time to finish. Nothing I could find relevant in the logs. Thanks and regards, Souvik. On Wed, Dec 12, 2012 at 8:04 AM, bejoy...@yahoo.com wrote: ** Hi Souvik Apart from hive jobs is the normal mapreduce jobs like the wordcount running fine on your cluster? If it is working, for the hive jobs are you seeing anything skeptical in task, Tasktracker or jobtracker logs? Regards Bejoy KS Sent from remote device, Please excuse typos -- *From: * Souvik Banerjee souvikbaner...@gmail.com *Date: *Tue, 11 Dec 2012 17:12:20 -0600 *To: *user@hive.apache.org; bejoy...@yahoo.com *ReplyTo: * user@hive.apache.org *Subject: *Re: Map side join Hello Everybody, Need help in for on HIVE join. As we were talking about the Map side join I tried that. I set the flag set hive.auto.convert.join=true; I saw Hive converts the same to map join while launching the job. But the problem is that none of the map job progresses in my case. I made the dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be done very quickly. No luck with any change of settings. Failing to progress with the default setting changes these settings. set hive.mapred.local.mem=1024; // Initially it was 216 I guess set hive.join.cache.size=10; // Initialliu it was 25000 Also on Hadoop side I made this changes mapred.child.java.opts -Xmx1073741824 But I don't see any progress. After more than 40 minutes of run I am at 0% map completion state. Can you please throw some light on this? Thanks a lot once again. Regards, Souvik. On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee souvikbaner...@gmail.com wrote: Hi Bejoy, That's wonderful. Thanks for your reply. What I was wondering if HIVE can do map side join with more than one condition on JOIN clause. I'll simply try it out and post the result. Thanks once again. Regards, Souvik. On Fri, Dec 7, 2012 at 2:10 PM, bejoy...@yahoo.com wrote: ** Hi Souvik In earlier versions of hive you had to give the map join hint. But in later versions just set hive.auto.convert.join = true; Hive automatically selects the smaller table. It is better to give the smaller table as the first one in join. You can use a map join if you are joining a small table with a large one, in terms of data size. By small, better to have the smaller table size in range of MBs. Regards Bejoy KS Sent from remote device, Please excuse typos -- *From: *Souvik Banerjee souvikbaner...@gmail.com *Date: *Fri, 7 Dec 2012 13:58:25 -0600 *To: *user@hive.apache.org *ReplyTo: *user@hive.apache.org *Subject: *Map side join Hello everybody, I have got a question. I didn't came across any post which says somethign about this. I have got two tables. Lets say A and B. I want to join A B in HIVE. I am currently using HIVE 0.9 version. The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 = B.id2) AND (A.id3 = B.id3) Can I ask HIVE to use map side join in this scenario? Should I give a hint to HIVE by saying /*+mapjoin(B)*/ Get back to me if you want any more information in this regard. Thanks and
Re: Map side join
Hi Souvik Apart from hive jobs is the normal mapreduce jobs like the wordcount running fine on your cluster? If it is working, for the hive jobs are you seeing anything skeptical in task, Tasktracker or jobtracker logs? Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Souvik Banerjee souvikbaner...@gmail.com Date: Tue, 11 Dec 2012 17:12:20 To: user@hive.apache.org; bejoy...@yahoo.com Reply-To: user@hive.apache.org Subject: Re: Map side join Hello Everybody, Need help in for on HIVE join. As we were talking about the Map side join I tried that. I set the flag set hive.auto.convert.join=true; I saw Hive converts the same to map join while launching the job. But the problem is that none of the map job progresses in my case. I made the dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be done very quickly. No luck with any change of settings. Failing to progress with the default setting changes these settings. set hive.mapred.local.mem=1024; // Initially it was 216 I guess set hive.join.cache.size=10; // Initialliu it was 25000 Also on Hadoop side I made this changes mapred.child.java.opts -Xmx1073741824 But I don't see any progress. After more than 40 minutes of run I am at 0% map completion state. Can you please throw some light on this? Thanks a lot once again. Regards, Souvik. On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee souvikbaner...@gmail.comwrote: Hi Bejoy, That's wonderful. Thanks for your reply. What I was wondering if HIVE can do map side join with more than one condition on JOIN clause. I'll simply try it out and post the result. Thanks once again. Regards, Souvik. On Fri, Dec 7, 2012 at 2:10 PM, bejoy...@yahoo.com wrote: ** Hi Souvik In earlier versions of hive you had to give the map join hint. But in later versions just set hive.auto.convert.join = true; Hive automatically selects the smaller table. It is better to give the smaller table as the first one in join. You can use a map join if you are joining a small table with a large one, in terms of data size. By small, better to have the smaller table size in range of MBs. Regards Bejoy KS Sent from remote device, Please excuse typos -- *From: *Souvik Banerjee souvikbaner...@gmail.com *Date: *Fri, 7 Dec 2012 13:58:25 -0600 *To: *user@hive.apache.org *ReplyTo: *user@hive.apache.org *Subject: *Map side join Hello everybody, I have got a question. I didn't came across any post which says somethign about this. I have got two tables. Lets say A and B. I want to join A B in HIVE. I am currently using HIVE 0.9 version. The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 = B.id2) AND (A.id3 = B.id3) Can I ask HIVE to use map side join in this scenario? Should I give a hint to HIVE by saying /*+mapjoin(B)*/ Get back to me if you want any more information in this regard. Thanks and regards, Souvik.
Re: Map side join
Hi Souvik In earlier versions of hive you had to give the map join hint. But in later versions just set hive.auto.convert.join = true; Hive automatically selects the smaller table. It is better to give the smaller table as the first one in join. You can use a map join if you are joining a small table with a large one, in terms of data size. By small, better to have the smaller table size in range of MBs. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Souvik Banerjee souvikbaner...@gmail.com Date: Fri, 7 Dec 2012 13:58:25 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Map side join Hello everybody, I have got a question. I didn't came across any post which says somethign about this. I have got two tables. Lets say A and B. I want to join A B in HIVE. I am currently using HIVE 0.9 version. The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2 = B.id2) AND (A.id3 = B.id3) Can I ask HIVE to use map side join in this scenario? Should I give a hint to HIVE by saying /*+mapjoin(B)*/ Get back to me if you want any more information in this regard. Thanks and regards, Souvik.
Re: Doubt in INSERT query in Hive?
Hi Bhavesh INSERT INTO is supported in hive 0.8 . An upgrade would get you things rolling. LOAD DATA inefficient? What was the performance overhead you were facing here? Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Bhavesh Shah bhavesh25s...@gmail.com Date: Wed, 15 Feb 2012 14:33:29 To: user@hive.apache.org; d...@hive.apache.org Reply-To: user@hive.apache.org Subject: Doubt in INSERT query in Hive? Hello, Whenever we want to insert into table we use: INSERT OVERWRITE TABLE TBL_NAME (SELECT ) Due to this, table gets overwrites everytime. I don't want to overwrite table, I want append it everytime. I thought about LOAD TABLE , but writing the file may take more time and I don't think so that it will efficient. Does Hive Support INSERT INTO TABLE TAB_NAME? (I am using hive-0.7.1) Is there any patch for it? (But I don't know How to apply patch ?) Pls suggest me as soon as possible. Thanks. -- Regards, Bhavesh Shah
Re: Doubt in INSERT query in Hive?
Bhavesh In this case if you are not using INSERT INTO, you may need some tmp table write the query output to that. Load that data from there to your target table's data dir. You are not writing that to any file while doing the LOAD DATA operation. Rather you are just moving the files(in hdfs) from the source location to the table's data dir (where the previous data files are present). In hdfs move operation there is just a meta data operation happening at file system level. Go with INSERT INTO as it is a cleaner way in hql perspective. Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Bhavesh Shah bhavesh25s...@gmail.com Date: Wed, 15 Feb 2012 15:03:07 To: user@hive.apache.org; bejoy...@yahoo.com Reply-To: user@hive.apache.org Subject: Re: Doubt in INSERT query in Hive? Hi Bejoy K S, Thanks for your reply. The overhead is, in select query I have near about 85 columns. Writing this in the file and again loading it may take some time. For that reason I am thinking that it will be inefficient. -- Regards, Bhavesh Shah On Wed, Feb 15, 2012 at 2:51 PM, bejoy...@yahoo.com wrote: ** Hi Bhavesh INSERT INTO is supported in hive 0.8 . An upgrade would get you things rolling. LOAD DATA inefficient? What was the performance overhead you were facing here? Regards Bejoy K S From handheld, Please excuse typos. -- *From: * Bhavesh Shah bhavesh25s...@gmail.com *Date: *Wed, 15 Feb 2012 14:33:29 +0530 *To: *user@hive.apache.org; d...@hive.apache.org *ReplyTo: * user@hive.apache.org *Subject: *Doubt in INSERT query in Hive? Hello, Whenever we want to insert into table we use: INSERT OVERWRITE TABLE TBL_NAME (SELECT ) Due to this, table gets overwrites everytime. I don't want to overwrite table, I want append it everytime. I thought about LOAD TABLE , but writing the file may take more time and I don't think so that it will efficient. Does Hive Support INSERT INTO TABLE TAB_NAME? (I am using hive-0.7.1) Is there any patch for it? (But I don't know How to apply patch ?) Pls suggest me as soon as possible. Thanks. -- Regards, Bhavesh Shah
Re: parallel inserts ?
Hi John Yes Insert is parallel in default for hive. Hive QL gets transformed to mapreduce jobs and hence definitely it is parallel. The only case it is not parallel is when you have just 1 reducer . It is just reading and processing the input files and in parallel using map reduce jobs from the source table data dir and writes the desired output files to the destination table dir. Hive is just an abstraction over map reduce and can't be compared against a db in terms of features. Almost every data processing operation is just some map reduce jobs. Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: John B johnb4...@gmail.com Date: Wed, 15 Feb 2012 10:59:09 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: parallel inserts ? Other sql datbases typically can parallelize selects but are unable to automatically parallelize inserts. With the most recent stable hiveql will the following statement have the --insert-- automatically parallelized ? INSERT OVERWRITE TABLE pv_gender SELECT pv_users.gender FROM pv_users I understand there is now 'insert into ..select from' syntax. Is the insert part of that statement automatically parallelized ? What is the highest insert speed anybody has seen - and I am not talking about imports I mean inserts from one table to another ?
Re: external partitioned table
Hi Koert As you are creating dir/sub dirs using mapreduce jobs out of hive, hive is unaware of these sub dirs. There is no other way in such cases other than an add partition DDL to register the dir with a hive partition. If you are using oozie or shell to trigger your jobs,you can accomplish it as -use java to come up with the correct add partition statement and write those statement(s) into a file -execute the file using hive -f fileName Hope it helps!.. Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Koert Kuipers ko...@tresata.com Date: Wed, 8 Feb 2012 11:04:18 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: external partitioned table hello all, we have an external partitioned table in hive. we add to this table by having map-reduce jobs (so not from hive) create new subdirectories with the right format (partitionid=partitionvalue). however hive doesn't pick them up automatically. we have to go into hive shell and run alter table sometable add partition (partitionid=partitionvalue). to make matter worse hive doesnt really lend itself to running such an add-partition-operation from java (or for that matter: hive doesn't lend itself to any easy programmatic manipulations... grrr. but i will stop now before i go on a a rant). any suggestions how to approach this? thanks! best, koert
Re: Error when Creating an UDF
Hi One of your jar is not available and may be that has the required UDF or any related methods. Hive was not able to locate your first jar '/scripts/hiveMd5.jar does not exist' Just fix this with the correct location. Everything should work fine. Regards Bejoy K S From handheld, Please excuse typos. -Original Message- From: Jean-Charles Thomas jctho...@autoscout24.com Date: Mon, 6 Feb 2012 16:51:58 To: user@hive.apache.orguser@hive.apache.org Reply-To: user@hive.apache.org Subject: Error when Creating an UDF Hi everybody, i am trying to create an UDF follwing the example in the Hive Wiki. Everything is fine but the CREATE statement (see below) where an error occurs: hive add jar /scripts/hiveMd5.jar; /scripts/hiveMd5.jar does not exist hive add jar /scripts/hive/udf/Md5.jar; Added /scripts/hive/udf/Md5.jar to class path Added resource: /scripts/hive/udf/Md5.jar hive CREATE TEMPORARY FUNCTION mymd5 AS 'com.autoscout24.hive.udf.Md5'; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask hive in the Hive log, there is no much more: 2012-02-06 16:16:36,096 ERROR ql.Driver (SessionState.java:printError(343)) - FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask Any Help is welcome, Thanks a lot for hlep, Jean-Charles
Re: Important Question
Real Time.. Definitely not hive. Go in for HBase, but don't expect Hbase to be as flexible as RDBMS. You need to choose your Row Key and Column Families wisely as per your requirements. For data mining and analytics you can mount Hive table over corresponding Hbase table and play on with SQL like queries. Regards Bejoy K S -Original Message- From: Dalia Sobhy dalia.mohso...@hotmail.com Date: Wed, 25 Jan 2012 17:01:08 To: u...@hbase.apache.org; user@hive.apache.org Reply-To: user@hive.apache.org Subject: Important Question Dear all, I am developing an API for medical use i.e Hospital admissions and all about patients, thus transactions and queries and realtime data is important here... Therefore both real-time and analytical processing is a must.. Therefore which best suits my application Hbase or Hive or another method ?? Please reply quickly bec this is critical thxxx a million ;)
Re: Question on bucketed map join
Corrected a few typos in previous mail Hi Avrila Hi Avrila AFAIK the bucketed map join is not default in hive and it happens only when the configuration parameter hive.optimize.bucketmapjoin is set to true. You may be getting the same execution plan because hive.optimize.bucketmapjoin is set to true in the hive configuration xml file. To cross confirm the same could you explicitly set this to false (set hive.optimize.bucketmapjoin = false; ) in your hive session and get the query execution plan from explain command. Please find some pointers in line 1. Should I see sth different in the explain extended output if I set and unset the hive.optimize.bucketmapjoin option? [Bejoy]Yes, you should be seeing different plans for both. Try EXPLAIN your join query after setting this set hive.optimize.bucketmapjoin = false; 2. Should I see something different in the output of hive while running the query if again I set and unset the hive.optimize.bucketmapjoin? [Bejoy] No,Hive output should be the same. What ever is the execution plan for an join, optimally the end result should be same. 3. Is it possible that even though I set bucketmapjoin to true, Hive will still perform a normal map-side join for some reason? How can I check if this has actually happened? [Bejoy] Hive would perform a plain map side join only if the following parameter is enabled. (default it is disabled) set hive.auto.convert.join = true; you need to check this value in your configurations. If it is enabled irrespective of the table size hive would always try a map join, it would come to a normal join only after the map join attempt fails. AFAIK, if the number of buckets are same or multiples between the two tables involved in a join and if the join is on the same columns that are bucketed, with bucketmapjoin enabled it shouldn't execute a plain mapside join but a bucketed map side join would be triggered. Hope it helps!.. Regards Bejoy K S -Original Message- From: Bejoy Ks bejoy...@yahoo.com Date: Thu, 19 Jan 2012 09:22:08 To: user@hive.apache.orguser@hive.apache.org Reply-To: user@hive.apache.org Subject: Re: Question on bucketed map join Hi Avrila AFAIK the bucketed map join is not default in hive and it happens only when the values is set to true. It could be because the same value is already set in the hive configuration xml file. To cross confirm the same could you explicitly set this to false (set hive.optimize.bucketmapjoin = false;)and get the query execution plan from explain command. Please some pointers in line 1. Should I see sth different in the explain extended output if I set and unset the hive.optimize.bucketmapjoin option? [Bejoy] you should be seeing the same Try EXPLAIN your join query after setting this set hive.optimize.bucketmapjoin = false; 2. Should I see something different in the output of hive while running the query if again I set and unset the hive.optimize.bucketmapjoin? [Bejoy] No,Hive output should be the same. What ever is the execution plan for an join, optimally the end result should be same. 3. Is it possible that even though I set bucketmapjoin to true, Hive will still perform a normal map-side join for some reason? How can I check if this has actually happened? [Bejoy] Hive would perform a plain map side join only if the following parameter is enabled. (default it is disabled) set hive.auto.convert.join = true; you need to check this value in your configurations. If it is enabled irrespective of the table size hive would always try a map join, it would come to a normal join only after the map join attempt fails. AFAIK, if the number of buckets are same or multiples between the two tables involved in a join and if the join is on the same columns that are bucketed, with bucketmapjoin enabled it shouldn't execute a plain mapside join a bucketed map side join would be triggered. Hope it helps!.. Regards Bejoy.K.S From: Avrilia Floratou flora...@cs.wisc.edu To: user@hive.apache.org Sent: Thursday, January 19, 2012 9:23 PM Subject: Question on bucketed map join Hi, I have two tables with 8 buckets each on the same key and want to join them. I ran explain extended and get the plan produced by HIVE which shows that a map-side join is a possible plan. I then set in my script the hive.optimize.bucketmapjoin option to true and reran the explain extended query. I get the exact same plans as output. I ran the query with and without the bucketmapjoin optimization and saw no difference in the running time. I have the following questions: 1. Should I see sth different in the explain extended output if I set and unset the hive.optimize.bucketmapjoin option? 2. Should I see something different in the output of hive while running the query if again I set and unset the hive.optimize.bucketmapjoin? 3. Is it possible that even though I set bucketmapjoin to true, Hive will still perform a normal
Re: Insert based on whether string contains
I agree with Matt on that aspect. The solution proposed by me was purely based on the sample data provided where there were 3 digit comma separated values. If there are chances of 4 digit values as well in event_list you may need to revisit the solution. Regards Bejoy K S -Original Message- From: Tucker, Matt matt.tuc...@disney.com Date: Wed, 4 Jan 2012 08:56:44 To: user@hive.apache.orguser@hive.apache.org; Bejoy Ksbejoy...@yahoo.com Reply-To: user@hive.apache.org Subject: Re: Insert based on whether string contains The find_in_set() UDF is a safer choice for doing a search for that value, as %239% could also match 2390, which has a different meaning in Omniture logs. On Jan 4, 2012, at 8:46 AM, Bejoy Ks bejoy...@yahoo.commailto:bejoy...@yahoo.com wrote: Hi Dave If I get your requirement correct, you need to load data into video_plays_for_sept table FROM omniture table only if omniture.event_list contain the string 239. Try the following query, it should work fine. INSERT OVERWRITE TABLE video_plays_for_sept SELECT concat(visid_high, visid_low), geo_city, geo_country, geo_region FROM omniture WHERE event_list LIKE ‘%239%’; Hope it helps!.. Regards, Bejoy.K.S From: Dave Houston r...@crankyadmin.netmailto:r...@crankyadmin.net To: user@hive.apache.orgmailto:user@hive.apache.org Sent: Wednesday, January 4, 2012 6:41 PM Subject: Insert based on whether string contains Hi there, i have a string that has '239, 236, 232, 934' (not always in that order) and want to insert into another table if 239 is in the string. INSERT OVERWRITE TABLE video_plays_for_sept SELECT concat(visid_high, visid_low), geo_city, geo_country, geo_region from omniture where regexp_extract(event_list, '\d+') = 239; is that I have at the minute but always returns 0 Rows loaded to video_plays_for_sept Many thanks Dave Houston r...@crankyadmin.netmailto:r...@crankyadmin.net
Re: Schemas/Databases in Hive
Ranjith Hive do support multiple data bases if you are on some of the latest versions of hive try Create database testdb; Use testdb; It should give you what you are looking for. Regards Bejoy K S -Original Message- From: Raghunath, Ranjith ranjith.raghuna...@usaa.com Date: Thu, 22 Dec 2011 17:02:09 To: user@hive.apache.orguser@hive.apache.org Reply-To: user@hive.apache.org Subject: Schemas/Databases in Hive What is the intent of having tables in different databases or schemas in Hive? Thanks Thank you, Ranjith
Re: Schemas/Databases in Hive
Also multiple databases have proved helpful for me in organizing tables into corresponding databases when you have quite a large number of tables to manage. Also I believe it'd be helpful in providing access restrictions. Regards Bejoy K S -Original Message- From: bejoy...@yahoo.com Date: Thu, 22 Dec 2011 17:19:16 To: user@hive.apache.org Reply-To: bejoy...@yahoo.com Subject: Re: Schemas/Databases in Hive Ranjith Hive do support multiple data bases if you are on some of the latest versions of hive try Create database testdb; Use testdb; It should give you what you are looking for. Regards Bejoy K S -Original Message- From: Raghunath, Ranjith ranjith.raghuna...@usaa.com Date: Thu, 22 Dec 2011 17:02:09 To: user@hive.apache.orguser@hive.apache.org Reply-To: user@hive.apache.org Subject: Schemas/Databases in Hive What is the intent of having tables in different databases or schemas in Hive? Thanks Thank you, Ranjith
Re: Loading data into hive tables
Adithya The answer is yes. SQOOP is the tool you are looking for. It has an import option to load data from from any jdbc compliant database into hive. It even creates the hive table for you by refering to the source db table. Hope It helps!.. Regards Bejoy K S -Original Message- From: Aditya Singh30 aditya_sing...@infosys.com Date: Fri, 9 Dec 2011 09:57:26 To: user@hive.apache.orguser@hive.apache.org Reply-To: user@hive.apache.org Subject: Loading data into hive tables Hi, I want to know if there is any way to load data directly from some other DB, say Oracle/MySQL etc., into hive tables, without getting the data from DB into a text/rcfile/sequence file in a specific format and then loading the data from that file into hive table. Regards, Aditya CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
Re: Hive query failing on group by
Hi Mark What does your Map reduce job logs say? Try figuring out the error form there. From hive CLI you could hardly find out the root cause of your errors. From job tracker web UI http://hostname:50030/jobtracker.jsp you can easily browse to failed tasks and get the actual exception from there. If you are not able to figure out from there then please post in those logs with your table schema. Regards Bejoy K S -Original Message- From: Mark Kerzner mark.kerz...@shmsoft.com Date: Wed, 19 Oct 2011 09:06:13 To: Hive useruser@hive.apache.org Reply-To: user@hive.apache.org Subject: Hive query failing on group by HI, I am trying to figure out what I am doing wrong with this query and the unusual error I am getting. Also suspicious is the reduce % going up and down. select trans.property_id, day(trans.log_timestamp) from trans JOIN opts on trans.ext_booking_id[ext_booking_id] = opts.ext_booking_id group by day(trans.log_timestamp), trans.property_id; 2011-10-19 08:55:19,778 Stage-1 map = 0%, reduce = 0% 2011-10-19 08:55:22,786 Stage-1 map = 100%, reduce = 0% 2011-10-19 08:55:29,804 Stage-1 map = 100%, reduce = 33% 2011-10-19 08:55:32,811 Stage-1 map = 100%, reduce = 0% 2011-10-19 08:55:39,829 Stage-1 map = 100%, reduce = 33% 2011-10-19 08:55:43,839 Stage-1 map = 100%, reduce = 0% 2011-10-19 08:55:50,855 Stage-1 map = 100%, reduce = 33% 2011-10-19 08:55:54,864 Stage-1 map = 100%, reduce = 0% 2011-10-19 08:56:00,878 Stage-1 map = 100%, reduce = 33% 2011-10-19 08:56:04,887 Stage-1 map = 100%, reduce = 0% 2011-10-19 08:56:05,891 Stage-1 map = 100%, reduce = 100% Ended Job = job_201110111849_0024 with errors FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask Thank you, Mark
Re: Hive query failing on group by
Looks like some data problem. Were you using the GROUP BY query on same data set? But if count(*) also throws an error then it comes to square 1, installation/configuration problem with hive or map reduce. Regards Bejoy K S -Original Message- From: Mark Kerzner mark.kerz...@shmsoft.com Date: Wed, 19 Oct 2011 10:55:34 To: user@hive.apache.org; bejoy...@yahoo.com Reply-To: user@hive.apache.org Subject: Re: Hive query failing on group by Bejoy, I've been using this install of Hive for some time now, and simple queries and joins work fine. It's the GROUP BY that I have problems with, sometimes even with COUNT(*). I am trying to isolate the problem now, and reduce it to the smallest query possible. I am also trying to find a workaround (I noticed that sometimes rephrasing queries for Hive helps), since I need this for a project. Thank you, Mark On Wed, Oct 19, 2011 at 10:25 AM, bejoy...@yahoo.com wrote: ** Mark To ensure your hive installation is fine run two queries SELECT * FROM trans LIMIT 10; SELECT * FROM trans WHERE ***; You can try this for couple of different tables. If these queries return results and work fine as desired then your hive could be working good. If it works good as the second step issue a simple join between two tables on primitive data type columns. If that also looks good then you can kind of confirm that the bug is with your hive query. We can look into that direction then. Regards Bejoy K S -- *From: * Mark Kerzner mark.kerz...@shmsoft.com *Date: *Wed, 19 Oct 2011 10:02:57 -0500 *To: *user@hive.apache.org *ReplyTo: * user@hive.apache.org *Subject: *Re: Hive query failing on group by Vikas, I am using Cloudera CDHU1 on Ubuntu. I get the same results on RedHat CDHU0. Mark On Wed, Oct 19, 2011 at 9:47 AM, Vikas Srivastava vikas.srivast...@one97.net wrote: install hive with RPM this is correpted!! On Wed, Oct 19, 2011 at 8:01 PM, Mark Kerzner mark.kerz...@shmsoft.comwrote: Here is what my hive logs say hive -hiveconf hive.root.logger=DEBUG 2011-10-19 09:24:35,148 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.core.resources but it cannot be resolved. 2011-10-19 09:24:35,150 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.core.runtime but it cannot be resolved. 2011-10-19 09:24:35,150 ERROR DataNucleus.Plugin (Log4JLogger.java:error(115)) - Bundle org.eclipse.jdt.core requires org.eclipse.text but it cannot be resolved. On Wed, Oct 19, 2011 at 9:21 AM, bejoy...@yahoo.com wrote: ** Hi Mark What does your Map reduce job logs say? Try figuring out the error form there. From hive CLI you could hardly find out the root cause of your errors. From job tracker web UI http://hostname:50030/jobtracker.jsp you can easily browse to failed tasks and get the actual exception from there. If you are not able to figure out from there then please post in those logs with your table schema. Regards Bejoy K S -- *From: * Mark Kerzner mark.kerz...@shmsoft.com *Date: *Wed, 19 Oct 2011 09:06:13 -0500 *To: *Hive useruser@hive.apache.org *ReplyTo: * user@hive.apache.org *Subject: *Hive query failing on group by HI, I am trying to figure out what I am doing wrong with this query and the unusual error I am getting. Also suspicious is the reduce % going up and down. select trans.property_id, day(trans.log_timestamp) from trans JOIN opts on trans.ext_booking_id[ext_booking_id] = opts.ext_booking_id group by day(trans.log_timestamp), trans.property_id; 2011-10-19 08:55:19,778 Stage-1 map = 0%, reduce = 0% 2011-10-19 08:55:22,786 Stage-1 map = 100%, reduce = 0% 2011-10-19 08:55:29,804 Stage-1 map = 100%, reduce = 33% 2011-10-19 08:55:32,811 Stage-1 map = 100%, reduce = 0% 2011-10-19 08:55:39,829 Stage-1 map = 100%, reduce = 33% 2011-10-19 08:55:43,839 Stage-1 map = 100%, reduce = 0% 2011-10-19 08:55:50,855 Stage-1 map = 100%, reduce = 33% 2011-10-19 08:55:54,864 Stage-1 map = 100%, reduce = 0% 2011-10-19 08:56:00,878 Stage-1 map = 100%, reduce = 33% 2011-10-19 08:56:04,887 Stage-1 map = 100%, reduce = 0% 2011-10-19 08:56:05,891 Stage-1 map = 100%, reduce = 100% Ended Job = job_201110111849_0024 with errors FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask Thank you, Mark -- With Regards Vikas Srivastava DWH Analytics Team Mob:+91 9560885900 One97 | Let's get talking !
Re: upgrading hadoop package
Hi Li AFAIK 0.21 is not really a stable version of hadoop . So if this upgrade is on a production cluster it'd be better to go in with 0.20.203. Regards Bejoy K S -Original Message- From: Shouguo Li the1plum...@gmail.com Date: Thu, 1 Sep 2011 11:41:46 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: upgrading hadoop package hey guys, i'm planning to upgrade my hadoop cluster from 0.20.1 to 0.21 to take advantage of new bz2 splitting feature. i found a simple upgrade guide, http://wiki.apache.org/hadoop/Hadoop_Upgrade but i can't find anything that's related to hive. do we need to do anything for hive? is the migration transparent to hive? thx!
Re: Re:Re: Re: RE: Why a sql only use one map task?
Hi Daniel In the hadoop eco system the number of map tasks is actually decided by the job basically based no of input splits . Setting mapred.map.tasks wouldn't assure that only that many number of map tasks are triggered. What worked out here for you is that you were specifying that a map tasks should process a min data volume by setting value for mapred.min.split size. So in your case in real there were 9 input splits but when you imposed a constrain on the min data that a map task should handle, the map tasks came down to 3. Regards Bejoy K S -Original Message- From: Daniel,Wu hadoop...@163.com Date: Thu, 25 Aug 2011 20:02:43 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Re:Re:Re: Re: RE: Why a sql only use one map task? after I set set mapred.min.split.size=2; Then it will kick off 3 map tasks (the file I have is 500M). So looks like we need to set mapred.min.split.size instead of mapred.map.tasks to control how many maps to kick off. At 2011-08-25 19:38:30,Daniel,Wu hadoop...@163.com wrote: It works, after I set as you said, but looks like I can't control the map task, it always use 9 maps, even if I set set mapred.map.tasks=2; Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed Task Attempts map100.00% 900900 / 0 reduce100.00% 100100 / 0 At 2011-08-25 06:35:38,Ashutosh Chauhan hashut...@apache.org wrote: This may be because CombineHiveInputFormat is combining your splits in one map task. If you don't want that to happen, do: hive set hive.input.format=org.apache.hadoop.hive.ql.io.HiveI nputFormat 2011/8/24 Daniel,Wuhadoop...@163.com I pasted the inform I pasted blow, the map capacity is 6. And no matter how I set mapred.map.tasks, such as 3, it doesn't work, as it always use 1 map task (please see the completed job information). Cluster Summary (Heap Size is 16.81 MB/966.69 MB) Running Map TasksRunning Reduce TasksTotal SubmissionsNodesOccupied Map SlotsOccupied Reduce SlotsReserved Map SlotsReserved Reduce SlotsMap Task CapacityReduce Task CapacityAvg. Tasks/NodeBlacklisted NodesExcluded Nodes 0063664. Completed Jobs JobidPriorityUserNameMap % CompleteMap TotalMaps CompletedReduce % CompleteReduce TotalReduces CompletedJob Scheduling InformationDiagnostic Info job_201108242119_0001NORMALoracleselect count(*) from test(Stage-1)100.00% 00100.00% 1 1NANA job_201108242119_0002NORMALoracleselect count(*) from test(Stage-1)100.00% 11100.00% 1 1NANA job_201108242119_0003NORMALoracleselect count(*) from test(Stage-1)100.00% 11100.00% 1 1NANA job_201108242119_0004NORMALoracleselect period_key,count(*) from...period_key(Stage-1)100.00% 11100.00% 3 3NANA job_201108242119_0005NORMALoracleselect period_key,count(*) from...period_key(Stage-1)100.00% 11100.00% 3 3NANA job_201108242119_0006NORMALoracleselect period_key,count(*) from...period_key(Stage-1)100.00% 11100.00% 3 3NANA At 2011-08-24 18:19:38,wd w...@wdicc.com wrote: What about your total Map Task Capacity? you may check it from http://your_jobtracker:50030/jobtracker.jsp 2011/8/24 Daniel,Wu hadoop...@163.com: I checked my setting, all are with the default value.So per the book of Hadoop the definitive guide, the split size should be 64M. And the file size is about 500M, so that's about 8 splits. And from the map job information (after the map job is done), I can see it gets 8 split from one node. But anyhow it starts only one map task. At 2011-08-24 02:28:18,Aggarwal, Vaibhav vagg...@amazon.com wrote: If you actually have splittable files you can set the following setting to create more splits: mapred.max.split.size appropriately. Thanks Vaibhav From: Daniel,Wu [mailto:hadoop...@163.com] Sent: Tuesday, August 23, 2011 6:51 AM To: hive Subject: Why a sql only use one map task? I run the following simple sql select count(*) from sales; And the job information shows it only uses one map task. The underlying hadoop has 3 data/data nodes. So I expect hive should kick off 3 map tasks, one on each task nodes. What can make hive only run one map task? Do I need to set something to kick off multiple map task? in my config, I didn't change hive config.
Re: Hive crashing after an upgrade - issue with existing larger tables
A small correction to my previous post. The CDH version is CDH u1 not u0 Sorry for the confusion Regards Bejoy K S -Original Message- From: Bejoy Ks bejoy...@yahoo.com Date: Thu, 18 Aug 2011 05:51:58 To: hive user groupuser@hive.apache.org Reply-To: user@hive.apache.org Subject: Hive crashing after an upgrade - issue with existing larger tables Hi Experts I was working on hive with larger volume data with hive 0.7 . Recently my hive installation was upgraded to 0.7.1 . After the upgrade I'm having a lot of issues with queries that were already working fine with larger data. The queries that took seconds to return results is now taking hours, for most larger tables even the map reduce jobs are not getting triggered. Queries like Select * and describe are working fine since they don't involve any map reduce jobs. For the jobs that didn't even get triggered I got the following error from job tracker Job initialization failed: java.io.IOException: Split metadata size exceeded 1000. Aborting job job_201106061630_6993 at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48) at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:807) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:701) at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4013) at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Looks like some metadata issue. My cluster is on CDH3-u0 . Has anyone faced similar issues before. Please share your thoughts what could be the probable cause of the error. Thank You
Re: why need to copy when run a sql with a single map
Hi Hive queries are parsed into hadoop map reduce jobs. In map reduce jobs, between map and reduce tasks there are two phases, copy-phase and sort-phase together known as sort and shuffle phase. So the copy task indicated in hive job here should be the copy phase of map reduce. It does the copying of map output from map task nodes to corresponding reduce task nodes. Regards Bejoy K S -Original Message- From: Daniel,Wu hadoop...@163.com Date: Wed, 10 Aug 2011 20:07:48 To: hiveuser@hive.apache.org Reply-To: user@hive.apache.org Subject: why need to copy when run a sql with a single map I run a single query like select retailer_key,count(*) from records group by retailer_key; it uses a single map as shown below, since the file is already on HDFS, so I think hadoop/hive doesn't need to copy anything. Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed Task Attempts map100.00% 100100 / 0 reduce100.00% 100100 / 0 but the final chart in the job report shows copy takes about 33% of the total time, and the rest are sort, and reduce. So why it should copy here, or copy means something elso? oracle@oracle-MS-7623:~/test$ hadoop fs -lsr / drwxr-xr-x - oracle supergroup 0 2011-08-10 19:46 /user drwxr-xr-x - oracle supergroup 0 2011-08-10 19:46 /user/hive drwxr-xr-x - oracle supergroup 0 2011-08-10 19:59 /user/hive/warehouse drwxr-xr-x - oracle supergroup 0 2011-08-10 19:59 /user/hive/warehouse/records -rw-r--r-- 1 oracle supergroup 41600256 2011-08-10 19:59 /user/hive/warehouse/records/test.txt
Hive or pig for sequential iterations like those using foreach
Hi I've been successful using hive for a past few projects. Now for a particular use case I'm bit confused what to choose, Hive or Pig. My project involves a step by step sequential work flow. In every step I retrieve some values based on some query, use these values as input to new queries iterative(similar to foreach implementation in Pig) and so on. Is hive a good choice here when I'm having 11 sequence of operation as described? The second confusion for me is, does hive support 'foreach' equivalent functionality? Please advise. I'm from JAVA background, not much into db development so not sure of any such concepts in SQL. Thanks Regards Bejoy K S
Re: Hive or pig for sequential iterations like those using foreach
Thanks Amareshwari, the article gave me much valuable hints to decide my choice. But on curiosity, does hive support stage by stage iterative processing? If so how? Thank You Regards Bejoy K S -Original Message- From: Amareshwari Sri Ramadasu amar...@yahoo-inc.com Date: Mon, 8 Aug 2011 17:14:21 To: user@hive.apache.orguser@hive.apache.org; bejoy...@yahoo.combejoy...@yahoo.com Reply-To: user@hive.apache.org Subject: Re: Hive or pig for sequential iterations like those using foreach You can have a look at typical use cases of Pig and Hive here http://developer.yahoo.com/blogs/hadoop/posts/2010/08/pig_and_hive_at_yahoo/ Thanks Amareshwari On 8/8/11 5:10 PM, bejoy...@yahoo.com bejoy...@yahoo.com wrote: Hi I've been successful using hive for a past few projects. Now for a particular use case I'm bit confused what to choose, Hive or Pig. My project involves a step by step sequential work flow. In every step I retrieve some values based on some query, use these values as input to new queries iterative(similar to foreach implementation in Pig) and so on. Is hive a good choice here when I'm having 11 sequence of operation as described? The second confusion for me is, does hive support 'foreach' equivalent functionality? Please advise. I'm from JAVA background, not much into db development so not sure of any such concepts in SQL. Thanks Regards Bejoy K S
Re: NPE with hive.cli.print.header=true;
Hi Ayon AFAIK hive is supposed to behave so. If you set the hive.cli.print.header=true for enabling column headers then some commands like 'desc' is not expected to work. Not sure whether there is some patch recently out for this. Regards Bejoy K S -Original Message- From: Ayon Sinha ayonsi...@yahoo.com Date: Mon, 1 Aug 2011 17:29:17 To: Hive Mailinglistuser@hive.apache.org Reply-To: user@hive.apache.org Subject: NPE with hive.cli.print.header=true; With set hive.cli.print.header=true; I get NPE's for desc as well as use Exception in thread main java.lang.NullPointerException at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:176) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:241) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:456) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186) Is there a patch for this? -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions.
Re: Partition by existing field?
Hi Travis From my understanding of your requirement, Dynamic Partitions in hive is the most suitable solution. I have written a blogpost on such requirements please refer http://kickstarthadoop.blogspot.com/2011/06/how-to-speed-up-your-hive-queries-in.html for an understanding on the implementation . You can refer the hive wiki as well. Please revert for any clarification Regards Bejoy K S -Original Message- From: Travis Powell tpow...@tealeaf.com Date: Fri, 8 Jul 2011 13:11:58 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Partition by existing field? Can I partition by an existing field? I have a 10 GB file with a date field and an hour of day field. Can I load this file into a table, then insert-overwrite into another partitioned table that uses those fields as a partition? Would something like the following work? INSERT OVERWRITE TABLE tealeaf_event PARTITION(dt=evt.datestring,hour=evt.hour) SELECT * FROM staging_event evt; Thanks! Travis
Re: Hive create table
Hi Jinhang I don't think hive supports multi character delimiters. The hassle free option here would be to preprocess the data using mapreduce to replace the multi character delimiter with another permissible one that suits your data. Regards Bejoy K S -Original Message- From: jinhang du dujinh...@gmail.com Date: Wed, 25 May 2011 19:56:16 To: user@hive.apache.org Reply-To: user@hive.apache.org Subject: Hive create table Hi all, I want to custom the delimiter of the table in a row. Like my data format is '124‘, and how could I create a table (int, int, int) Thanks. -- dujinhang
Re: Hadoop error 2 while joining two large tables
Try out CDH3b4 it has hive 0.7 and the latest of other hadoop tools. When you work with open source it is definitely a good practice to upgrade those with latest versions. With newer versions bugs would be minimal , performance would be better and you get more functionalities. Your query looks fine an upgrade of hive could sort things out. Regards Bejoy K S -Original Message- From: Edward Capriolo edlinuxg...@gmail.com Date: Thu, 17 Mar 2011 08:51:05 To: user@hive.apache.orguser@hive.apache.org Reply-To: user@hive.apache.org Subject: Re: Hadoop error 2 while joining two large tables I am pretty sure the cloudera distro has an upgrade path to a more recent hive. On Thursday, March 17, 2011, hadoop n00b new2h...@gmail.com wrote: Hello All, Thanks a lot for your response. To clarify a few points - I am on CDH2 with Hive 0.4 (I think). We cannot move to a higher version of Hive as we have to use Cloudera distro only. All records in the smaller table have at least one record in the larger table (of course a few exceptions could be there but only a few). The join is using ON clause. The query is something like - select ... from ( (select ... from smaller_table) join (select from larger_table) on (smaller_table.col = larger_table.col) ) I will try out setting mapred.child.java.opts -Xmx to a higher value and let you know. Is there a pattern or rule of thumb to follow on when to add more nodes? Thanks again! On Thu, Mar 17, 2011 at 1:08 AM, Steven Wong sw...@netflix.com wrote: In addition, put the smaller table on the left-hand side of a JOIN: SELECT ... FROM small_table JOIN large_table ON ... From: Bejoy Ks [mailto:bejoy...@yahoo.com] Sent: Wednesday, March 16, 2011 11:43 AM To: user@hive.apache.org Subject: Re: Hadoop error 2 while joining two large tables Hey hadoop n00b I second Mark's thought. But definitely you can try out re framing your query to get things rolling. I'm not sure on your hive Query.But still, from my experience with joins on huge tables (record counts in the range of hundreds of millions) you should give join conditions with JOIN ON clause rather than specifying all conditions in WHERE. Say if you have a query this way SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 b.Column2; You can definitely re frame this query as SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b ON (a.Column4=b.Column1 AND a.Column2=b.Column4) WHERE a.Column3 b.Column2; From my understanding Hive supports equijoins so you can't have the inequality conditions there within JOIN ON, inequality should come to WHERE. This approach has worked for me when I encountered a similar situation as yours some time ago. Try this out,hope it helps. Regards Bejoy.K.S From: Sunderlin, Mark mark.sunder...@teamaol.com To: user@hive.apache.org user@hive.apache.org Sent: Wed, March 16, 2011 11:22:09 PM Subject: RE: Hadoop error 2 while joining two large tables hadoop n00b asks, “Is adding more nodes the solution to such problem?” Whatever else answers you get, you should append “ … and add more nodes.” More nodes is never a bad thing ;-) --- Mark E. Sunderlin Solutions Architect |AOL Data Warehouse P: 703-256-6935 | C: 540-327-6222 AIM: MESunderlin 22000 AOL Way