Hive parquet vs Vertica vs Impala
Hi, Can someone help me with insights into Hive with parquet vs Vertica comparison. I need to store large xml data into one these database so please help me with query performance. Is Impala opensource and can we use it without Cloudera license. Thanks Shashi
Re: Hive parquet vs Vertica vs Impala
Hive is the only system that can store and query xml directly, with the help of different serde's or input formats. Impala and Vertical have more standard schema systems that do not support Collections like List, Map, Struct or nested collections you might need to store and process a complex XML document. Parquet (A storage format that works with Hive and Impala can support List,Map, Structs) but he the Impala engine can not access these at the moment. Last I checked impala refuses to read tables that have one of these elements ( instead of skipping them). It sounds like you want to do one of a few things: 1) Normalize your xml into a table and then you can use Vertica, Hive, or Imapa 2) Write your data using using an Parquet (to handle nested objects ) and Hive to query it.(Hopefully then when Impala adds collection support you can switch over. But mostly you need to do more research. Edward On Sat, Jan 3, 2015 at 2:15 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi, Can someone help me with insights into Hive with parquet vs Vertica comparison. I need to store large xml data into one these database so please help me with query performance. Is Impala opensource and can we use it without Cloudera license. Thanks Shashi
Re: Hive parquet vs Vertica vs Impala
Edward, Thanks for your reply. Can you please tell me the query performance of Hive-parquet against Vertica. Can Hive -parquet match against Vertica's retrieval performance, as I have been told Vertica is also compressed columnar format and is fast? What if I query against some 50 millions of rows , which one will be faster? And moreover is Impala open source ? In some blogs I have seen Impala as open source but in some it says Impala as Cloudera proprietary engine. Ultimately, I want to use Hive -parquet but need to show that it is better than Vertica, a few microseconds here and there would be fine. I don't have access to Vertica. Thanks shashi On Sun, Jan 4, 2015 at 1:07 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Hive is the only system that can store and query xml directly, with the help of different serde's or input formats. Impala and Vertical have more standard schema systems that do not support Collections like List, Map, Struct or nested collections you might need to store and process a complex XML document. Parquet (A storage format that works with Hive and Impala can support List,Map, Structs) but he the Impala engine can not access these at the moment. Last I checked impala refuses to read tables that have one of these elements ( instead of skipping them). It sounds like you want to do one of a few things: 1) Normalize your xml into a table and then you can use Vertica, Hive, or Imapa 2) Write your data using using an Parquet (to handle nested objects ) and Hive to query it.(Hopefully then when Impala adds collection support you can switch over. But mostly you need to do more research. Edward On Sat, Jan 3, 2015 at 2:15 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi, Can someone help me with insights into Hive with parquet vs Vertica comparison. I need to store large xml data into one these database so please help me with query performance. Is Impala opensource and can we use it without Cloudera license. Thanks Shashi
Re: Hive parquet vs Vertica vs Impala
Shashi, Your questions are too broad, and you are asking questions that are impossible to answer. Q. What is faster X or Y?. A. This depends on countless variables and can not be answered. For one example even databases that are very similar in nature like mysql/postgres might execute a query a different way based on it's query planner or even the characteristics of the data. How can you show if a query is faster then vertica if you do not have access vertica to prove it? I understand some of what you are trying to determine, but you should really attempt to install these things and build a prototype to determine what is the best fit for your application. This will grow your understanding of the systems, help you ask better questions, and potentially give you the ability to answer those questions yourself and make better decisions. The right way to ask this question might be Hello, I have loaded 50Million rows of data into hive and I am running this query 'select X, from bla bla'. My vertica instances runs this query in X seconds and hive runs this in Y seconds. Can this be optimized further? The software license for Impala is included here: https://github.com/cloudera/Impala/blob/master/LICENSE.txt Edward On Sat, Jan 3, 2015 at 3:29 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Edward, Thanks for your reply. Can you please tell me the query performance of Hive-parquet against Vertica. Can Hive -parquet match against Vertica's retrieval performance, as I have been told Vertica is also compressed columnar format and is fast? What if I query against some 50 millions of rows , which one will be faster? And moreover is Impala open source ? In some blogs I have seen Impala as open source but in some it says Impala as Cloudera proprietary engine. Ultimately, I want to use Hive -parquet but need to show that it is better than Vertica, a few microseconds here and there would be fine. I don't have access to Vertica. Thanks shashi On Sun, Jan 4, 2015 at 1:07 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Hive is the only system that can store and query xml directly, with the help of different serde's or input formats. Impala and Vertical have more standard schema systems that do not support Collections like List, Map, Struct or nested collections you might need to store and process a complex XML document. Parquet (A storage format that works with Hive and Impala can support List,Map, Structs) but he the Impala engine can not access these at the moment. Last I checked impala refuses to read tables that have one of these elements ( instead of skipping them). It sounds like you want to do one of a few things: 1) Normalize your xml into a table and then you can use Vertica, Hive, or Imapa 2) Write your data using using an Parquet (to handle nested objects ) and Hive to query it.(Hopefully then when Impala adds collection support you can switch over. But mostly you need to do more research. Edward On Sat, Jan 3, 2015 at 2:15 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi, Can someone help me with insights into Hive with parquet vs Vertica comparison. I need to store large xml data into one these database so please help me with query performance. Is Impala opensource and can we use it without Cloudera license. Thanks Shashi
Re: Hive parquet vs Vertica vs Impala
Sorry Edward, I mentioned that I didn't have access to vertica , but yes I was given vertica query retrieval time . Query is kind of select a.x, b.y from t as a , t1 as b where a.id = b.id etc and the schema for those tables required for the join were given. A subset of the data was given and I just need to find a open source framework like Hive and simulate it by storing and executing it. Yes, you are right by saying that it is impossible and I do agree, but unfortunately, I have to come up with this difficult task. The hard part is that the rows in vertica will not have similar number of rows in Hive , it is only a subset , but it seems query time will be derived based on the rows in hive compared to rows in vertica through some calculation. I only wanted to know how does vertica compare with Hive in similar settings in case if someone has done some benchmarking, it need not be my case. Some links to some blogs would certainly help. But thanks anyway for your reply Thanks shashi On Sun, Jan 4, 2015 at 2:21 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Shashi, Your questions are too broad, and you are asking questions that are impossible to answer. Q. What is faster X or Y?. A. This depends on countless variables and can not be answered. For one example even databases that are very similar in nature like mysql/postgres might execute a query a different way based on it's query planner or even the characteristics of the data. How can you show if a query is faster then vertica if you do not have access vertica to prove it? I understand some of what you are trying to determine, but you should really attempt to install these things and build a prototype to determine what is the best fit for your application. This will grow your understanding of the systems, help you ask better questions, and potentially give you the ability to answer those questions yourself and make better decisions. The right way to ask this question might be Hello, I have loaded 50Million rows of data into hive and I am running this query 'select X, from bla bla'. My vertica instances runs this query in X seconds and hive runs this in Y seconds. Can this be optimized further? The software license for Impala is included here: https://github.com/cloudera/Impala/blob/master/LICENSE.txt Edward On Sat, Jan 3, 2015 at 3:29 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Edward, Thanks for your reply. Can you please tell me the query performance of Hive-parquet against Vertica. Can Hive -parquet match against Vertica's retrieval performance, as I have been told Vertica is also compressed columnar format and is fast? What if I query against some 50 millions of rows , which one will be faster? And moreover is Impala open source ? In some blogs I have seen Impala as open source but in some it says Impala as Cloudera proprietary engine. Ultimately, I want to use Hive -parquet but need to show that it is better than Vertica, a few microseconds here and there would be fine. I don't have access to Vertica. Thanks shashi On Sun, Jan 4, 2015 at 1:07 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Hive is the only system that can store and query xml directly, with the help of different serde's or input formats. Impala and Vertical have more standard schema systems that do not support Collections like List, Map, Struct or nested collections you might need to store and process a complex XML document. Parquet (A storage format that works with Hive and Impala can support List,Map, Structs) but he the Impala engine can not access these at the moment. Last I checked impala refuses to read tables that have one of these elements ( instead of skipping them). It sounds like you want to do one of a few things: 1) Normalize your xml into a table and then you can use Vertica, Hive, or Imapa 2) Write your data using using an Parquet (to handle nested objects ) and Hive to query it.(Hopefully then when Impala adds collection support you can switch over. But mostly you need to do more research. Edward On Sat, Jan 3, 2015 at 2:15 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Hi, Can someone help me with insights into Hive with parquet vs Vertica comparison. I need to store large xml data into one these database so please help me with query performance. Is Impala opensource and can we use it without Cloudera license. Thanks Shashi
Re: Hive parquet vs Vertica vs Impala
how does vertica compare with Hive in similar settings Vertical in hive do no have similar settings. Vertica is a columnar MPP analytic database. Hive is an SQL on hadoop platform. Depending on your usage patterns you can use these things interchangeably but not all. Vertica can do low latency 1 second. select queries based on PROJECTIONS which are something like materialized views. Hive is a batch processing system and not built for low latency queries. Hive use cases typically chop up very large datasets and produce other large data sets, or smaller datasets which are put into online systems for low latency analysis. Even if you use one of the columnar style formats like Parquet and the faster execution engines you are not going to get the type of experience you get from vertica. Vertical to Mysql. They both do queries, but vertica does not do primary key enforcement and is non-optimal for small row at a time insertions. Certain datasets mysql queries faster some Vertical will query faster. Basically hive and vertica are very different even though they are both considered data warehouses. On Sat, Jan 3, 2015 at 4:30 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Sorry Edward, I mentioned that I didn't have access to vertica , but yes I was given vertica query retrieval time . Query is kind of select a.x, b.y from t as a , t1 as b where a.id = b.id etc and the schema for those tables required for the join were given. A subset of the data was given and I just need to find a open source framework like Hive and simulate it by storing and executing it. Yes, you are right by saying that it is impossible and I do agree, but unfortunately, I have to come up with this difficult task. The hard part is that the rows in vertica will not have similar number of rows in Hive , it is only a subset , but it seems query time will be derived based on the rows in hive compared to rows in vertica through some calculation. I only wanted to know how does vertica compare with Hive in similar settings in case if someone has done some benchmarking, it need not be my case. Some links to some blogs would certainly help. But thanks anyway for your reply Thanks shashi On Sun, Jan 4, 2015 at 2:21 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Shashi, Your questions are too broad, and you are asking questions that are impossible to answer. Q. What is faster X or Y?. A. This depends on countless variables and can not be answered. For one example even databases that are very similar in nature like mysql/postgres might execute a query a different way based on it's query planner or even the characteristics of the data. How can you show if a query is faster then vertica if you do not have access vertica to prove it? I understand some of what you are trying to determine, but you should really attempt to install these things and build a prototype to determine what is the best fit for your application. This will grow your understanding of the systems, help you ask better questions, and potentially give you the ability to answer those questions yourself and make better decisions. The right way to ask this question might be Hello, I have loaded 50Million rows of data into hive and I am running this query 'select X, from bla bla'. My vertica instances runs this query in X seconds and hive runs this in Y seconds. Can this be optimized further? The software license for Impala is included here: https://github.com/cloudera/Impala/blob/master/LICENSE.txt Edward On Sat, Jan 3, 2015 at 3:29 PM, Shashidhar Rao raoshashidhar...@gmail.com wrote: Edward, Thanks for your reply. Can you please tell me the query performance of Hive-parquet against Vertica. Can Hive -parquet match against Vertica's retrieval performance, as I have been told Vertica is also compressed columnar format and is fast? What if I query against some 50 millions of rows , which one will be faster? And moreover is Impala open source ? In some blogs I have seen Impala as open source but in some it says Impala as Cloudera proprietary engine. Ultimately, I want to use Hive -parquet but need to show that it is better than Vertica, a few microseconds here and there would be fine. I don't have access to Vertica. Thanks shashi On Sun, Jan 4, 2015 at 1:07 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Hive is the only system that can store and query xml directly, with the help of different serde's or input formats. Impala and Vertical have more standard schema systems that do not support Collections like List, Map, Struct or nested collections you might need to store and process a complex XML document. Parquet (A storage format that works with Hive and Impala can support List,Map, Structs) but he the Impala engine can not access these at the moment. Last I checked impala refuses to read tables that have one of these elements ( instead of skipping them). It sounds like you
Re: CREATE FUNCTION: How to automatically load extra jar file?
Hi, A1: Are all of these commands (Step 1-5) from the same Hive CLI prompt? Yes A2: Would you be able to check if such a file exists with the same path, on the local file system? The file does not exist on the local file system. Is there a way to set the another “tmp folder for HIVE? or any suggestions to fix this issue? Thanks !! Arthur On 3 Jan, 2015, at 4:12 am, Jason Dere jd...@hortonworks.com wrote: The point of USING JAR as part of the CREATE FUNCTION statement to try to avoid having to do ADD JAR/aux path stuff to get the UDF to work. Are all of these commands (Step 1-5) from the same Hive CLI prompt? hive CREATE FUNCTION sysdate AS 'com.nexr.platform.hive.udf.UDFSysDate' using JAR 'hdfs://hadoop/hive/nexr-hive-udf-0.2-SNAPSHOT.jar'; converting to local hdfs://hadoop/hive/nexr-hive-udf-0.2-SNAPSHOT.jar Added /tmp/69700312-684c-45d3-b27a-0732bb268ddc_resources/nexr-hive-udf-0.2-SNAPSHOT.jar to class path Added resource: /tmp/69700312-684c-45d3-b27a-0732bb268ddc_resources/nexr-hive-udf-0.2-SNAPSHOT.jar OK One note, /tmp/69700312-684c-45d3-b27a-0732bb268ddc_resources/nexr-hive-udf-0.2-SNAPSHOT.jar here should actually be on the local file system, not on HDFS where you were checking in Step 5. During CREATE FUNCTION/query compilation, Hive will make a copy of the source JAR (hdfs://hadoop/hive/nexr-hive-udf-0.2-SNAPSHOT.jar), copied to a temp location on the local file system where it's used by that Hive session. The location mentioned in the FileNotFoundException (hdfs://tmp/5c658d17-dbeb-4b84-ae8d-ba936404c8bc_resources/nexr-hive-udf-0.2-SNAPSHOT.jar) has a different path than the local copy mentioned during CREATE FUNCTION (/tmp/69700312-684c-45d3-b27a-0732bb268ddc_resources/nexr-hive-udf-0.2-SNAPSHOT.jar). I'm not really sure why it is a HDFS path here either, but I'm not too familiar with what goes on during the job submission process. But the fact that this HDFS path has the same naming convention as the directory used for downloading resources locally (***_resources) looks a little fishy to me. Would you be able to check if such a file exists with the same path, on the local file system? On Dec 31, 2014, at 5:22 AM, Nirmal Kumar nirmal.ku...@impetus.co.in wrote: Important: HiveQL's ADD JAR operation does not work with HiveServer2 and the Beeline client when Beeline runs on a different host. As an alterntive to ADD JAR, Hive auxiliary path functionality should be used as described below. Refer: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v4-8-0/Cloudera-Manager-Managing-Clusters/cmmc_hive_udf.html Thanks, -Nirmal From: arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com Sent: Tuesday, December 30, 2014 9:54 PM To: vic0777 Cc: arthur.hk.c...@gmail.com; user@hive.apache.org Subject: Re: CREATE FUNCTION: How to automatically load extra jar file? Thank you. Will this work for hiveserver2 ? Arthur On 30 Dec, 2014, at 2:24 pm, vic0777 vic0...@163.com wrote: You can put it into $HOME/.hiverc like this: ADD JAR full_path_of_the_jar. Then, the file is automatically loaded when Hive is started. Wantao At 2014-12-30 11:01:06, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I am using Hive 0.13.1 on Hadoop 2.4.1, I need to automatically load an extra JAR file to hive for UDF, below are my steps to create the UDF function. I have tried the following but still no luck to get thru. Please help!! Regards Arthur Step 1: (make sure the jar in in HDFS) hive dfs -ls hdfs://hadoop/hive/nexr-hive-udf-0.2-SNAPSHOT.jar; -rw-r--r-- 3 hadoop hadoop 57388 2014-12-30 10:02hdfs://hadoop/hive/nexr-hive-udf-0.2-SNAPSHOT.jar Step 2: (drop if function exists) hive drop function sysdate; OK Time taken: 0.013 seconds Step 3: (create function using the jar in HDFS) hive CREATE FUNCTION sysdate AS 'com.nexr.platform.hive.udf.UDFSysDate' using JAR 'hdfs://hadoop/hive/nexr-hive-udf-0.2-SNAPSHOT.jar'; converting to local hdfs://hadoop/hive/nexr-hive-udf-0.2-SNAPSHOT.jar Added /tmp/69700312-684c-45d3-b27a-0732bb268ddc_resources/nexr-hive-udf-0.2-SNAPSHOT.jar to class path Added resource: /tmp/69700312-684c-45d3-b27a-0732bb268ddc_resources/nexr-hive-udf-0.2-SNAPSHOT.jar OK Time taken: 0.034 seconds Step 4: (test) hive select sysdate(); Automatically selecting local only mode for query Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in
Re: Hive CLI question
Bals's right, you can execute any shell command in hive-cli by using !shell_command; On Fri, Jan 2, 2015 at 4:30 AM, Bala Krishna Gangisetty b...@altiscale.com wrote: !clear is an another option too. --Bala G. On Thu, Jan 1, 2015 at 12:23 PM, Mohan Krishna mohan.25fe...@gmail.com wrote: Hi Louis I use Ctrl+L as a keyboard short cut to clear Hive screen. Thanks Mohan On Fri, Jan 2, 2015 at 1:41 AM, Louis Vincent Frolio frol...@yahoo.com wrote: Does anyone know how to issue a clear screen at the hive prompt? Is this even possible? I am looking for something similar to system clear in MySQL. Thank you, Louis. -- Best Regards Jeff Zhang